Sie sind auf Seite 1von 304

Genetic Mapping and Marker Assisted

Selection
N. Manikanda Boopathi

Genetic Mapping
and Marker Assisted
Selection
Basics, Practice and Benefits
N. Manikanda Boopathi
Plant Molecular Biology &
Bioinformatics
Tamil Nadu Agricultural University
Coimbatore, TN, India

ISBN 978-81-322-0957-7 ISBN 978-81-322-0958-4 (eBook)


DOI 10.1007/978-81-322-0958-4
Springer New Delhi Heidelberg New York Dordrecht London

Library of Congress Control Number: 2012954276

Springer India 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material
supplied specifically for the purpose of being entered and executed on a computer system, for
exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is
permitted only under the provisions of the Copyright Law of the Publishers location, in its
current version, and permission for use must always be obtained from Springer. Permissions for
use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable
to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility
for any errors or omissions that may be made. The publisher makes no warranty, express or
implied, with respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

Current trends in agricultural biotechnological tools clearly show that the


genes or regulatory elements controlling agronomically important traits
remain unknown and, possibly, will remain mysterious for some time. For the
moment, marker assisted selection (MAS) is considered to be an efficient
supplementary tool to conventional plant breeding since other techniques
such as genetic engineering in crop improvement have limitations in transferring
such a large number of genes residing in quantitative trait loci (QTL). Plant
scientists will continue to use QTL maps and markers that tag and manipulate
the genes of interest for many years to come.
Despite its importance, it was difficult for me, since my graduation, to find
a book that explains the basics and procedures of genetic mapping and MAS.
On the other hand, I used to find a large collection of advanced literature on
every point of MAS in the latest journals. That is the reason I started to write
this small introductory book. I am very sure that what I have tried to show in
this book is just a single cup of water that has been taken from the genetic
mapping and MAS pond. Further, I am completely aware that it is not at
all possible to completely list out each and every aspect of MAS and their
contributors even if I work for years together. Anyone can easily find the
missed component(s) in a complete index of MAS, even though it was
prepared by a subject specialist because of rapid developments in genetical
and statistical methodologies in MAS. The simple idea of writing this book is
introducing the basic concept and protocol for practising MAS in crop plants
with suitable examples. There are different roads to reach the destination.
I just stand on a junction with a comprehensive map, trying to explain all
the possible routes, their rewards and restrictions. And of course, you can
find your own way. Hence, readers are requested to refer to the bibliography
to get more information on the given topics and find an appropriate design of
an MAS programme for their targeted crop and trait.
I further request your feedback, suggestions and critical comments on this
work to improve the quality and usage of this book.
I sincerely apologise having not cited all the authors who have contributed
a lot to this field. This is mainly due to space limitation and not with any other
intention. I also wish to thank and acknowledge all my teachers, guides,
colleagues and friends whom I have had the good fortune to associate with
during my research period.

v
vi Preface

I greatly appreciate and thank Springer for publishing this work.


I exquisitely dedicate this book to my dearly loved son, Sri Ezhilalan
Boopathi, who had forgone all his quality time with me.

Coimbatore N. Manikanda Boopathi


20th November, 2012 nmboopathi@tnau.ac.in
www.sites.google.com/sites/drnmboopathi
Contents

1 Germplasm Characterisation: Utilising


the Underexploited Resources...................................................... 1
Phenotyping for Morphological and Agronomic Characters .......... 2
Case Study in Rice Germplasm Characterisation
for Drought Resistance............................................................... 2
Traits Useful for Characterisation .............................................. 3
Allele Mining .................................................................................. 5
Genetic Diversity and Clustering .................................................... 8
Software ..................................................................................... 9
Principle Behind the Genetic Diversity
Analysis ...................................................................................... 9
Principle of Measuring Goodness of Fit
of a Classification ....................................................................... 10
Genetic Diversity Analysis Using Molecular Markers ................... 10
Parental Selection............................................................................ 20
Bibliography ................................................................................... 20
Literature Cited ............................................................................... 20
Further Readings ............................................................................. 20
2 Mapping Population Development .............................................. 23
Mapping Population and Its Importance
in Genetic Mapping......................................................................... 23
Selfing and Crossing Techniques in Crop Plants ............................ 27
F2 Progenies .................................................................................... 27
F2-Derived F3 (F2:3) Populations ...................................................... 28
F2 Intermating Populations or Immortalised F2 Populations........... 28
DH Lines ......................................................................................... 29
BC Progenies .................................................................................. 29
RILs................................................................................................. 30
NILs, Exotic Libraries and Advanced
Backcross Populations .................................................................... 30
Four-Way Cross Populations........................................................... 31
Multi-Cross Populations ................................................................. 31
Nested Association Mapping Populations ...................................... 32
Natural Populations......................................................................... 33
Chromosome-Specific Genetic Stocks
for Linkage Mapping ...................................................................... 34

vii
viii Contents

Bulk Segregant Analysis ................................................................. 34


Combining Markers and Populations.............................................. 35
Characterisation of Mapping Populations....................................... 35
Choice of Mapping Populations...................................................... 35
Challenges in Mapping Population Development
and Solutions to These Challenges ................................................. 35
Bibliography ................................................................................... 37
Literature Cited ............................................................................... 37
Further Readings ............................................................................. 37
3 Genotyping of Mapping Population ............................................ 39
Markers and Its Importance ............................................................ 39
Morphological Markers .................................................................. 39
Biochemical Markers or Isozymes.................................................. 40
Principle ..................................................................................... 40
Electrophoresis ........................................................................... 41
Chromatography......................................................................... 42
Gel Filtration .............................................................................. 42
Immunochemistry ...................................................................... 42
Catalysis ..................................................................................... 43
Genome Structure and Organisation ............................................... 43
Chromosome Structure............................................................... 45
Mitochondrial DNA ................................................................... 45
Chloroplast DNA........................................................................ 46
Molecular Markers .......................................................................... 46
Restriction Fragment Length Polymorphism (RFLP)..................... 51
PCR-Based Techniques ................................................................... 51
Arbitrarily Primed PCR-Based Markers ......................................... 54
Random Amplified Polymorphic DNA (RAPD)........................ 54
Arbitrarily Primed Polymerase Chain Reaction
(AP-PCR) and DNA Amplification
Fingerprinting (DAF) ................................................................. 54
Amplified Fragment Length Polymorphism (AFLP) ................. 55
Sequence-Specific PCR-Based Markers ......................................... 55
Microsatellite-Based Marker Technique .................................... 56
Inter-Simple Sequence Repeats (ISSR) ..................................... 60
Single-Nucleotide Polymorphism (SNPs).................................. 61
Single-Feature Polymorphism (SFP) ......................................... 61
Sequence-Characterised Amplified Regions (SCAR) ................ 62
Cleaved Amplified Polymorphic Sequences (CAPS)................. 62
Randomly Amplified Microsatellite
Polymorphisms (RAMP)............................................................ 63
Sequence-Related Amplified Polymorphism (SRAP)................ 64
Target Region Amplification Polymorphism (TRAP) ................ 64
Single-Strand Conformation Polymorphism (SSCP) ................. 64
Transposable Elements (TE)-Based Molecular Markers ................ 65
Retrotransposon-Based Molecular Markers ............................... 66
Diversity Array Technology (DArT) ............................................... 68
Contents ix

Intron-Targeted Intron-Exon Splice Conjunction


(IT-ISJ) Marker ............................................................................... 68
Restriction Site Associated DNA (RAD) Markers.......................... 69
RNA-Based Molecular Markers ..................................................... 69
cDNA-AFLP .............................................................................. 70
RNA Fingerprinting by Arbitrarily Primed PCR
(RAP-PCR) ................................................................................ 70
cDNA-SSCP ............................................................................... 70
Role of Genomics ........................................................................... 70
Selection of Marker Technology ..................................................... 74
Research Problem....................................................................... 74
The Number of Loci and/or Alleles ........................................... 75
Discrimination Level .................................................................. 75
Mode of Inheritance ................................................................... 75
Quality of DNA .......................................................................... 75
Expertise Required ..................................................................... 75
Costs ........................................................................................... 75
Speed .......................................................................................... 76
Reproducibility........................................................................... 76
PCR Versus Non-PCR Techniques ............................................. 76
Marker Genotyping and Scoring ..................................................... 76
Analysing the Genotype Score: Chi-Square Test ............................ 77
c2 Test to Analyse the Segregation Ratio
Using the Program ANTMAP......................................................... 78
Bibliography ................................................................................... 78
Literature Cited ............................................................................... 78
Further Readings ............................................................................. 80
4 Linkage Map Construction .......................................................... 81
Basics of Genetic/Linkage Mapping:
Mendelian Ratios, Meiosis, Crossing Over
and Partial Linkage ......................................................................... 81
Mapping Functions ......................................................................... 87
Mapping of Genetic Markers: Practical Considerations ................. 89
Testing for Linkage: LOD Scores ................................................... 90
Grouping, Ordering and Spacing .................................................... 90
Sources of Error .............................................................................. 92
Chromosomal Assignment .............................................................. 94
Allopolyploidy and Autopolyploidy ............................................... 94
Bridging Linkage Maps to Develop Unified
Linkage Maps.................................................................................. 95
Bibliography ................................................................................... 108
Literature Cited ............................................................................... 108
Further Readings ............................................................................. 108
5 Phenotyping ................................................................................... 109
Phenotyping Versus QTL Mapping................................................. 109
Need for Precise Phenotyping......................................................... 110
Phenotyping for Biotic Stress ......................................................... 111
x Contents

Phenotyping for Abiotic Stress ....................................................... 112


Heritability of Phenotypes .............................................................. 113
Statistical Analysis of Phenotypic Data: Simple Statistics,
Heritability Estimation and Correlation .......................................... 115
Bibliography ................................................................................... 115
Literature Cited ............................................................................... 115
Further Readings ............................................................................. 115
6 QTL Identification ........................................................................ 117
QTL: A Prelude ............................................................................... 117
Single-Marker Analysis (SMA) ...................................................... 119
Interval Mapping ............................................................................. 120
Multiple QTL and Methods to Detect Multiple QTL ..................... 124
Composite Interval Mapping .......................................................... 124
Multiple Trait Mapping ................................................................... 125
Testing for Linked QTL Versus Pleiotropic QTL ........................... 125
Multiple Interval Mapping (MIM) or Multiple QTL Mapping....... 125
Statistical Significance .................................................................... 140
Permutation Testing ........................................................................ 140
Bootstrapping .................................................................................. 141
Permutation Versus Bootstrapping
and Other Methods.......................................................................... 141
QTL QTL Interaction: Impact of Epistasis................................... 142
QTL Environment Interaction ...................................................... 143
Congruence of QTL: Across the Environments and
Across the Genetic Backgrounds Is the Key in MAS ..................... 144
Meta-QTL Analysis ........................................................................ 144
Concluding Remarks on QTL Methods .......................................... 145
Alternatives in Classical QTL Mapping ......................................... 146
Bulked Segregant Analysis and Selective Genotyping ................... 146
Genomics-Assisted Breeding .......................................................... 146
Array Mapping ................................................................................ 147
Association Mapping ...................................................................... 148
Nested Association Mapping .......................................................... 151
EcoTILLING................................................................................... 152
Challenges in QTL Mapping .......................................................... 153
Confronts with Mapping Populations ........................................ 153
Markers and Its Implications ...................................................... 155
Segregation Distortion................................................................ 155
Phenotyping................................................................................ 156
Statistical Issues ......................................................................... 157
Practical Utility .......................................................................... 161
Bibliography ................................................................................... 162
Literature Cited ............................................................................... 162
Further Readings ............................................................................. 163
7 Fine Mapping ................................................................................ 165
Need for Fine Mapping or High-Resolution Mapping ................... 165
Types of Molecular Markers Suitable for Fine Mapping ................ 166
Contents xi

Physical Mapping and Its Role in Fine Mapping............................ 166


Comparative Mapping..................................................................... 167
Genetical Genomics/eQTL Mapping .............................................. 168
Map-Based Cloning ........................................................................ 170
Validation of QTLs ......................................................................... 171
Testing the Markers in Related Germplasm Accessions ................. 171
Bibliography ................................................................................... 172
Literature Cited ............................................................................... 172
Further Readings ............................................................................. 172
8 Marker-Assisted Selection ............................................................ 173
Advantages of MAS ........................................................................ 173
Limitations in MAS ........................................................................ 175
Prerequisites for an Efficient Marker-Assisted
Selection Program ........................................................................... 175
Procedure for a Generalised MAS Program for Selection
from Breeding Lines/Populations ................................................... 176
Marker-Assisted Backcross Breeding ............................................. 177
Gene Pyramiding or Stacking ......................................................... 181
Accelerated Methods of Gene Pyramiding ..................................... 181
Marker-Assisted Recurrent Selection (MARS) .............................. 181
Advanced Backcross (AB)-QTL Analysis ...................................... 184
Mapping-As-You-Go (MAYG) ....................................................... 184
Application of Markers in Germplasm Storage,
Evaluation and Use ......................................................................... 184
Resources for MAS on the Web ...................................................... 185
Bibliography ................................................................................... 185
Literature Cited ............................................................................... 185
Further Readings ............................................................................. 186
9 Success Stories in MAS................................................................. 187
Tomato ............................................................................................ 187
Maize............................................................................................... 188
Wheat .............................................................................................. 188
Rice ................................................................................................. 188
Barley .............................................................................................. 189
Soybean ........................................................................................... 189
Varieties Released Through MAS ................................................... 189
Hybrids Developed Through MAS ................................................. 190
MAS in Multinational Companies .................................................. 190
Contrasting Stories .......................................................................... 190
Conclusions and Future Prospects .................................................. 190
Bibliography ................................................................................... 191
Literature Cited ............................................................................... 191
Further Readings ............................................................................. 192
10 Curtain Raiser to Novel MAS Platforms .................................... 193
Current Techniques in Molecular, Biochemical
and Physiological Studies and Its Integration into MAS ................ 193
xii Contents

Molecular Techniques ..................................................................... 193


Expression Profiling ........................................................................ 193
cDNA Library Construction............................................................ 195
Differential Display and Representational
Difference Analysis ......................................................................... 196
Subtractive Hybridisation ............................................................... 196
Microarray....................................................................................... 199
Types of DNA Chips and Their Production ............................... 200
Hybridisation and Detection Methods ....................................... 200
1. DNA Sequencing by Hybridisation........................................ 201
2. Single Nucleotide Polymorphisms and Point Mutations ....... 202
3. Functional Genomics ............................................................. 202
4. Reverse Genetics .................................................................... 202
5. Diagnostics and Genetic Mapping ......................................... 203
6. Genomic Mismatch Scanning ................................................ 203
7. DNA Chips and Agriculture ................................................... 203
8. Proteomics .............................................................................. 204
9. Nucleic Acid Sequencing ....................................................... 204
Second-Generation DNA Sequencing ........................................ 205
454 Pyrosequencing ................................................................... 206
Illumina Genome Analyser ........................................................ 206
AB SOLiD.................................................................................. 207
Microchip-Based Electrophoretic Sequencing........................... 209
Sequencing by Hybridisation ..................................................... 210
Sequencing in Real Time ........................................................... 210
Targeted Capture of Genomic Subsets ....................................... 211
Handling and Storage of Sequence Information ........................ 212
Predicting Function from Sequence ........................................... 213
Homology Searches ................................................................... 213
Other Sequence Comparisons Strategies ................................... 214
Serial Analysis of Gene Expression (SAGE) .................................. 215
cDNA-AFLP ................................................................................... 217
RFLP-Coupled Domain-Directed Differential
Display (RC4D) .............................................................................. 219
Gene Tagging by Insertional Mutagenesis ...................................... 219
T-DNA Tag ................................................................................. 220
Transposon Tags ......................................................................... 220
Post-transcriptional Gene Silencing................................................ 221
MicroRNAs ..................................................................................... 221
Biochemical Techniques ................................................................. 222
Plant Proteomics ............................................................................. 222
Why Proteomics? ............................................................................ 224
Types of Proteomics ........................................................................ 225
Protein Expression Proteomics .................................................. 225
Structural Proteomics ................................................................. 225
Functional Proteomics................................................................ 225
Protein Analysis .............................................................................. 225
Contents xiii

One- and Two-Dimensional Gel Electrophoresis ........................... 225


Alternatives to Electrophoresis in Proteomics ................................ 227
Acquisition of Protein Structure Information ................................. 227
Edman Sequencing ..................................................................... 227
Mass Spectrometry ..................................................................... 228
Types of Mass Spectrometers ......................................................... 230
Peptide Fragmentation .................................................................... 231
De Novo Peptide Sequence Information ......................................... 231
Uninterpreted MS/MS Data Searching ........................................... 231
Proteomics Approach to Protein Phosphorylation .......................... 232
Phosphoprotein Enrichment ............................................................ 232
Phosphorylation Site Determination
by Edman Degradation ................................................................... 233
Phosphorylation Site Determination
by Mass Spectrometry..................................................................... 233
Metabolite Profiling Technologies .................................................. 234
Physiological Techniques ................................................................ 234
Near-Infrared (NIR) Spectroscopy.................................................. 236
Canopy Spectral Reflectance (SR) and Infrared
Thermography (IRT) ....................................................................... 236
Estimation of Compatible Solutes .................................................. 236
Genomics-Assisted Breeding .......................................................... 237
Functional Markers ......................................................................... 238
Comparative Genomics ................................................................... 239
Identification of Novel Molecular Networks
and Construction of New Metabolic Pathway ................................ 240
Bioinformatics for MAS ................................................................. 241
Bibliography ................................................................................... 243
Literature Cited ............................................................................... 243
Further Readings ............................................................................. 244
11 Recent Advances in MAS in Major Crops .................................. 245
Rice ................................................................................................. 245
Rice and Drought ....................................................................... 246
Mechanisms of Drought Resistance in Rice .............................. 246
Phenology................................................................................... 246
Root System ............................................................................... 247
Osmotic Adjustment ................................................................... 247
Dehydration Tolerance ............................................................... 248
Shoot-Related Drought-Resistance Traits .................................. 248
Genetic Linkage Map in Rice .................................................... 250
QTL Mapping of Drought-Resistance
Traits in Rice .............................................................................. 250
Rice Subspecies and Habitat ...................................................... 256
Marker-Aided Selection and Near-Isogenic Lines
for Drought-Resistance Improvement ........................................ 257
Target Population of Environment and Molecular
Breeding ..................................................................................... 257
xiv Contents

Concluding Remarks on MAS in Rice


for Water-Limited Environments................................................ 258
Cotton.............................................................................................. 259
Status of Cotton Molecular Marker Technology ........................ 260
Molecular Markers and Polymorphism in Cotton ...................... 260
Simple Sequence Repeats (SSRs) in Cotton .............................. 260
Cotton Linkage Maps ................................................................. 262
QTL Mapping for Yield and Fibre Quality
Traits in Cotton........................................................................... 262
Specific Challenges in Cotton MAS .......................................... 263
Confronts with Mapping Population .......................................... 263
QTL Environment Analysis ..................................................... 263
Incongruence Among QTL Studies............................................ 264
Complexities in Integration of Functional
Genomics with QTL................................................................... 264
Alternatives and Future Perspectives ......................................... 264
Meta-analysis of QTL: Synergy Through Networks.................. 264
Map-Based Cloning ................................................................... 265
Cotton Genome Sequencing....................................................... 265
Advances in Functional Genomics............................................. 265
System Quantitative Genetics: Bridging
Subdisciplines ............................................................................ 266
Association Mapping and Alternatives ...................................... 266
Improved Databases ................................................................... 266
Concluding Remarks for MAS in Cotton................................... 267
Mungbean ....................................................................................... 267
Genetic Diversity and Linkage Mapping
in Mungbean............................................................................... 268
QTL Mapping in Mungbean ...................................................... 268
Legume Comparative Genomics and Its Importance
in Mungbean MAS ..................................................................... 269
Concluding Remarks for MAS in Mungbean ............................ 270
Tomato ............................................................................................ 271
Conventional Breeding and Tomato Improvement .................... 271
Biotechnology and Tomato Breeding ......................................... 272
MAS for Bacterial Spot Resistance............................................ 273
MAS for Tomato Yellow Leaf Curl Virus Resistance ................ 274
MAS for Other Economic Traits ................................................ 275
MAS for Genetic Improvement
of Fruit Quality Traits................................................................. 275
Fine Mapping and Characterisation
of Fruit-Size QTL....................................................................... 276
Concluding Remarks for MAS in Tomato ................................. 276
Hot Pepper ...................................................................................... 277
Progress in MAS in Hot Pepper ................................................. 277
Concluding Remarks on MAS in Hot Pepper ............................ 278
Bibliography ................................................................................... 278
Contents xv

Literature Cited ............................................................................... 278


Further Reading .............................................................................. 280
12 Future Perspectives in MAS ......................................................... 281
MAS in Orphan Crops .................................................................... 283
MAS in Developing Countries ........................................................ 285
Community Efforts in Developing Countries
and Their Implications in MAS ...................................................... 286
Field and Laboratory Infrastructure Improvement.......................... 288
Lessons Learnt and Concluding Remarks....................................... 289
Bibliography ................................................................................... 290
Literature Cited ............................................................................... 290
Further Readings ............................................................................. 290

About the Author................................................................................... 293


Germplasm Characterisation:
Utilising the Underexploited 1
Resources

Farmers, in the given geographical region, cultivate the developments in nucleic acid characterisation
only a small set of crop varieties for a long period and manipulation, it is now possible to genetically
of time. Modern plant breeding programs also analyse and manipulate such quantitative traits
resulted in severe genetic bottleneck. As a using quantitative trait loci (QTL) mapping and
consequence, reduction in genetic diversity is marker-assisted selection (MAS). Thus, advances
widespread among crop plants, and it is consid- in molecular marker technologies have opened
ered as a detrimental feature to the future farming the door to new techniques for construction and
process. This is because continuous use of same screening of breeding populations, increase the
cultivars usually leads to at least (1) extensive efficiency of selection and accelerate the rates of
existence of (as well as emergence of new) pest genetic gain. By employing genetic and QTL
and diseases to the given crop species and (2) loss mapping, a marker can either be located within
of landraces and wild species of the given crop the gene of interest or be linked to a gene deter-
plants (which is otherwise referred to as genetic mining a trait of interest. Consequently, MAS can
erosion). Due to ever increasing population be executed as a selection for a trait based on
growth and continuous shrinking of farming genotype using associated markers rather than
lands, farmers are forced to cultivate crop plants the phenotype of the trait. This book is designed
under a wide range of latitudes and longitudes. to describe the basics of genetic and QTL mapping
This requires crop plants which can tolerate vari- using molecular markers and practicing MAS in
ations in light, temperature, water and nutrients crop plants with step-by-step procedures. In
besides occurrence of peculiar pest and diseases general, MAS scheme in genetic improvement of
that challenge crop production in these environ- crop plants for the given trait involves (1) charac-
ments. Conventional breeding approaches such terisation of germplasm for the trait of interest,
as desirable phenotypic selection among the (2) selection of extremely diverse parents, (3)
breeding materials have considerably contributed development of mapping population, (4) selection of
in genetic improvement of crops. However, only appropriate combinations of molecular markers
a few genetically improved lines are available to and genotyping of parents and mapping popula-
meet such challenges. The main limitations that tion, (5) construction of genetic or linkage map,
prevent the further progress through conventional (6) phenotyping of mapping population for the
breeding methods are lack of adequate genetic/ selected trait, (7) QTL analysis by combining the
biochemical/molecular knowledge on expression data obtained from step 5 and 6, (8) fine mapping
of traits that are beneficial to the crop cultivation and validation of QTLs and (9) executing MAS
and production. Most of the agronomically and for the target trait. Therefore, this first chapter of
economically important traits are quantitative in this book is keen to describe the leading vital step
nature and having complex inheritance. Thanks to in MAS: characterisation of germplasm.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 1
and Benefits, DOI 10.1007/978-81-322-0958-4_1, Springer India 2013
2 1 Germplasm Characterisation: Utilising the Underexploited Resources

Traditional collections, exotic accessions and the germplasm. Combining precise phenotyping
the wild species of crop plants, which are main- of germplasm with dissection of genetic and func-
tained in the germplasm banks, possess excellent tional basis of yield and other agronomically and/
tolerance to the biotic and abiotic stresses that are or economically important traits under various
prevalent in the above-said existing and new crop biotic and abiotic stresses would give unprece-
production environments. Such germplasm collec- dented ways to characterise the crop germplasm.
tions provide potential resources for future crop Thus, precise phenotyping practice is the first key
improvement program that is designed to cope step, and its successful completion definitely
with the many biotic and abiotic stresses. Hence, it would guarantee a better germplasm characterisa-
is important to characterise and understand the tion. To this end, it is imperative to have knowl-
genetic variation that exists in germplasm for their edge on factors that affect the quality of phenotypic
effective and proficient utilisation in crop breeding data, defining the nomenclature and mechanisms
programs using MAS. Characterisation of germ- of crop productivity under different climatic and
plasm facilitates identification and selection of stress conditions. All these limiting factors should
beneficial genes or alleles in the related wild spe- be addressed adequately for the target crop and
cies and landraces via MAS. It involves screening trait. There is no general procedure that fits well
each entry for morphological and agronomic to all the crops and for all the target traits. It
characters using a standard descriptor list. As definitely varies from crop to crop (and even
many characteristics as possible should be recorded within the species) and trait to trait. As an exam-
using coded qualitative scores. Further, gathering ple, a detailed phenotyping procedure in rice for
passport data (such as country, site and location of characterising the germplasm for one of the most
collection) permits selection of germplasm on a important abiotic stress, drought, is elucidated
geographical basis. In addition, a range of molecu- hereunder. However, many of the concepts pre-
lar markers (e.g. isozymes, RAPD, AFLP and sented herein are equally useful to other crops too
microsatellites) are also used for classification of for drought-resistance screening.
germplasm, and this data would be useful for more
detailed genetic diversity analysis. Thus, screen-
ing thousands of accessions for pest and disease Case Study in Rice Germplasm
resistance and tolerance to different abiotic stresses Characterisation for Drought Resistance
and systematic studies of the wild species and
molecular studies of genetic diversity provide data Realisation of the Essential Requirements
on species taxonomy and genetic relationships. It has long been realised that release of rice
Based on this information, a core set of germplasm cultivars with enhanced resistance to drought
entries can be selected for selection of parents. conditions and with high yield stability is essen-
Knowledge on genetic diversity and relationship tial to ensure food security in the twenty-first
among elite breeding materials constituting the century due to frequent occurrence and rigorous-
germplasm (see below) can have a significant ness of water stress around the world. Hence, we
impact on the selection of parents in crop improve- need to genetically tailor new cultivars that can
ment program. Selection of parents is also impera- withstand drought and its other closely related
tive in QTL mapping (see below). environmental constraints such as high tempera-
ture, salinity and nutrient deficiency. In the past,
traditional breeding strategies have shown sev-
Phenotyping for Morphological eral promising achievements. However, the
and Agronomic Characters progress has shown to be slow in several occasions
mainly due to lack of knowledge on drought-
The most salient hurdle to the effective utilisation resistance mechanisms and their appropriate
of germplasm in development of improved crop screening methods and strategies, poor herita-
cultivars is the troubles in accurately phenotyping bility of traits under water stress in field, lack of
Phenotyping for Morphological and Agronomic Characters 3

comprehensive interpretation of results at molecular, water stress), severity (e.g. mild/moderate/severe)


biochemical, physiological, genetical and agro- and duration (e.g. short/long duration) of water
nomical perspectives, etc. Hence, before pro- stress in the given environment. This also helps to
ceeding further, it is important to set the scene on describe other associated stresses such as high
long-term and short-term objectives. temperature, dry and high wind speed and nutrient
As stated earlier, first we should describe the deficiency. Another key point in characterising
nomenclature and mechanism of expression of the germplasm within the given environment is
target trait. In general, the term drought is observation of genotype by environment interac-
referred in agriculture as a condition in which tions on expression of yield traits. This observa-
the amount of water available via rainfall and/or tion may include additional factors of environment
irrigation is insufficient to meet the transpira- such as rainfall pattern; maximum and minimum
tion needs of the crop. Plants adapt different temperature; relative humidity; soil physical
mechanisms to withstand and mitigate the nega- (e.g. texture), chemical (e.g. presence of heavy
tive effects of such water deficit. In general, metal or other toxic elements) and biological
there are traits that (1) help plants to survive factors (e.g. beneficial and harmful microbial
under drought stress and (2) mitigate yield load); diseases (e.g. foliar diseases); pests/
losses in crops when exposed to a water stress. beneficial insects (e.g. pollinators); and parasites.
Therefore, it is essential to judge the overall Thus, it is nearly impossible to find a single envi-
phenotypic value of given germplasm accession ronment that represents the target population of
in terms of yield under water stress in the given environments. An ideal strategy would be pheno-
environment. In other words, the knowledge typing for drought tolerance and yield stability
generated by any drought-related study should across a broad range of sites within the given envi-
address their impact on the yield and its compo- ronment with at least three replications in Latin
nent traits either directly or indirectly. Several square design. Latin square design effectively tak-
absolute reviews and committed volumes and ing care of field heterogeneity. During the past
book chapters have addressed the mechanisms decades, it has been repeatedly shown in several
underlying drought-resistance and breeding crops that multi-environment trails are instrumental
strategies that can improve yield under water in increasing yield potential under drought. Thus,
stress (please see further readings). Provided it is essential to define the set of environments,
below is the very simple synopsis of this knowl- fields and seasons in which the given germplasm
edge and its application in characterising rice entry is expected to do well before beginning the
germplasm for drought resistance in a labora- genetic mapping and MAS.
tory that has minimum facilities.
To begin well, the major critical step is to
define the environment to which the breeding Traits Useful for Characterisation
program is targeted (referred some times as target
population of environments). Each crop is grown Considering the fact that farmers ultimately har-
in a complex set of socio-physical and biological vest grain in rice, it is vital to interpret cause
environments, and there is no single and similar effect relationships (usually with correlation
environment even on the same farm. The studies) between morpho-physio-agronomical
identification and characterisation of a target traits and grain yield (or other economic traits
environment is facilitated by the use of historic in case of other crops) under drought conditions.
records of weather data, cropping pattern fol- It should be noted that the sign and magnitude of
lowed during the past, etc. Simulation models this relationship are not universal and can change
can also be used to describe the target environ- widely according to frequency, timing and intensity
ment by the frequency of occurrence of water of water stress periods. Thus, the traits that are
stress and based on the soil moisture profile. This potential in characterising rice germplasm for
helps to shortlist the type (e.g. early/mid/terminal improving yield under water-limited conditions
4 1 Germplasm Characterisation: Utilising the Underexploited Resources

should be genetically (i.e. causally) correlated for yield. On the other hand, excessively vigorous
with yield and preferably would have higher leaf development could cause early depletion of
heritability than yield (see chapter 5 for heritability soil moisture. Thus, the optimal degree of vigour
calculation). Presence of sufficient genetic vari- should be selected, and besides genetic potential, it
ability and lack of yield penalties under favour- also depends on the characteristics of the given
able conditions are considered as additional environment. Keeping all these in mind, the rice
features of these traits. Ideally, measurement of germplasm should be screened for each accession
such trait(s) must be non-destructive (i.e. use of to count the number of days required to germinate
small number of plants or plant samples), rapid and develop a particular leaf area under field
(e.g. without using lengthy procedures to cali- conditions.
brate sensors to individual plants), accurate and
inexpensive and, finally, should provide long- Flowering Time
term ecophysiological performance of the crop. Another critical factor that optimises adaptation
Such traits should be cheaper and easier to mea- (and produce better yield) under low water avail-
sure than grain yield under stress. The reader ability is flowering time. It was established in
could now realise the difficulty in identifying almost all the crops that there is positive asso-
such potential trait since there is no single trait ciation between yield and flowering time across
that can satisfy all the above-said requirements. different levels of water availability. Days to
Very often, experiments are lost due to pest or achieve 50% flowering can be phenotyped quite
erratic weather damage before recording final easily and effectively under both irrigated con-
yield. In such conditions, these traits are useful. trol and water-stressed experimental conditions,
Based on the peer-reviewed literature, carefully and it can be used as a valuable trait for drought
tested under different experimental procedures tolerance breeding program. Flowering delay
and personal experience, the following traits are (=days to flowering under stress conditions
listed as potential candidates for characterising days to flowering under irrigated control) could
rice germplasm. As a caution, it should be noted serve as a potential additional trait to the 50%
that these traits are not final and they are not suit- flowering.
able for all the water-limited environments.
Readers are requested to finalise the traits based Chlorophyll Concentration, Leaf Rolling
on the target environment, breeding objective, etc. and Leaf Drying
However, the concept and procedure of charac- The traits that have been phenotyped to indirectly
terising the plant germplasm described here is the estimate photosynthetic potential (a critical ele-
same for all the plants. By ensuring random repre- ment that decides final yield) are chlorophyll
sentative plants are selected for measurement of concentration, leaf rolling and leaf drying, all of
traits in the each plot, sampling bias can be which are interconnected. Total and individual
avoided. Again it is highlighted that the secondary components of chlorophylls and chlorophyll sta-
traits (other than the grain yield) should always be bility index can be measured both under normal
associated (good statistical correlations) with and water stressed conditions. Similarly, leaf roll-
yield, and it is essential in depicting any final con- ing and drying scores need to be phenotyped by
clusion on the germplasm characterisation. essentially following the procedures around
midday.
Early Vigour
Several physiological and biochemical studies Grain Yield
have shown that selection of germplasm acces- The main objective of drought tolerance breeding
sions that shown early and vigorous establish- program is to develop a variety that produces
ment allow the stored water available for later higher yield when compared to currently avail-
developmental stages when soil moisture becomes able varieties in the given environment under the
progressively exhausted and increasingly limiting types of drought stress that occur most frequently.
Allele Mining 5

Further, if water stress does not occur in some in terms of interpreting cause and effect relation-
years, that variety should also produce high yields ships between yield and drought tolerance traits.
in the absence of stress. Thus, in farmers view-
point, a drought-tolerant variety is the one that
produces higher yield relative to other cultivars Allele Mining
under drought stress and produce sustainable
yield under normal conditions. Hence, all the Allele mining refers to identification of naturally
protocols and strategies that focus on breeding occurring allelic variation at agronomically
for drought tolerance should be designed in this important genetic loci (otherwise called as
light. To increase the efficiency of direct selec- genes). This can be performed by using a variety
tion for yield, it is essential to ensure that the test- of approaches including mutant screening, QTL
ing environment is a true representation of the and AB-QTL analysis, association mapping and
target environments; large numbers of germplasm genome-wide survey for the signature of artificial
entries (usually > 500) are screened in order to selection (each method is described in details in
increase the selection intensity; uniform manage- subsequent chapters). Though several methods
ment of drought stress across the trails, sites and have been described, efficient extraction and
seasons with reasonable levels of replications (it exploitation of the adaptive variation and valu-
was noticed that increasing the number of loca- able traits present in the germplasm is yet to be
tions is more effective than increasing the num- uncovered. For example, several traditional and
ber of replications within the location); and use improved cultivars from drought-prone areas
of best experimental design to address the field have some tolerance to reproductive stage
variation. drought stress, but they have rarely been used in
The traits mentioned above are very far from molecular breeding program. A more extensive
being exhaustive. Therefore, the use of the above survey of these germplasm may lead to the
said and other traits as selection criteria for yield identification of new germplasm entries carrying
should be exercised cautiously and only after superior alleles for agronomic and economic
defining the target environment. Irrespective of crop traits. Such unique alleles can be integrated
the procedures used and experimental designs into molecular crop breeding program that aimed
employed, each phenotyping score might have a to combat pest and diseases; to promote yield,
specific background, and hence results should be quality or nutritional properties; or to improve
inferred accordingly in characterising the germ- abiotic stress tolerance.
plasm. Availability of a good record of meteoro- Thus, the successful allele mining procedure
logical parameters (rainfall, temperatures, wind, is highly dependent on the use of diverse germ-
evapotranspiration, light intensity and relative plasm collections, especially those rich in wild
humidity) allows meaningful interpretation of species. This is because the majority of allelic
the results. Collection of meaningful phenotypic variation at the gene(s) of interest is largely
data in field experiments greatly depends on assumed to occur in the wild relatives of a crop
experimental design, heterogeneity of experi- (i.e. not in the cultivating crop varieties) due to
mental conditions between and within experi- the unavoidable loss of variation during the
mental units, size of the experimental unit and domestication process. Several efforts have been
number of replicates, number of sampled plants made to identify useful new alleles that are pres-
within each experimental unit and genotype envi- ent in the wild gene pool in almost all the crop
ronment management interactions. Further plants. Despite those efforts, unfortunately, entire
variations due to phenology (duration for each germplasm entries have not yet been efficiently
developmental phases) and other environmental characterised for their novel phenotypes due to
stresses should also be considered while evaluat- several challenges including lack of resources
ing the germplasm. Poor attention on these fac- for evaluating huge collections. Alternatively,
tors may lead to erroneous conclusions, particularly core collection of germplasm has been proposed
6 1 Germplasm Characterisation: Utilising the Underexploited Resources

as materials for allele mining. A representative To this end, large-scale genome sequencing
subset of the complete collection of germplasm projects and functional genomic efforts on sev-
that has been optimised to contain maximal diver- eral major food crops provide a directory of all
sity in a minimal number of accessions is referred the genes in the given crop with their function.
to as core collection. Thus, while maintaining Though this information has been generated
maximum allelic diversity at loci controlling using the reference crop cultivar or accession,
traits of interest, core collections help in integra- this can also be extended to other varieties/species
tion of novel useful alleles into molecular or con- too, in light of allele mining. This is possible
ventional breeding programs by reducing the because of genome synteny and gene(s)
number of accessions. This will lead to the devel- sequence conservation among the species.
opment of broad and diversified elite breeding Several approaches has been designed to isolate
lines with superior yield and enhanced adaptation novel alleles from the related species and genera
to diverse environments. using this sequence information, and it would
Best core collections can be constituted by result in direct access to key alleles conferring
assembling a wide range of evidence on diversity resistance to biotic stresses, tolerance to abiotic
and subsequently sampling those accessions that stresses, greater nutrient use efficiency, enhanced
are representative of this diversity. One such simple yield and improved quality and nutrition. One
generic factor is geographic origin. Conventional among the technique, which employs simple
accessions from different parts of the world usually polymerase chain reaction (PCR; refer box 3.1
have had an independent history of domestication in chapter 3) strategy to isolate useful alleles
for thousands of years and are therefore likely to from rice germplasm, has been given in Box 1.1
show differences across the genome. Construction as an example. It is also worth to mention here
of such core collection can discover at least the the role of EcoTILLING in allele mining. A
majority of new alleles in a relatively small number variant of targeting induced local lesions in
of accessions. On the other side, one key factor to genomes (TILLING), known as EcoTILLING,
be remembered at this time is even a carefully con- was developed to identify multiple types
structed core collection will not allow to discover of polymorphisms in germplasm collections
the complete list of alleles in all possible combina- or breeding materials (Comai et al. 2004).
tions. Hence, it is essential to screen the whole ger- EcoTILLING allows characterisation of natural
mplasm. When cheaper and faster technologies for alleles at a specific locus across several germ-
allele mining are developed, this effort would not plasm entries in a rapid and affordable way (see
be a titanic task. chapter x for more details).

Box 1.1 Rapid and Inexpensive Strategy for Allele Mining in Rice
There are >100,000 germplasm accessions/ physical map location of each rice gene (refer
entries deposited at International Rice Gene international rice genome sequencing project
Bank, IRRI, the Philippines. Each genotype (IRGSP) home page at http://rgp.dna.affrc.go.
has ~50,000 estimated genes. Every gene has jp/IRGSP/download.html or gramene at http://
an unknown number of alleles and each allele www.gramene.org/resources/ for example)
may change the way the rice adapts or grows form the base for allele mining. The first step
or seems or tastes. Hence, understanding the in allele mining is deciding which part of the
function of each allele has utmost importance genome we should explore. In other words,
that decides future rice breeding. Publically allele mining can be conducted on specific
available rice genome sequence database and genes that are involved in the particular

(continued)
Allele Mining 7

Box 1.1 (continued)


mechanism of phenotypic trait expression. not result any additional information. Instead,
Usually allelic differences (also called as allelic we need to employ a subset of highly distinctive
polymorphism) will be a result of differences accessions, namely, core collections (see the
in intron and exon sequences or in the regula- text for more information on core collection).
tory regions of the given gene. For example, The amplified PCR product using the prim-
the genes involved in abiotic stress tolerance ers designed with the above-said principle rep-
(like genes code for heat-shock proteins, tran- resents either entire allele or functional
scription factors, late embryogenesis abundant component of the allele (i.e. depending on the
proteins) can be fished out from the genome primer designing strategy that have employed).
sequence, and primers that are specifically If it is component of the gene, the full length
flanking the conserved genic regions can be gene should be amplified with same strategy
designed. Primer3 is the most frequently used explained above. The identified polymorphic
freely available online software (http://frodo. allele needs to be sequenced, and at the end of
wi.mit.edu/) for primer designing. We need to this experiment, we could identify, isolate and
paste the target sequence in FASTA format in characterise the novel alleles of genes that are
the box provided, and by clicking the PICK candidates for the target trait (in this case, it is
PRIMER radio button, we can obtain appro- abiotic stress tolerance). Since we do have data
priate primers that flank the target sequence. on field-based phenotyping of the given rice
Since the selected genes are members of multi- germplasm, we can group those accessions
gene family, the members may have conserved that are having similar alleles and tolerance
genic sequences. In general, member of multi- level. The strategy that associates alleles or
gene family dispersed around the genome or genomic regions to the given phenotype using
may have remained as tandem repeats within a linkage disequilibrium or association mapping
single genetic locus. Thus, these primers can is described separately in detail (see chapter 6).
be used in PCR-based allele mining that pro- Briefly, association mapping assumes that
vides an opportunity to test the evolutionary an allele responsible for the expression of a
range over cultivated rice and its relatives. To phenotype, along with the markers that flank
increase the efficiency of identifying polymor- the allelic locus, will be inherited as a block.
phic alleles, it is better to design primers in the Hence, use of such flanking markers or allelic
5 or 3 untranslated regions of the selected sequence itself as a marker will predict the
genes since these DNA sequences have shown performance of a progeny that express the
to have variation in multi-gene family when favourable phenotype. We can also proceed
compared to coding sequences. Thus, it is further in characterising the key biochemical
important to have a balance in targeting the and physiological mechanisms of tolerance
conserved genic sequence and maintaining the using the functional genomics tool. Thus, upon
genetic variation. Once the candidate gene(s) complete characterisation of these alleles,
was explored, discovering new alleles for the molecular backcross breeding strategy can be
selected candidate gene(s) should be performed employed to transfer this useful allele into elite
with the germplasm collection. It should not start variety. Development of such new combina-
with the first accession and work through the tion of useful alleles from different genes in
collection. This is because such effort would different accessions will lead to breed for a
be inefficient, since the second accession might novel variety that meets the farmers and con-
be similar to the first accession at the given sumers needs. However, this technique has
loci. Hence, analysing second accession would some drawbacks: (1) lack of specificity during

(continued)
8 1 Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.1 (continued)


primer annealing may lead to amplification of sequence is beyond the limit of PCR, it would
non-specific PCR products, (2) usually PCR be difficult to proceed further for complete
will not be successful for those distantly related allelic characterisation using this strategy;
genera due to poor conservation of primer alternatively, PCR walking would be useful in
sequences and (3) when the length of gene mining such alleles.

of genetic distance measurement methods are


Genetic Diversity and Clustering available, and use of such method is highly
decided by the selection of software tool we
Study of genetic diversity exists in the germplasm employ for the analysis. Among the genetic
(i.e. investigation on genetic variation among distance measurement methods, modified Rogers
individuals or groups of individuals) is usually a genetic distance (GDMR) is the most frequently
collective process. There are several methods and used measure. There are several constraints while
strategies available to study the germplasm in employing the data for the analysis of genetic
terms of genetic diversity which is essential to distance. One most frequently occurring problem
reveal the genetic relationships among the ger- is use of molecular marker data. When certain
mplasm entries. Precise estimation of genetic genotypes did not show any amplification for
relationship depends on sampling strategies, use some marker alleles, it is often difficult to assume
of several data sets, selection of genetic distance whether such lack of amplification is due to null
estimate strategies, clustering procedures or other alleles or failure in molecular experiment. In such
multivariate methods, etc. Thus, careful combi- cases (i.e. when we are not sure about the null
nations of these features and use of appropriate status of a genotype at this specific marker locus),
statistical programs and strategies are the key it should be considered as missing data during
in these data analysis (refer Mohammadi and genetic distance measurements; otherwise it will
Prasanna 2003 for further details). In general, the lead to erroneous inference. It should also be
germplasm data comprises numerical measure- noted that use of dominant and co-dominant types
ments and combinations of different types of of marker can also influence the genetic distance
variables. Pedigree data, passport data, morpho- measurements due to unknown statistical distri-
logical data, biochemical data, storage proteins butions. In order to overcome this limitation,
data and more recently DNA-based marker data several alternatives, including bootstrapping
are being used to reliably estimate the genetic method, have been proposed in certain statistical
relationship in crop plants (for details on markers software. When a scientist wish to use more than
and its application, see chapter 3). The selection one genetic distance measures to analyse the data
of data sets is decided by the objective of the set, it is essential to understand the correspon-
experiment, the level of resolution required, dence between matrices derived from those
availability of resources and infrastructure facili- measures. To reliably test this correspondence, a
ties and impact of operational, cost and time con- popularly known Mantel test can be engaged
straints. Each data provide a specific type of and it has been widely followed in crop plants.
information. For example, when we use the Resampling techniques such as bootstrapping
molecular data, genetic distance or similarity or and jackknife are also used predominantly in
relationship among individuals of the given germ- the recent publications, particularly in relation to
plasm is usually calculated as a quantitative application of marker data in genetic diversity
measure that differentiates the two individuals at analysis. Especially, to find the smallest set of
sequence or allelic frequency level. Wide range markers that can provide an accurate assessment
Genetic Diversity and Clustering 9

of genetic relationships among the germplasm p different types of phenotypic and/or binary
entries, resampling techniques have provided molecular data), the term genetic diversity among
useful measures. The latest versions of statistical the n genetic objects refers to grouping of the n
programs used in genetic diversity analysis objects into an appropriate number of classes
(see below) have these features. Interpreting the (usually less than n), and the objects within
resampling techniques is also simple. For exam- classes are relatively homogeneous with respect
ple, a simple rule of thumb is that internal tree to the data p. The statistical techniques,
branches that have >70% bootstrap are likely to classification and ordination are used for group-
be correct at the 95% probability level. ing the n entities based on the p types of phe-
When sample sizes of germplasm increases, it is notypic and/or binary molecular data. Application
important to classify and order genetic variability of these techniques requires an a priori selection
among germplasm by using established multivari- of an appropriate quantitative measure of prox-
ate statistical algorithms such as cluster analysis, imity (similarity/dissimilarity/distance) among
principal component analysis, principal coordinate the given entities. In consequence to the selection
analysis and multidimensional scaling. Interestingly, of appropriate proximity measure, the data matrix
multivariate analytical techniques simultaneously Xn*p is converted to a square proximity matrix
analyse multiple measurements on each individual Mn*n of n rows and n columns corresponding
of the germplasm and analyse the genetic diversity to the n genetic entities. Implementation of an
irrespective of the data set (i.e. morphological, appropriate sequential agglomerative hierarchi-
biochemical or molecular data can be used). cal nonoverlapping (SAHN) classification tech-
This book has focused only on clustering method nique and an appropriate ordination technique on
(especially on salient statistical methodologies and the proximity matrix, Mn*n, yields a dendrogram
other considerations with respect to this method) and a two- or three-dimensional ordination plot,
and is described in Box 1.2. respectively. Such dendrogram and the ordina-
tion plot, which are the graphical end products of
classification and ordination, elucidate the under-
Software lying structure of genetic diversity among the n
genetic objects. In general, SAHN clustering
Numerous software programs are available for takes dissimilarity matrix Dn*n = {dij} as input
assessing genetic diversity, such as Arlequin, data. Initially, two closest objects are joined based
DnaSP, PowerMarker, MEGA2, PAUP, TFPGA, on their dij values, giving (n 1) clusters, one
GDA, GENEPOP, NTSYSpc, Structure, Gene contains two objects and others have a single
Strut, POPGENE, Maclade, PHYLIP, SITES, member. In each succeeding steps, two closest
CLUSTALW and MALIGN. Most of them are clusters are merged. But to do so, we need appro-
freely available in the World Wide Web. Most of priate definition of dissimilarity between clusters
the programs perform similar tasks, with the main based on dissimilarity between their constituent
differences being in the user interface, type of objects. This is the point at which different SAHN
data input and output, and platform. Thus, choos- methods differ. There are several SAHN methods
ing which to use depends profoundly on individ- including unweighted pair group method using
ual favourites. arithmetic averages (UPGMA), single linkage
method, complete linkage method (compromise
between single and complete linkage preferred
Principle Behind the Genetic Diversity due to its robust nature), Wards method (useful
Analysis for continuous variables such as plant height and
yield) and weighted average linkage (WPGMA).
When a rectangular data matrix Xn*p is prepared Other SAHN methods that are rarely used in
(where n rows corresponding to n different practice are centroid (UPGMC), median
genetic objects and p columns corresponding to (WPGMC), and flexible. SAHN classification
10 1 Germplasm Characterisation: Utilising the Underexploited Resources

results are represented by 2-D diagram known as netic correlation and Mantels permutation test.
dendrogram. The dendrogram depicts the fusion These are implemented in statistical program
of objects/clusters at each step of the analysis itself (e.g. in NTSYSpc). There are other mea-
along with a numerical measure of (dis) similar- sures such as kappa coefficient, Rand index,
ity. Thus, hierarchical clustering methods are adjusted Rand index and BC coefficient, but rarely
agglomerative or divisive. Agglomerative meth- employed. Cophenetic matrix of cophenetic val-
ods proceed by a series of successive fusions of n ues is generated from the dendrogram to compute
objects into groups. Divisive methods proceed by cophenetic correlation. Values of cophenetic cor-
separating n objects into successively finer relation above 0.80 indicate a good agreement
groups. Groupings or divisions produced by a (see Box 1.2). The Mantel test provides a measure
hierarchical method are final; thus, defects in of statistical significance for the observed cophe-
clusters, once introduced, cannot be repaired. netic correlation. When the same n objects are
Agglomerative methods are more widely used separately clustered using phenotypic and molec-
than divisive methods. Single linkage, complete ular data, results can be synthesised into a single
linkage, centroid, Wards and group average are consensus dendrogram using strict consensus or
the most widely used agglomerative clustering majority consensus rules (refer NTSYSpc manual
methods. The group average method, also called for performing such analysis). Strict consensus rule
as average linkage or UPGMA method, has been delivers a consensus dendrogram, each subset of
widely used for germplasm analysis in plant which is in each individual constituent dendrogram.
breeding. The clustering method by data structure In a majority consensus dendrogram, each subset
interactions can be significant. The aim of cluster in it is in a majority of the individual constituent
analysis is to find an optimum tree (or phenogram dendrograms. Before attempting to obtain a con-
or dendrogram) or set of clusters. Hierarchical sensus dendrogram, it may be useful to first com-
algorithmic clustering methods are used to repre- pute cophenetic correlations to get an idea of the
sent distance matrices as ultrametric trees. If the extent to which the constituent dendrograms rep-
distances are ultrametric, then the fit of the data resent similar results. Bootstrap can be used to
to an ultrametric tree is exact. If the distances are assess reliability of results produced by a dendro-
not ultrametric, then the fit of the data to an ultra- gram. WinBoot performs bootstrap on binary
metric tree is not exact. The reliability of the esti- data to determine confidence limits of UPGMA-
mated diversity elucidated by a dendrogram and/ based dendrogram.
or an ordination plot depends on many factors.
However, the most critical factor is the accuracy
with which the phenotypic and molecular scores Genetic Diversity Analysis Using
in the data matrix Xn*p are recorded and Molecular Markers
estimated.
Success of any crop breeding program is based on
(1) the knowledge of and (2) availability of genetic
Principle of Measuring Goodness of Fit variability for efficient selection. Genetic similar-
of a Classication ity (or genetic distance) estimates among geno-
types are helpful in at least two ways: (1) selecting
When genetic diversity analysis was done with parental combinations for creating segregating
more than one statistical software (see above), populations so as to maintain genetic diversity in
comparison of dendrograms, with each other or a breeding program and (2) the classification of
with their proximity matrices, is required for vali- germplasm into heterotic groups for hybrid crop
dation of clustering results. For example, we may breeding. Establishment of heterotic groups can
like to test whether different subsets of p variables be based on geographical origin, agronomical
or different clustering methods applied on same traits, pedigree data or on molecular marker
data provided the similar results. Statistical mea- data. Before the use of molecular markers,
sures to address such questions include cophe- genetic diversity was estimated from pedigree or
Genetic Diversity Analysis Using Molecular Markers 11

agronomic and morphological characteristics. distinguished with certainty from individuals that
However, the estimates based on pedigree infor- are homozygous for that band (see chapter 3).
mation are generally overestimated and often In the second approach, a genetic dissimi-
found unrealistic. For example, the morphologi- larity matrix constructed using molecular data
cally based genetic diversity estimates suffer from from all possible pairwise combinations of
the drawback that morphological characteristics individuals and is used for characterising pop-
are limited in number and are influenced by the ulation structure based on relative affinities of
environment. Therefore, neither pedigree-based each tested individual. This approach requires
nor morphologically based estimates may not proper methods for assessing dissimilarity
reflect the actual genetic difference of the studied between individuals, and it is particularly use-
populations. On the other hand, molecular mark- ful in the case of possible linkages between
ers are not influenced by environment and likely different loci. The choice of a suitable index of
reflect true genetic similarity (or dissimilarity) similarity is a very important and decisive
and do not require previous pedigree information point for determining true genetic dissimilarity
which is valuable for crops where pedigree infor- between individuals, clustering and analysing
mation is lacking. Various types of molecular diversity within populations and studying rela-
markers are available for genome analysis. Simple tionship between populations. This is because
sequence repeats (SSRs) in particular have been different dissimilarity indices may yield con-
reported to be very useful to analyse the structure trary outcomes. Many researchers have pre-
of germplasm collections as these are abundant, ferred for various well-documented reasons
co-dominant, multi-allelic, highly polymorphic to use the second approach either alone or in
and chromosome specific. SSR markers have been combination with the first approach. However,
extensively used in genetic diversity studies in the bases for choosing the most appropriate
many plants including wheat, pearl millet, sor- coefficient of dissimilarity depending on type
ghum, triticale, cotton, rice and maize. There are of marker and ploidy of the organism in ques-
also other types of DNA- and RNA-based mark- tion have not received sufficient attention in
ers that have shown their potential utility in genetic published research articles.
diversity analysis (see chapter 3 for more detailed 2. Molecular markers are commonly used to
description on markers). However, molecular characterise genetic diversity within or between
markers should be used in caution when they are populations or groups of individuals because
engaged in genetic diversity analysis because of they typically detect high levels of polymor-
the following issues. phism. Furthermore, RAPDs and AFLPs are
1. There are two approaches that are commonly efficient in allowing multiple loci to be analysed
used in studies of genetic diversity within and for each individual in a single gel run. In
among populations or groups of individuals analysing banding patterns of molecular mark-
using molecular markers. In the first, allele fre- ers, the data typically are coded as (0,1)-vectors,
quencies over a number of polymorphic loci 1 indicating the presence and 0 indicating the
are determined, and parameters based on the absence of a band at a specific position in the
allele frequencies are used for partitioning gel. With diploid organisms and co-dominant
genetic variation into components for variation markers, the banding patterns may be translated
within and between units. This approach may to homozygous or heterozygous genotypes at
be chosen when dominant markers (such as each locus, and the allelic structure derived is
RAPDs, AFLPs and ISSRs) are applied to hap- utilised for comparison between individuals.
loid individuals or co-dominant markers (such Several measures including the Dice (Nei and
as allozymes, RFLPs and SSRs) used with hap- Li), Jaccard and simple match (or the squared
loid or diploid species with the assumption of Euclidean distance) coefficients are commonly
no linkage between loci. With dominant mark- employed in the analyses of similarity of indi-
ers, individuals that are heterozygous for a viduals (binary patterns) in the absence of
DNA band at a specific position cannot be knowledge of ancestry of all individuals in the
12 1 Germplasm Characterisation: Utilising the Underexploited Resources

populations. These similarity coefficients are Clearly with co-dominant markers, the genetic
defined differently and therefore they may yield similarities between pairs of individuals can-
different results for both the qualitative and not be characterised simply in terms of the
quantitative relationships between individuals. proportion of bands that are shared between
Although these coefficients may not yield two individuals. Also, if there are multiple
identical results, most published studies do not alleles per locus, as expected for SSRs, which
offer any rationale to support the choice of the are highly variable, the total number of bands
coefficient that was used in relation to the type expressed by all the individuals in a sample
of marker evaluated or the ploidy and mating will likely be much greater than the number
system of the organism being studied. Each of of loci involved. Therefore, the banding
these factors may influence how accurately the profiles should be adjusted to represent the
direct application of a given similarity coefficient allelic patterns of individuals across all loci
to the (1,0)-vectors will reflect the true genetic studied and to represent the total number of
similarity of any pair of individuals. In most loci and the number of shared alleles rather
published studies, the similarity coefficient used than the total number of bands and the number
was apparently chosen simply because it was of shared bands, respectively, and the adjusted
used in an earlier publication or it is available values should be employed for measuring
in the software package used to analyse the data. similarity between individuals.
In some cases, two or three similarity coefficients 4. For dominant markers, it is generally assumed
are used with the same data set with the that each band represents a different locus and
expectation that if the results are robust; the dif- that the alternative to a band at the gel position
ferent coefficients should reveal essentially the characteristic of that locus is the absence of a
same patterns of diversity. If two similarity band anywhere in the gel. Thus, for dominant
coefficients reveal somewhat different patterns markers, there is a direct identity assumed
of relationships between individuals, there is between the number of unique bands observed
hardly any rationale presented to suggest which and the number of identifiable loci for the sam-
pattern is more valid, and often only one of the ple of individuals. On the other hand, the inter-
patterns is presented in the publication. As a pretation of shared absences of specific bands
general rule, we should expect an appropriate by two individuals may depend on the degree
similarity coefficient to produce a consistent of genetic similarity among individuals within
measure of the proportion of differentiating the sample. That is, the interpretation may be
factors showing similarity between any pair of different when the individuals are drawn from
individuals relative to the total number of fac- different taxa in a phylogenetic tree than when
tors in which differences could have been the individuals are all from closely related pop-
detected if the individuals showed no detect- ulations of a single species.
able similarity. That is, the similarity coefficient 5. The key problem with analysis of genetic rela-
employed should accurately reflect our best tionships between individuals with molecular
understanding of the phenotypes observed and markers is measuring their dissimilarity. There
the genetic basis for them. are no acceptable universal approaches for
3. With co-dominant markers, each recognisable assessing genetic dissimilarity between indi-
allele at a given locus is ordinarily associated viduals based on molecular markers. Different
with a single band at a unique position in the dissimilarity measures are relevant to, and
gel. Thus, in the case of diploid organisms for should be used with, multi-locus dominant
a given locus, a homozygote will have one and co-dominant DNA markers as well as with
band and a heterozygote will have two. Null diploid (polyploid) and haploid individuals.
alleles (no band) are rarely seen. Therefore, The Dice dissimilarity index is suitable for
the shared absence of a band at a specific haploids with co-dominant molecular mark-
position should not be considered in measures ers, and it can be applied directly to (0,1)-vec-
of similarity with co-dominant markers. tors representing multi-locus multi-allelic
Genetic Diversity Analysis Using Molecular Markers 13

banding profiles of individuals. None of the be considered as the most suitable measure of
Dice, Jaccard and simple mismatch coefficient dissimilarity between banding patterns of
is appropriate for diploids (polyploids) with closely related haploid forms, whereas for dis-
co-dominant markers, because there is no way tantly related haploid individuals, the Jaccard
for direct processing of fingerprint profiles. dissimilarity is recommended. In general, no
By transforming multi-allelic banding patterns suitable method for measuring genetic dis-
at each locus into the corresponding homozy- similarity between diploids with dominant
gous or heterozygous states, a new measure of markers can be proposed. Therefore, analyses
dissimilarity within loci needs to be used and of genetic dissimilarity between diploid (poly-
may be expanded for measuring dissimilarity ploid) organisms with dominant markers
between multi-locus states of two individuals should be viewed with caution unless the
by averaging across all co-dominant loci organism is highly inbred and therefore highly
tested. The simple mismatch coefficient can homozygous.

Box 1.2 Cluster Analysis


Cluster analysis refers to mathematically chical method) or successive divisions of
grouping (or clustering) the individuals of the group of individuals (see above). The most
germplasm based on their similar characteris- similar individuals are first grouped and
tics. Thus, individuals within the cluster show these initial groups are merged according to
high internal homogeneity and individuals their similarities. Among the various agglom-
between the cluster exhibit high external erative hierarchical methods, unweighted
heterogeneity. Broadly, there are two types of paired group method using arithmetic aver-
clustering strategies. One is based on distance- ages (UPGMA) is the most commonly adopted
based method (in which a pairwise distance clustering algorithm followed by Wards mini-
matrix is used which leads to a graphical mum variance method. For your information,
representation such as a tree or dendrogram) the nonhierarchical clustering procedures do
and another method is based on model-based not involve in construction of dendrogram,
methods such as parametric models (infer- and hence, it can be done using statistical soft-
ences on each cluster and their relationship is ware such as SAS or SPSS. However, this
obtained by maximum likelihood or Bayesian method is not usually followed in crops pri-
methods). It has been established that the later marily due to lack of prior information about
method is innovative and useful due to the the optimal number of clusters that are required
constraints associated with former method for accurate assignment of individual objects.
with respect to multi-locus genotypic data. Among the different types of clustering
However, at present, the distance-based meth- methods (such as UPGMA, unweighted paired
ods are most frequently used, and step-by-step group method using centroids (UPGMC),
procedure for clustering analysis using this single linkage, complete linkage and median),
method is explained hereunder. UPGMA dendrograms have been used exten-
Hierarchical and nonhierarchical methods sively in the published reports since it provide
are commonly used in distance-based cluster- consistency in grouping germplasm objects
ing analysis, and hierarchical clustering meth- with relationships computed from different
ods are most commonly employed in analysis data types. However, despite some advantages
of genetic diversity in crop plants. These in UPGMA, a single clustering method might
methods perform either by a series of succes- not be useful or effective in uncovering genetic
sive merger (called as agglomerative hierar- relationships, and it would be desirable to

(continued)
14 1 Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.2 (continued)


analyse the congruence among results obtained statistical packages that provide integrated
by different clustering procedures. The study on genetic diversity at various levels.
efficiency of different clustering algorithms However, because of user-friendliness and
can be estimated by calculating cophenetic availability of several features, NTSYSpc (F.
correlation coefficient (see above). It is a prod- J. Rohlf, State University of New York, Stony
uct moment correlation coefficient measuring Brook, USA) and PHYLIP (J. Felsenstein,
agreement between dissimilaritysimilarity University of Washington, Seattle, USA) have
indicated by a phenogramdendrogram as been extensively employed in publications.
output analysis and the distancesimilarity The procedure for employing NTSYSpc for
matrix as input of cluster analysis. Using this genetic diversity analysis using molecular
coefficient value, the degree of fit of the den- marker data is provided below.
drogram can be subjectively fixed as 0.9 r, Computer software, NTSYSpc (Numerical
very good fit; 0.8 r < 0.9, good fit; 0.7 r < 0.8, Taxonomy and multivariate analysis SYStem),
poor fit; and r < 0.7, very poor fit. At the same is a system of program modules used to discover
time, it should be kept in mind that low and describe the patterns of biological diversity
coefficient score does not mean that the den- that can be demonstrated in a set of multivariate
drogram has no use. This poor coefficient value data. There are modules in NTSYSpc that
only indicates that some distortion might have perform cluster analysis. The first crucial step in
occurred. It is also worth to note that whatever genetic diversity analysis using the marker (or
algorithm is used for dendrogram construction, DNA fingerprinting) data is the measurement
in order to assess the reliability of the nodes, it of similarity among germplasm entries. When
is essential to carry out bootstrapping of the DNA profiles of two individual plants are com-
allele frequencies followed by calculation of pared, certain number of bands will be common
genetic distances. (or shared or monomorphic) between the two
Therefore, while studying the genetic diver- DNA profiles (even by chance). The number or
sity in crop plants, it is vital to decide the follow- proportion of common bands is expected to be
ing points: (1) careful and effective use of larger if the two individuals are biologically
different types of data variables like continuous, related. It is therefore important to objectively
discrete, ordinal, multistate and binomial; (2) use measure the expected degree of similarity due to
of multiple data sets such as morphological, bio- chance of relatedness. Hierarchical clustering
chemical and molecular data; and (3) appropriate (which is going to be used in the below proce-
selection of clustering algorithms. Depending on dure) provides not only information about the
the genetic materials being analysed and objec- object that belong to each cluster but also gives
tives of the experiment, different strategies (since us an idea about which ones are closest to each
there is no single strategy that addresses all the other and how dissimilar with the other objects
issues in genetic diversity analysis) are required in the cluster. Subsequently, such analysis is
to formulate, and hence readers are requested to used for phylogenetic tree estimation, which is
refer to the bibliography to proceed further in then visualised as a graphical dendrogram. This
their crop and materials of interest. entire process involves first computing a matrix
There are many statistical packages avail- of similarity coefficients for all pairs of OUT
able for analysing genetic diversity (see above (operational taxonomic units) and then perform-
and Labate 2000). There is still a need for ing the actual cluster analysis based on the
developing a comprehensive and easy-to-use similarity index by UPGMA. The resulting

(continued)
Genetic Diversity Analysis Using Molecular Markers 15

Box 1.2 (continued)


dendrogram provides a good estimate of the used in sequence to build many other types of
phylogeny of a particular group of organisms. analyses (e.g. Gowers principal coordinates
As an example, the modules SIMQUAL (for analysis can be carried out by using the
similarity matrix construction), SAHN (for SIMINT, DCENTER and EIGEN modules;
sequential agglomerative, hierarchal and CONSENSUS computes a consensus tree for
nested) clustering methods and TREE (dis- two or more trees (such as multiple tied trees
plays tree from cluster analysis as dendrogram) from SAHN or between two different meth-
to perform phylogenetic tree (dendrogram) ods, and several consensus indices are also
estimation are explained hereunder. However, computed to measure the degree of agreement
there are several computational modules between trees); COPH produces a cophenetic
included in NTSYSpc. Detailed technical value matrix (matrix of ultrametric values)
descriptions of the modules (including equa- from a tree matrix produced by the SAHN pro-
tions for the operations and the various gram; this matrix can be used by the MXCOMP
coefficients) are provided in the help file. program to measure the goodness of fit of a
NTSYSpc is not limited to just the analyses cluster analysis to the similarity or dissimilar-
mentioned in this box. The modules can be ity matrix on which it was based).

Preparation of Input Data File

Scoring of Data from Gel Matrix


Individual1

Individual2

Individual3
Ladder

Individual 1 Individual 2 Individual 3

Scoring by band

A1 1 1 0
Locus A

A2 0 1 1

Scoring by genotype

Locus A 1,0 1,1 0,1


Geno A1A1 A1A2 A2A2
types

With a co-dominant marker (see chapter or to a 0 if it is absent. We can do it by band


3), the genotypes of the three genotypic or by genotype, as in the table. This is because
classes can be observed for the two homozy- the analysis of genetic diversity involves the
gotes and the heterozygote. In the drawing quantification of diversity and the relation-
above, a gel image with the banding pattern ships within and between populations and/or
of a co-dominant marker for a single locus of individuals and displays the relationships. To
a diploid organism is given. We need to score do this kind of analysis, molecular data are
the bands in the gel and convert them to usually handled as binary data. Molecular
numerical data (numbers). To do so, each of data can be usefully complemented with
the band sizes (the band in the same row) is morphological or evaluation data. To do so,
scored and transformed to a 1 if it is present these types of variables can be transformed to

(continued)
16 1 Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.2 (continued)


binary variables. A gel image with the banding With a dominant marker (see chapter 3), only
pattern of a co-dominant marker with three two genotypic classes can be observed: AA + Aa
alleles (A1, A2 and A3) or multiple alleles in and aa. That is, one of the homozygote classes is
a diploid sample, it needs to be scored each confounded with the heterozygote (as shown in
band (each row) independently, and trans- the below gel picture, banding pattern for AA or
form them to a score of 1 if present or a Aa will look like individual 1). Thus, the gel
score of 0 if not. It is wise to score the image with the banding pattern of a dominant
co-dominant markers as allele frequencies marker for a single locus will show either one
since scoring as presence/absence may cause band or no band for each individual. The bands
loss of genetic information. Alternatively, use are scored in a way similar to that for the
of large number markers with such scoring co-dominant marker, where bands are converted
would solve this issue. to a score of 1 if present or 0 if not.
Individual1

Individual2
Ladder

Individual 1 Individual 2
Locus A

Locus A 1 0

Geno AA or Aa aa
types

Creation of Data Files for NTSYSpc cannot be exported to Excel spreadsheet.


NTSYSpc files are ordinary *ASCII files. A NTedit needs to be started by clicking on the
file for an initial data matrix may be prepared program icon to start the program and then use
with an editor or any word processor that has the drop-down file menu (open the menu to
a pure ASCII character. Free format is used load an existing data file or files). Once NTedit
for all the entries in the data matrix. This is started, data can be entered or corrected in
means that at least one blank space is required any of the cells of the spreadsheet. Rows and
between numbers; tab characters will not columns can be deleted or inserted within the
work. Alternatively, an Excel sheet (derived table by clicking on the appropriate menu
from MS Office) can also be used to prepare choices. Addition or deletion of rows and col-
data file, and this can be imported into umns should be done by entering new values in
NTSYSpc using the NTedit program. the edit boxes displaying the current number of
For each of the basic file format (rectangu- rows and columns. The numerical code used to
lar, symmetric, diagonal tree and graph), indicate the missing values in the data can be
NTedit program displays an appropriate entered or changed. Make sure this field is
arrangement of the cells in the spreadsheet. blank (not zero) if there is no missing value. It
Though anyone of the above-said file format is essential to check for missing data and it
can be employed, use of NTedit ensures that should be of maximum of 5% since missing
the files are formatted correctly; however, data data can distort analyses.

(continued)
Genetic Diversity Analysis Using Molecular Markers 17

Box 1.2 (continued)


Tips to Prepare Data File 4. This opens up a new pop-up menu in which
1. The qualitative or quantitative data pertain- you have to browse for your Excel file to
ing to each individual (or population) may open in the NTedit window.
be prepared in Excel sheet in the following 5. Save this file in *.NTS format by specify-
format. ing appropriate file name.
6. Close this NTedit window and open
1 12 13 1 9
NTSYSpc window.
SSR1 SSR2 SSR3 SSR4
7. Select the Similarity icon, and on this
Individual1 0 1 0 1
window, select SIMQUAL which means
Individual2 1 1 0 1
for similarity index to be calculated from
Note: qualitative data (zero and one data; e.g. the
First column first row: type of matrix (1 for data file prepared as above). If the data is
rectangular matrix; 2 similarity matrix) in allele frequency format, select SIMGEN.
Second column first row: number of the If you have the data file in quantitative
markers scored in this analysis measures, then select SIMINT, which
Third column first row: number of means similarity index calculation using
accessions interval data (such as plant height).
Fourth column first row: presence of miss- 8. This leads to a new pop-up menu. In the
ing value (0 if there is no missing value; 1 input file pointer, double click to browse
if there is any missing value) the data file that has been saved using
Fifth column first row: the value given for NTedit program.
missing value (if any) 9. If you have saved the accessions in the
First column second row: leave it empty rows, then select BY ROW column. If you
First column second row: marker (or quan- have saved the data as per the format
titative trait) names in each column described in this exercise, DO NOT
First column third row: name of the acces- SELECT ROW option.
sions in the entire column (it is better to 10. In the next row, you will find coefficient
restrict the marker name and accession parameter for which a range of argu-
name to eight characters) ments have been given. The default
Second column third row onwards: marker coefficient is SM, which denotes simple
score for each accession for the corresponding matching coefficient. The coefficient
marker. quoted by Dr. Dice and his group is the
2. Save the Excel file as *.txt (text tab delimited preferred argument (DICE). Please click
file) and import this file through NTedit. the help icon to get more information on
the parameter/arguments and references
therein.
Construction of Dendrogram and 11. Specify the output file (e.g. file number 2)
Genetic Diversity Analysis by double clicking that corresponding
column using the browser.
1. Open the NTSYS program. 12. Running the Compute results in a new
2. Go to NTedit if you have your file in Excel pop-up menu report listing which
format. contains the information on data input file,
3. Point the cursor to select file import output file, the parameter you have selected
Excel using DDE. for coefficient, the matrix type, etc.

(continued)
18 1 Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.2 (continued)


13. Close this and similarity windows and Sometimes, it was found that some of the
select the CLUSTERING icon. germplasm entries show up in different
14. In the new pop-up menu, select SAHN; cluster when different procedure was
input the file by double clicking on the employed. It is very difficult to assign these
argument column and browsing the file that entries into a proper cluster; it may require
you have saved in step 11 (file number 2). some additional information (such as pedi-
15. Specify the new output tree file (e.g. file gree and region of origin) to assign them to
number 3) in the argument of next row by the appropriate cluster. Bootstrapping can
double clicking. be used to ensure that there were enough
16. Select the clustering method, nature of tie number of markers employed to sample the
and maximum number of ties. Rest you genetic diversity and the resulted dendro-
can leave as default values, if you dont gram is statistically sound. A bootstrapping
have any options. program (available in WinBoot) can repeat
17. Similar kind of report listing window the cluster analysis many times and return
found in step 12 will result which contains a dendrogram in which the clusters are
all the calculations. defined by the number of times the individ-
18. Close this window. uals within the cluster were found together
19. In the clustering window, now you can find in each analysis. This number can be used
the dendrogram symbol (a red-coloured as a confidence limit of the clusters within
icon) below the compute button; select that the dendrograms. It is generally believed
tree icon. that to ensure the accuracy of the bootstrap
20. It results into a picture of dendrogram is 95%, 400 repetitions of the analysis must
obtained based on the input file in a new tree be done; similarly, 2,000 repetitions must
plot window. The dendrogram is usually be done to ensure the accuracy of 99%.
plotted with distance or similarity in the hori- Often one wishes to test whether one set of
zontal axis and germplasm entries in the ver- relationships among a set of objects is inde-
tical side. If number of individuals is found pendent of another. For example, one may
to be low, use Options menu to increase the wish to test whether the degree of morpho-
number of clusters/individuals per page. logical difference between samples is
21.You can edit this picture using plot options or related to the geographical distances
copy the metafile and paste it in a PowerPoint between the sampled populations. A simple
slide. Before editing the PowerPoint picture, way to do this is by the use of the Mantel
ungroup the picture you have saved. test. The test assumes that the two matrices
22. The file number 2, which can be opened in have been obtained independently. However,
note file format, contains the coefficient one cannot use it to test two or more matri-
values for each individual with respect to ces where one of them has been derived
the other individuals, and this can be used from the other.
for interpretation of results.

Partitioning Variation in the


Interpretation of Results Germplasm

When you have completed clustering with Yet another critical step in a diversity analy-
a number of procedures, the obvious sis is to investigate the variation present
next step is finding the consensus clusters. in the germplasm, that is, not to visualise

(continued)
Genetic Diversity Analysis Using Molecular Markers 19

Box 1.2 (continued)


relationships between individuals but simply where Pi is the frequency of the ith allele
to see the overall breakdown of variation in for the individual P. This can be calculated
the sample. Usually, analysis of molecular by simply using Excel spreadsheet as shown
variation (AMOVA) is used for this analysis, below.
which is very similar to ANOVA procedure.
It is also useful to measure the richness of Data Sheet Preparation and PIC
alleles for each marker or the information Calculation
that each marker imparts to the study in dis- Enter the marker allelic data as presence (1)
criminating each individual. Usefulness of or absence (0) of each allele for each entry
such study is affected by number of alleles, of the germplasm. It is important to change
frequency of alleles, etc. To this end, there the score 1 to 2, if the entry is homozygous
are three measures that frequently used: for that allele; otherwise data 1 should be
polymorphic information content (PIC), retained if the entry is heterozygous or there is
allelic richness and discriminatory power of another allele present for that marker in the
the markers. Allelic richness can be calcu- given entry. For example, in case of SSRs, we
lated using the LCDMW package (http:// can sum over all alleles for each SSR to make
www.cimmyt.cgiar.org/ABC/Protocols/man- sure the sum is maximum of 2 in every indi-
ualABC.html). PIC is a calculation of num- vidual for every SSR (refer below tables 6th
ber of alleles (or bands) that a marker has and row). Thus, we can assure that the data was
the frequency of each of the alleles in the not mis-scored in any individuals, as every
studied germplasm. Since a marker with individual will have two alleles for every SSR.
fewer alleles (or bands) has less power to dis- An example of gel matrix of SSR profile
tinguish several entries that constitute the (which produced four different alleles (a, b, c
germplasm, markers possessing higher PIC and d) in the given five individuals) and its
values are usually preferred. The formula respective data sheet is given below for easy
used to calculate PIC is understanding.
i
PIC = 1 Pi 2 ,
n =1
Individual1

Individual2

Individual4

Individual5
Individual3
Ladder

Ind1 Indi2 Ind3 Ind4 Ind5 Freq* Freq2**

SSR1a 1 1 0 0 0 2/5 (2/5)2 = 0.16

SSR1b 0 0 0 0 1 1/5 (1/5)2 = 0.04

SSR1c 0 1 1 0 0 2/5 (2/5)2 = 0.16

SSR1d 0 0 0 1 0 1/5 (1/5)2 = 0.04

Sum 1 2 1 1 1 0.40

PIC 0.60

Freq*: frequency of allele = number of individual having this allele/total number of individuals
Freq2**: (frequency of allele)2
PIC = 1 sum
20 1 Germplasm Characterisation: Utilising the Underexploited Resources

Araus JL, Slafer GA, Royo C, Serret MD (2008) Breeding


Parental Selection for yield potential and stress adaptation in cereals. Crit
Rev Plant Sci 27:377412
Baker FWG (ed) (1989) Drought resistance in cereals.
Successful crop breeding program depends on CAB Publishing, Wallingford, 222 pp
careful selection of parents that complement each Bhullar NK, Zhang Z, Wicker T, Keller B (2010) Wheat
other for the given trait and yield. Thus, choosing gene bank accessions as a source of new alleles of the
powdery mildew resistance gene Pm3: a large scale
parents is one of the most important steps in allele mining project. BMC Plant Biol 10:88
a breeding program. Although breeders have Blum A (2011) Plant breeding for water-limited environ-
different approaches for parental selection, all ments. Springer, New York
the strategies share a common feature: Selected Boyer JS, Westgate ME (2004) Grain yields with limited
water. J Exp Bot 55:23852394
parents should be as diverse as possible at pheno- Ceccarelli S, Grando S (1996) Drought as a challenge for
typic and genotypic level. At least one locally the plant breeder. Plant Growth Reg 20:149155
adapted, popular cultivar is used as one parent to Chaves MM, Oliveira MM (2004) Mechanisms underly-
ensure the recovery of a high proportion of prog- ing plant resilience to water deficits: prospects for
water-saving agriculture. J Exp Bot 55:23652384
enies with adaptation and quality that are accept- Farooq M, Wahid A, Kobayashi N, Fujita D, Basra SMA
able by farmers and end users. Each parent should (2009) Plant drought stress: effects, mechanisms and
complement the weakness of the other parent. management. Agric Sustain Dev 29:185212
For instance, when we select parents for drought Fischer KS, Lafitte R, Fukai S, Atlin G, Hardy B (2003)
Breeding rice for drought prone environments. The
tolerance breeding, it is better to avoid parents that International Rice Research Institute, Los Baos, 98
are highly drought susceptible but genetically pp
diverse. In such cases, use of improved modern Fukai S, Cooper M (1995) Development of drought-resis-
varieties as one of the parent may offer many dis- tant cultivars using physiomorphological traits in rice.
Field Crop Res 40:6786
ease-, insect- and abiotic stress-tolerant genes. Kamoshita A, Babu RC, Boopathi NM, Fukai S (2008)
Thus, a thorough phenotyping and genetic diver- Phenotypic and genotypic analysis of drought-
sity analysis will lead to identify most appropri- resistance traits for development of rice cultivars
ate parental lines for biparental or multiparental adapted to rainfed environments. Field Crop Res
109:123
crosses to produce new segregating populations Kumar A, Bernier J, Verulkar S, Lafitte HR, Atlin GN
(discussed in chapter 2) suitable for high-resolution (2008) Breeding for drought tolerance: direct selection
genetic map construction and efficient quantita- for yield, response to selection and use of drought-
tive trait loci (QTL) discovery. tolerant donors in upland and lowland-adapted popu-
lations. Field Crop Res 107:221231
Lafitte HR, Li ZK, Vijayakumar CHM, Gao YM, Shi Y,
Xu JL, Fu BY, Ali AJ, Domingo J, Maghirang R,
Bibliography Mackill DJ (2006) Breeding for resistance to abiotic
stresses in rice: the value of quantitative trait loci. In:
Lamkey KR, Lee M (eds) Plant breeding: the Arnel R.
Literature Cited Hallauer international symposium. Blackwell, Ames,
pp 201212
Comai L, Young K, Till BJ et al (2004) Efficient discovery Monneveux P, Ribaut JM (eds) (2011) Drought phenotyp-
of DNA polymorphisms in natural populations by ing in crops: from theory to practice. Available at
Ecotilling. Plant J 37:778786 Generation Challenge Program website www.genera-
Labate JA (2000) Software for population genetic analysis tioncp.org
of molecular marker data. Crop Sci 40:15211528 Morison JIL, Baker NR, Mullineaux PM, Davies WJ
Mohammadi SA, Prasanna BM (2003) Analysis of genetic (2008) Improving water use in crop production. Philos
diversity in crop plants salient statistical tools and Trans R Soc B Biol Sci 363:639658
considerations. Crop Sci 43:12351248 Nguyen HT, Babu RC, Blum A (1997) Breeding for
drought resistance in rice: physiology and molecular
genetics considerations. Crop Sci 37:14261434
Passioura JB (2007) The drought environment: physical,
Further Readings biological and agricultural perspectives. J Exp Bot
58:113117
Alpert P (2006) Constraints of tolerance: why are desicca- Reynolds M, Tuberosa R (2008) Translational research
tion-tolerant organisms so small or rare? J Exp Biol impacting on crop productivity in drought-prone
209:15751584 environments. Curr Opin Plant Biol 11:171179
Bibliography 21

Ribaut JM (ed) (2006) Drought adaptation in cereals. donors and selection in drought nurseries. Field Crop
The Haworth Press Inc, Binghamton, 642 pp Res 97:7786
Richards RA (2008) Genetic opportunities to improve Tuberosa R, Salvi S (2007) Dissecting QTLs for tolerance
cereal root systems for dryland agriculture. Plant Prod to drought and salinity. In: Jenks MA, Hasegawa PM,
Sci 11:1216 Jain M (eds) Advances in molecular breeding toward
Torres R, Mackill D (2006) Improvement of rice drought drought and salt tolerant crops. Springer, Dordrecht,
tolerance through backcross breeding: evaluation of pp 381411
Mapping Population Development
2

efficient ways for positional cloning of the genes


Mapping Population and Its or genome sequencing. Hence, mapping popula-
Importance in Genetic Mapping tions are the basic tools for understanding the
effect of selected genetic factors and the organi-
The principle of genetic mapping is mainly based sation of the genome of a species as a whole.
on sampling recombination frequency for the They are the backbone of genomics research that
given genes (or markers) that are available in the aims to decipher large, complex genomes at the
mapping population. Mapping population con- nucleotide sequence level.
sists of individual progenies that are originated Generally in conventional genetic mapping
from two parents of one species or related spe- and QTL analysis, mapping population is devel-
cies. Hence, the first step in linkage or genetic oped from parents that are highly homozygous
map construction is development of mapping (usually inbreds are homozygous in nature). The
population. It is considered as key genetic tools/ major key phase in the development of the map-
resources in linkage map construction since they ping population is selection of two genetically
are used to identify genetic loci that influence the divergent parents (see chapter 1) and should show
expression of phenotypes and to determine the clear phenotypic differences for the trait of inter-
recombination distance between loci. est. It is also desirable to choose the parents that
In diverse crops of the same species, the genes are as diverse as possible for a number of eco-
(or markers), represented by alternative allelic nomic and agronomically important traits, and
forms, are arranged in a fixed linear order on the hence, the same mapping population can be used
chromosomes. Linkage values among these gene to identify QTLs for several traits. In addition to
or marker loci are estimated based on recombi- that, it is essential to have significant trait herita-
nation events between alleles of different loci, bility. Both monogenic (trait governed by single
and such linkage relationship along all the chro- genes) and polygenic (trait governed by several
mosomes offers a genetic map of the crop (see genes) traits can be mapped when two parents
chapter 4 for more details). However, to explain are extremely different for these traits. It is
the complexity of genome organisation, genetic expected that the more the parental lines differ,
maps are not sufficient since they are based on the more genetic factors will be described for the
recombination events, which is highly different trait in the segregating population and the easier
along the chromosomes. At the same time, knowl- their identification will be. Due to intensive
edge on the genetic map and cytogenetic map breeding and pedigree selection, genetic vari-
forms the fundamentals for the physical map ability within the gene pools of the relevant
construction. An integrated map thus provides crops is at risk and hence contribution of wild
a detailed view on genome structure and offers species is of high value at this point. At the same

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 23
and Benefits, DOI 10.1007/978-81-322-0958-4_2, Springer India 2013
24 2 Mapping Population Development

time, the parental divergence should not be too sets of chromosome pairs). Mapping populations
genetically distant. This is because it helps to used for mapping cross-pollinating species may
reduce the sterility of the progenies and segrega- be derived from a cross between a heterozygous
tion distortion during linkage analysis. parent and a haploid or homozygous parent. For
Several types of mapping population such as example, in both the cross-pollinating species of
F2 progenies, F2 immortal populations, backcross white clover (Trifolium repens L.) and ryegrass
(BC) progenies, recombinant inbred lines (RILs), (Lolium perenne L.), F1 generation mapping pop-
double haploids (DHs), near isogenic lines (NILs) ulations were successfully developed by pair
and nested association mapping (NAM) popula- crossing heterozygous parental plants that were
tions have been utilised in this regard. It should distinctly different for important traits associated
be noted that each population type possess its with plant persistence and seed yield.
own rewards and restrictions and hence selection There is no specific study that pinpoints the ideal
of population type is critical for successful number of individuals in a given population that are
genetic mapping. Both F2 and BC populations required to establish accurate genetic map. The pre-
are simplest and easy to construct, but they are cision with which genetic distance measured in a
highly heterozygous and cannot be propagated genetic map is directly related to the number of indi-
indefinitely through seeds. It can temporarily be viduals that constitute the given mapping popula-
used to construct a preliminary linkage map. tion. For example, if only 20 individuals are studied
Alternatively, RILs, NILs and DHs are perma- and no recombinants are found between the given
nent populations since they are homozygous or two markers, then the distance between these two
true breeding lines that can be multiplied and markers would be noted as 0 cM (see chapter 4 for
reproduced without any genetic change. Thus, details on genetic distance calculation). On the other
these populations represent eternal resources hand, analysis of another 80 individuals in the same
for mapping, and seed from individual RI or DH population may reveal recombinants, and hence the
lines can be exchanged among different laborato- distance between the same two markers would be
ries for further linkage analysis or addition of >0 cM depending on the number of recombinants
more markers to existing maps and ensure that all identified. In general, segregating progenies con-
collaborators examine identical material. sisting of 50250 individuals may be sufficient to
The type of mapping population to be used construct the initial skeletal linkage map; however,
depends on the reproductive mode of the given a larger population size (say >1,000) is needed for
crop. For self-pollinating species, F2 progenies high resolution or fine mapping. It has been shown
and RILs are used; for self-incompatible, highly in several studies that more accurate maps were
heterozygous progenies, that is, the F1 popula- obtained when large population size and co-domi-
tions are mostly the tools of choice. BC progenies nant markers were employed and poor population
and DHs can be employed for both types of size provided several fragmented linkage groups
plants. If pure lines cannot be generated from a and inaccurate locus order (discussed in chapter 4).
species due to self-incompatibility or inbreeding It was also noticed that maximum genetic infor-
depression, heterozygous parental plants are used mation can be obtained from F2 population using
to derive mapping populations such as F1 and BC a co-dominant markers. Dominant markers supply
progenies. This is the case for several tree species as much information as co-dominant markers in
(such as apple, pear and grape) and for potato. To RILs, NILs and DHs since all loci in these popu-
maintain the identity of the F1 genotypes of the lation are homozygous or nearly so. It is important
mapping population, parental lines and each of to note that RILs, NILs and DHs may be powerful
their F1 or BC progenies are propagated clonally. tools for QTL detection in some occasions, but
In cross-pollinating species, the situation is more offer no information on QTLs dominance relation-
complicated since most of these species do not ships. Characteristics of major types of mapping
tolerate inbreeding. Many cross-pollinating plant populations used in genetic mapping studies are
species are also polyploids (i.e. they contain several described in Table 2.1.
Table 2.1 Characteristics of major types of mapping populations used in genetic mapping studies.
Particulars F2 progenies BC progenies DH lines RILs NILs
Development procedure Parent (x) Parent (x) Parent (x) Parent (x) Parent (x) Parent F1
Parent F1 (s) F2 Parent F1 (x) Parent F1 Anther Parent F1 (s) SSD F6 (x) Parent BC
Parent BC culture DH lines or more RILs continues with Parent
up to BC6 (s) two
generations NILs
Number of generations 2 2 2 68 9
required to make
Number of informative 2 1 1 1 1
gametes per individual
Number of recombinant x x x 2x x
events per gamete
Number of possible 3 2 2 2 2
genotypes per locus
Merits Best population for Requires less time DHs are permanent Once homozygosity is achieved, NILs are immortal
preliminary mapping to be developed mapping population and RILs can be propagated mapping population
hence can be replicated indefinitely without further
and evaluated over segregation
locations and years and
Mapping Population and Its Importance in Genetic Mapping

maintained without any


genotypic change
Requires less time for The populations Useful for mapping Since RILs are immortal Suitable for tagging
development can be further both qualitative and population, they can be replicated the qualitative and
utilised for quantitative characters over locations and years quantitative trait
Can be developed with marker-assisted Instant production of RILs, being obtained after NILs are quite useful in
minimum efforts, when backcross breeding homozygous lines, thus several cycles of meiosis, are functional genomics
compared to other saving time very useful in identifying tightly
populations linked markers
The degree of dominance Epistasis can be detected RIL populations obtained by Epistasis can be detected
can be estimated selfing have twice the amount of
observed recombination
between very closely linked
markers as compared to
population derived from a single
cycle of meiosis.
Epistasis can be detected
25

(continued)
Table 2.1 (continued)
26

Particulars F2 progenies BC progenies DH lines RILs NILs


Demerits Linkage established They are not Recombination from the Requires many seasons/ Require many generations
using F2 population is immortal male side alone is generations to develop. for development
based on one cycle of accounted
meiosis
F2 populations are of The recombination Since it involves in vitro Developing RILs is relatively Directly useful only for
limited use for fine information in case techniques, relatively difficult in crops with high molecular tagging of the
mapping. of backcrosses is more technical skills are inbreeding depression gene concerned, but not
based on only one required in comparison for linkage mapping
parent with the development of
other mapping
populations
Quantitative traits Often suitable culturing Linkage drag is a
cannot be precisely methods/haploid potential problem in
mapped using F2 production methods are constructing NILs, which
population as each not available for number has to be taken care of
individual is genetically crops and different crops
different and cannot be differ significantly for
evaluated in replicated their tissue culture
trials over locations and response. Further, anther
years. Thus, the effect culture-induced
the G x E interaction or variability should be
epistatic interaction on taken care of
the expression of
2

quantitative traits cannot


be precisely estimated
Not a long-term
population; impossible
to construct exact
replica or increase seed
amount
Inheritance of dominant 3:1 1:0a 1:1 1:1 1:1
markers
Inheritance of 1:2:1 1:1a 1:1 1:1 1:1
co-dominant markers
x crossing, s selfing, SSD single seed descent method, BC backcross
a
However, backcross with recessive parent (B2) or testcross would segregate in a ratio of 1:1 irrespective of the nature of marker
Mapping Population Development
F2 Progenies 27

bisexual flowers, emasculation is essential to


Selng and Crossing Techniques prevent of self-pollination. In monoecious plants,
in Crop Plants male flowers are removed (e.g. castor, coconut) or
male inflorescence is removed (e.g. maize). In spe-
In crop improvement program, selfing and cross- cies with large flowers (e.g. cotton, pulses), hand
ing are the two paramount procedures. Success emasculation is accurate and it is adequate. For
of mapping population development largely other crops, several other methods of emascula-
depends on perfect execution of selfing and tion are being followed (e.g. suction method, hot
crossing procedures. The exact procedures used water or cold water treatment, alcohol treatment,
to ensure self- or cross-pollination of specific use of genetic or cytoplasmic male sterility lines,
plants will depend on the floral structure and employing protogyny (e.g. cumbu.) and use of
method of pollination. Generally, accomplishing gametocides (e.g. ethrel, sodium methyl arsenate,
cross-pollination in a strictly self-pollinating zinc methyl arsenate are used in rice, maleic
species is more difficult because prevention of hydrazide is used in cotton and wheat)). Immediately
self-pollination that occurs inside the unopened after emasculation, the flower or inflorescence is
flowers is not easy. However, self-pollination in enclosed with suitable bags of appropriate size to
cross-pollinating species is simple. In the selfing prevent random cross-pollination. The pollen
of cross-pollinated species, it is essential that the grains collected from a desired male parent should
flower are bagged or otherwise protected to pre- be transferred to the emasculated flower. This is
vent natural cross-pollination. The structure of normally done in the morning hours during anthe-
the flowers in the species determine manner of sis. The flowers are bagged immediately after
pollination. For these reasons, during mapping artificial crossing and should be tagged with appro-
population development, it is always better to priate information such as date, name of the cross
acquaint flowering habit of the crop. combination, etc. using pencil.
In the case of wheat, rice, barley, groundnut,
etc., the plant is permitted to have self-pollination
and the seeds are harvested. It is necessary to F2 Progenies
know the mode of pollination. If the extent of
natural cross-pollination is more, then the flowers Development of F2 progenies are the simplest and
should be protected by bagging. This will prevent rapid method when compared to other mapping
the foreign pollen to reach the stigma. Seed set is population types. This is the population in which
frequently reduced in ear heads enclosed in bags the foundations of Mendelian laws were first
because of excessive temperature and humidity established. Usually, two pure lines that result
inside the bags. In crops like cotton which have from natural or artificial inbreeding are selected
larger flowers, the petals may fold down the sex- as parents (Fig. 2.1). Alternatively, two doubled
ual organs and fasten, thereby pollen and pollen- haploid lines can be used as parents to avoid any
carrying insects may be excluded. This is simply residual heterozygosity. Crossing of such parents
achieved by closing the flower bud with cotton will lead to produce fertile progenies and those
lint. In certain legumes which are almost polli- progenies are called as F1 generation. If the paren-
nated via insect, the plants may be caged to pre- tal lines are true homozygotes, all individuals of
vent the insect pollination. In maize, a paper bag the F1 generation will have the same genotype
is placed over the tassel to collect pollen and the and have a similar phenotype as per the Mendels
cob is bagged to protect from foreign pollen. The law of uniformity. Each individual of F1 plant is
pollen collected from the tassel is transferred to then selfed to produce F2 population that segre-
the cob. gates for the given trait. Thus, F2 population is the
Removal of stamens or anthers or killing the outcome of one meiosis, during which the genetic
pollen of a flower without affecting the female material is recombined. The expected segrega-
reproductive organ is known as emasculation. In tion ratio for each co-dominant marker is 1:2:1
28 2 Mapping Population Development

Male parent
Female parent
(Donor parent )
(elite line)
X AAbb
aaBB
X Female parent
Hybrid F1 Anther culture (elite line)
AaBb F1
S X Female parent
Chromosome doubling by BC1F1 (elite line)
Haploids AB Ab ab
aB Colchicine treatment
BC2F1

Doubled AABB AAbb aaBB aabb F2


haploids
BC4F1
S
S

F3 BC4F2 Near Isogenic Lines (NILs)


SSD

(Each plant contributes a single


offspring to the next generation)

F7 Recombinant Inbred Lines (RILs)

Fig. 2.1 Schematic illustration that explains development of commonly used mapping populations in genetic mapping.
X refers to crossing, S refers to selfing, SSD single seed descent method

(homozygous-like female parent to heterozygous genes is required that underlie the quantitative
to homozygous-like male parent) (see chapter 3). trait of interest. The F2:3 family can be used for
The main limitation in F2 population is it cannot reconstituting the genotype of respective F2
be easily preserved, because F2 plants are fre- plants, if needed, by pooling the DNA from plants
quently not immortal and F3 plants that result in the family. However, the main limitation is,
from their selfing are genetically not identical. like F2 populations, it is not immortal popula-
Alternatively, the crops that can be multiplied as tion and hence cannot be used for replicated
clones using tissue culture can be produced and experiments to validate the results.
regrown whenever needed. Another way is main-
tain the F2 population in pools of F3 plants. Traits
can be evaluated in hybrids and testcross plants F2 Intermating Populations
can be constructed by crossing each F2 individual or Immortalised F2 Populations
with a common tester genotype. Ideally, different
common testers should produce corresponding Random intermating of F2 populations has been
results to exclude the specific effects of one par- suggested for obtaining precise estimates of recom-
ticular tester genotype. With a compromise bination frequencies between tightly linked loci.
between resolution of linked loci and cost, a pre- Immortalised F2 populations can be developed by
liminary genome-wide map can be produced with paired crossing of the randomly chosen RILs
200 F2 individuals. However, for higher resolu- derived from a cross in all possible combinations
tion, as required for positional cloning of genes, excluding reciprocals. The set of RILs used for
F2 progenies of several thousands are required crossing along with the F1s produced provides a
(see chapter 7). true representation of all possible genotype combi-
nations (including the heterozygotes) expected in
the F2 of the cross from which the RILs are derived.
F2-Derived F3 (F2:3) Populations The RILs can be maintained by selfing and required
quantity of F1 seed can be produced at will by fresh
F2:3 population is obtained by selfing the F2 indi- hybridisation. This population therefore provides
viduals for a single generation. It is suitable for an opportunity to map heterotic QTLs and interac-
specific situations where mapping of recessive tion effects from multi-location data.
BC Progenies 29

However, in a simulation study, sampling materials to be used for genetic mapping. DH


effects due to small population sizes in the inter- lines are also product of one meiotic cycle and
mating generations were found to abolish the hence comparable to F2 in terms of recombina-
advantages of random intermating that were tion information. Despite this, DH lines are used
reported in previous theoretical studies consider- as permanent resource for genetic mapping and
ing an infinite population size. Frisch and are ideal crossing partners in the production of
Melchinger (2008) proposed a mating scheme for mapping population since they have no residual
intermating with planned crosses that yields more heterozygosity.
precise estimates than those under random inter-
mating. Mapping populations generated with
mating scheme with independent recombina- BC Progenies
tions have the same properties as mapping popu-
lations derived from large random-mating To analyse the specific genes or other regulatory
populations. Hence, such mating scheme guaran- DNA elements derived from one parent (i.e.
tees the maximum possible information content donor parent) in the background of another par-
in the mapping population but reduces the efforts ent (i.e. recurrent (or elite) parent), the hybrid F1
of employing large intermating populations. plant is backcrossed to recurrent parent (Fig. 2.1).
Two key features that best describe BC progenies
are: unlinked donor fragments are separated by
DH Lines segregation and linked donor fragments are mini-
mised due to recombination with the recurrent
Doubled haploid (DH) lines contain two identi- parent. In order to reasonably reduce the number
cal sets of chromosomes in their cells. They are and size of donor fragments, backcrossing is
completely homozygous, as only one allele is repeated. With each round of backcrossing, the
available for all the genes. Usually, DH lines are proportion of the donor genome is reduced by
produced from haploid lines. These haploid lines 50%. Sometimes backcrossing process can be
either occur spontaneously (e.g. rapeseed and accelerated by use of recurrent parent-specific
maize) or can be induced artificially (Fig. 2.1). markers (referred to as background markers; dis-
Haploid plants are usually smaller and less vig- cussed in detail in chapter 3). With each round of
orous than diploids and nearly sterile. Haploids backcrossing, the number and size of genomic
can be induced by culturing immature anthers on fragments of the donor parent are reduced until a
special media, and haploid plant can later be single gene (or other regulatory DNA element)
regenerated from the haploid cells of the game- differentiates the BC progeny from the recurrent
tophyte. Alternatively, microspore culture can be parent. That particular progeny is later screened
employed. As a rare event, in some of the hap- for the trait introduced by the donor. In the event
loid plants, the chromosome number doubles of dominant expression of traits, the progeny can
spontaneously that leads to DH plants. Such be screened directly; on the other hand, recessive
lines can also be obtained artificially by colchi- expression of traits requires the testing of selfed
cine treatment of haploid plants. It is shown that progeny of each BC progeny. Identical BC prog-
colchicine prevents the formation of the spindle eny with the exception of few donor loci is called
apparatus during mitosis and thus inhibits the as near isogenic lines (NILs) and discussed sepa-
separation of chromosomes and leading to DH rately (see below). BC progeny incorporated with
plants. If callus is induced in haploid plants, a a fragment of genomic DNA from a very distantly
doubling of chromosomes often occurs sponta- related species is called as introgression line,
neously during endomitosis and DH lines can be while the BC progeny incorporated with genetic
regenerated via somatic embryogenesis. On the material from a different variety is indicated as
other hand, in vitro culture conditions may inter-varietal substitution lines. At this point, it
decrease the genetic variability of regenerated should be noted that recombination is reduced in
30 2 Mapping Population Development

interspecific hybrids with respect to intraspecific markers can be determined. This is because the
hybrids since variations in DNA will lead to degree of recombination is higher compared to F2
reduced pairing of the chromosomes during mei- populations. RILs also equalise marker types like
osis. This phenomenon is called as linkage drag, DH lines; the genetic segregation ratio for both
which can be explained as the situation when dominant and co-dominant markers is 1:1. RILs
larger than expected fragments are retained dur- developed through brothersister mating require
ing backcross breeding. Thus, linkage drag can more time than those developed through selfing.
cause undesirable effect in addition to introgres- The number of inbred lines required is twice, in
sion of trait of interest. case they are developed through brothersister
mating compared to selfing, particularly, when
linkage is not very tight.
RILs

Recombinant inbred lines (RILs) are the homozy- NILs, Exotic Libraries and Advanced
gous selfed or sib-mated progeny of the individu- Backcross Populations
als of an F2 population (Fig. 2.1). Use of RIL
concept in genetic mapping was originally devel- Development of near isogenic lines (NILs)
oped for mouse. Nearly 20 generations of sib involves several generations of backcrossing.
mating are required to reach useful levels of Backcrossing is executed with the help of molec-
homozygosity in animals. However, in plants, ular markers since markers can be used to recover
RILs with more than 98% homozygosity are pro- the maximum amount of recurrent genome. Two
duced by selfing within eight or nine generations additional rounds of self-fertilisation are required
(unless the species is completely self-incompati- at the end of backcrossing process in order to fix
ble). Self-pollination allows production of RILs the donor segments and to visualise traits that are
in a relatively short period of time. In fact, in some caused by recessive genes (Fig. 2.1). Generally, it
of the strict self-pollinating crops, almost com- is assumed that if two NILs differ in phenotypic
plete homozygosity can be reached within six performance, it might be the effect of the alleles
generations. Development of RILs is usually fol- carried by the introgressed DNA fragment in the
lowing a single-seed descent method, since dur- given NIL. Thus, NILs constitute powerful tools
ing the selfing process, one seed of each line is the in the functional analysis of the underlying genes.
source for the next generation. Bulk method and Particularly, they are valuable for those species
pedigree methods without selection can also be for which no transformation protocol is estab-
used for production of RILs. In RILs, alleles lished to produce transgenics for the alleles of
derived from either of the parent are arranged in interest. In addition, genomic rearrangements,
alternative way along each chromosome. In each which may occur during transformation, are also
generation, meiotic events lead to further recom- avoided in NILs.
bination and reduce heterozygosity until com- Usually desirable positive alleles (e.g. disease
pletely homozygous RILs with fragments of either resistance, quality parameters) are found in dis-
parental genome are achieved. Since recombina- tantly related or wild species, and those alleles
tion cannot change the genetic constitution of can be introduced into the local elite cultivar
RILs, further segregation in the progeny of such through backcrossing. If the trait to be introduced
lines is absent. Because of this, RILs are consid- is already known, the backcrossing can be expe-
ered as a permanent resource that can be repli- dited directly via marker-assisted selection.
cated indefinitely and be shared by many groups However, the potential of wild species that
among the researchers. Another advantage of influence the expression of quantitative traits is
using RILs is it can be used to construct higher- often not assessed. To this end, backcross breeding
resolution genetic map than F2 populations, and is a method to identify single genomic compo-
hence, the map positions of even tightly linked nents contributing to the phenotype. In such cases,
Multi-Cross Populations 31

NILs are developed by an advanced backcross be excluded in linkage analysis because the
program (i.e. simultaneous act of mapping alleles did not segregate in four-way cross pop-
population development and QTL identification ulation. The markers can have Mendelian seg-
and their phenotypic effects are assayed; first regation ratio of 1:1, 1:2:1, 3:1 and 1:1:1:1 in
described by Tanksley and his research team four-way cross population. Since four-way
(1996) in tomato; see chapter 8). A collection of cross involves four inbred lines (L1, L2, L3
introgression lines, each harbouring a different and L4), the polymorphic markers identified
fragment of genomic DNA, can be generated to between L1 and L2 or L3 and L4 can be
assess the effects of small chromosomal introgres- employed to develop genetic map. If only two
sion at a genome-wide level. Such collections parents were employed to mapping, half of
are referred to as exotic library, and they are polymorphic markers would be homozygous
developed through recurrent backcrossing and and could not be used in linkage analysis. Thus,
marker-assisted selection for six generations and a four-way cross can increase the density of the
to the self-fertilisation of the two more genera- linkage map, and in some cases, it can counter-
tions to generate plants homozygous to the intro- act the lower levels of polymorphism found in
gressed DNA fragments. Thus, NILs, after the certain crops. Further, use of four-way cross
advanced backcross program, will resemble the can potentially reduce the type II error caused
cultivated parent, but introgressed fragments by a random sampling of parents and increase
with even subtle phenotypic effects can be easily the probability of detecting QTL (see chapter x)
identified. The introgressed fragments can be if they segregate in single-line cross but not in
clearly defined by the use of molecular markers. the other single-line cross. In contrast to a sin-
gle cross in which only two alleles are involved,
a four-way cross can have a maximum of four
Four-Way Cross Populations alleles. Because of this, the additive and domi-
nance effects in a four-way cross are defined
The majority of the genetic maps in crops were differently from a simple cross to accommodate
constructed using mapping populations derived different inbred lines. When only two different
from either interspecific or intraspecific single- alleles exist among four inbred parents, the
cross hybridisation. Due to lower level of additive and dominance effects of alleles have
within-species and between-species polymor- common mean with that of alleles identified in
phism, most of the maps have included only a a single-cross population. If allele of one parent
relatively small portion of the genome. For differs from other three parents on one locus, a
example, even a joint map from different map- four-way cross population is analogous to BC
ping populations has shown 31% coverage of population.
the cotton genome. If such poor coverage
genetic map is used for QTL mapping, only a
small portion of genome will be explored and Multi-Cross Populations
large amounts of QTL information could not be
revealed. Use of four parents of a double cross The features of the genetic structure of RILs can
(otherwise referred to as four-way cross) has be studied using two-, four- and eight-way crosses
been shown to increase the density of genetic following either selfing or sib mating. Though
maps (Qin et al. 2008). The F1s derived from eight-way cross RILs have been successfully
two different single-cross hybridisation pro- shown in mouse, it is yet to be demonstrated in
grams are crossed to generate four-way cross major crops. Interestingly, there are several con-
populations. Initial parental polymorphic sur- trasting features between the nested association
vey should include all the four parents. If one mapping (NAM) strategy (explained below) and
locus screened for polymorphism was homozy- eight-way cross RILs. In maize, which has very
gous in two of the F1 parents, this locus would low linkage disequilibrium and tremendous genetic
32 2 Mapping Population Development

diversity, the main point in RIL generation for RILs per family. NAM has been successfully
NAM development is to capture large array of implemented in maize using the inbred B73 as the
alleles by using many founders, rapid production reference line (because of its use for public physi-
of RILs and minimised physiological variation by cal map and for the maize sequencing project).
crossing a reference line. In contrast, the mouse The other 25 parents (called as founder lines)
has low diversity and high linkage disequilibrium, were independent of any specific phenotype and
but the eight-way cross produces more recombina- represented diverse germplasm lines (that were
tions per line, which helps compensate for the high collected from all over the world to maximise the
linkage disequilibrium, and the mixing ensures genetic diversity of the RIL families). The NAM
that a fuller range of epistatic interactions are pro- strategy addresses complex trait dissection at a
duced. For example, if 5,000 maize RILs capture fundamental level by generating a common map-
~200,000 independent recombination breakpoints ping resource to efficiently exploit genetic,
when compared to 135,000 breakpoints in the genomic and systems biology tools. The original
1,000 mouse RILs from an eight-way cross. Thus, procedure proposed by McMullen et al. (2009)
previous studies of genetic designs with multiple involves the following steps: (a) selection of
line crosses have shown an improved power and diverse founders and developing a large set of
mapping resolution over a single population. related mapping progenies (preferably RILs for
Nevertheless, their importance in genetic mapping robust phenotypic trait collection), (b) either
is yet to be clearly demonstrated in crops. sequencing completely or densely genotyping the
founders, (c) genotyping a smaller number of tag-
ging markers on both the founders and the proge-
Nested Association Mapping nies to define the inheritance of chromosome
Populations segments and to project the high-density marker
information from the founders to the progenies,
Linkage mapping focuses on the development of (d) phenotyping progenies for various complex
large families from two inbred lines to detect traits and (e) conducting genome-wide associa-
QTLs. However, slow progress has been made in tion analysis relating phenotypic traits with
identifying completely characterised QTLs projected high-density markers of the progenies.
because of limitations in the scope of allelic diver- When compared to conventional linkage mapping
sity and resolution in available genetic resources. procedure, NAM has the advantages of (1) lower
Particularly, the poor resolution of the QTLs is sensitivity to genetic heterogeneity, (2) higher
mainly due to the limited number of recombina- power, (3) higher efficiency in using the genome
tion events that occur during population develop- sequence or dense markers and (4) maintaining
ment. Association mapping takes advantage of high allele richness due to diverse founders.
remarkable recombination from long history as Thus, NAM aims to create an integrated map-
linkage disequilibrium generally decays within ping population specifically designed for a full
2 kb (see chapter 6). Nevertheless, since there is a genome scan with high power for QTL detection
requirement of a large number of highly polymor- with different effects. In NAM, individual prog-
phic molecular markers and the confounding eny of RILs represents a mosaic of chromosome
effects of population structure, whole-genome segments derived from either one of the diverse
association analysis is difficult in crop plants. To founders or common parent. With the scores of
circumvent these problems, nested association common parent-specific markers (markers for
mapping (NAM) population can be constructed to which reference line has rare alleles) in RILs, the
enable high power and high resolution by captur- marker or sequence information nested between
ing the best features of both linkage and associa- two flanking common parent-specific markers can
tion mapping through joint linkage-association be predicted for RILs on the basis of marker or
analysis. The genetic structure of NAM popula- genome sequence available for the founders. By
tion is a reference design of 25 families of 200 choosing diverse founders, linkage disequilibrium
Natural Populations 33

within these chromosome segments resulting pedigree. Further, as in association mapping, the
from historical or evolutionary recombination is mapping resolution offered by NAM largely
mostly preserved in RILs due to the small proba- depends on the linkage disequilibrium among the
bility of recombination within short genetic dis- founder individuals. Rapid decay of linkage dis-
tances between flanking common parent-specific equilibrium has been noticed across genetically
markers. The potentially confounding effects of diverse species over 2 kb. Given the diversity of
genes outside of a specific segment being tested the founders and the rapid linkage disequilibrium
are minimised across the whole RILs via the decay within 2 kb, mapping resolution for NAM
reshuffling of the parental genomes by the recent is expected to be high.
recombinations during RIL development. All the
immortal mapping population used in the publica-
tions have maximum of 400 lines, and thus, it Natural Populations
limits their mapping power and coverage of allelic
diversity. Further, because of genetic heteroge- The main limitations of experimental mapping
neity, QTL mapped in a single two-parent pop- populations are: they are laborious, time consum-
ulation often have little application to QTL ing and require great care and effort in construc-
segregating in other populations, limiting the tion. The natural variation existing among
scope of inference of QTL studies and the use of individuals of one species can also be exploited
MAS in crops. In NAM, the polymorphisms for genetic mapping. In case of crops, germplasm
within the tagging molecular markers can be entries consisting of different breeding materials
tested more directly because high-density mark- and wild species can fulfil this purpose. It has
ers on founders can be obtained, and this informa- been shown that such natural populations can be
tion can be projected onto the progeny through used to map complex traits that are influenced by
flanking common parent-specific markers. Thus, the action of many genes in a quantitative way.
rather than inferring multiple alleles at each test- However, it is important that such a collection of
ing locus as in previous methods, NAM reduced different accessions of the germplasm should
the testing to exact biallelic contrasts across the contain a whole range of phenotypes for a given
whole population. Therefore, the advantages of trait. More importantly, the availability of extreme
designed mapping populations from linkage anal- phenotypes of interest is valuable. The basic
ysis and of high resolution from association map- norm of this idea is that genomic fragments natu-
ping are integrated in NAM through development rally present in a particular genotype are trans-
of a large number of RILs from diverse founders. mitted as non-recombining blocks and that
While common parent-specific markers allowed molecular markers can easily follow the inheri-
the prediction of transmission of chromosome tance of such blocks. These are called as haplo-
segments in RILs, the short range of linkage dis- types and their existence reveals a state of linkage
equilibrium within these segments across the disequilibrium (LD) among allelic variants of
diverse founders enabled improved mapping reso- tightly linked genes (explained in detail in
lution. The genetic background effect of these Chapter 6). Usually, the association between a
parental founders on mapping individual QTL, marker and a trait can exist if one marker allele or
which is a limiting factor for association mapping, haplotype is significantly associated with a
is systematically reduced by reshuffling the particular phenotype when studied in unrelated
genomes of the two parents of each cross during genotypes (such as natural population). The
RIL development as well as by the combined main strength of this approach is that it does not
analysis of all the RILs across all 25 crosses. require the construction of mapping populations.
At the same time, a balanced design with well- Particularly, for self-pollinating crops, inbred
chosen diverse founders in NAM, if possible for a individuals of natural ecotypes are specifically
particular species, would provide higher power immortal, and phenotyping needs to be performed
and finer resolution than exploiting an existing only once. In addition, natural populations are
34 2 Mapping Population Development

particularly informative because usually more carrying chromosome deletions, translocation


than two alleles exist for each marker locus. Since breakpoints or monosomics/trisomics/nullisomics
unrelated natural populations are genetically sep- have been generated for this purpose. Thus,
arated by many generations, the corresponding numerical aberrations in chromosome numbers,
large number of meiotic events leads to a high together with marker data, could clearly help in
rate of recombinations. Therefore, if LD blocks identification of chromosomes.
exist, the loci that influence the expression of trait Alternatively, defined translocation break-
can be mapped with high precision (sometimes points can also localise probes to specific regions
largely exceeding the resolution of F2 popula- on the arms of chromosomes by using techniques
tions). However, such association study requires that can localise nucleic acids in situ on the chro-
thorough statistical assessment of the relatedness mosomes. At pachytene stage (during the meiotic
and population structure and the reasons for such prophase), the chromosomes are generally 20
analysis is given in chapter 6. times longer than at mitotic metaphase. During
this time, chromosomes display a differentiated
pattern of brightly fluorescing heterochromatin
Chromosome-Specic Genetic Stocks segments. It is possible to identify all chromo-
for Linkage Mapping somes based on chromosome length, centromere
position, heterochromatin patterns and the posi-
Chromosome-specific tools or genetic stocks tions of repetitive sequences (such as 5S rDNA,
allow a segregation population to be genotyped 45s rDNA) using fluorescence in situ hybridisa-
in a way that each chromosome is directly tion (FISH). The recent refinement in multicolour
scanned for linkage. There are several such tools FISH even allows the mapping of single-copy
and one such kind were mutant lines with one sequences. Thus, cytogenetic maps developed
or more visible mapped mutations. As stated ear- using FISH can provide complementary informa-
lier, the distances in genetic maps are based on tion for the assembly of physical map by posi-
recombination frequencies (refer chapter 4 for tioning bacterial artificial clones and other DNA
details). However, recombination frequencies are sequences along the chromosomes (discussed in
not equally distributed all over the genome. For detail in chapter 7).
example, in heterochromatic regions such as the
centromeres, usually reduced recombination
frequencies are noticed. In such situations, cyto- Bulk Segregant Analysis
genetic maps can provide complementary infor-
mation since they are based on the fine physical Besides the above-mentioned populations, bulk
structure of chromosomes. The chromosomes are segregant analysis (BSA) approach is fre-
visualised under the (fluorescent or phase con- quently used in gene tagging or identifying
trast) microscopes and can be characterised by major QTLs. BSA is based on the principle of
specific staining (e.g. Giemsa C) patterns or by isogenic lines and this concept was introduced
morphological structures such as the centrom- by Michelmore et al., in lettuce for identifying
eres, the nucleolus-organising regions (NOR), the genes associated with downy mildew resistance
telomeres and knobs, heritable heterochromatic during 1991. In BSA, two parents (say a resis-
regions of particular shape. Cytogenetic maps tant and susceptible), showing high degree of
provide information on association of linkage molecular polymorphism and contrast for the
groups with chromosome and orientation of the target trait are crossed and F1 is selfed to gener-
linkage groups with respect to chromosome mor- ate F2 population. In F2, individual plants are
phology. It is worth to mention here that the phenotyped for resistance and susceptibility.
anonymous molecular markers (see chapter 3) Usually, the DNA isolated from ten plants in
are assigned to particular chromosome based on each group is pooled to constitute resistant and
such cytogenetic stocks. In several crops, lines susceptible bulks. The resistant parent, susceptible
Challenges in Mapping Population Development and Solutions to These Challenges 35

parent, resistant bulk and susceptible bulk are


surveyed for polymorphism using molecular Characterisation of Mapping
markers. A marker showing polymorphism Populations
between parents as well as bulks is considered
putatively linked to the target trait and is fur- Precise genotypic and phenotypic characterisation
ther used for mapping using individual F2 of mapping population is vital for success of any
plants. Conceptually, the genetic constitution mapping project. Since the molecular genotype of
of the two bulks is similar but for the genomic any individual is independent of environment, it is
region associated with the target trait. Hence, not influenced by G E interaction. However, trait
they serve the purpose of isogenic lines in prin- phenotype could be influenced by the environ-
ciple. It has been observed over experiments ment, particularly in case of quantitative charac-
that when ten plants are sampled in each group ters. Therefore, it becomes important to precisely
for constituting the bulk, the probability of a estimate the trait value by evaluating the genotypes
polymorphic marker (between parents as well in multi-location testing over seasons and/or years
as bulks) not being linked to the target trait is using immortal mapping populations to have a
extremely low. Hence, usually ten plants are valid markertrait association.
used for constituting the bulks. However, this
number may vary depending upon the types of
mapping populations used. Using BSA, mark- Choice of Mapping Populations
ers can be reliably identified in a 0- to 25-cM
window to either side of the locus of interest. It is evident from the foregoing discussion that
Further, this method can be applied iteratively, the short-term mapping populations such as F2,
in the sense that new bulks can be constructed backcross and conceptual near isogenic lines
based on each new marker that linked more developed through BSA approach can be a good
closely to the gene. The linkage of each marker starting point in molecular mapping, while long-
with the tagged locus is verified by analysing term mapping populations such as RILs, NILs
single plants of the segregating populations. and DHs must be developed and characterised
properly with respect to the traits of importance
for global mapping projects. As a matter of fact,
Combining Markers and Populations the development and phenotypic characterisation
of mapping populations should become an inte-
The genetic segregation ratio at marker locus is gral part of the ongoing breeding programs in
jointly determined by the nature of marker (dom- important crops. At this point, the role of geneti-
inant/co-dominant; see chapter 3 for definition cists and plant breeders becomes crucial to reap
and details) and types of mapping populations the benefits of genetic mapping.
(Table 2.1). Therefore, a thorough understanding
of the nature of markers and mapping population
is crucial for any mapping projects. Mapping Challenges in Mapping Population
populations such as RILs and DHs equalise Development and Solutions to These
marker type because of fixation of parental alleles Challenges
at marker locus in homozygous condition. These
populations result in 1:1 segregation ratio at As described in chapter 1, a loss in genetic diver-
marker locus irrespective of genetic nature of sity inevitably causes problems in breeding for
markers, while an F2 population segregates in new varieties, and this has been repeatedly shown
1:2:1 ratio for a co-dominant marker and in 3:1 in several crops (well-known examples are tomato
ratio for dominant marker. Depending upon the and cotton). This erosion in genetic diversity
segregation pattern, statistical analysis of marker created a bottleneck. Breeding methods such as
data will vary. single-seed descent and pedigree selection also
36 2 Mapping Population Development

promote genetic uniformity. In self-compatible error. Hence, it is desirable to independently confirm


species, even further decrease in genetic diversity QTL-mapping studies. Such confirmation studies
can be expected since the mode of reproduction may involve independent mapping populations
is playing a major role in the maintenance of constructed from the same parental genotypes or
genetic variability. In such cases, use of landraces closely related genotypes used in the primary QTL-
that are not genetically uniform is one option to mapping study. Sometimes, larger population sizes
increase genetic polymorphism and is essential may also be used. Furthermore, some recent studies
for introducing new genetic factors into the have proposed that QTL positions and effects should
breeding pool of this crop. Another problem that be evaluated in independent populations, because
is often found in genetic mapping is distorted QTL mapping based on typical population sizes
segregation. Significant deviation from expected results in a low power of QTL detection and a large
segregation ratio in a given markerpopulation bias of QTL effects. Unfortunately, due to con-
combination is referred to as segregation distor- straints such as lack of research funding and time
tion. There are several reasons for segregation and perhaps a lack of understanding of the need to
distortion, including gamete/zygote lethality, confirm results, QTL-mapping studies are rarely
meiotic drive/preferential segregation, sampling/ confirmed. Validation of conserved QTLs across
selection during population development and populations has not been conclusive so far due to
differential responses of parental lines to tissue the fact that the majority of the QTL studies were
culture in case of DHs (find more details in either derived from small and mortal (F2 or BC)
chapter 4). Segregation distortion can also be populations. As compared to F2 or BCs, homozy-
specific with respect to some markers in an other- gous immortalised RILs constitute the preferred
wise normal mapping population. It is common material for QTL mapping in many crops. When n
in plants that one allelic class can be underrepre- pairs of genes segregate independently, the number
sented due to dysfunction of the concerned gam- of different gametes is 2n, while the number of pos-
etes. This can occur in pollen or in megaspores or sible genotypes in an F2 is 3n; that is, with doubled
in both organs. It can be explained either by the haploids or RILs, fewer individuals need to be
selective abortion of male and female gametes or screened (and this is economically very important
by the selective fertilisation of particular gametic when using molecular markers) to cover a similarly
genotypes. A selection process during seed devel- wide spectrum of recombinants, and more accurate
opment, seed germination and plant growth can estimates of the location of the QTL can be obtained
also be a causative agent. Gametophyte loci lead- with less variance.
ing to a distorted segregation have been identified For RILs or DHs, the power of detecting a given
in rice and other crops. They are supposed to quantitative trait locus is clearly related to its rela-
be responsible for the partial or total elimina- tive contribution to the heritability of the character
tion of gametes carrying one of the parental (refer chapter 5). The power of the test was about
alleles. Thus, a marker locus linked to a gameto- 90% for heritabilities of QTL. To obtain a similar
phyte locus, also referred to as a gamete elimi- power for backcrosses, the heritability attributable
nator or pollen killer, can also show distorted to the individual quantitative trait locus should be
segregation. Self-incompatibility loci prevent- around 14%. For a given type of gene action, it
ing self-pollination are also another important seems that DHs have a similar power to an F2.
direct cause for distorted segregation. Therefore, However, if dominance is present, DHs or RILs
breeding programs that aim at the generation of will only detect the additive component of a
specific recombinants are directly affected if one particular quantitative trait locus. This could be
locus is close to a region affected by segregation very important for QTL showing overdominant
distortion. (or pseudo-overdominant) effects. The major tech-
Detection of QTLs is often limited by several nical advantage for DHs or RILs, independent of
factors such as genetic properties of QTLs, environ- any effect of replication on the required number of
mental effects, population size and experimental offspring, lies in the fact that the lines can be repro-
Bibliography 37

duced independently and continuously evaluated program, a map made from a wide cross must
with respect to additional quantitative traits and be collinear (i.e. order of loci should show sim-
markers with all the information being cumulative. ilarity) with map constructed using adapted
If the effect of replication is taken into account, parents.
replicated progenies can bring about a major Thus, before starting up a mapping population
reduction in the number of lines that need to be development program, several above-said points
scored. Reductions are greatest when heritability need to be critically evaluated depending on the
of the trait is low, under the assumption of co- type of crop pollination, nature of marker types,
dominance at all QTL. In this situation of low heri- availability of resources, genetics of the investi-
tability, MAS is much more efficient when gating trait, etc.
compared with phenotypic selection.
RILs have not been widely utilised in crops
except in some cases, mainly due to long devel- Bibliography
opment timelines and difficulties in production
of sufficient seeds. Though there is no clear rule Literature Cited
for the precise population size that is required
for QTL analysis, it is increasingly believed that Broman KW (2005) The genomes of recombinant inbred
lines. Genetics 169:11331146
sampling limited numbers of progeny (say <200)
Frisch M, Melchinger EA (2008) Precision of recombina-
in mapping studies tends to cause the skewed tion frequency estimates after random intermating
distribution of QTL effects and identification of with finite population sizes. Genetics 178:597600
limited number of QTLs, even if many genes McMullen MD et al (2009) Genetic properties of maize
nested association mapping population. Science
with equal and small effects actually control
325:737740
the trait. Further, in several published reports, Michelmore RW, Paran I, Kesseli RV (1991) Identification
the number of linkage groups exceeds the gametic of markers linked to disease resistance genes by bulked
chromosome number, and numerous linkage groups segregant analysis: a rapid method to detect markers in
specific regions by using segregating populations.
are yet to be associated with specific chromo-
Proc Natl Acad Sci USA 88:98289832
somes mainly due to lack of informative mark- Qin H, Guo W, Zhang YM, Zhang T (2008) QTL mapping
ers and use of small sample size. In most of the of yield and fiber traits based on a four-way cross pop-
published genetic maps, the markers were not ulation in Gossypium hirsutum L. Theor Appl Genet
117:883894
uniformly spaced over many linkage groups. It is
Tanksley SD, Nelson JC (1996) Advanced backcross QTL
attributed that these regions may be heterochro- analysis: a method for the simultaneous discovery and
matin or gene rich. Clusters of markers with transfer of valuable QTLs from unadapted germplasm
very limited recombination are frequently into elite breeding lines. Theor Appl Genet 92:191203
Yu J et al (2008) Genetic design and statistical power of
present which may be indicative of QTL-rich
nested association mapping in maize. Genetics
(gene-rich) regions. 178:539551
Consideration must be given to the source
of parents (adapted vs. exotic) used in develop-
ing mapping population. Chromosome pairing
Further Readings
and recombination rates can be severely dis-
turbed (suppressed) in wide crosses and gener- McCouch SR, Kochert G, Yu ZH, Wang ZY, Khush GS,
ally yield greatly reduced linkage distances. Tanksley SD, Coffman RW (1988) Molecular mapping
Wide crosses will usually provide segregating of rice chromosomes. Theor Appl Genet 76:815829
populations with a relatively large array of Rao SQ, Xu SZ (1998) Mapping quantitative trait loci for
ordered categorical traits in four-way crosses. Heredity
polymorphism when compared to progeny seg- 81:214224
regating in a narrow cross (adapted adapted). Xu S (1996) Mapping quantitative trait loci using four-
To have significant value in crop improvement way crosses. Genet Res 68:175181
Genotyping of Mapping Population
3

12. Evaluating germplasm for useful genes


Markers and Its Importance 13. Pedigree analysis
14. Hybrid identification
The basic principle of plant breeding for genetic The following section describes two different
improvement of crop plants is mainly relied on classes of markers that are being used in plant
selection of superior progenies from the available breeding from time immemorial, and later part of
population based on the traits of interest (such this chapter describes the importance of molecular
as higher yield, improved nutritional quality, markers in characterising or genotyping mapping
appropriate colour or fragrance preferred by the populations for genetic and QTL mapping.
consumer). In general, such traits are not mea-
sured directly from the plants; instead, they are
enumerated from some other markers or tags that Morphological Markers
are closely linked to the trait of interest. For
example, rice yield is decided by higher number During the early days of plant breeding, breeders
of productive tillers, number of grains/spikelet, use to cross and select the progeny based on
etc.; other classical examples are traits such as certain neutral characteristics, since these easily
pea seed size, colour and plant height, used by recognisable characteristics most probably coin-
Mendel. Such tags which used to select the cide with specific expression of agronomically
superior progenies from the heterogeneous and economically important traits. Therefore,
mixture of population are called as markers. those visibly observable characteristics are
These markers are useful in an array of plant used to mark or tag the desired (or sometime
breeding and genetics studies including: undesired) progeny among the population, and
1. Genetic relatedness and diversity they are called as phenotypic or morphological
2. Population genetics markers. Genetically, their function is based on
3. Studying polymorphism in landraces, culti- linkage between the genes for the characteristics
vars and germplasm and the agronomic trait. This concept of using
4. Identification of cultivars and taxonomy markers in genetics dates back to as early as nine-
5. Phylogenetic studies teenth century. Gregor Mendel used phenotype-
6. Studying domestication and evolution based genetic markers in his experiment. It is also
7. Gene flow and introgression interesting to note that those phenotype-based
8. Comparative mapping genetic markers in Drosophila led to the estab-
9. Gene mapping and identification lishment of the theory of genetic linkage by
10. Genetic improvement of crop plants Alfred Henry Sturtevant at Dr. Morgans laboratory
11. Detecting somaclonal variation (the details of linkage mapping are discussed in

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 39
and Benefits, DOI 10.1007/978-81-322-0958-4_3, Springer India 2013
40 3 Genotyping of Mapping Population

chapter 4). In 1913, A. H. Sturtevant generated the different molecular forms of proteins, which
the first genetic map using six morphological exhibit the same enzymatic specificity. The terms
traits (it was termed factors at his period) in the isozyme and isoenzyme have been used inter-
fruit fly (Drosophila melanogaster). Similarly, changeably. However, the Standing Committee on
Karl Sax produced evidence for genetic linkage Enzymes of the International Union of Biochemistry
between a qualitative and a quantitative trait (seed prefers the term isoenzyme. Specific enzymes
colour and seed size, respectively) in the com- isolated from different species may possess wide
mon bean (Phaseolus vulgaris). variations in their physical and catalytic properties
To be useful in genetic analysis, a heritable mor- and are called heteroenzymes since these enzymes
phological characteristic has to exist in at least two have originated from different origins (hence, it is
alternative forms or phenotypes (e.g. the tall or short different from the term isoenzymes). The term
stems studied by Mendel in the pea plants). As a isoenzyme is restricted to those forms of an
consequence, several such visibly observable enzyme with similar enzymatic activity occur-
phenotypes were used as markers to construct the ring within a single species, as a result of the pres-
genetic map in earlier days. As stated, the first ence of more than one structural gene. The multiple
genetic map was developed in fruit fly which forms of enzymes are also divided into two main
showed the positions of body colour, eye colour, classes according to how they are coded: allozymes
wing shape and other suchlike traits. However, (enzymes coded by different alleles at one gene
shortly, it was recognised that there were only a lim- locus) and isozymes (enzymes coded by alleles at
ited number of visual phenotypes, and identification more than one gene locus). However, the term
of such morphological markers is very infrequent isoenzymes refers to both the classes. Allozymes
and mostly not accessible in every segregating pop- are controlled by co-dominant alleles, which means
ulation. Further, in many cases, the genetic analysis that homozygotes (all alleles at a locus are similar)
was intricate since a single phenotype could be can be distinguished from heterozygotes (parents
affected by more than one gene. Though morpho- of the individual have contributed different alleles
logical markers are easily examined, they are fre- to that locus). For monomeric enzymes (i.e. con-
quently affected by the environment. Some of them sisting of a single polypeptide), plants that are
appear late in plant development (e.g. flower colour), homozygous for a given locus will produce one
making early scoring impossible. In addition, a band, whereas heterozygous individuals will pro-
given morphological marker can affect other duce three because of random association of the
morphological markers or traits of interest in breed- polypeptides. Multimeric enzymes also exist, where
ing programs because of pleiotropic gene action. the polypeptides are specified by different loci.
Because of these features, morphological markers The formation of isozymic heteromers can thus
found limited application in plant breeding. To considerably complicate gel banding patterns
make genetic maps more comprehensive, it would (discussed below). As per the definition, the follow-
be necessary to find phenotypes that were more dis- ing categories are regarded as isozymes: (1) geneti-
tinctive and less complex. As innovations in protein cally independent proteins arising from the presence
science were developed, a new set marker, namely, of multiple gene loci, (2) enzyme variants from
protein or enzyme-based marker (isoenzymes), was occurrence of allelic genes at a particular gene locus
introduced in this context. (these isoenzymes are called allelozymes) and
(3) heteropolymers (noncovalent hybrid molecular
of two or more different polypeptide chains).
Biochemical Markers or Isozymes

Enzymes, a type of protein, usually act as catalysts, Principle


and each enzyme is highly specific to the particular
biochemical reaction. The term isozymes was Isoenzyme in plant breeding (also called as pro-
coined by Markert and Miller in 1959 to describe tein markers) is based on the principle that allelic
Biochemical Markers or Isozymes 41

variation exists for many different proteins. It is now well established that there can be vast
For instance, two alleles of malic dehydrogenase regions of the protein unnecessary for enzyme
can perform the same enzymatic function, but the activity. Therefore, by modifying the nonessen-
electrophoretic mobility of these two may differ tial region of the structure, a wide variety of pro-
(i.e. the proteins of two alleles would not migrate tein molecules could exist with similar enzymatic
to the same location in the gel). Thus, the proce- activities. It is also known that most of the
dure to identify isoenzymes variation is simple. enzymes are made up of more than one polypep-
A crude protein extract is made from tissues tide or subunits. Hence, variation in one subunit
(such as leaves, flowers). The extracts are next may cause structural difference in the enzyme.
separated by electrophoresis in a starch or Further, random association of subunits may also
polyacrylamide gel. The gel is then placed in a lead to isoenzyme formation.
solution that contains reagents required for the Numerous studies have shown that isoen-
enzymatic activity of the isoenzyme that are zymes may arise as artefacts during the course
being investigated. Further, the solution contains of purification. How to determine whether the
a dye that the isoenzyme can catalyse into a presence of isoenzymes in a tissue homogenate is
colour reagent that stains the protein, and because real or artefacts that have arisen during isolation
of this the allelic variants of the protein can be and purification procedures? The usual precau-
visualised on the gel. tion against artefacts is to establish the existence
Since isoenzymes catalyse the same reaction, of isoenzymes by as many techniques as possible,
they should be closely related forms of proteins. that is, different isolation procedures, assorted
Hence, it is possible to explain the specific purification techniques and different detection
chemical and biological properties of individual methods. It is also essential to demonstrate that
isoenzymes in terms of their physicochemical the isoenzymes do not arise from one another
structure. The amino acid sequence in an enzyme during the experimental procedures (discussed
is predetermined and made up of 21 amino acid below). Clear evidence for isoenzymes occurring
residues. It is a well-known fact that the amino within the tissue may be obtained by demonstrat-
acid sequence (primary structure) of the polypep- ing a definite structural difference or by showing
tide chain predetermines the secondary, tertiary that the isoenzymes are synthesised under inde-
and quaternary structure of the protein. The spa- pendent genetic control.
tial conformation dictated by the secondary, ter- Diverse types of experimental protocols are
tiary and quaternary structures is of paramount available to detect and distinguish isoenzymes.
importance in determining the specific and unique However, each procedure has its own advantages
properties of an enzyme. Hence, any change, and limitations, and they are discussed hereunder.
which would modify the secondary, tertiary or
quaternary structure, would thereby produce
different but closely related forms of an enzyme. Electrophoresis
This may be due to some modification in protein
structure such as small changes in amino acid Electrophoresis is presently the most powerful ana-
sequence, amidation of carboxyl groups, conju- lytical technique available to separate isoenzymes.
gation with small molecules, polymerisation and Its scope of application has been broadened
folding the same primary structure in different tremendously in recent years by simplification of
ways. The spatial conformation of lipoproteins the apparatus and by the development of synthetic
and glycoproteins may also be modified by small support media, which have shortened the time of
changes in the covalently linked prosthetic analysis. The theory underlying electrophoresis is
groups. If isoenzymes are closely related forms simple. Direct current is used to separate the indi-
of a protein, it is necessary to propose a mecha- vidual isoenzymes (electrophoretic mobility) by
nism, which will permit structural variation in the taking advantage of the differences in net charge
protein but allow retention of enzymatic activity. of each isoenzyme. Changes in electrophoretic
42 3 Genotyping of Mapping Population

mobility may result from the substitution of a single gels offered a convenient way of determining
amino acid. Thus, the altered electrophoretic mobil- the molecular weights of many proteins even
ity reflects a change in the net charge of the protein though anomalous results may be obtained if the
molecule, which occurs when the substituted amino protein forms a complex with the gel or contains
acid carries a charge, different from that of the one an appreciable amount of carbohydrate. It has
it replaces. The most widely used electrophoretic also been reported that gel filtration is unsatisfac-
technique involves zymogram display of isoen- tory for estimating the molecular weight of gly-
zymes, which utilises zone electrophoresis fol- coproteins. A number of cases have been reported
lowed by histochemical staining methods to locate where enzyme fragments produced by proteolytic
the zones of enzyme activity directly in the sup- digestion still retain part of their enzymatic
porting medium (starch or polyacrylamide). While activity. These observations serve to illustrate the
the zymogram method is very sensitive and conve- possibility that in some instances, isoenzymes
nient, it must be noted that several sources of error may arise from endogenous proteolytic actions
are inherent in the staining techniques. Since isoen- during enzyme purification. Gel filtration would
zymes frequently differ in catalytic properties, cer- appear to provide a convenient technique to
tain isoenzymes may fail to react with the detecting ascertain this type of artefact. Gel filtration may
stain because they are not at optimal conditions also be employed to examine the possibility that
or they may have lower specific activity. Hence, an isoenzyme might have arisen as a result of the
extreme caution must be exercised in interpreting dissociation of one or more subunits from the
the electrophoretic results. parent enzyme.

Chromatography Immunochemistry

Chromatographic techniques represent the second When a foreign protein, an antigen, is injected
most powerful tool available for the separation of into a suitable animal, the animal produces a
isoenzymes. This approach is desirable when it is specific protein called an antibody. The anti-
necessary to isolate isoenzymes as a preparative body may combine with the antigen to produce
step. When separating isoenzymes by chromato- a visible precipitate. Antibodies are highly
graphic procedures, it is necessary to establish specific in their activity. Injection of a homo-
their validity as chromatographic identities. This geneous isoenzyme can result in the formation
can be achieved by performing re-chromatogra- of a single type of antibody, which gives no
phy of the isolated peaks under the original condi- precipitation reaction with other isoenzymes.
tions as a result of which each peak should emerge If one isoenzyme is shown to be immunologi-
in the effluent profile as that of original position. cally different from another, it can be said
However, the appearance of false components is unequivocally that the two are structurally dif-
a characteristic feature of protein chromatogra- ferent. On the other hand, if two isoenzymes
phy, and it is for this reason that most investiga- give the same immunological reaction to a
tors use a variable gradient device for eluting given antibody, it can only be said that they
proteins from a column. may be identical. The immuno-electrophoresis
technique combines the principles of zone
electrophoresis with those of immunochemical
Gel Filtration analysis, thus making it possible to establish
the immunochemical relationship between
Gel filtration or molecular sieving is carried out electrophoretically dissimilar components.
on various cross-linked dextran polymers (e.g. Selective staining may be carried out in the gel
Sephadexes) or cross-linked polyacrylamide medium, thereby adding a third dimension to
polymers (e.g. Bio-Gel P). However, use of dextran the analysis.
Genome Structure and Organisation 43

Catalysis
Genome Structure and Organisation
Isoenzymes may differ from one another in a
variety of catalytic properties including affinity The genome is the sum of the entire DNA of an
for substrates, behaviour towards coenzyme ana- individual or a species. It includes the entire
logues, sensitivity to inhibitors or denaturing DNA, not just the genes. For simple viruses, with
agents, in their amino acids sequence and order a single nucleic acid molecule, the genome is
of amino acids, pH and pI optima, thermal stabil- obvious, although of course for RNA viruses, it is
ity, Vmax and/or regulatory properties and specific RNA rather than DNA. For haploid prokaryotes,
activity. In order to investigate the catalytic prop- it is also straightforward, except for the plasmids,
erties of isoenzymes, it is imperative that each one copy of which is counted in the genome. For
should be purified to a state approaching homo- eukaryotes, one haploid copy of the DNA of each
geneity. Since this criterion is difficult to achieve, of the diploid pairs of chromosomes (the auto-
reliable studies in this area of investigation are somes) is included, plus one copy of the DNA of
limited. sex chromosomes. Thus, the female and male
Obvious limitations of the above-said proce- genomes will differ if there is a difference in sex
dures for isoenzyme detection are development of chromosomes. One copy of the DNA from any
reagent systems (so far, nearly 50 different reagent organelles other than the nucleus, such as the
system alone has been developed, and each plant mitochondria and chloroplasts, should also be
species require a specific modification) and tissue included. The majority of the DNA of a genome
variability (some enzymes are better expressed in is not in the genes themselves and their known
roots, whereas others are best sampled in leaves). associated regulatory sequence. While the phe-
About 90 isoenzyme systems have been used for nomenon of gene regulation is beginning to be
plants, with isozyme loci being mapped in many understood, little is known of the significance of
cases. As a consequence of this smallest number the majority of the non-genic DNA, whether it
of isozymes, the percentage of genome coverage is has any functions other than acting as spacer
inadequate for a thorough study of genetic diver- between genes. In most species, a large fraction
sity. Further, since differential expression of the of the DNA is repeated sequences that cause
genes may occur at different developmental stages genetic recombination and unequal crossing over
or in different tissues, the same type of material (discussed in chapter 4), resulting in genomic
must be used for all experiments. Other issues rearrangements, but their overall significance is
while interpreting isoenzyme banding pattern are not understood.
the quaternary structure of enzymes (whether The main role of the genome is providing gene
monomeric, dimeric, etc.), whether the plant is products, but in many genomes, only 1% or so of
homozygous or heterozygous at each gene locus, the DNA is transcribed and translated during nor-
the number of gene loci, the number of alleles per mal cellular activities. Striking evidence states
loci and how the genes are inherited. In order to that the actual coding capacity is likely to be rela-
overcome these limitations, the next-generation tively constant among plants. For example, when
markers, namely, DNA or molecular markers, have the genomes of Arabidopsis and maize were
been introduced after the 1950s since the proper- compared with the sequence information obtained
ties of nucleic acids were completely elucidated from cDNAs, it indicated that both genomes code
during this period. Of late, the paramount role of essentially the same number of genes, although
molecular markers in plant breeding (when com- the genome sizes differ by two orders of magni-
pared to other two types of markers) has been tude. Similarly, maize and sorghum are closely
documented in almost all the crop plants. Before related plants, and both have ten chromosomes,
getting into the basics and details of molecular but the maize genome is more than three times
markers, it is appropriate to introduce genome the size of that of sorghum. When DNA frag-
structure and organisation in crop plants. ments from maize were used in hybridisation
44 3 Genotyping of Mapping Population

analyses with sorghum sequences, homology was described as the complexity, usually given in
shared predominantly by low copy number base pairs. The renaturation of the DNA of any
sequences and unique sequences. In fact, several genome (or part of a genome) should display a
of the genes in sorghum show the same chromo- Cot1/2 that is proportional to its complexity. Thus,
somal arrangement as their counterparts in maize. the complexity of any DNA can be determined
From these and similar analysis, the extra DNA by comparing its Cot1/2 with that of a standard
that accounts for the difference in maize and sor- DNA of known complexity. Usually E. coli DNA
ghum genome size apparently comprises mostly is used as a standard. Its complexity is taken to
non-coding repetitive sequences between genes. be identical with the length of genome (implying
This finding supports the conclusion that the that every sequence in the E. coli genome of
majority of nuclear DNA may play a supporting 4.2 106 bp is unique).
role in the structure and organisation of the From the perspective of genetics, a major dif-
genome but does not contribute directly to its ference between prokaryotic and eukaryotic cells
protein-coding capacity. is that a eukaryote has a nuclear envelope, which
The size of the nuclear genome varies among surrounds the genetic material to form a nucleus
organisms. The DNA content of haploid eukary- and separates the DNA from the other cellular
otic cells (referred to as C value) ranges from contents. In prokaryotic cells, the genetic mate-
107 to 1011 base pair (bp). Although it has been rial is in close contact with other components of
assumed that organism complexity correlates the cella property that has important conse-
roughly with genome sizehumans have larger quences for the way in which genes are con-
genomes than most insects, and insects have trolled. Another fundamental difference between
larger genomes than fungithis correlation is by prokaryotes and eukaryotes lies in the packaging
no means universal. For example, some amphib- of their DNA. In eukaryotes, DNA is closely
ians have genomes almost 50 times larger than associated with a special class of proteins, the
that of humans, and cartilaginous fish generally histones, to form tightly packed chromosomes.
have larger genomes than bony fish. The lack of a This complex of DNA and histone proteins is
direct relationship between genome size and termed chromatin, which is the stuff of eukary-
organism complexity is called the C-value para- otic chromosomes. Histone proteins limit the
dox. We have no satisfactory explanation yet for accessibility of enzymes and other proteins that
the C-value paradox, but in plant, at least, we copy and read the DNA, but they enable the DNA
know that genome size can to some degree be to compactly fit into the nucleus. Eukaryotic
attributed to repetitive DNA and duplicated DNA must separate from the histones before the
genomes (due to polyploidy). genetic information in the DNA can be accessed.
The general nature of the eukaryotic genome However, prokaryotes do not possess histones, so
can be assessed by the kinetics with which dena- their DNA does not exist in the highly ordered,
tured DNA reassociates. The reassociation reac- tightly packed arrangement found in eukaryotic
tion is the product of DNA concentration (Co) cells. The copying and reading of DNA are there-
and time of incubation (t), usually described fore simpler processes in eubacteria.
simply as the Cot value. A useful parameter is Genes of prokaryotic cells are generally on a
derived by considering the conditions when the single, circular molecule of DNA, the chromosome
reaction is half complete, at time t1/2. The value of the prokaryotic cell. In eukaryotic cells, genes are
required for half reassociation is called the Cot1/2. located on multiple, usually linear DNA molecules
Since the Cot1/2 is the product of the concentra- (multiple chromosomes). Eukaryotic cells therefore
tion and time required to proceed halfway, a require mechanisms that ensure that a copy of each
greater Cot implies slower reaction and thereby chromosome is faithfully transmitted to each new
low similarity between two genomes. The Cot1/2 cell. This generalisationa single, circular chro-
of a reaction therefore indicates the total length mosome in prokaryotes and multiple, linear chro-
of different sequences that are present. This is mosomes in eukaryotesis not always true. A few
Genome Structure and Organisation 45

bacteria have more than one chromosome, and often stains less strongly than does the rest of
important bacterial genes are frequently found on the chromosome. Before cell division, a pro-
other DNA molecules called plasmids. Furthermore, tein complex called the kinetochore assembles
in some eukaryotes, a few genes are located on cir- on the centromere, to which spindle microtu-
cular DNA molecules found outside the nucleus bules later attach. Chromosomes without a
such as in mitochondria and chloroplast. centromere cannot be drawn into the newly
Each eukaryotic species has a characteristic formed nuclei; these chromosomes are lost,
number of chromosomes per cell: potatoes have often with calamitous consequences to the cell.
48 chromosomes, fruit flies have 8 and humans Telomeres are the natural ends, the tips, of a
have 46. There appears to be no special significance linear chromosome; they serve to stabilise the
between the complexity of an organism and its chromosome ends. If a chromosome breaks,
number of chromosomes per cell. In most eukary- producing new ends, these ends have a ten-
otic cells, there are two sets of chromosomes. The dency to stick together, and the chromosome is
presence of two sets is a consequence of sexual degraded at the newly broken ends. Telomeres
reproduction; one set is inherited from the male provide chromosome stability. Origins of repli-
parent and the other from the female parent. Each cation are the sites where DNA synthesis
chromosome in one set has a corresponding chro- begins; they are not easily observed by micros-
mosome in the other set, together constituting a copy. In preparation for cell division, each
homologous pair. chromosome replicates, making a copy of it.
These two initially identical copies, called sister
chromatids, are held together at the centrom-
Chromosome Structure ere. Each sister chromatid consists of a single
molecule of DNA.
The chromosomes of eukaryotic cells are larger
and more complex than those found in prokaryotes,
but each unreplicated chromosome nevertheless Mitochondrial DNA
consists of a single molecule of DNA. Although
linear, the DNA molecules in eukaryotic chromo- In animals and most fungi, the mitochondrial
somes are highly folded and condensed; if stretched genome consists of a single, highly coiled, circu-
out, some human chromosomes would be several lar DNA molecule (mtDNA). Plant mitochondrial
centimetres longthousands of times longer than genomes often exist as a complex collection of
the span of a typical nucleus. To package such a multiple circular DNA molecules. Each mito-
tremendous length of DNA into this small volume, chondrion contains multiple copies of the mito-
each DNA molecule is coiled again and again and chondrial genome, and a cell may contain many
tightly packed around histone proteins, forming the mitochondria. Like eubacterial chromosomes,
rod-shaped chromosomes. Most of the time, the mtDNA lacks the histone proteins normally asso-
chromosomes are thin and difficult to observe, but ciated with eukaryotic nuclear DNA. The gua-
before cell division, they condense further into ninecytosine (GC) content of mtDNA is often
thick, readily observed structures; it is at this stage sufficiently different from that of nuclear DNA
that chromosomes are usually studied under genetic that mtDNA can be separated from nuclear DNA
mapping. by density gradient centrifugation. Mitochondrial
A functional chromosome has three essen- genomes are small compared with nuclear
tial elements: a centromere, a pair of telomeres genomes and vary greatly in size among different
and origins of replication. The centromere is organisms. Most of this size variation is in
the attachment point for spindle microtubules, non-coding sequences such as introns and inter-
which are the filaments responsible for moving genic regions. Flowering plants (angiosperms)
chromosomes during cell division. The cen- have the largest and most complex mitochondrial
tromere appears as a constricted region that genomes known; their mitochondrial genomes
46 3 Genotyping of Mapping Population

range in size from 186,000 bp in white mustard a trait. Molecular markers offer numerous
to 2,400,000 bp in muskmelon. Even closely advantages over conventional morphological
related plant species may differ greatly in the markers and isoenzymes. They are stable and
sizes of their mtDNA. Part of the extensive size detectable in all tissues regardless of growth,
variation in the mtDNA of flowering plants can differentiation, development and status of the
be explained by the presence of large direct cell. Further, they are not confounded by the
repeats, which constitute large parts of the environment, pleiotropic and epistatic effects.
mitochondrial genome. Crossing over between The publication of Botstein et al. in 1980 about
these repeats can generate multiple circular the construction of genetic maps using restric-
chromosomes of different sizes. The mitochon- tion fragment length polymorphism (RFLP) was
drial genome in turnip, for example, consists of a the first reported molecular marker technique in
master circle consisting of 218,000 bp that the detection of DNA polymorphism. After the
has direct repeats. Homologous recombination invention of polymerase chain reaction (PCR;
between the repeats can generate two smaller see Box 3.1), several PCR-based markers were
circles of 135,000 bp and 83,000 bp. Other spe- developed. Thus, basic techniques used to iden-
cies contain several direct repeats, providing tify such third-generation markers can be
possibilities for complex crossing-over events classified into two categories: (1) non-PCR-
that may increase or decrease the number and based techniques or hybridisation-based techniques
sizes of the circles. and (2) PCR-based techniques. Depending on
the need and modifications in the techniques,
second generation of advanced molecular mark-
Chloroplast DNA ers has been made, and they are discussed in the
following sections.
Geneticists have long recognised that many Though there are several marker techniques
traits associated with chloroplasts exhibit cyto- available at this point, it is essential to consider
plasmic inheritance, indicating that these traits that an ideal molecular marker technique for
are not encoded by nuclear genes. In 1963, chlo- genetic mapping should have at least the follow-
roplasts were shown to have their own DNA. ing criteria: (1) be polymorphic and evenly dis-
Among different plants, the chloroplast genome tributed throughout the genome; (2) provide
ranges in size from 80,000 to 600,000 bp, but adequate resolution of genetic differences;
most chloroplast genomes range from 120,000 (3) generate multiple, independent and reliable
to 160,000 bp. Chloroplast DNA (cpDNA) is markers; (4) simple, quick and inexpensive;
usually contained on a single, double-stranded (5) need small amounts of tissue and DNA
DNA molecule that is circular, is highly coiled samples; (6) have linkage to distinct phenotypes;
and lacks associated histone proteins. As in (7) and require no prior information about
mtDNA, multiple copies of the chloroplast the genome of an organism. Unfortunately, no
genome are found in each chloroplast, and there molecular marker technique is ideal for every
are multiple organelles per cell; so there are sev- situation. Techniques differ from each other
eral hundred to several thousand copies of with respect to important features such as
cpDNA in a typical plant cell. genomic abundance, level of polymorphism
detected, locus specificity, reproducibility, tech-
nical requirements and cost. The following
Molecular Markers sections describe the principle of each marker
technique, advancement and applications in plant
A molecular marker is defined as a particular breeding. Table 3.1 describes the comprehensive
segment of DNA that differs among individuals view of marker techniques, their applications
at the nucleotide level. Molecular markers may or and limitations. The details/features of each
may not correlate with phenotypic expression of marker class are furnished in Table 3.2.
Molecular Markers 47

Box 3.1 PCR


For DNA marker analysis, it is essential to machine. PCR is utilising the ability of DNA
have large quantity of specific DNA fragment, polymerase to synthesise new strand of DNA,
and scientists find it difficult to make such complementary to the given template strand.
quantity before the 1980s. It was Dr. Kary Since DNA polymerase can add a nucleotide
Banks Mullis who gave solution to this limita- only onto a pre-existing 3-OH group, it needs
tion in the form of polymerase chain reaction a primer to which it can add the first nucle-
(PCR), and he received Nobel Prize in chem- otide. This requirement makes it possible to
istry in 1993 for this invention. Since then, amplify the target region of DNA template. At
this process is addressed as one of the scientific the end of the PCR reaction, the specific target
techniques of the twentieth century that has DNA sequence (in our case, the marker region)
immense potential, and it is now very hard to will be accumulated in billions of copies
find a molecular laboratory without a PCR (Fig. 3.1).

Denaturation
Target sequence
5 3
3 5
94-96C

5 3

3 5
nd
2 Cycle
Annealing 8 copies
30-55C
Amplification
Exponential
30 -35 cycles

5 3
3rd Cycle
16 copies

3 5

35th cycle
236 copies
Extension
72C
5 3

3 5

Fig. 3.1 Exponential amplification of target sequence using PCR

How It Works? DNA Polymerase: It is a type of enzyme


that synthesises new strands of DNA comple-
The PCR reaction requires the following mentary to the template. The first and most
components: commonly used enzyme is Taq DNA poly-
DNA Template: It is the sample DNA that merase (isolated from Thermus aquaticus).
contains the target sequence. At the beginning Alternatively, Pfu DNA polymerase (obtained
of the reaction, high temperature is applied to from Pyrococcus furiosus) is used widely
the original double-stranded DNA molecule because of its higher fidelity when copying
to separate the strands from each other, and DNA. Although these enzymes are subtly
this process is termed as denaturation. different, they both have two capabilities that

(continued)
48 3 Genotyping of Mapping Population

Box 3.1 (continued)


make them suitable for PCR: (1) they can temperature. This is done on an automated
generate new strands of DNA using a DNA PCR thermal cycler or PCR machine.
template and primers and (2) they are heat
resistant. Generally, the DNA polymerase in Step 1: Initialisation or Initial
eukaryotes breaks down at temperatures below Denaturation
95C, the temperature necessary to separate This step consists of heating the reaction to a
two complementary strands of DNA in a test temperature of 9496C (or 98C if extremely
tube. Hence, the DNA polymerase thats most thermostable polymerases are used), which is
often used in PCR comes from above-said held for 19 min. At this temperature, almost
microbes that live in the hot springs. Such all the DNA got denatured by disrupting
enzymes can survive near boiling tempera- the hydrogen bonds between complemen-
tures and work quite well at 72C. tary bases, yielding single-stranded DNA
Primers: They are short pieces of single- molecules.
stranded DNA that are complementary to the
5 ends of template. Depending on the marker Step 2: Denaturation
class, we need to provide either single primer It usually consists of heating the reaction to
(in case of RAPD, ISSR, etc.) or two primers 9498C for 2030 s.
(forward and reverse primers; in case of SSR,
CAPS, etc.). The polymerase begins synthe- Step 3: Annealing
sising new DNA from the 5 end of the primer. The reaction temperature is lowered to
Through complementary base pairing, primer 5065C for 2040 s allowing annealing of
attaches to target DNA at one end of the top the primers to the single-stranded DNA tem-
strand and in the bottom strand at the other end. plate. Typically, the annealing temperature is
In most of the cases, since the primers are about 35C below the melting temperature
more than 20 bp long, they target just a single (Tm) of the primers used (melting temperature
locus in the entire genome. can be obtained from data sheet provided by
Nucleotides (dNTPs or Deoxynucleotide the commercial company who had synthesised
Triphosphates): They are single units of the the primer). Stable DNADNA hydrogen
bases A, T, G and C, which are essentially bonds are formed only when the primer
building blocks for synthesising new DNA sequence very closely matches the template
strands. sequence. The polymerase binds to the primer
Buffers and sterile water: These are added template hybrid and is ready to begin new
to the PCR mix to maintain the pH and other DNA strand synthesis.
deleterious effects of chemical reaction that
affects the PCR and maintain the optimum Step 4: Extension or Elongation
activity of the enzyme. Divalent ions such as The temperature of this step is fixed depend-
Mg2+ are also supplied since they are cofactor ing on the type of DNA polymerase used in
for the DNA polymerase. the PCR mix. For example, Taq polymerase
PCR Program: PCR relies on thermal has its optimum activity temperature at
cycling, consisting of 3040 cycles of repeated 7580C, and commonly a temperature of
heating and cooling of the reaction for DNA 72C is used with this enzyme. At this step,
melting and enzymatic replication of the DNA the DNA polymerase synthesises a new DNA
(Fig. 3.1). PCR program contains a minimum of strand complementary to the DNA template
five different steps characterised with specific strand by adding dNTPs that are complementary

(continued)
Molecular Markers 49

Box 3.1 (continued)


to the template in 53 direction. This is done content) of the primers and the length of
by condensing the 5-phosphate group of the the expected PCR product. In the majority
dNTPs with the 3-hydroxyl group at the end of the cases, products expected to be
of the nascent (extending) DNA strand. Thus, amplified are relatively small (from 0.1 to
polymerase enzyme adds dNTPs from 5 to 3, 3 kb). The activity of the Taq polymerase
reading the template from 3 to 5 side, to make is about 1,000 nucleotides/min at optimal
two double-stranded molecules. The exten- temperature (7278C), and the extension
sion time depends both on the DNA poly- time in the reaction can be calculated
merase used and on the length of the DNA accordingly. As the activity of the enzyme
fragment to be amplified. As a rule of thumb, may not be always optimal during the
at its optimum temperature, the DNA poly- reaction, an easy rule is to consider an
merase will polymerise a 1,000 bases per extension time (in minutes) equal to the
minute. number of kb of the product to be
Steps 24 are repeated for 3035 cycles. amplified (e.g. 1 min for a 1 kb product,
Under optimum conditions, that is, if there are 2 min for a 2 kb product).
no limitations due to limiting substrates or Many researchers use a 25-min first
reagents, at each extension step, the amount of denaturing step before the actual cycling
DNA target is doubled, leading to exponential starts. This is supposed to help denatur-
(geometric) amplification of the specific DNA ing the target DNA better (especially the
fragment. hard to denature templates as it found in
polyploids). Also, a final last extension
Optional Step: Final Elongation time, of 510 min, is described in many
This single step is occasionally performed at a reports (to finish the elongation of many
temperature of 7074C for 515 min after the or most PCR products initiated during
last PCR cycle to ensure that any remaining the last cycle). A denaturing time of
single-stranded DNA is fully extended. 2050 s is sufficient to achieve good PCR
products during the cyclic process. Long
Step 5: Final Hold denaturing time will expose Taq poly-
This step is set at 415C for an indefinite merase for long time at high temperatures
time and may be employed for short-term stor- and hence may decrease the activity of
age of the PCR products. the enzyme.
The annealing temperature can be chosen
based on the melting temperature of the
Tips to Improve the PCR primers. A simple procedure is to use an ini-
tial annealing temperature of 54C (usually
The requirement of an optimal PCR reac- good for most primers with a length of 20 bp
tion is to amplify a specific locus without or more). Annealing temperature should
any unspecific by-products. Therefore, not be much lower unless you have designed
annealing needs to take place at a the primer from heterologous sequence.
sufficiently high temperature to allow If unspecific products result, this tempera-
only the perfect templateprimer matches ture should be increased. If the reaction is
to occur in the reaction. For any given specific (only the expected product is syn-
primer pair, the PCR program can be thesised), the melting temperature can be
selected based on the composition (GC used as it is. Gradient PCR can be employed

(continued)
50 3 Genotyping of Mapping Population

Box 3.1 (continued)


with different annealing temperature, when In theory, a single molecule of DNA can be
the primers are designed from heterolo- used in PCR but normally people use
gous systems. To calculate Tm for duplex between 1,000 and 100,000 molecules for
DNA of <50 bp, use the following simple eukaryotic nuclear DNA. Both DNA tem-
rule: plate quality and PCR product size affect
Calculate number of A or T and G or C the amount of DNA added to the PCR mix.
Add 2C for each A or T If the DNA possess very high molecular
Add 4C for each G or C weight (such as polyploids), and/or the
In general, 30 cycles is sufficient for a usual PCR product length is short (e.g. an SSR),
PCR reaction. Little or no quantitative less DNA can be used since higher fraction
changes (i.e., relative amounts of PCR of the molecules will contain the annealing
products) were observed with increasing sites for both the forward and reverse
cycle from 30 to 45. Little quantitative gain primer. If the DNA is degraded and you
was noticed when increasing the number of want to amplify a large product, it may not
cycles up to 60. work, but the same DNA may be fine for
Like a simple PCR, multiplex reactions amplifying SSRs.
should be done at a stringent enough Standard Mg2+ concentration is 2 mM, but
temperature, allowing amplification of all sometimes the concentration needs to be
loci of interest without any by-products. raised (rarely lowered) to get a PCR to work.
Although many individual loci can be Raising Mg lowers specificity and is roughly
specifically amplified at an annealing tem- comparable to lowering the annealing tem-
perature of 5660C, experiments showed perature. It may cause multiple bands to
that lowering the annealing temperature appear (or, occasionally, disappear).
by 46C was required for the same loci It is better to heat up the thermocycler block
to be co-amplified in multiplex mixtures. to high temperature (>100C) before start-
Due to differences in base composition, ing the PCR program. This is not a true hot
length of product or secondary structure, start, but it may improve the specificity of
some loci are more efficiently amplified the reaction.
than others. When many loci are simultane- Nested PCR can be employed using the
ously amplified (multiplexed), the more primary PCR product as template with new
efficiently amplified loci will negatively forward and reverse primers that are
influence the yield of product from the less designed internal to the original. It will
efficient loci. This phenomenon is due in eliminate extra bands if the first PCR is
part to the limited supply of enzyme and messy and produce robust band where the
nucleotides in the PCR reaction. Therefore, first PCR is weak or even invisible. Besides,
during the multiplex procedure sufficient this method saves genomic DNA.
quantity of PCR components should be Enzymes are expensive and perishable. It is
added. better to follow all the rules that specify the
While people typically measure DNA usage of enzymes (such as storage at 20C
quantity in ng, the relevant unit is actually in a frost free freezer in 50% glycerol,
moles, that is, how many copies of the wearing gloves when handling the
sequence that will anneal with the primers. enzymes). Before you open a new tube of
Thus, the amount of DNA in ng that you enzyme, first spin it briefly as there is often
need to add is a function of its complexity. enzyme in the cap. This is particularly true

(continued)
PCR-Based Techniques 51

Box 3.1 (continued)


for temperature-sensitive enzymes that tip will become covered with enzyme and
may be put in ice: enzyme in the cap does your measurement will be off. Whenever
not stay cold. Spin tubes as necessary to possible, make a cocktail of enzyme, buffer,
keep enzyme at bottom. Avoid trying to water, etc., and aliquot this as appropriate.
measure out minute quantities of enzyme, Do not add enzyme to unbuffered water,
as the 50% glycerol storage buffer makes which will denature it. Mix water and
this impossible. When pipetting enzyme buffer first, place on ice, then add enzyme.
from a stock tube, place the end of the tip The volume of the enzyme should be less
just far enough into the enzyme to get what than one-tenth of the final volume of the
you need, and do not plunge the tip way reaction mixture, as too much glycerol can
down into the solution, as the outside of the interfere with enzyme activity.

blots can be analysed repeatedly by stripping and


Restriction Fragment Length re-probing (usually eight to ten times) with differ-
Polymorphism (RFLP) ent RFLP probes. However, RFLP is not widely
used in linkage mapping since it is time consum-
In RFLP, DNA polymorphism is detected by ing, involves expensive and radioactive/toxic
hybridising a chemically labelled DNA probe to a reagents and requires large quantity of high-quality
Southern blot of sample DNA which has digested genomic DNA. The requirement of prior sequence
with restriction endonucleases (Botstein et al. information for probe generation further increases
1980). Thus, RFLP generates differential banding the complexity of the methodology. These limita-
profile which is generated due to nucleotide sub- tions led to the conceptualisation of a new set of
stitutions or DNA rearrangements like insertion or less technically complex methods that are based
deletion or single-nucleotide polymorphisms in on PCR.
recognition site of the restriction enzymes
(Fig. 3.2). Further, the detection of polymorphism
is also due to the use of DNA probe. The DNA PCR-Based Techniques
probe is a radioactively labelled DNA sequence
that hybridises with one or more fragments of the After the invention of polymerase chain reaction
restriction enzyme digested DNA sample after (PCR) technology (Mullis and Faloona 1987;
they have separated by gel electrophoresis. Short, Box 3.1), a large number of approaches for gen-
single or low copy genomic DNA or cDNA clones eration of variety of molecular markers were
are typically used as RFLP probes. Thus, RFLP is described and used in genetic mapping. This is
specific to the probe and restriction enzyme com- primarily due to its obvious simplicity and high
bination and hence results in unique banding pat- probability of success. Further, usage of random
tern characteristic to a specific genotype at a primers overcame the limitation of prior sequence
specific locus. RFLP markers are relatively highly knowledge for PCR analysis and facilitated the
polymorphic, co-dominantly inherited and highly development of genetic markers for a range of
reproducible. Because of their presence throughout purposes. PCR-based techniques can further be
the plant genome, high heritability and locus subdivided into two subcategories: (1) arbitrarily
specificity, the RFLP markers are considered supe- primed PCR-based techniques or sequence
rior. The method also provides opportunity to nonspecific techniques and (2) sequence-targeted
simultaneously screen numerous samples. DNA PCR-based techniques.
52

Table 3.1 Properties, advantages and limitations of markers used in genetic mapping
Marker class Principle and mode of inheritance Advantages Disadvantages Applications
Morphological Differences in phenotypic expression Simple to assay Limited in number Conventional plant
markers of the given trait (e.g. petal colour) Lowest cost involved protocol Laborious and time-consuming procedures breeding program
Isozymes Differences in isoenzymes that Robust and highly reproducible Relatively few biochemical assays available Genetic diversity
are detected by gel electrophoresis to detect enzymes
and specific staining
Co-dominant inheritance Suitable for estimating a wide range Phenotype-based analysis Map construction
of population genetics parameters
and for genetic mapping
RFLP Differences in the presence or absence Robust and reliable Time consuming, laborious and harmful Map construction
of recognition sites in the target region
Co-dominant inheritance Transferable across the laboratories Large amount of DNA is required Hybrid fixation
Limited polymorphism especially in related Genetic diversity
species
RAPD Differences in primer annealing sites Quick, simple and inexpensive Poor reproducibility Genetic diversity
Dominant inheritance Small amount of DNA is required Generally not transferable Saturation mapping
Multiple loci from a single primer
SSR Differences in number of repeats of Technically simple, robust Time and cost intensive initial establishment Linkage and QTL
microsatellite motifs and reliable mapping
Co-dominant inheritance Transferable between populations Usually require polyacrylamide gel Marker-assisted
3

electrophoresis which is labour intensive selection


Hybrid fixation
AFLP Differences in the presence or absence Multiple loci Large amount of DNA required Saturation mapping
of recognition site and differences in
the primer annealing sites
Dominant inheritance High levels of polymorphism Complicated methodology Genetic diversity
SNP Difference in the sequences at Extremely degraded DNA samples Each marker has less alleles Fine mapping
single-nucleotide level can be used
Dominant inheritance Most common in genome Mixture interpretation is more difficult Map-based cloning
Multiplexing hundreds of markers Require costly equipments to assay
in a single chip is possible
Genotyping of Mapping Population
Table 3.2 Comparison of features of different types of markers
Features Isozymes RFLP RAPD AFLP SSR ISSR SCAR CAPS SNP
Origin Genic Anonymous/ Anonymous Anonymous Anonymous/genic Anonymous Anonymous/genic Anonymous/ Anonymous/
genic genic genic
Maximum Limited by Limited by the Limited by Limited by the Limited by the Limited by Limited by the Limited by the Limited by the
theoretical the number restriction site the size of restriction site size of genome the size of size of genome size of genome size of genome
number of of enzyme (nucleotide) genome and by (nucleotide) and number of genome and
possible loci genes and polymorphism nucleotide polymorphism simple repeats by nucleotide
in analysis histochemical polymorphism in a genome polymorphism
enzyme assays
available
Number of loci 3050 100 s Unlimited Unlimited 10s 1,000 s 10s 10s 10s
Null alleles Rare Rare to Not applicable Not applicable Occasional Not applicable Rare to extremely Rare to Rare to
extremely rare (presence/ (presence/ to common (presence/ rare extremely rare extremely rare
absence type absence type absence type
of detection) of detection) of detection)
Degree of Low Lowmedium Lowmedium Lowmedium Mediumhigh Lowmedium Mediumhigh Mediumhigh Mediumhigh
polymorphism
Amount of DNA Few mg ~10 mg ~10 ng ~25 ng ~50 ng ~10 ng ~25 ng ~25 ng ~50 ng
sample required of tissue
Ease of assay Easy Difficult Easy Moderate Easy to moderate Easy Moderate Easy Easy
Can be Difficult Difficult Yes Yes Yes Yes Yes Semi-automated Yes
automated?
Equipment cost Cheap Expensive Moderate Expensive Expensive Moderate Moderate Moderate Expensive
Development Cheap Expensive Moderate Moderate Very expensive Moderate Expensive Moderate Expensive
cost
Assay cost Cheap Expensive Moderate Moderate Moderate Moderate Moderate Moderate Moderate to
expensive
Transferability Across families Across genera Within species Within species Within genus Within genus Within genus Within genus Within genus
and genera or species or species or species or species or species
Reproducibility Very high High to very Low to medium Medium to Medium to Low to High High Medium to
high high high medium high
Genome and Limited Good Very good Very good Good Very good Limited Limited Very good
QTL-mapping
potential
Comparative Excellent Good Very limited Very limited Good Very limited Limited Limited Limited
mapping
potential
54 3 Genotyping of Mapping Population

Restriction digestion and technical simplicity, use of fluorescence and


Gel electrophoresis
feasibility of automation and requirement of a
DNA isolated few nanograms of DNA. Because of these
from individuals rewards, RAPD markers has potential application
Transfer of digested DNA
in crop improvement by locating and manipulat-
fragments to a membrane ing genes of interest, identification of somatic
(Southern blotting)
hybrids, evaluation and conservation of genetic
resources, DNA profiling, population genetics
and gene mapping. However, the limitation asso-
ciated with RAPD technology is inconsistency
because PCR reactions are very sensitive to
Radioactive DNA probe factors such as annealing temperature, template
binds to specific DNA
fragments
DNA concentration and Mg2+ ion concentration,
and hence it cannot be reproduced even within
the laboratory. Further, as several discrete loci in
the genome are amplified by each primer, it com-
plicates the scoring procedures. Since they are
Autoradiography dominant markers, RAPD profiles cannot be used
(X-ray film sandwiched to
the membrane to detect
to distinguish heterozygous from homozygous
radioactive pattern) individuals. Hence, RAPD markers, although
Individual A B useful for genetic studies, should be used with
caution. Paran and Michelmore (1993) were able
to separate RAPD fragments and clone and
sequence those fragments after reamplification.
These sequence data were used to design lengthy
PCR primers specific to particular RAPD frag-
Fig. 3.2 Schematic description of RFLP marker
development
ments and use PCR to consistently produce
specific RAPDs fragments from genomic DNA
(thus, this technique is also called as sequence-
Arbitrarily Primed PCR-Based Markers tagged siteSTS). This method allows in elimi-
nating the reproducibility problem associated
Random Amplied Polymorphic DNA with RAPD analysis. Reamplification from
(RAPD) genomic DNA and subsequent sequencing of the
PCR products also allow for the identification of
The basis of RAPD technique is differential PCR any artefacts in RAPD technology.
amplification of genomic DNA using short ran-
dom oligonucleotide sequences (mostly ten bases
long) (Fig. 3.3). Usually differential banding Arbitrarily Primed Polymerase Chain
pattern is produced due to rearrangements or Reaction (AP-PCR) and DNA
deletions at or between oligonucleotide primer Amplication Fingerprinting (DAF)
binding sites in the genome (Williams et al.
1991). These techniques are independently developed
As the approach requires no prior knowledge methodologies, which are variants of RAPD. For
of the genome that is being analysed, it can be AP-PCR (Welsh and McClelland 1990), a single
employed across species using universal primers. primer (about 1015 nucleotides long) is used.
Various results obtained in plants indicated that The technique involves amplification for initial
RAPDs are dominant, highly polymorphic and two PCR cycles at low stringency. Thereafter, the
informative and complement to RFLP markers. remaining cycles are carried out at higher strin-
RAPD markers offer many advantages such as gency by increasing the annealing temperature.
higher frequency of polymorphism, rapidity, This variant of RAPD was not very popular as
Sequence-Specific PCR-Based Markers 55

a
1 2 3

b
1 2 3

x
Random 10 bp oligonucleotide primer; for simplicity
only 3 loci are described in the genome
x Single base change destroys target
sequence for primer binding and
hence this locus will not amplify from individual B

PCR amplification of target gene


and agarose gel electrophoresis
A B

1
2

Fig. 3.3 Schematic description of RAPD marker

it involved autoradiography, but it has been fingerprints of any DNA regardless of its source
simplified as fragments and can now be fraction- and without any prior knowledge of DNA
ated using agarose gel electrophoresis. The DAF sequence. Most AFLP fragments correspond to
technique involves usage of single arbitrary prim- unique positions on the genome and hence can be
ers shorter than ten nucleotides for amplification exploited as landmarks in genetic and physical
(Caetano-Anolls and Bassam 1993), and the mapping. The technique can also be used to distin-
amplicons are analysed using polyacrylamide gel guish closely related individuals at the subspecies
along with silver staining. level and map genes. Applications for AFLP in
plant mapping include establishing linkage groups
in crosses, saturating regions with markers for
Amplied Fragment Length map-based gene cloning efforts and assessing the
Polymorphism (AFLP) degree of relatedness or variability among culti-
vars. For high-throughput screening approach,
To overcome the limitation of reproducibility asso- fluorescence tagged primers are also used for
ciated with RAPD, AFLP technology (Vos et al. AFLP analysis. The amplified fragments are
1995) was developed. It combines the power of detected on denaturing polyacrylamide gels using
RFLP with the flexibility of PCR-based technol- an automated ALF-DNA sequencer with the frag-
ogy by ligating primer recognition sequences ment option (Huang and Sun 1999).
(adaptors) to the restricted DNA and selective PCR
amplification of restriction fragments using a lim-
ited set of primers (Fig. 3.4). The primer pairs used Sequence-Specic PCR-Based Markers
for AFLP usually produce 50100 bands per assay.
The number of amplicons per AFLP assay is a With the advent of high-throughput sequencing
function of the number of selective nucleotides in technology, abundant information on DNA
the AFLP primer combination, the selective nucle- sequences for the genomes of many plant species
otide motif, GC content and physical genome size has been generated. For the crops where the
and complexity. The AFLP technique generates genome sequencing projects have not yet been
56 3 Genotyping of Mapping Population

MseI EcoRI
TTAA GAATTC
AATT CTTAAG

Digestion of genomic DNA with EcoRI and MseI


and ligation of EcoRI and MseI adaptors to
restriction products

TAA G
T CTTAAG

Pre-amplification with unlabeled primers having


a single selective nucleotide

A A

TAA G
T CTTAAG

C A

Final selective amplification of with AFLP primers having


2-3 selective nucleotides; EcoRI specific primers are
Labeled. AFLP primers consists of three parts: a core
(property sequence (not revealed to public), a enzyme
A ACC Specific sequence and a selective extension sequences

TAA G
T CTTAAG

AGC

A B
Polyacrylamide gel electrophoresis and scoring for
AFLP profile; for simplicity only few bands are shown
here (actually there will be 50-100 bands per assay)

Fig. 3.4 Schematic representation of AFLP protocol

started, large collections of expressed sequence molecular marker techniques have been designed.
tags (ESTs) are available in public domains. The following sections describe such marker
Functional genomics approaches through ESTs techniques in detail.
offer great scope in the development of gene-
based markers for molecular breeding of complex
traits. It also provides better knowledge on the Microsatellite-Based Marker
activity of genes involved in pest and disease Technique
resistance and tolerance to environmental stresses
and promises to increase productivity and yield. Microsatellites or short tandem repeats (STR) or
ESTs have been generated and thousands of simple sequences repeats (SSR) or sequence-
sequences have been annotated as putative func- tagged microsatellite site (STMS) are monoto-
tional genes using powerful bioinformatics tools. nous repetitions of very short nucleotide motif
In order to correlate DNA sequence information (usually one to five base pairs). It occurs as inter-
with particular phenotypes, sequence-specific spersed repetitive elements in all eukaryotic
Sequence-Specific PCR-Based Markers 57

SSR motif in individual A SSR motif in individual B SSR motif in individual C

(AT)10 (AT)5 (AT)20


(TA)10 (TA)5 (TA)20

Forward and Reverse primers that


flanks corresponding SSR or
microsatellite motif

PCR amplification and gel electrophoresis

A B C

Differential number of repeats helps in polymorphism identification [note that individual B is having only
5 motifs and hence the PCR product was moved very rapidly whereas the PCR product of C moved slowly
because of its large size (20 motifs)]

Fig. 3.5 Schematic representation of microsatellite or SSR marker development

genomes (Tautz and Renz 1984). Variation in the pair, the assay becomes costly. Alternatively,
number of tandemly repeated units is mainly due Schuelke (2000) introduced a novel procedure in
to strand slippage during DNA replication where which three primers are used for the amplification
the repeats allow matching via excision or addi- of a defined microsatellite locus: a sequence-
tion of repeats. As slippage in replication is more specific forward primer with M13 (21) tail at its
likely than point mutations, microsatellite loci 5 end, a sequence-specific reverse primer and
tend to be hyper variable. The regions flanking the universal fluorescent-labelled M13 (21)
the microsatellites are generally conserved primer. This technique has been proved as sim-
among species or even among genera, and PCR ple and less expensive. Microsatellites are highly
primers complementary to the flanking regions popular genetic markers because of their co-
are used to amplify SSR containing DNA frag- dominant inheritance, high abundance, enor-
ments. The length of the amplified fragment will mous extent of allelic diversity and the ease of
vary according to the number of repeat units assessing SSR size variation by PCR with pairs
(Fig. 3.5). Microsatellite assays show extensive of flanking primers. The reproducibility of mic-
inter-individual length polymorphisms during rosatellites is such that they can be used
PCR analysis of unique loci using discrimina- efficiently by different research laboratories to
tory primers sets. The PCR amplification proto- produce consistent data, and hence they are being
cols used for microsatellites employ loci-specific considered as the markers of choice in many of
either unlabelled primer pairs or primer pairs the crop-breeding programs. Besides, this marker
with one radiolabelled or fluorolabelled primer. had high information content, co-dominant
Analysis of unlabelled PCR products is carried inheritance, locus specificity and ease for
out using polyacrylamide or agarose gels. The automation for high-throughput screening. Thus,
employment of fluorescent-labelled microsatel- advent of SSR or microsatellite markers has
lite primers and laser detection (that are brought a new, user-friendly and highly poly-
available automated sequencer) in genotyping morphic class of genetic markers in many plant
procedures has significantly improved the species. However, the higher development cost
throughput and automation. However, due to the and effort required to obtain working SSR prim-
high price of the fluorescent label, which must ers for a given species has restricted their use to
be carried by one of the primers in the primer only a few of the agriculturally important crops.
58 3 Genotyping of Mapping Population

Microsatellites are classified in to different types imperfect or compound imperfect (CCA)n TT


as: (1) based on the number of nucleotides per (CGA)n) and (3) based on location of SSRs in
repeat (such as mononucleotide (A)n, dinucle- the genome (such as nuclear (nuSSRs), chloro-
otide (CA)n, trinucleotide (CGT)n, tetranucle- plastic (cpSSRs) and mitochondrial (mtSSRs)).
otide (CAGA)n, pentanucleotide (AAATT)n and Several methods have been pursued to develop
hexanucleotide (CTTTAA)n, where n is number SSR markers, including analysis of SSR-enriched
of variables), (2) based on the arrangement of small insert genomic DNA libraries, SSR mining
nucleotides in the repeat motifs (such as pure or from ESTs and large insert bacterial artificial
perfect or simple perfect (CA)n, simple imper- chromosome derivation by end sequence analy-
fect (AAC)n ACT (AAC)n, compound or simple sis. As an example, mining of SSRs from EST
compound (CA)n (GA)n and interrupted or database is described in Box 3.2.

Box 3.2 Practising Genotyping of a Mapping Population with SSR Markers


DNA-based marker techniques such as RFLP, higher versions of this program (http://frodo.
RAPD, SSR and AFLP are routinely being wi.mit.edu/primer3/). Primers can be designed
used in genetic studies, and their advantages as based on the criteria of 50% GC content, a
well as limitations have long been realised. minimum melting temperature of 50C and
Among the markers, SSRs are molecular breed- absence of secondary structure or other
ers marker of choice. SSRs exist throughout parameters as per the requirement. Primers
the whole genome of an organism in both non- ranged from 18 to 27 nucleotides in length
coding and coding regions. In the past, genomic with amplified products of 100400 bp can be
SSRs (gSSRs) were developed on the basis of picked up and used for primer synthesis. If
isolating and sequencing clones containing possible, primers may be designed within the
putative SSR regions, together with design- 5 or 3 untranslated region (UTR) (or near to
ing and testing flanking primers. However, the start or stop codon within coding DNA)
expressed sequence tag (EST)-derived SSRs closest to the repeat motif and/or at the start of
have some intrinsic advantages over gSSRs the intron (as intronic polymorphic primers) to
because they are present in expressed regions increase the efficiency of polymorphic infor-
of the genome. In recent years, great efforts mation content. Once the primers are synthe-
have been made to develop gSSRs and EST sised, PCR (see Box 3.1) can be executed to
SSRs for several crops, and they have been amplify the SSR motifs from the template
widely used in genetic mapping. DNA samples.
The ESTs containing at least four di-, tri, For SSR analysis, there are three electro-
tetra-, penta- or hexanucleotide repeats (EST phoresis methods currently employed to
SSRs) in the crop of interest can be identified determine the length polymorphisms: poly-
using SSR identification tool (SSRIT) avail- acrylamide gel electrophoresis, MetaPhor
able at http://www.gramene.org/db/mark- agarose gel electrophoresis and automated
ers/ssrtool. The procedure is simple: just enter capillary electrophoresis (see Box 3.4), and all
or paste the EST sequence to the text area and these methods produce comparable and repro-
select the parameters to identify SSR motifs. ducible results. The polyacrylamide gel elec-
Once the SSR motif containing EST sequence trophoresis (PAGE) is the most common and
is identified, primers that flank the given SSR excellent method. The amplification products
motif are to be identified. Such primers can be in polyacrylamide gels are typically visualised
designed for the flanking regions of the SSR with radioactive labelling, fluorescent dye
using web-based software, Primer3 v 0.4.0 or labelling and silver staining. However, these

(continued)
Sequence-Specific PCR-Based Markers 59

Box 3.2 (continued)


visualisation techniques require either expen- markers. MetaPhor agarose (FMC or Cambrex
sive or hazardous radioactive chemicals and Corporation, USA) is an intermediate melting
are time consuming. On the other hand, capil- temperature agarose (75C) that provides
lary electrophoresis can be performed more twice the resolution capabilities of the finest-
quickly and good for high-throughput analy- sieving agarose products. Using submarine
sis. Capillary electrophoresis with the CEQTM gel electrophoresis, MetaPhor agarose gives
8000 Genetic Analysis System, QIAxcel high-resolution separation of 20800 bp
System and ABI 3130xl DNA sequencer can DNA fragments that differ in size by 2%,
easily separate products and determine allelic which approximates the resolution of poly-
size. But it is more expensive and requires acrylamide gels. MetaPhor agarose gels
more sophistication and expertise. MetaPhor (24%) made in either TAE or TBE and
agarose gel electrophoresis (MAGE) is another stained with ethidium bromide are ideal for
approach to separate alleles of microsatellite resolving SSRs.

Organelle Microsatellites examining patterns of cytoplasmic variation in a


Plant organelle genomes such as chloroplast wide range of plant species. Chloroplast micro-
DNA and mitochondrial DNA have been increas- satellites are particularly effective markers for
ingly applied to study population genetic struc- studying mating systems, gene flow via both pol-
ture and phylogenetic relationships in plants. Due len and seeds and uniparental lineage. Chloroplast
to their uniparental mode of transmission microsatellite-based markers have been used for
(Box 3.3), chloroplast and mitochondrial genomes the detection of hybridisation and introgression
exhibit different patterns of genetic differentia- and the analysis of the genetic diversity and phy-
tion compared to nuclear alleles. Thus, for a logeography of plant populations. One limitation
comprehensive understanding of plant popula- of the approach is the need of sequence data for
tion differentiation and evolution, three interre- primer construction. Primer sequences flanking
lated genomes must be considered. chloroplast microsatellites are usually inferred
from fully or partially sequenced chloroplast
Chloroplast Microsatellites genomes. In general, these primer pairs produce
Numerous studies have shown that chloroplast polymorphic PCR fragments from the species of
microsatellites consisting of relatively short and origin and their close relatives, but transportabil-
several mononucleotide stretches (such as (dA)n ity to more distant taxa is limited. Attempts to
and (dT)n) are ubiquitous and polymorphic. design universal primers to amplify chloroplast
Chloroplast genome-based markers uncover microsatellites have resulted in a set of consensus
genetic discontinuities and distinctiveness among chloroplast microsatellite primers that aims at
or between taxa with slight morphological differ- amplifying cpSSR regions in the chloroplast
entiation, which sometimes cannot be revealed genome of dicotyledonous angiosperms (Weising
by nuclear DNA markers. The conservation and and Gardner 1999).
homology of sequence in chloroplast genome
makes it possible to compare genes across the Mitochondrial (mt) Microsatellites
plant kingdom and examine phylogenetic In contrast to animal mtDNA, which typically
relationships in taxa that have diverged for has a size of 10 MDa per mitochondrial genome,
hundreds of thousands to millions of years. plant mtDNA is far more complex. For example, the
Chloroplast microsatellites are now becoming maize mitochondrial genome has been estimated
firmly established as a high-resolution tool for to be 320 MDa. In addition to larger size, plant
60 3 Genotyping of Mapping Population

Box 3.3 Features of Molecular Markers


a b
AA Aa aa AA Aa aa
Dominant, Co-dominant
and Cytoplasmic or Uniparentally
Inherited Markers

For diploid organisms (organisms harbouring


two copies of each chromosome), the exact Fig. 3.6 Diagrammatic explanation for dominant (a)
genotype of each individual should have two and co-dominant (b) marker that reveals homozygotes
(AA or aa) and heterozygotes (Aa)
possible genotypes for the given marker. In
contrast, for markers such as RAPD, AFLP
and ISSR, it is only possible to describe groups between different genetic maps.
whether the given marker allele (e.g. A) is However, it is very imperative to precisely
present or not at the given locus. Therefore, in know the sequence of the particular locus.
such cases, one cannot distinguish the Interestingly, there is yet another fascinating
heterozygous genotype (Aa) from the homozy- feature of molecular marker which is worth
gous genotype (AA). It is clear that this geno- to mention here. They are called as cyto-
typing method incurs a loss of information, plasmic markers which are uniparentally
and such kinds of markers are referred to as inherited (either maternally or paternally).
dominant markers. Alternatively, SSRs, Mitochondrial- and chloroplast-specific SSR
RFLPs, etc., are called as co-dominant mark- or SNP markers are placed under this category
ers since they can distinguish a heterozygote (refer the text for detail), and use of such kind
(two bands for Aa (i.e. the bands produced by of markers requires adequate caution during
both AA and aa are co-occurring) from each linkage mapping.
of homozygotes AA and aa (different sizes of
single band for AA and aa)) (Fig. 3.6).
Dominant markers allow the analysis of Polymorphism Information
many loci per experiment without requiring Content (PIC)
any prior information on their sequence.
For predominantly self-fertilising species, PIC value is commonly used in genetics as a
heterozygosity could be disregarded, and measure of polymorphism for a marker locus
allele frequencies can be considered as equal used in linkage analysis. It is the probability
to observed frequencies. In contrast, co-domi- that one could identify which marker allele of
nant markers allow analysis of only one locus the parents has inherited to the offspring. PIC
per experiment, and hence the degree of data can be calculated as described in Chap. 1 or
per assay is usually lower. Nevertheless, they using the freely available program, CERVUS
are more informative since the allelic varia- v2.0. PIC value for co-dominant markers range
tions of that locus can be distinguished. As a from 0.5 to 1.0 and for dominant markers it has
consequence, we can identify the linkage a maximum value of 0.5.

mtDNA is characterised by molecular heteroge- Inter-Simple Sequence Repeats (ISSR)


neity observed as classes of circular chromosomes
that vary in size and relative abundance. There ISSR involves amplification of DNA segments
are only few reports that describe the utilisation present at an amplifiable distance in between two
of mtSSRs in plant species. identical microsatellite repeat regions that are
Sequence-Specific PCR-Based Markers 61

oriented in opposite direction. The technique uses They constitute the most abundant molecular
microsatellites as primers in a single primer PCR markers in the genome and are widely distributed
targeting multiple genomic loci to amplify mainly throughout genomes although their occurrence
inter-simple sequence repeats of different sizes. and distribution varies among species. The SNPs
The microsatellite repeats used as primers for are usually more prevalent in the non-coding
ISSRs can be dinucleotide, trinucleotide, tetra- regions of the genome. Within the coding regions,
nucleotide or pentanucleotide. The primers used an SNP is either non-synonymous and results in
can be either unanchored or more usually anchored an amino acid sequence change or it is synony-
at 3 or 5 end with 14 degenerate bases extended mous and hence does not alter the amino acid
into the flanking sequences. Thus, the principle is sequence. However, synonymous changes can
similar to RAPD; however, ISSRs use longer modify mRNA splicing and thus sometimes
primers (1530 mers) as compared to RAPD result in phenotypic differences. Improvements
primers (10 mers), which permit the subsequent in sequencing technology and availability of an
use of high annealing temperature leading to increasing number of EST sequences have made
higher stringency. The annealing temperature direct analysis of genetic variation at the DNA
depends on the GC content of the primer used sequence level. Majority of SNP genotyping
and ranges from 45 to 65C. The amplified prod- assays are based on one or two of the following
ucts are usually 2002,000 bp long and amenable molecular mechanisms: allele-specific hybridisa-
to detection by both agarose and polyacrylamide tion, primer extension, oligonucleotide ligation
gel electrophoresis (PAGE). ISSRs exhibit the and invasive cleavage. High-throughput genotyp-
specificity of microsatellite markers but need no ing methods, including DNA chips, allele-specific
sequence information for primer synthesis enjoy- multiplex PCR and primer extension approaches,
ing the advantage of random markers. The prim- make SNPs especially attractive as genetic mark-
ers are not proprietary and can be synthesised by ers. Because of these technological improve-
anyone. The technique is simple and quick, and ments, SNPs are highly suitable for automation
the use of radioactivity is not essential. ISSR and are used for construction of ultra-high-den-
markers usually show high polymorphism sity genetic maps.
although the level of polymorphism has been
shown to vary with the detection method used.
PAGE in combination with radioactivity was Single-Feature Polymorphism (SFP)
shown to be most sensitive, followed by PAGE
with silver nitrate staining and then agarose gel The basis of genome-wide polymorphism dis-
with ethidium bromide system of detection. Like covery by the SNP depends on the principle that
RAPDs, reproducibility, dominant inheritance a sequence, which is perfect match to a feature/
and homology of co-migrating amplification probe sequence present on gene chip or microar-
products are the main limitations of ISSRs. ISSRs ray, may hybridise with greater affinity than one
segregate mostly as dominant markers, although with a mismatch sequence. The polymorphism of
co-dominant segregation has been reported in the two sequences, originating from two different
some cases. There is also a possibility as in RAPD varieties or genotype, results in differential
that fragments with the same mobility originate hybridisation intensity, and this property associ-
from non-homologous regions. ated with sequence characteristics functions as a
molecular marker popularly known as SFP.
Such genetic differences between genotypes at
Single-Nucleotide Polymorphism (SNPs) sequence level are at two levels: single-nucleotide
polymorphisms (SNPs) and insertion/deletions
Variations at single-nucleotide level in genome (INDELs). These assays are done by labelling
sequence of individuals of a population are genomic DNA (target) and hybridising to arrayed
known as SNPs (Jordan and Humphries 1994). oligonucleotide probes that are complementary
62 3 Genotyping of Mapping Population

to target. Either type of variation can potentially arbitrary-primed marker loci into co-dominant
influence the hybridisation of target to 25-mer SCAR markers. As SCARs are primarily defined
oligonucleotides. Each SFP is scored by the pres- genetically, they can be used both as physical
ence or absence of a hybridisation signal with its landmarks in the genome and as genetic markers.
corresponding oligonucleotide probe on the array. Co-dominant SCARs are more informative for
Thus, a polymorphism detected by a single probe genetic mapping than dominant arbitrary-primed
in an oligonucleotide array is called a SFP, where molecular markers, as they can be used to screen
a feature refers to a probe in the array. Since it is pooled genomic libraries by PCR and for physi-
amenable to microarray-based genotyping, it is cal mapping, defining locus specificity as well as
highly suitable for high-throughput genotyping. comparative mapping and homology studies
For genotyping large populations, the cost per among related plant species. Thus, SCARs
individual is more critical than the cost per data have several advantages over RAPD or AFLP:
point. Spotted oligonucleotide microarrays have (1) higher reproducibility resulting from longer
the potential to provide low-cost genotyping primer and higher annealing temperature and
platforms. Polymorphisms within a transcribed (2) having the possibility of changing dominant
sequence are of particular interest because they markers to co-dominant markers.
may reflect variation in biological function. However, cloning and sequencing are still
laborious in SCAR development. To avoid this
problem, extended random primer amplified
Sequence-Characterised Amplied region (ERPAR) has been developed (Wang et al.
Regions (SCAR) 2000). Similar to SCAR, an ERPAR uses specific
primer pairs derived from RAPD primers by add-
In order to utilise markers identified by arbitrary ing bases sequentially to their 3 ends. The exten-
markers (such as RAPD, AFLP, ISSR) for map- sion of primers is a continuous procedure of
based cloning and/or efficient marker-assisted adding bases and screening primer pairs. Because
selection (MAS), identification of unambiguous longer primers are designed without sequence
single locus is a must. In addition, the arbitrary information, cloning and sequencing are not
marker techniques are sensitive to changes in the needed. ERPAR has the same advantages of
reaction conditions. In order to bridge the gap SCAR; in addition, it eliminates the tedious
between the ability to obtain linked markers to a works involved in SCAR development. Thus, it is
gene of interest in a short time and the use of a universal and efficient approach to convert an
these markers for map-based cloning approaches RAPD marker in to a stable marker.
and for routine MAS, SCAR marker technique
was developed and applied. The SCARs are PCR-
based markers that represent genomic DNA frag- Cleaved Amplied Polymorphic
ments at genetically defined loci. SCARs are Sequences (CAPS)
identified by PCR amplification using sequence-
specific oligonucleotide primers (Paran and The CAPS marker technique provides a way to
Michelmore 1993). Development of SCARs utilise the DNA sequences of mapped RFLP
involves cloning the amplified products of arbi- markers to develop PCR-based markers thereby
trary marker techniques and then sequencing the eliminating the tedious DNA blotting (Komori
two ends of the cloned products. The sequence is and Nitta 2005). Therefore, CAPS are also known
thereafter used to design specific primer pairs of as PCR-RFLP markers. The CAPS make out the
1530 bp which amplify single major bands of the restriction fragment length polymorphisms caused
size similar to that of cloned fragment. Polymor- by single base changes like SNPs, insertions/dele-
phism is either retained as the presence or tions, which modify restriction endonuclease rec-
absence of amplification of the band or can appear ognition sites in PCR amplicons. The CAPS
as length polymorphisms convert dominant assays are performed, by digesting locus-specific
Sequence-Specific PCR-Based Markers 63

Target gene in individual A Target gene in individual B

PCR amplification of target gene and


Restriction digestion of PCR products

A B

Presence or absence of restriction site


helps in polymorphism identification

Fig. 3.7 Schematic illustration of CAPS

PCR amplicons with one or more restriction or absence of the SNP is determined by the
enzyme, followed by separation of the digested resulting restriction pattern. The method is sim-
DNA on agarose or polyacrylamide gels (Fig. 3.7). ple, relatively inexpensive and utilises the ubiq-
The primers are synthesised based on the sequence uitous technologies of PCR, restriction digestion
information available in databank of genomic or and agarose gel analysis. This technique proved
cDNA sequences or cloned RAPD bands. useful for following known mutations in segre-
The CAPS analysis is versatile and can be gating populations and positional-based cloning
combined with single-strand conformational of new genes in plants.
polymorphism (SSCP; see below), SSR, SCAR,
AFLP or RAPD analysis to increase the possi-
bility of finding DNA polymorphisms. The Randomly Amplied Microsatellite
CAPS markers are co-dominant and locus Polymorphisms (RAMP)
specific and have been used to distinguish
between plants that are homozygous or heterozy- Microsatellite-based markers show a high degree
gous for alleles. Thus, CAPS proves useful for of allelic polymorphism, but they are labour
genotyping, positional or map-based cloning intensive. On the other hand, RAPD markers are
and molecular identification studies where inexpensive but exhibit a low degree of poly-
sequence-based identification is not feasible. morphism. To compensate for the weaknesses
The technique is, however, limited by mutations, of these two approaches, a technique termed as
which create or disrupt a restriction enzyme rec- RAMP was developed (Wu et al. 1994). The
ognition site. To overcome this limitation, technique involves a radiolabelled primer con-
Michaels and Amasino (1998) proposed a vari- sisting of a 5 anchor and 3 repeats which is
ant of the CAPS method called derived cleaved used to amplify genomic DNA in the presence
amplified polymorphic sequence (dCAPS). In or absence of RAPD primers. The resulting
dCAPS analysis, a restriction enzyme recogni- products are resolved using denaturing poly-
tion site, which includes the SNP, is introduced acrylamide gels, and as the repeat primer is
into the PCR product by a primer containing one labelled, the amplification products derived
or more mismatches to template DNA. The from the anchored primer are only detected. The
modified PCR product is then subjected to melting temperatures of the anchored primers
restriction enzyme digestion, and the presence are usually 1015C higher than those of the
64 3 Genotyping of Mapping Population

RAPD primers; thus, at higher annealing genome and results in a moderate number of
temperature, only the anchored primer would co-dominant markers. Sequencing demon-
anneal efficiently, whereas in PCR cycles at low strated that SRAP polymorphism results from
annealing temperature, both anchored microsat- two events, fragment size changes due to inser-
ellite and RAPD primers would anneal. So the tions and deletions, which could lead to co-
PCR program was modified such that there is dominant markers, and nucleotide changes
switching between high and low annealing tem- leading to dominant markers. The SRAP marker
peratures during the reaction. Most fragments system has been adapted for a variety of purposes
obtained with RAMP primers alone disappear in different crops, including map construction,
when RAPD primers are included, and different gene tagging and genetic diversity studies.
patterns are obtained with the same RAMP
primer and different RAPDs, indicating that
RAPD primers compete with RAMP primer Target Region Amplication
during the low annealing temperature cycle. Polymorphism (TRAP)
RAMP has been successfully employed in plant
genetic diversity studies. The TRAP technique (Hu and Vick 2003) is a
rapid and efficient PCR-based technique, which
utilises bioinformatics tools and EST database
Sequence-Related Amplied information to generate polymorphic markers,
Polymorphism (SRAP) around targeted candidate gene sequences. The
technique uses two primers (18 nucleotides in
The aim of SRAP technique (Li and Quiros length) to generate markers. One of the primers,
2001) is the amplification of open reading the fixed primer, is designed from the targeted EST
frames (ORFs). It is based on two-primer sequence in the database; the second primer is an
specific PCR amplification. The technique uses arbitrary primer with either an AT- or GC-rich core
primers of arbitrary sequence, which are 1721 to anneal with an intron or exon. As the TRAP
nucleotides in length. It uses pairs of primers technique can be used to generate markers for
with AT- or GC-rich cores to amplify intragenic specific gene sequences, it is useful for genotyp-
fragments for polymorphism detection. The ing germplasm and generating markers associ-
primers consist of the following elements: (1) ated with desirable agronomic traits in crop plants
Core sequences, which are 1314 bases long, for marker-assisted breeding. The technique has
where the first 10 or 11 bases starting at the 5- also been effectively used in fingerprinting, in
end, are sequences of no specific constitution estimating genetic diversity and mapping QTL.
(filler sequences), followed by the sequence
CCGG in the forward primer and AATT in the
reverse primer and (2) the core is followed by Single-Strand Conformation
three selective nucleotides at the 3-end. The Polymorphism (SSCP)
filler sequences of the forward and reverse
primers must be different from each other and Single-strand conformation polymorphism is the
can be 10 or 11 bases long. For the first five mobility shift analysis of single-stranded DNA
cycles, the annealing temperature is set at 35C. sequences on neutral polyacrylamide gel electro-
The following 35 cycles are run at 50C. The phoresis, to detect polymorphisms produced by
amplified DNA fragments are fractionated by differential folding of single-stranded DNA due
denaturing acrylamide gels and detected by to subtle differences in sequence (often a single
autoradiography or silver staining. SRAP com- base pair) (Orita et al. 1989). In the absence of a
bines simplicity, reliability, moderate through- complementary strand, the single strand experi-
put ratio and facilitate sequencing of selected ences intra-strand base pairing, resulting in loops
bands. SRAP targets coding sequences in the and folds, that gives it a unique 3-D structure
Transposable Elements (TE)-Based Molecular Markers 65

Target gene in individual A Target gene in individual B

PCR amplification of target gene

Denatured to produce single strands


(or pooling of denatured products from A and B)

differential folding of single-stranded DNA due to differences in DNA sequence or


internal sequence polymorphisms in PCR products from two genomes A and B

The differential conformation leads


to differences in gel mobility

Fig. 3.8 Schematic representation of SSCP

which can be considerably altered due to single fluorescent primers (Makino et al. 1992). The
base change resulting in differential mobility major disadvantage of the technique is that the
(Fig. 3.8). The SSCP analysis proves to be a pow- development of SSCP markers is labour intensive
erful tool for assessing the complexity of PCR and costly and cannot be automated.
products as the two DNA strands from the same
PCR product often run separately on SSCP gels,
thereby providing opportunities (1) to score a Transposable Elements (TE)-Based
polymorphism and (2) resolving internal sequence Molecular Markers
polymorphisms in some PCR products from
identical places in the two parental genomes. The Transposons are mobile genetic elements capable
PCR-based SSCP analysis is a rapid, simple and of changing their location in the genome. They
sensitive technique for detection of various muta- were first discovered in maize. There are two
tions, including single-nucleotide substitutions broad classes of transposable elements, each with
and insertions and deletions in PCR-amplified characteristic properties. For all Class I or retro-
DNA fragments. The technique shares similarity elements, such as retrotransposons, short inter-
to RFLPs as it can also decipher the allelic vari- spersed nuclear elements and long interspersed
ants of inherited and genetic traits. However, nuclear elements, it is the element-encoded
unlike RFLP analysis, SSCP analysis can detect mRNA, and not the element itself, that forms
DNA polymorphisms and mutations at multiple the transposition intermediate. This means that
places in DNA fragments. The SSCP gels have each transposition event creates a new copy of
been used to increase throughput and reliability the transposon, while the original copy remains
of scoring during mapping. intact at the donor site. In contrast, Class II con-
Fluorescence-based PCR-SSCP (F-SSCP) sists of DNA transposons, which change their
is an adapted version of SSCP analysis involv- location in the genome by a cut and paste mech-
ing amplification of the target sequence using anism. In other words, they excise themselves
66 3 Genotyping of Mapping Population

Long terminal repeats (LTR)

Outward facing 5 and 3 LTR primers for IRAP marker development

Long terminal repeats (LTR)

SSR or microsatellite motif

Use of one outward facing LTR primers

Primer corresponding to SSR or microsatellite motif

Fig. 3.9 Schematic representation of development of (a) IRAP and (b) REMAP primers

from the donor site and reintegrate themselves at retrotransposons, lack terminal repeats and
the acceptor site. Based on structural characteris- encode proteins with significantly less similarity
tics, transposons can be further subdivided into to those of the retroviruses. Retrotransposons
subclasses, super families, families and subfami- replicate by successive transcription, reverse
lies based on the type and orientation of open transcription and insertion of the new cDNA
reading frames; the presence, orientation, length copies back into the genome. Copia-like and
and sequence of their terminal repeats; and the gypsy-like retrotransposons are present through-
length and sequence of target site duplications out the plant kingdom. Retrotransposons provide
created upon insertion. an excellent opportunity to develop molecular
marker system (Kalendar et al. 1999) due to their
long, defined, conserved sequences and new
Retrotransposon-Based Molecular insertional polymorphisms produced by replica-
Markers tionally active members. The new insertions help
organising insertion events temporally in a lin-
In plants with large genomes, retrotransposons eage and thus can be used to determine pedigrees
are the major class of repetitive DNA, compris- and phylogenies. Retrotransposon-based molec-
ing 4060% of the genome. Based on their struc- ular analysis relies on amplification using a
tural organisation and amino acid similarities primer corresponding to the retrotransposon and
among their encoded reverse transcriptases, ret- a primer matching a section of the neighbouring
rotransposons can be divided into three catego- genome. Sequence-specific amplified polymor-
ries. Long terminal direct repeats (LTRs) flank phism (S-SAP) relies on amplification of DNA
two of these categories, and they encode proteins between a retrotransposon integration site and a
similar to the retroviruses. These LTR retrotrans- restriction site with a ligated adapter (Waugh
posons are referred to as the gypsy-like and et al. 1997). In inter-retrotransposon amplified
copia-like retrotransposons. The third class of polymorphism (IRAP), DNA between two nearby
retrotransposons, the LINE1-like or non-LTR retrotransposons or LTRs is amplified (Fig. 3.9).
Transposable Elements (TE)-Based Molecular Markers 67

Retrotransposon-microsatellite amplified poly- assay, a variant of REMAP, utilises a Ty-1


morphism (REMAP) involves amplification of copia-specific primer along with anchored SSR
fragments which lie between a retrotransposon primers. IRAP technique has been used in genome
insertion site and a microsatellite site (Fig. 3.9). classification of plant cultivars and detects simi-
Retrotransposon-based amplified polymorphism larity between cultivars.
(RBIP) detects loci occupied by or empty of a
retrotransposon. Sequence-Specic Amplication
Polymorphism (S-SAP)
Inter-retrotransposon Amplied The technique was first used to investigate the
Polymorphism (IRAP) and REtrotransposon- location of BARE-1 retrotransposons in the barley
Microsatellite Amplied Polymorphism genome (Waugh et al. 1997). In principle, it is a
(REMAP) simple modification of the standard AFLP proto-
IRAP and REMAP are two amplification-based col. The final amplification is performed with
marker methods which have been developed retrotransposon-specific and MseI-adaptor-
based on the position of given LTRs within the specific primers. S-SAP has been extensively used
genome. These two markers have been developed to generate markers to study genetic diversity and
originally for BARE-I retrotransposon of Hordeum to prepare linkage maps in several plants.
genus, which is present in the barley genome in
numerous copies. The IRAP markers are gener- Retrotransposon-Based Insertion
ated by the proximity of two LTRs using outward- Polymorphism (RBIP)
facing primers annealing to LTR target sequences The technique was first developed using the
(Fig. 3.9). In REMAP, amplification between PDR1 retrotransposon in the pea (Flavell et al.
LTRs proximal to simple sequence repeats such 1998). It requires the sequence information of
as constitutive microsatellites produces markers the 5 and 3 regions flanking the transposon.
(Fig. 3.9). Both IRAP and REMAP examine poly- When a primer specific to the transposon is used
morphism in retrotransposon insertion sites, IRAP together with a primer designed to anneal to the
between retrotransposons and REMAP between flanking region, they generate a product from
retrotransposons and microsatellites (SSRs). template DNA containing the insertion. On the
Retrotransposons can integrate in either orienta- other hand, primers specific to both flanking
tion into the genome. For head-to-head and tail- regions amplify a product if the insertion is
to-tail orientations, PCR products can be generated absent. Polymorphisms can be identified using
using a single primer from elements sufficiently standard agarose gel electrophoresis or by
close to one another. Intervening genomic DNA hybridisation with a reference PCR fragment.
for elements in head-to-tail orientation is amplified Hybridisation is more useful for automated,
using both 5 and 3 LTR primers. The REMAP high-throughput analysis. It is technically
method relies on one outward-facing LTR primer demanding and little bit costlier than other meth-
and a second primer from a microsatellite. Primers ods for detecting transposon insertions.
were designed to the (GA)n/(CT)n/(CA)n/(CAC)n/
(GTG)n/and (CAC)n microsatellites and were Transposable Display (TD)
anchored (all but one) to the microsatellite 3 ter- TD permits the simultaneous detection of many
minus by the addition of a single selective base at TEs from high copy number lines. The tech-
the 3 end. In both techniques, polymorphism is nique is a modification of the AFLP procedure
detected by the presence or absence of the PCR where PCR products are derived from primers
product. Lack of amplification indicates the anchored in a restriction site (i.e. BfaI or MseI)
absence of the retrotransposon at the particular and a transposable element rather than in two
locus. As these markers were extremely poly- restriction sites (van den Broeck et al. 1998).
morphic, they can prove useful for evaluating Individual transposons are identified by a liga-
intraspecific relationships. Copia-SSR marker tion-mediated PCR that starts from within the
68 3 Genotyping of Mapping Population

transposon and amplifies part of the flanking because they reflect sequence variation that
sequence up to a specific restriction site. determines the fraction of the original DNA
Resulting PCR products can be analysed in a sample that is included in the representation.
high-resolution polyacrylamide gel system. TD Thus, the variable fragments are called as DArT
was first used to reveal the copy number of the markers. Their presence or absence in a genomic
dTph1 transposon (TIRs) family in petunia and representation is assayed by hybridising the repre-
related insertion event. It also allows detection sentation to a DArT array consisting of a library
of an insertion that can be correlated with a par- of that given sample. Thus, DArT consists of the
ticular phenotype. It is also possible to exploit following sequences of steps: complexity reduc-
the unique properties of a group of TEs called tion of the sample DNA, library creation, microar-
miniature inverted repeat transposable elements raying libraries to the glass slides, hybridisation
(MITEs) using TD technique to develop a new of fluorolabelled DNA onto slides, scanning of
class of molecular marker for analysing Hbr slides for hybridisation signal and data extrac-
transposon family in maize. tion and analysis (http://www.diversityarrays.
com/molecularprincip.html).
Inter-MITE Polymorphism (IMP)
The technique is in principle very similar to
IRAP, except that it uses MITE like transposons Intron-Targeted IntronExon Splice
rather than retrotransposons. MITEs are short, Conjunction (IT-ISJ) Marker
non-autonomous DNA elements (class II trans-
posons) that are widespread and abundant in plant Weining and Langridge (1991) considered that
genomes and exhibit high copy number and intra- gene promoter regions, intronexon splice con-
family homogeneity in size and sequence. Most junction sites and 3 poly-A addition sites in pri-
of the hundreds of thousands of MITEs identified mary RNA all have the characteristics closely
to date have been divided into two major groups linked with targeted genes, so they can contribute
on the basis of shared structural and sequence to design PCR primers. According to the con-
characteristics: Tourist-like and Stowaway-like. served sequences of intronexon junctions,
The IMP technique was first used to identify two Weining and Langridge (1991) designed ISJ
groups of MITEs in barley, one belonging to the (intronexon splice junctions) primer which was
Stowaway family and the other to the Barfly family used for amplifying intron or exon and utilised ISJ
(Chang et al. 2001). primer PCR products to analyse the genome DNA
It is assumed that still more number of marker in wheat and barley, and they found that the ISJ
systems could be developed based on the features primers produced smear bands, but the ISJ prim-
of transposable elements. However, it would be ers conjunction with random primers and specific
desirable to generate such markers that are chro- primers produced clear bands. The core part of
mosome specific (which would be a herculean task forward primers included 5 splice junction con-
because of the nature of transposable elements). served sequence GAGGTAAGT, which was
supplement with restriction endonuclease Sph Is
recognition sequence GCATGC at the 5 end
Diversity Array Technology (DArT) and with 3 selective bases at the 3 end. The core
part of reverse primers included 3 splice junction
DArT operates on the principle that the sample conserved sequence ACCTGCA, which was
genomic DNA contains two types of fragments: supplement with restriction endonuclease EcoRIs
(1) constant fragments (found in any representa- recognition sequence GAATTC at the 5 end
tion prepared from a DNA sample) and (2) vari- and three selective bases at the 3 end. In order to
able or polymorphic fragments (found only in determine the applicable value of IT-ISJ marker
some but not all of the representations of the DNA in genetic map construction, different IT-ISJ
samples). The variable fragments are informative primer combinations were used to genotype the
RNA-Based Molecular Markers 69

Digest DNA sample A and B with restriction enzymes

Ligate linkers Recognition site

Linkers
Mutation at recognition site

Physically shear restriction products

Purify RAD tags


A

Release RAD tags


A

B
Label and hybridize to identify or type RAD markers

Fig. 3.10 Schematic representation of RAD marker development

recombinant inbred line population developed a pre-existing genomic tiling path microarray.
from upland cotton, and a genetic map was The procedure of RAD marker development is
constructed. explained in Fig. 3.10.

Restriction Site Associated DNA (RAD) RNA-Based Molecular Markers


Markers
In an alternate to DNA, other types of nucleic acids,
RAD can identify and type a large number of such as RNA, have also been used as template to
markers on a resource that is easy to produce for develop special kinds of molecular markers. For
both model and non-model organisms (Baird example, PCR-based marker techniques, such
et al. 2008). These markers were first employed as complementary DNA-AFLP (cDNA-AFLP),
to rapidly map a recombination breakpoint in the cDNA-SSCP and RNA fingerprinting by arbitrarily
model organism, Drosophila melanogaster, using primed PCR (RAP-PCR), are used as markers.
70 3 Genotyping of Mapping Population

cDNA-AFLP homologous gene pairs from a polyploid


genome. Replicated tests show that cDNA-SSCP
The cDNA-AFLP is a novel RNA fingerprinting reliably separates duplicated transcripts with
technique to display differentially expressed 99% sequence identity (Cronn and Adams 2003).
genes (Bachem et al. 1996). The methodology This technique has been used to gain remarkable
includes digestion of cDNAs by two restriction insight into the global frequency of silencing in
enzymes followed by ligation of oligonucleotide synthetic and natural polyploids.
adapters and PCR amplification using primers
complementary to the adapter sequences with
additional selective nucleotides at the 3 end. The Role of Genomics
cDNA-AFLP technique is more stringent and
reproducible than RAP-PCR. In contrast to Genomics has brought an innovative level of
hybridisation-based techniques, such as cDNA hope to development of novel types of markers
microarrays, cDNA-AFLP can distinguish and unravelling the secrets of complex traits.
between highly homologous genes from individ- Genome and/or gene sequences themselves have
ual gene families. Further, there is no require- the potential to provide a comprehensive list of
ment of any pre-existing sequence information in the markers in an organism. Functional genomics
cDNA-AFLP; thus, it is valuable as a tool for the approaches can then be used to generate informa-
identification of novel process-related genes such tion about gene function, as well as data on
as stress-regulated genes. genetic interactions, not only among and between
gene complexes but also in response to environ-
mental stimuli. At present, microarray technology
RNA Fingerprinting by Arbitrarily (see Box 3.4) is providing the most comprehen-
Primed PCR (RAP-PCR) sive assessment of gene function and variation. Our
ability to view the transcription of the genome is
The RAP-PCR technique (Welsh et al. 1992) improving rapidly, and as a result, the potential to
involves fingerprinting of RNA populations using dissect complex traits is also developing. Already,
arbitrarily selected primer at low stringency for array technology has been instrumental in identi-
first and second strand cDNA synthesis followed fying groups of co-expressed genes in various
by PCR amplification of cDNA population. The physiological states, including stages of develop-
method requires nanograms of total RNA and is ment and disease. Although array technology is
unaffected by low levels of genomic DNA con- valuable, these data are not conclusive or com-
tamination. Differential PCR fingerprints are prehensive as regards gene function and only pro-
detected for RNAs from the same tissue isolated vide one more piece (i.e. transcriptional profile) of
from different individuals and for RNAs from the puzzle. The translation of genes into proteins
different tissues from the same individual. The is another key step in gene action, and it will be
individual-specific differences revealed are due to essential to subject protein synthesis, as well as
sequence polymorphisms and are useful for genetic protein interaction, to the same genome-wide
mapping of genes. The tissue-specific differences analysis to understand how genotype can influence
revealed are useful for studying differential gene a complex phenotype. In other words, how the
expression. growing collections of data at the DNA, RNA,
protein and metabolite levels can be combined to
dissect complex traits and diseases remains to be
cDNA-SSCP seen. It has been proposed that the power available
through the merger of genetics and genomics
The SSCP analysis of RT-PCR products can (called genetical genomics or eQTL; discussed
be used to evaluate the expression status (pres- in chapter 7) might lead to further unravelling
ence and relative quantity) of highly similar of metabolic, regulatory and developmental
Role of Genomics 71

Box 3.4 Techniques Used to Find DNA Variations


Finding the polymorphic marker is the key a free radical polymerisation, carried out with
factor that decides the success of linkage map- ammonium persulfate as the initiator and
ping. Identifying polymorphism relies on the N,N,N,N-tetramethylenediamine (TEMED)
efficient discrimination of DNA markers as the catalyst. The length of the polymer
generated from the individuals. Usually, the chains is dictated by the concentration of acryl-
markers are classified as monomorphic or amide used, which is typically between 3.5
polymorphic using techniques such as gel and and 20%. Polyacrylamide gels are significantly
capillary electrophoresis, microarray and more annoying to prepare than agarose gels.
TILLING. Because oxygen inhibits the polymerisation
process, they must be poured between glass
plates (or cylinders). Polyacrylamide gels have
Gel Electrophoresis a rather small range of separation but very high
resolving power. Polyacrylamide is used for
The electrophoresis is used to describe the separating fragments of less than 500 bp DNA
migration of charged particle under the fragments. However, under appropriate condi-
influence of an electric field. Gel electropho- tions, fragments of DNA differing in length by
resis is the technique in which molecules are a single base pair are easily resolved. Small
forced across a span of gel, driven by an elec- DNAs or RNAs (smaller than 100 bp) are bet-
trical current. On either end of the gel, there ter separated by polyacrylamide gels; however,
are activated electrodes that provide the driv- 23% agarose gels may be adequate to sepa-
ing force. Therefore, a molecules properties rate even 50 bp fragments from much larger
(especially size, charge (the possession of ion- nucleic acids.
isable groups) and conformation) determine DNA electrophoresis is arguably the most
how rapidly an electric field can move the commonly performed molecular assay over
molecule through a gelatinous medium or a the past 50 years. The technique was initially
matrix. The important factor here is the length borrowed from protein and RNA techniques
and conformation of DNA molecule; smaller rather than primarily developing through
molecules travel farther. design of optimised methods. It generally
employs suboptimal buffers having high ionic
Agarose and Polyacrylamide Gel strength, conductance and electric field
Electrophoresis strength. Excessive joule heating limits the
Matrix is composed of either agarose or poly- tolerable applied voltage and the speed of
acrylamide, each of which has attributes electrophoretic separation.
suitable to particular tasks. Agarose is a poly- There are a number of buffers used for aga-
saccharide extracted from seaweed. It can be rose electrophoresis. The most common being
simply prepared, and it is typically used at Tris/Acetate/EDTA (TAE), Tris/Borate/EDTA
concentrations of 0.52% to resolve 100 bp (TBE) and lithium borate (LB). TAE has the
to 15 kb DNA fragments. The higher the lowest buffering capacity but provides the best
agarose concentration, the stiffer the gel and resolution for larger DNA. This means a lower
smaller DNA fragments can be resolved. voltage and more time but can produce a
Polyacrylamide gels are chemically cross- better resolution. LB is relatively new and is
linked gels formed by the polymerisation of ineffective in resolving fragments larger than
acrylamide with a cross-linking agent, N,N- 5 kb (Brody et al. 2004). However, with its
methylenebisacrylamide (Bis). The reaction is low conductivity, a much higher voltage could

(continued)
72 3 Genotyping of Mapping Population

Box 3.4 (continued)


be used (up to 35 V/cm), which means a this has been widely employed in SSR marker
shorter analysis time for routine electrophoresis. analysis, since polyacrylamide involves
As low as one base pair size difference could expensive and laborious protocols.
be resolved in 3% agarose gel with an
extremely low conductivity medium such as Temperature Gradient Gel
1 mM Lithium borate. Electrophoresis
Thus, recent modifications of DNA electro- Temperature Gradient Gel Electrophoresis
phoresis eliminated sodium EDTA and substi- (TGGE) is a powerful technique to separate
tuted alkali metal cations for Tris. Lithium was DNA fragments of identical length. In contrast
preferred over other alkali metal cations for its to conventional electrophoresis methods, mol-
large shell of hydration and low electrokinetic ecules are separated by their melting behav-
mobility, which provided lower conductance, iour. Thus, it becomes possible to separate
improved tolerance for applied voltage, lower DNA fragments according to their primary
heat generation and improved separation qual- sequence. To understand TGGE, there are two
ity. Compared to Tris/Borate/EDTA (TBE), fundamental points. The first is how the struc-
the alkali metal ion media decreased the con- ture of DNA changes with temperature; the
ductivity, lowered the final running tempera- second is how these changes in structure affect
ture and reduced the time for electrophoretic the movement of DNA through a gel. As tem-
separation. In general, TAE buffer (Tris/ perature rises, the two strands of the DNA
Acetate/EDTA) is the most commonly used start to unwind. At some high temperature, the
agarose gel electrophoresis buffer. TAE has the two strands will completely separate. However,
lowest buffering capacity and offers the best at some intermediate temperature, the two
resolution for larger DNA. However, TAE strands will be partly separated, with part of
requires a lower voltage and more time. the molecule still double stranded and part
Alternatively, TBE buffer (Tris/Borate/EDTA) single stranded. What makes TGGE useful is
is often used for smaller DNA fragments (i.e. that the mobility of the DNA molecule through
less than 500 bp). Sodium borate (SB) buffer the gel decreases drastically when these par-
can also be used because of its low conductiv- tially melted structures are formed, and most
ity and allowing higher voltages (up to 35 V/ important, the exact temperature at which this
cm) during the electrophoresis. This could occurs depends on sequence; thus, TGGE
allow a shorter analysis time for routine elec- offers a sequence-dependent, size-indepen-
trophoresis. However, it is ineffective for dent method for separating DNA molecules.
resolving fragments larger than 5 kb. A very simple but realistic analogy is to con-
sider a person moving through a crowded
MetaPhor Agarose Gel Electrophoresis room; when you extend your arms out, your
MetaPhor Agarose is a high-resolution aga- movement through the room slows drastically,
rose which is considered as an alternative to even though your mass has not changed.
polyacrylamide. MetaPhor Agarose is an Denaturing gradient gel electrophoresis
intermediate melting temperature (75C) aga- (DGGE) works in the same principle. However,
rose with twice the resolution capabilities of the difference is a small sample of DNA is
the finest-sieving agarose products. Using applied to an electrophoresis gel that contains
submarine gel electrophoresis, PCR products a denaturing agent. It has been shown that cer-
and small DNA fragments (20800 bp) that tain denaturing gels are capable of inducing
differ in size by 2% can be resolved. Of late, DNA to melt at various stages. As a result of

(continued)
Role of Genomics 73

Box 3.4 (continued)


this melting, the DNA spreads through the gel the DNA sample can be affected by the run
and can be analysed for single components, conditions: the buffer type, concentration and
even those as small as 200700 bp. pH; the run temperature; the amount of volt-
age applied; and the type of polymer used.
Pulsed Field Gel Electrophoresis Shortly before reaching the positive electrode,
In 1984, Schwartz and Cantor described the fluorescently labelled DNA fragments,
pulsed field gel electrophoresis (PFGE), intro- separated by size, move across the path of a
ducing a new way to separate DNA. In par- laser beam. The laser beam causes the dyes on
ticular, PFGE resolved extremely large DNA the fragments to fluoresce. An optical detec-
in agarose from 3050 kb to 10 Mb. During tion device detects the fluorescence, and the
continuous field electrophoresis, DNA above signal is converted into data.
3050 kb migrates with the same mobility
regardless of size. This is seen in a gel as a
single large diffuse band. If, however, the Microarray
DNA is forced to change direction during
electrophoresis, different-sized fragments Microarray can be used to find the polymor-
within this diffuse band begin to separate from phic SNP or SFP markers. Microarray works
each other. With each reorientation of the by exploiting the ability of fluorescently
electric field relative to the gel, smaller sized labelled given DNA fragment to bind (or
DNA will begin moving in the new direction hybridise) specifically to the markers
more quickly than the larger DNA. Thus, the (predefined DNA template) arranged in a reg-
larger DNA lags behind providing a separa- ular pattern on a small chip. Depending on the
tion from the smaller DNA. Currently, there strength or degree of binding/hybridisation,
are three models that attempt to describe the the colour intensity varies, and it is used to
behaviour of DNA during PFGE: the biased generate the data. The major advantage of
repetition model (BRM), the chain model and, microarray is several DNA samples can be
most recently, the bag model. analysed in a single experiment and thousands
of data points can be generated.

Capillary Electrophoresis
TILLING
Capillary electrophoresis has largely replaced
the use of gel separation techniques due to sig- TILLING (Targeting Induced Local Lesions
nificant gains in workflow, throughput and ease IN Genomes) is a reverse genetics process,
of use. Fluorescently labelled DNA fragments and it relies on the ability of a special enzyme
are separated according to molecular weight, to detect mismatches in normal and mutant
and it can be automated since it does not involve (or polymorphic) DNA strands when they are
gel casting. During capillary electrophoresis, annealed. By selectively pooling the DNA
the PCR products or DNA enters the capillary and amplifying with fluorescently labelled
as a result of electrokinetic injection. A high- primers, mismatched heteroduplexes were
voltage charge applied to the buffered sequenc- generated between wild type and mutant
ing reaction forces the negatively charged DNA. Heteroduplexes were incubated with
fragments into the capillaries. The extension the plant endonuclease CEL-I, (which cleaves
products are separated by size based on their heteroduplex mismatched sites), and the resul-
total charge. The electrophoretic mobility of tant products are visualised on a capillary

(continued)
74 3 Genotyping of Mapping Population

Box 3.4 (continued)


sequencer, and the fluorescently labelled traces When a mutation/polymorphism is detected in
are analysed. The differential end labelling of the pooled DNA, the individual DNA samples
the amplification products permits the two are sequenced to identify the specific plant
cleavage fragments to be observed and identify carrying the polymorphism (McCallum et al.
the position of the mismatch or polymorphism. 2000).

SNPs on chips (after 2000)


AFLP on microarrays (1998)
SNPs (1994)

AFLP on automated sequencers (1998)


Complete genomic sequence
Genomics era High throughput marker analysis
Automation

Gene-Based markers
Anonymous markers

AFLP (1995)

cDNA sequencing (ESTs)

SCARs (1991)

RAPD (1990) CAPS (1993)


Oligo scene SSCPs (1989)

Minisatellites and SSRs (1989) Gene specific PCR


PCR (1987)
Pre-PCR era RFLPs (1980s)
DNA Hybridization scene Restriction (1968) and Southern blotting (1975)
Protein scene Allozymes (1960s)
Gel electrophoresis (1950s)

Classical era Morphological variants (Pre 1950s)

Fig. 3.11 Evolutionary and historical perspectives of molecular markers

pathways, but rigorous investigations still need to different marker technologies. In general, the
be completed. What is clear, however, is that choice of a molecular marker technique has to be
genomic technology is emerging in such a way a compromise between reliability and ease of
that it will supply quantities of data that require analysis, statistical power and confidence of
detailed statistical and mathematical analyses. revealing polymorphisms. Thus, before select-
ing the marker technology, the following should
be finalised.
Selection of Marker Technology

When science advances, several classes of Research Problem


marker technologies are identified. Figure 3.11
describes the evolutionary and historical per- This is the key question that needs to be solved
spectives of the marker systems. An obvious before choosing the right marker technology.
problem that usually arises is how to choose the Thus, the first step is to finalise what is the
most appropriate marker among the myriad of biological question one wants to answer with
Selection of Marker Technology 75

the research? For instance, for information on Quality of DNA


population history or phylogenetic relationships,
sequence data or restriction site data should be RFLP analysis requires large amounts of pure
used. In order to construct a saturated linkage quality DNA. Most PCR-based methods require
map (i.e. approximately one marker in every only tiny quantities of DNA. In many cases, PCR
1 cM distance), a combination of SSRs and is performed only to amplify the original amount
AFLPs needs to be selected. of target DNA. Hence, the marker technology
should also be selected with the available facili-
ties and resources.
The Number of Loci and/or Alleles

The next critical question in this context is Will Expertise Required


information from a few loci be sufficient or is
greater genome coverage required? Isozymes Techniques involving hybridisation or manual
are usually limited in number. AFLP detect high sequencing are technically demanding, whereas
numbers of loci. Where hyper-variability is RAPDs or SSRs (once the primers are available)
required, the best techniques are those based on are the least demanding techniques. Thus, exper-
single-locus SSR. tise availability also decides the selection of
marker technology. Further, availability of or
access to laboratory facilities and equipments
Discrimination Level and man power with a good grasp of many basic
laboratory skills are also required to choose the
Further, it is also important to decide at what appropriate marker technology.
taxonomic level is the genetic variation being
measured: within populations, between species or
between genera? Is the selected method appropri- Costs
ate for detecting the desired level of variation?
SSRs can provide sufficient variation between In terms of costs, isozymes are the cheapest;
genera; however, to generate same degree of vari- RAPD, RFLP and even AFLP are intermediate;
ation between species, it is better to use SNPs. but sequencing or SNP is still more expensive.
The costs of all types of experiments should be
considered, because lack of reproducibility of
Mode of Inheritance some markers may, in the end, result in higher
costs. For required skills, a visit to another labo-
Other questions related to inheritance of markers ratory where the relevant techniques are being
in the segregating progenies such as should both used can provide invaluable information. Of late,
homozygotes and heterozygotes be identified? costs for sequencing experiments have
Are co-dominant markers needed (single-locus significantly decreased. Many ESTs are already
RFLPs, isozymes, SSRs) or will dominant available for several species. Microarrays, based
markers suffice (RAPD, AFLP)? also need to be on either anonymous genomic characterisation or
addressed before selecting the marker system. gene expression, are becoming common.
If presence versus absence information is Microarray technology is still very demanding,
sufficient, then any molecular marker technology technically and financially (in terms of equip-
can be used; but if information about heterozy- ment and consumables). Before deciding on it,
gotes is needed (e.g. population and diversity get acquainted with the techniques, requirements
structure, knowledge on type of inheritance), then and outputs. A better option might be to consider
co-dominant markers such as isozymes or micro- outsourcing of sample analysis. SNPs are being
satellites should be used. routinely used in human studies. They are still
76 3 Genotyping of Mapping Population

too expensive for standard applications to genetic searching putative microsatellites rely on
diversity studies in plants. Nevertheless, SNPs sequence databases, circumventing the prob-
reveal ultimate level of variation in the DNA lem of having to make and screen libraries in
sequencethe nucleotidesand they would be the laboratory.
the futures best molecular marker option when AFLPs have become a very popular option,
their costs of discovery and application decrease. although their need for a double PCR and ver-
tical gel electrophoresis makes them more
expensive and technically more demanding.
Speed However, this is the only PCR-based tech-
nique that helps in constructing saturated link-
Further, it is required to decide how quickly are age map.
data needed? and how much time will the equip- In summary, the three key factors that assess
ment allow? PCR-based methods certainly give the utility of DNA markers in genetic mapping
fast results when primers are available. are:
Hybridisation-based methods are slower. 1. The informativeness of a genetic marker: It is
Conventional DNA sequencing is slow, whereas measured by the number of alleles and allele
automated sequencing is faster. frequencies. There are two measures of infor-
mativeness: heterozygosity and polymorphic
information content (discussed in Box 3.3).
Reproducibility 2. The throughput of a genetic marker: It is the
multiplex ratio, that is, number of simultane-
Yet another critical question to be finalised is are ously assayed loci.
robust methods required? For example, will the 3. Genotyping error: It affects the reproducibil-
markers be exchanged? is more than one labora- ity of the marker assay and clarity of the
tory involved? Isozymes, RFLPs, SSRs and marker genotypes.
sequencing are robust, whereas RAPD is not.

Marker Genotyping and Scoring


PCR Versus Non-PCR Techniques
Once the appropriate marker technolog(y)ies is
PCR-based molecular marker techniques open up selected, initially, they need to be employed in
numerous possibilities and could be considered parental polymorphic survey. It is vital to identify
first, because of their simplicity. Hybridisation- as many numbers of polymorphic markers as
based techniques are more labour intensive, haz- possible since only those polymorphic markers
ardous and more technically demanding and will be used to construct the linkage map. In order
require costly equipment. Thus, PCR-based tech- to construct a saturated linkage map, it is essen-
niques can be explored. To this end, the following tial to find polymorphic markers that span all
points may be considered: over the genome. As a general rule, to construct a
RAPD is an excellent technique by which to preliminary linkage map, it is suggested to have
become familiar with PCR. It allows rapid markers in every 10-cM interval; so as to create a
examination of polymorphisms in most, if not saturated linkage map, it requires markers in
all, species of interest since primers are read- every 1-cM interval. The number of markers
ily available. required to construct such preliminary or satu-
Other PCR-based markers such as SSR could rated linkage map varies depending on the marker
be applied relatively easily, if primers are system and plant species. For example, in cotton
already available in the given species. (Gossypium spp.,), SSRs provide 830% poly-
Strategies for searching appropriate primers morphism between the interspecific parents.
are also improving, and some approaches for Tanksley and McCouch (1997) suggested that
Analysing the Genotype Score: Chi-Square Test 77

Table 3.3 Expected segregation ratios for different marker systems in different population types
Population type Genetic segregation ratio
Co-dominant markers Dominant markers
(e.g. RFLP, SSR, CAPS) (e.g. RAPD, AFLP, ISSR)
F2 progenies 1:2:1 (AA:Aa:aa) 3:1 (B_:bb)
Back cross progenies BC1 1:1 (Cc:cc) 1:0 (D_)
BC2 1:1 (Ee:ee) 1:1 (Ff:ff)
Recombinant inbred lines or double 1:1 (GG:gg) 1:1 (HH:hh)
haploid lines or near isogenic lines

once a map of 5,125 cM reaches a density of expected segregation ratios for co-dominant and
about one marker per 5 cM or a total of about dominant markers (Table 3.3) are compared with
1,025 marker loci, the map should link up into 26 the actual ratios found in the experimental
linkage groups corresponding to 26 gametic population.
chromosomes of the tetraploid cotton. He et al. Significant deviations from expected ratios
(2007) have published such a map with F2 and can be analysed using chi-square tests (dis-
F2:3 population (G. hirsutum x G. barbadense) cussed below). Generally, markers will segre-
which includes 1,029 genetic loci mapped to 26 gate in a Mendelian fashion although distorted
linkage groups that covered 5472.3 cM with an segregation ratios may be encountered in cer-
average distance of 5.31 cM between loci. In tain populations. The frequency of recombi-
some polyploid species such as sugarcane, iden- nant genotypes can be used to calculate
tifying polymorphic markers is more compli- recombination fractions, which may be used to
cated. In such cases, the mapping of diploid infer the genetic distance between markers. By
relatives of polyploid species can be of great analysing the segregation of markers, the rela-
benefit in developing maps for polyploid species. tive order and distances between markers can
However, diploid relatives do not exist for all be determined: the lower the frequency of
polyploid species. A general method for the map- recombination between two markers, the closer
ping of polyploid species is based on the use of they are situated on a chromosome (conversely,
single-dose restriction fragments. the higher the frequency of recombination
In all the cases, thus, it is essential to identify between two markers, the further away they are
sufficient number of polymorphic markers that situated on a chromosome). Markers that have
span all the chromosomes of the given species. a recombination frequency of 50% are described
These polymorphic markers are to be surveyed as unlinked and assumed to be located far
across the progenies of the given mapping popu- apart on the same chromosome or on different
lation (and if possible across F1 hybrids). This is chromosomes. Mapping functions are used to
known as marker genotyping of the population. convert recombination fractions into map units
Therefore, DNA must be extracted from each called centimorgans (cM). For a more detailed
individual of the given mapping population when explanation of linkage mapping, kindly refer
DNA markers are used. chapter 4.
The segregation of these polymorphic markers
in the progenies is then scored for parental or
recombinant behaviour. Markers that are close Analysing the Genotype Score:
together or tightly linked will be transmitted Chi-Square Test
together from parent to progeny more frequently
than markers that are located further apart. In a The genetic segregation ratio at given maker
segregating population, there is a mixture of locus is jointly determined by the nature of
parental and recombinant genotypes. The marker (dominant/co-dominant; see Box 3.3)
78 3 Genotyping of Mapping Population

and types of mapping populations (Table 3.3). the tabulated c2 value. Reject the hypothesis of
Therefore, a thorough understanding of the goodness of fit to the given ratio, if the computed
nature of markers and mapping population is c2 value exceeds the corresponding c2 value at
crucial for any mapping projects. Markers such given level of significance (i.e. 1% or 5%). The
as RFLPs, microsatellites and CAPS are co- chi-square test can be done using the program
dominant in nature, while AFLP, RAPD and AntMap. This program is freely available at
ISSR are often scored as dominant markers. http://lbm.ab.a.u-tokyo.ac.jp/~iwata/antmap/ .
Mapping populations such as RILs and DHs The following simple steps are sufficient to per-
equalise marker type because of fixation of form chi-square analysis using AntMap (For fur-
parental alleles at marker locus in homozygous ther advanced analyses refer the tutorial given in
condition. These populations result in 1:1 segre- the same website).
gation ratio at marker locus irrespective of
genetic nature of markers. In contrast, F2 popu-
lation segregates in 1:2:1 ratio for a co-dominant Step 1: Open an Input File
marker and in 3:1 ratio for dominant marker
(refer Table 3.3 for other types of segregation). Open an input file in MapMaker format (*.raw)
Depending upon the segregation pattern, statisti- through File-Open menu. Refer chapter 4 for
cal analysis of marker data will vary. how to prepare a *.raw file? After opening the
Significant deviation from expected segregation file, contents of the file will appear in the Data
ratio in a given marker-population combination is panel. When the Log tab is clicked, you can see
referred to as segregation distortion. There are a summary of the input data.
several reasons for segregation distortion. It may
be due to gamete/zygote lethality, meiotic drive/
preferential segregation, sampling/selection during Step 2: Segregation Ratio Test
population development and differential responses
of parental lines to tissue culture in case of DHs. Select Segregation Test from the Analysis
Segregation distortion can also be specific with menu. By selecting, you can see the results of
respect to some markers in an otherwise normal segregation ratio tests in the Result panel. If P
mapping population. It is therefore important that value is <0.01, it will have ** (this indicates that
the goodness of fit of segregation ratio must be highly significant); for P value of 0.010.05, it
tested for individual marker locus and, if neces- will have * (it indicates significant). In other
sary, the marker showing high degree of segrega- words, the above-said P value specifies the data
tion distortion be eliminated from the analysis. set fit the hypothesised frequency distribution at
1 and 5% level of significance.

c2 Test to Analyse the Segregation


Ratio Using the Program ANTMAP Bibliography

The chi-square (c2) test is the most commonly Literature Cited


used statistical analysis to test the hypothesis
concerning the frequency distribution or segrega- Bachem CWB, van der Hoeve RS, de Bruijn SM,
Vreugdenhil D, Zabeau M, Visser RGF (1996)
tion pattern in genetics. Visualisation of differential gene expression using a
novel method of RNA fingerprinting based on AFLP:
(O E )2
2 =
analysis of gene expression during potato tuber devel-
E opment. Plant J 9:745753
Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL
et al (2008) Rapid SNP discovery and genetic map-
where O is observed frequency and E is expected ping using sequenced RAD markers. PLoS One
frequency. Measure the computed c2 value with 3(10):e3376. doi:10.1371/journal.pone.0003376
Bibliography 79

Botstein D, White RL, Skolnick M, Davis RW (1980) Mullis KB, Faloona F (1987) Specific synthesis of DNA
Construction of a genetic linkage map in man using in vitro via polymerase chain reaction. Methods
restriction fragment length polymorphisms. Am J Enzymol 155:350355
Hum Genet 32:314333 Orita M, Iwahana H, Kanazawa H, Hayashi K, Sekiya T
Brody JR, Calhoun ES, Gallmeier E, Creavalle TD, Kern (1989) Detection of polymorphisms of human DNA by
SE (2004) Ultra-fast high-resolution agarose electro- gel electrophoresis as single-strand conformation poly-
phoresis of DNA and RNA using low-molarity con- morphism. Proc Natl Acad Sci USA 86:27662770
ductive media. Biotechniques 37(4):598602 Paran I, Michelmore RW (1993) Development of reliable
Caetano-Anolls G, Bassam BJDNA (1993) Amplification PCR-based markers linked to downy mildew resis-
fingerprinting using arbitrary oligonucleotide primers. tance genes in lettuce. Theor Appl Genet 85:985999
Appl Biochem Biotechnol 42:189200 Schuelke M (2000) An economic method for the
Chang RY, ODonoughue LS, Bureau TE (2001) Inter- fluorescent labelling of PCR fragments. Nat Biotechnol
MITE polymorphisms (IMP): a high throughput trans- 18:233234
poson-based genome mapping and fingerprinting Schwartz DC, Cantor CR (1984) Separation of yeast chro-
approach. Theor Appl Genet 102:773781 mosome-sized DNAs by pulsed field gradient electro-
Cronn RC, Adams KL (2003) Quantitative analysis of phoresis. Cell 37:6775
transcript accumulation from genes duplicated by poly- Tanksley SD, McCouch SR (1997) Seed banks and molec-
ploidy using cDNA-SSCP. Biotechniques 34:726734 ular maps: unlocking genetic potential from the wild.
Flavell AJ, Knox M, Pearce SR, Ellis THN (1998) Science 277:10631066
Retrotransposon based insertion polymorphisms Tautz D, Renz M (1984) Simple sequences are ubiquitous
(RBIP) for high throughput marker analysis. Plant J repetitive components of eukaryotic genomes. Nucleic
16:643665 Acids Res 12(10):41274138
He DH, Lin ZX, Zhang XL, Nie YC, Guo XP, Zhang YX, Li van den Broeck D, Maes T, Sauer M, Zethof J, De
W (2007) QTL mapping for economic traits based on a Keukeleire P, DHauw M, Van Montagu M, Gerats T
dense genetic map of cotton with PCR-based markers (1998) Transposon Display identifies individual trans-
using the interspecific cross of Gossypium hirsu- posable elements in high copy number lines. Plant J
tum Gossypium barbadense. Euphytica 153(1):181197 13:121129
Hu J, Vick BA (2003) Target region amplification poly- Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T,
morphism: a novel marker technique for plant geno- Hornes M, Frijters A, Pot J, Peleman J, Kuiper M,
typing. Plant Mol Biol Rep 21:289294 Zabeau M (1995) AFLP: a new technique for DNA
Huang J, Sun M (1999) A modified AFLP with fluorescence fingerprinting. Nucleic Acids Res 23:44074414
labelled primers and automated DNA sequencer detec- Wang X, Zhiyuan F, Sanwen H, Peitian S, Yumei L, Limei
tion for efficient fingerprinting analysis in plants. Y, Mu Z, Dongyu Q (2000) An extended random
Biotechnol Tech 14:277278 primer amplified region (ERPAR) marker linked to a
Jordan SA, Humphries P (1994) Single nucleotide poly- dominant male sterility gene in cabbage (Brassica
morphism in exon 2 of the BCP gene on 7q31-q35. oleracea var. capitata). Euphytica 112:267273
Hum Mol Genet 3:1915 Waugh R, McLean K, Flavell AJ, Pearce SR, Kumar A,
Kalendar R, Grob T, Regina M, Suoniemi A, Schulman A Thomas WTB, Powell W (1997) Genetic distribution
(1999) IRAP and REMAP: two new retrotransposon- of Bare-1-like retrotransposable elements in the
based DNA fingerprinting techniques. Theor Appl barley genome revealed by sequence-specific
Genet 98:704711 amplification polymorphisms (SSAP). Mol Gen Genet
Komori T, Nitta N (2005) Utilization of CAPS/dCAPS 253:687694
method to convert rice SNPs into PCR-based markers. Weining S, Langridge P (1991) Identification and map-
Breed Sci 55:9398 ping of polymorphisms in cereals based on the poly-
Li G, Quiros CF (2001) Sequence-related amplified merase chain reaction. Theor Appl Genet 82:209216
polymorphism (SRAP), a new marker system based Weising K, Gardner RC (1999) A set of conserved PCR
on a simple PCR reaction: its application to mapping primers for the analysis of simple sequence repeat
and gene tagging in Brassica. Theor Appl Genet polymorphisms in chloroplast genomes of dicotyle-
103:455546 donous angiosperms. Genome 42:911
Makino R, Yazyu H, Kishimoto Y, Sekiya T, Hayashi K Welsh J, McClelland M (1990) Fingerprinting genomes
(1992) F-SSCP: fluorescence-based polymerase chain using PCR with arbitrary primers. Nucleic Acids Res
reaction single-strand conformation polymorphism 18:72137218
(PCR-SSCP) analysis. PCR Methods Appl 2:1013 Welsh J, Chada K, Dalal SS, Ralph D, Cheng R, McClelland
McCallum CM, Comai L, Greene EA, Henikoff S (2000) M (1992) Arbitrarily primed PCR fingerprinting of
Targeted screening for induced mutations. Nat RNA. Nucleic Acids Res 20:49654970
Biotechnol 18:455457 Williams JGK, Kubelik AR, Livak KJ, Rafalski JA, Tingey
Michaels SD, Amasino RMA (1998) A robust method for SV (1991) DNA polymorphisms amplified by arbi-
detecting single nucleotide changes as polymorphic trary primers are usefll as genetic markers. Nucleic
markers by PCR. Plant J 14:381385 Acids Res 18:65316535
80 3 Genotyping of Mapping Population

Wu KS, Jones R, Danneberger L, Scolnik P (1994) Eathington SR et al (2007) Molecular markers in a commer-
Detection of microsatellite polymorphisms without cial breeding program. Crop Sci 47(S3):S154S163
cloning. Nucleic Acids Res 22:32573258 Jena KK, Mackill DJ (2008) Molecular markers and their
use in marker assisted selection in rice. Crop Sci
48:12661277
Lorz H, Wenzel G (2005) Molecular marker systems in
Further Readings plant breeding and crop improvement, Biotechnology
in agriculture and forestry 55. Springer, New York
Agarwal M, Shrivastava N, Padh H (2008) Advances in Van Bueren L et al (2010) The role of molecular markers
molecular marker techniques and their applications in and marker assisted selection in breeding for organic
plant sciences. Plant Cell Rep 27:617631 agriculture. Euphytica 175:5164
Linkage Map Construction
4

Genome mapping methods are generally divided components of the landscape, such as rivers,
into two categories: (1) genetic or linkage map- ponds, elevations, roads and buildings. Similarly,
ping and (2) physical mapping. Genetic mapping to describe the genetic landscape, morphological
is based on the use of genetic techniques to con- markers, isozymes and nucleic acid-based mark-
struct maps showing the positions of genes and ers are used (discussed in Chap. 3).
other sequence features on a genome, whereas The principle of genetic mapping has been
physical maps are constructed by directly conceptualised more than a century ago. The dis-
sequencing DNA molecules, and such physical covery of genetic linkage, first reported in 1905
map shows the positions of sequence features, in the sweet pea by Bateson and colleagues
including genes. There is yet another map in (however, it was referred to as coupling during
genome analysis, which is called as cytogenetic those period), and the observation by Morgan
map. It is a genetic term used to describe the (that the amount of crossing over between genes
visual appearance of a chromosome (known as indicates the distance between them on a chro-
karyotype) when chromosomes are stained and mosome) helped Sturtevant to develop the first
examined under a microscope. Physical map genetic map in 1913.
identifies actual physical position of genes and Visual appearance or morphological markers
other DNA elements on the chromosomes and were initially used to construct the first genetic
facilitates positional cloning of agronomically maps in the early decades of the twentieth cen-
important genes and analysing chromosomes and tury for organisms such as the fruit fly. To be use-
genome structure in detail (refer chapter 7 for ful in genetic analysis, a morphological trait
detailed description). This chapter focuses on should have heritable characteristics, that is, it
detailed description of genetic or linkage map- has to exist in at least two alternative forms or
ping besides briefly portraying other two types of phenotypes (e.g. having tall or short stems in the
mapping procedure. pea plants originally studied by Mendel). Each
phenotype is specified by a different allele of the
corresponding gene, and those phenotypes should
Basics of Genetic/Linkage Mapping: be distinguishable by visual examination. For
Mendelian Ratios, Meiosis, Crossing example, the first fruit-fly maps showed the posi-
Over and Partial Linkage tions of genes for body colour, eye colour, wing
shape etc. since all of these phenotypes are being
As that of a geographic map, a genetic map must visible simply by looking at the flies with a low-
show the positions of distinctive features since power microscope or the naked eye. As discussed
both of these maps share the same analogy. In a in Chap. 3, it was soon realised that there were
geographic map, these markers are recognisable only a limited number of visual phenotypes

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 81
and Benefits, DOI 10.1007/978-81-322-0958-4_4, Springer India 2013
82 4 Linkage Map Construction

whose inheritance could be studied, and in many phenotypes are crossed, then all the progeny (the
cases their analysis was complicated because a F1 generation) display the same phenotype. These
single phenotype could be affected by more than F1 plants must be heterozygous, meaning that
one gene. For example, by 1922 over 50 genes they possess two different alleles, one for each
had been mapped onto the four fruit-fly chromo- phenotype, one allele inherited from the mother
somes, but nine of these were for eye colour. In and one from the father. Mendel postulated that
other words, it was very difficult to distinguish in this heterozygous condition one allele over-
between fly eyes that were coloured red, light rides the effects of the other allele; he therefore
red, vermilion, garnet, carnation, cinnabar, ruby, described the phenotype expressed in the F1
sepia, scarlet, pink, cardinal, claret, purple or plants as being dominant over the second, reces-
brown. To make gene maps more comprehensive, sive phenotype. Thus, Mendels first law of
it would be necessary to find characteristics that alleles segregate randomly and the second law
were more distinctive and less complex than of alleles segregate independently help to pre-
visual ones. The answer was to use the knowl- dict the outcome of genetic crosses. This is the
edge on biochemistry (isozymes) and molecular perfectly correct interpretation of the interaction
biology (DNA- or RNA-based markers) to distin- between the pairs of alleles studied by Mendel.
guish phenotypes. Hence, in order to prepare a This study helped to introduce the concept recom-
comprehensive genetic map (i.e. to have a com- bination. When two characters are considered, a
plete coverage of the genome), we need a large gamete is said to be parental or nonrecombinant
set of markers. if the genes governing the two characters were
Once a set of distinguishable or polymorphic both inherited from the same parent. It is said to
markers have assembled, the next process is to be recombinant if the genes it contains for the
construct the linkage map. The technique involved two characters were inherited from different par-
in the linkage map construction is based on the ents. In the above example, an F1 individual may
genetic linkage, a discovery made in ninetieth pass on to an offspring one of the four gametes,
century by Gregor Mendel. Mendel studied seven WV, Wv, wV or wv. In this example, Wv and wV
pairs of contrasting characteristics in his pea are recombinant gametes because they represent
plants, one of which was violet and white flower a mixing of genetic material which had been
colour. There were two important points to be inherited separately. Mendels second law
considered to understand the concept of genetic specifies that a given gamete has a chance of to
linkage: (1) Pure-breeding plants always give rise be recombinant (i.e. to the maximum of half of
to flowers with the parental colour. These plants the progenies).
are homozygotes, each possessing a pair of iden- However, later it was noticed that this simple
tical alleles, denoted by VV for violet flowers and dominant-recessive rule can be complicated by
WW for white flowers. (2) When two pure-breed- situations that he did not encounter. One of these
ing plants are crossed, only one of the phenotypes is incomplete dominance, where the heterozy-
is seen in the F1 generation. Genetic mapping is gous phenotype is intermediate between the two
based on the principles of inheritance as first homozygous forms. An example is when red car-
described by Gregor Mendel in 1865. From the nations are crossed with white ones, the F1
results of his breeding experiments with peas, heterozygotes being pink. Another complication
Mendel concluded that each pea plant possesses is co-dominance, when both alleles are detectable
two alleles for each gene, but displays only one in the heterozygote. Co-dominance is the typical
phenotype. As discussed above, this is easy to situation for DNA markers.
understand if the plant is pure breeding, or Further, his second law was also questioned
homozygous, for a particular characteristic, since since it was soon established that genes reside on
it possesses two identical alleles and displays the chromosomes, and all organisms have many more
appropriate phenotype. However, Mendel showed genes than chromosomes. If the chromosomes
that if two pure-breeding plants with different are inherited as intact units, then the alleles of genes
Basics of Genetic/Linkage Mapping: Mendelian Ratios, Meiosis, Crossing Over and Partial Linkage 83

should also be inherited together since they are common, being the process by which the diploid
on the same chromosome. Correns in 1913 nucleus of a somatic cell divides to produce two
described the phenomenon of complete linkage daughter nuclei, both of which are diploid. Before
or complete gametic coupling, in which alleles of mitosis begins, each chromosome in the nucleus
two or more different characters appeared to be is replicated, but the resulting daughter chromo-
always inherited together rather than indepen- somes do not immediately break away from one
dently (i.e. no recombination was observed another. To begin with they remain attached at
between them). This is the principle of genetic their centromeres and by cohesion proteins which
linkage. hold together the arms of the replicated chromo-
Although this seems to violate Mendels sec- somes. The daughters do not separate until later
ond law, an obvious extension of his theory would in mitosis when the chromosomes are distributed
be to assume that the genes for these characters between the two new nuclei. Obviously it is
are physically attached. Further, the chromosome important that each of the new nuclei receives a
theory of heredity formulated by Sutton in 1903 complete set of chromosomes.
also provided a physical mechanism for Mendels In contrast, at the start of meiosis the chromo-
law if it was assumed that the independent somes condense, and each homologous pair lines
Mendelian characters lay on different chromo- up to form a bivalent (Fig. 4.1). Within the biva-
somes and that those which were completely lent, crossing over might occur, involving break-
linked lay on the same chromosome. age of chromosome arms and exchange of DNA.
Though it was shown to be correct, the results Meiosis then proceeds by a pair of mitotic nuclear
did not turn out exactly as expected. The com- divisions that result initially in two nuclei, each
plete linkage that had been anticipated between with two copies of each chromosome still attached
many pairs of genes failed to materialise. Pairs of at their centromeres, and finally in four nuclei,
genes were either inherited independently, as each with a single copy of each chromosome.
expected for genes in different chromosomes, or, These final products of meiosis, the gametes, are
if they showed linkage, then it was only partial therefore haploid.
linkage, that is, sometimes they were inherited Mitosis illustrates the basic events occurring
together and sometimes they were not. during nuclear division but is not directly relevant
Partial linkage was discovered in the early to genetic mapping. Instead, it is the distinctive
twentieth century. When a cross was carried out features of meiosis that interest us. Meiosis
by Bateson, Saunders and Punnett in 1905 with occurs only in reproductive cells and results in a
sweet peas, the parental cross gives the typical diploid cell giving rise to four haploid gametes,
dihybrid result, with all the F1 plants displaying each of which can subsequently fuse with a gam-
the same parental phenotype. However, the F1 ete of the opposite sex during sexual reproduc-
cross gives unexpected results as the progenies tion. The fact that meiosis results in four haploid
show neither a 9:3:3:1 ratio (expected for genes cells whereas mitosis gives rise to two diploid
on different chromosomes) nor a 3:1 ratio cells is easy to explain: meiosis involves two
(expected if the genes are completely linked). An nuclear divisions, one after the other, whereas
unusual ratio is typical of partial linkage. Partial mitosis is just a single nuclear division. This is an
linkage was explained later when the behaviour important distinction, but the critical difference
of chromosomes during meiosis was elucidated between mitosis and meiosis is more refined.
at molecular level. Recall that in a diploid cell there are two separate
It was Thomas Hunt Morgan who made the copies of each chromosome, referred to as pairs
conceptual link between partial linkage and the of homologous chromosomes. During mitosis,
behaviour of chromosomes when the nucleus of a homologous chromosomes remain separate from
cell divides. Cytologists in the late ninetieth cen- one another, each member of the pair replicating
tury had distinguished two types of nuclear divi- and being passed to a daughter nucleus indepen-
sion: mitosis and meiosis. Mitosis is more dently of its homologue. In meiosis, however, the
84 4 Linkage Map Construction

Interphase Prophase I Metaphase I Anaphase I

P P p p P p P P p p
Q Q q q Q q Q q Q q
R R r r R r R R r r

Chiasma

Homologous Crossing over has


chromosomes occurred; Recombinant
chromotids

p p
P P
Q q
Q q
r r
R R

Prophase II

p p
P P
Q q
Q q
r r
R R

Recombinant gametes

Telophase II

Fig. 4.1 Features of meiosis

pairs of homologous chromosomes are by no over on the inheritance of genes. Let us consider
means independent. During meiosis I, each chro- two genes, each of which has two alleles. We
mosome lines up with its homologue to form a will call the first gene A and its alleles A and a,
bivalent. This occurs after each chromosome has and the second gene B with alleles B and b.
replicated, but before the replicated structures Imagine that the two genes are located on chro-
split, so the bivalent in fact contains four chromo- mosome number 2 of Drosophila melanogaster
some copies, each of which is destined to find its (fruit fly), the organism used by Morgan. We are
way into one of the four gametes that will be pro- going to follow the meiosis of a diploid nucleus
duced at the end of the meiosis. Within the biva- in which one copy of chromosome 2 has alleles
lent, the chromosome arms (the chromatids) can A and B, and the second has a and b. In such
undergo physical breakage and exchange of seg- scenario there are two alternatives (as depicted
ments of DNA (refer the Fig. 4.1). This is called in Fig. 4.2):
crossing over or recombination and was discov- 1. A crossover does not occur between genes A
ered by the Belgian cytologist Janssens in 1909. and B. If this happen, then two of the result-
This was just 2 years before Morgan started to ing gametes will contain chromosome copies
think about partial linkage. with alleles A and B, and the other two will
This discovery of crossing over helped contain a and b. In other words, two of the
Morgan to explain partial linkage. To understand gametes have the genotype AB, and two have
this we need to think about the effect of crossing the genotype ab.
Basics of Genetic/Linkage Mapping: Mendelian Ratios, Meiosis, Crossing Over and Partial Linkage 85

Fig. 4.2 The effect of


crossover on linked genes A B

If there is A B
no cross over If cross over
a b
occurs between A and B
a b

Prophase II

A B A B

A B a B

a b A b

a b a b

A B A B
AB AB

Telophase II

A B
AB aB
a B

a b
ab A b
Ab

a b a b
ab ab

Genotypes Genotypes
2AB:2ab 1AB:1aB:1Ab:1ab

2. A crossover does occur between genes A and cells. If crossovers never occur, then the resulting
B. This leads to segments of DNA containing gametes will have the following genotypes: 200 AB
gene B being exchanged between homologous and 200 ab. This is complete linkage: genes A and
chromosomes as shown in Fig. 4.2. The even- B behave as a single unit during meiosis. But if
tual result is that each gamete has a different crossovers occur between A and B in some of the
genotype: 1 AB, 1 aB, 1 Ab, 1 ab. nuclei (as is more likely), then the allele pairs will
Now think about what would happen if we not be inherited as single units. Let us say that
looked at the results of meiosis in a 100 identical crossovers occur during 40 of the 100 meiosis.
86 4 Linkage Map Construction

The following gametes will result: 160 AB, 160 ab, Recombination frequencies
40 Ab, 40 aB. In this context, the linkage is not com- Between miniature wings (m) and Vermilion wings (v) = 3.0%
Between miniature wings (m) and yellow body (y) = 33.7%
plete, it is only partial. And gametes are termed as Between vermilion eyes (v) and White eyes (w) = 29.4%
the two parental genotypes (AB, ab) and recombi- Between white eyes (w) and yellow body (y) = 1.3%

nant genotypes (Ab, aB). In the example, the combi-


nation aB and Ab did not appear in the parental
y w v m
cells. These new combinations are the result of
recombination, therefore, indicated as recombi-
0 1.3 30.7 33.7
nant genotypes.
Once Morgan had understood how partial link- Fig. 4.3 Construction of a genetic map using recombina-
age could be explained by crossing-over during tion frequencies
meiosis, he was able to devise an experiment that
paved a way to map the relative positions of genes
on a chromosome. In fact the most important work The calculated recombination frequencies
was done not by Morgan, but by an undergraduate in between these four genes were used to depict the
his laboratory, Arthur Sturtevant in 1913. Sturtevant distance between the investigated genes. These
assumed that crossing-over was a random event, are shown along with their deduced map posi-
there being an equal chance of it occurring at any tions in Fig. 4.3.
position along a pair of lined-up chromatids. If this Thus, it is clear that the resolution of a genetic
assumption is correct, then two genes that are close map depends on the number of crossovers that
together will be separated by crossovers less have been scored (the higher the sampled cross-
frequently than two genes that are more distant over events, the higher the resolution). This is not
from one another. Furthermore, the frequency with a major problem for microorganisms because
which the genes are unlinked by crossovers will be these can be obtained in huge numbers, enabling
directly proportional to how far apart they are on many crossovers to be studied, resulting in a
their chromosome. The recombination frequency is highly detailed genetic map in which the markers
therefore a measure of the distance between two are just a few kb apart. For example, when the
genes. If you work out the recombination frequen- Escherichia coli genome sequencing project
cies for different pairs of genes, you can construct a began in 1990 to construct the physical map, the
map of their relative positions on the chromosome. latest genetic map for this organism comprised
The way in which the recombination frequency over 1,400 markers, an average of one per 3.3 kb
calculation has helped in the construction of (kilobase pairs). This was sufficiently detailed to
genetic map is explained below: Let us consider direct the sequencing program without the need
the original experiments carried out with fruit flies for extensive physical mapping. Similarly, the
by A. Sturtevant (explained in Fig. 4.3). He has Saccharomyces cerevisiae project was supported
taken four genes (during his period gene was not by a fine-scale genetic map (approximately 1,150
defined at molecular level; instead genes were genetic markers, on average one per 10 kb). The
considered as entities responsible for heritability problem with humans and most other eukaryotes
of traits from parent to offspring). All the four is that it is simply not possible to obtain large
genes are on the X chromosome of the fruit fly. By numbers of progeny, so relatively little crossover
making experimental crosses, he had observed the events can be studied, and the resolving power of
number of parental and recombinant genotypes linkage analysis is restricted. This means that
among the progenies. Recombination frequencies genes that are several tens of kb apart may appear
between the genes were calculated as at the same position on the genetic map, and thus
such genetic maps have limited accuracy.
Recombination frequency = When we assessed Sturtevants assumption,
Number of recombinants we understand that crossovers occur at random
100 %.
Total number of progenies along chromosomes. However, when molecular
Mapping Functions 87

Physical map Genetic map It is worth to mention here that S. cerevisiae is


one of the two eukaryotes (fruit fly is the second)
chaI glkI
chaI
whose genomes have been subjected to intensive
glkI genetic mapping. If the yeast genetic map is inac-
curate, then how precise are the genetic maps of
his4 his4 organisms subjected to less detailed analysis?
SUPS3
These two limitations of genetic mapping clearly
leu2 SUPS3 stress the point that for most eukaryotes a genetic
leu2 map must be checked and supplemented by alter-
Centromere Centromere native mapping procedures (such as cytogenetic
pgkI
pgkI
mapping or fluorescent in situ hybridization
pet18 (FISH)) before large-scale DNA sequencing
pet18 begins.
cryI Thus, Sturtevants assumption about the ran-
cryI
MAT
domness of crossovers was not entirely justified.
MAT Comparisons between genetic maps and the
actual positions of genes on DNA molecules, as
thr4 thr4 revealed by physical mapping and DNA sequenc-
SUP61
ing, have shown that some regions of chromo-
SUP61 somes, called recombination hotspots, are more
ABTI likely to be involved in crossovers than others.
This means that a genetic map distance does not
ABTI
necessarily indicate the physical distance between
Fig. 4.4 Comparison between the part of the genetic and two markers. Also, we now realise that a single
physical maps of Saccharomyces cerevisiae chromosome 3 chromatid can participate in more than one cross-
over at the same time but that there are limita-
tions on how close together these crossovers can
data are generated, it was realised that this be, leading to more inaccuracies in the mapping
assumption is only partly correct because the procedure. Despite these limitations, linkage
presence of recombination hotspots means that analysis usually makes correct deductions about
crossovers are more likely to occur at some points gene/marker order, and distance estimates are
rather than at others. The effect that this can have sufficiently accurate to generate genetic maps
on the accuracy of a genetic map was illustrated that are of value as frameworks for genome
in 1992 when the complete sequence for S. cere- sequencing projects. The following section
visiae chromosome III was published, enabling describes the basic principles involved in con-
the first direct comparison to be made between a struction of linkage or genetic mapping using dif-
genetic map and the actual positions of markers ferent algorithms since mapping cannot be done
as shown by DNA sequencing. There were con- manually with large number of markers.
siderable discrepancies even to the extent that
one pair of genes had been ordered incorrectly by
genetic analysis. The comparison in Fig. 4.4 Mapping Functions
shows the discrepancies between the genetic and
physical maps (determined by DNA sequencing), From the above explanation, it is clear that two
and part of the discrepancies was shown in genes are said to be linked if they are located on
Fig. 4.4. Note that the order of the upper two the same chromosome by assuming that different
markers (glk1 and cha1) is incorrect on the chromosomes segregate independently during
genetic map, and that there are also differences in meiosis. Therefore, for two genes located at differ-
the relative positioning of other pairs of markers. ent chromosomes, we may assume that their alleles
88 4 Linkage Map Construction

also segregate independently. The chance that an The distance between two genes is determined by
allele at one locus coinherits with an allele at their recombination fraction. The map units are
another locus of the same parental origin is then centimorgan (cM). One cM is the distance over
0.5 (), and such genes are unlinked. Thus, in the which, on average, one crossover occurs per mei-
above example, the chance that A/B or a/b coin- osis. Sturtevant established the genetic map unit,
herit to the offspring is 0.5 in case the genes are cM, by defining a portion of the chromosome of
unlinked. This chance increases if the genes are such length that, on the average, one crossover
linked. We can observe a degree of linkage. The will occur in it out of every 100 gametes formed
reason is that even if genes are located on the same (Sturtevant 1913).
chromosome, they have a chance of not inheriting When considering the mapping of more than
as in the parental state due to recombination. The two markers/genes on the genetic map, it would
further the distance between two genes, the more be very handy if the distances on the map were
frequently there will be crossover, and hence the additive. However, genetic studies have shown
higher the number of recombinations. that recombination fractions are not additive
It should be also noted that the combinations (recombination fraction is not the best estimate
aB and Ab are not always the recombinants. For of genetic distance since they have certain vari-
example, if the F1 was made from a parental cross ability). For example, consider the loci A, B and
AAbb aaBB, then the recombinant gametes C. The recombination fraction between AC is
would be AB and ab. Therefore, we have to deter- not equal to the sum of the recombination frac-
mine how the alleles were joined in the parental tions AB and BC.
generation. This is known as the phase. If AB and If the distance AB is r1 and the distance BC
ab were joined in the parental gametes, the gene is r2, then the distance AC = r12 depends on the
pairs are said to be in coupling phase (as in the existence of interference. If the recombination
cross, AABB aabb). Otherwise, as in the cross between A and B (with probability r1) is indepen-
AAbb aaBB, the gene pairs are in repulsion dent from the event of recombination between B
phase. These terms can be somewhat messy if and C (with probability r2), we say that there is no
there are no dominant or mutant alleles. interference. In that case, the recombination
Another two genetic phenomena to be noticed between A and C: r12 = r1 + r2 2r1r2.
at this point are linkage equilibrium and its oppo- Interference is the effect in which the occur-
site, linkage disequilibrium. These are terms used rence of a crossover in a certain region reduces
for the chance of coinheritance of alleles at dif- the probability of a crossover in the adjacent
ferent loci. Alleles that are in random association region. This is a reflection of the double cross-
are said to be in linkage equilibrium. Linkage overs. If there is complete interference, the event
disequilibrium can be the result of physical link- of a crossover in one region completely sup-
age of genes, even if the genes are on different presses recombinations in adjacent regions. In
chromosomes (refer chapter 6 for more details). that case r12 = r1 + r2, that is, the recombination
The main idea of linkage or genetic mapping fractions are additive. Also within small dis-
is finding those genes/markers that are linked tances, the term 2r1r2 may be ignored, and recom-
together and coinherited to the next generation. bination fractions are nearly additive. More
Modern linkage analysis uses not only genes that generally, double recombinants cannot be
code for proteins that produce observable traits ignored, and recombination fractions are not
but also neutral markers (refer chapter 3 for additive. If distances were not additive, it would
more detail). Markers are mapped relative to one be necessary to redo a genetic map each time
another on chromosomes and used as signposts when new loci (marker/gene) are discovered. To
against which to map genes of interest that are avoid this problem, the distances on the genetic
linked with marker. This process of finding the map are mapped using a mapping function.
linked markers/genes is referred to as grouping. A mapping function translates recombination
Mapping of Genetic Markers: Practical Considerations 89

frequencies between two loci into a map distance stated earlier, there is no general relationship
in cM. between genetic distance and physical distance
A mapping function gives the relationship (in base pairs). There is a large variability between
between the distance between two chromosomal species for the average number of kilobase pairs
locations on the genetic map (in cM) and their (kb) per cM. Even within chromosomes there is
recombination frequency. variation, with some regions having less cross-
Thus, the properties of a good mapping func- overs, and therefore more kb per cM, than others.
tion are: Further, it should be noted that the estimation of
1. Distances are additive, that is, the distance AC genetic map distances is highly influenced by
should be equal to AB + BC if the order is chemical and physical radiation that are prevail-
ABC. ing during the experiments (which can increase
2. A distance of more than 50 cM should trans- the recombination frequency), plasmagenes, gen-
late into a recombination fraction of 50%. otype, chromosomal aberrations, distance from
In general, a mapping function depends on the centromere etc. Thus, there is always certain
interference assumed. With complete interfer- variations between genetic distance and physical
ence, and within small distances, a mapping distance, since genetic distance estimation is rel-
function is simply: atively affected by more factors.

distance (d ) = r (recombination fraction).


Mapping of Genetic Markers: Practical
There are two types of mapping function: Considerations
Haldane mapping function and Kosambi map-
ping function. With no interference, the Haldane From the foregoing discussion, it is clear that
mapping function is appropriate. On the other markers can be genetically mapped relative to
hand, Kosambis mapping function allows some each other by:
interference. 1. Determining recombination fractions
With no interference (i.e. all crossovers occurs 2. Using a mapping function
independently of one another), the Haldane map- Recombination fractions between genetic
ping function is appropriate: markers can be estimated from mapping popula-
tion (see chapter 2 for different types of mapping
1
d = ln(1 2r ), populations and its importance in genetic map-
2 ping). Since we can observe complete marker
whereas Kosambis mapping function allows genotypes in the every progenies of mapping
some positive interference (i.e. one chiasma population, it is easy to calculate recombination
deters the occurrence of the second in close prox- fraction. Recombination fractions are estimated
imity to the first), and hence the distance is calcu- from the proportion of recombinant gametes.
lated as This is relatively easy to determine if we know
linkage phase in parents (the haplotype of the
1 1 + 2r
d= ln . gamete that was transmitted from parent to off-
4 1 2r
spring). If the linkage phase is known in parents,
Based on several studies, it is established that we can know which gametes are recombinants
there is little difference between the different and which ones are nonrecombinant. However, in
mapping functions when the distance is below practice, linkage phases are not always known.
15 cM. This is especially the case in animals, as it is hard
Thus, mapping function provides a better esti- to create inbred lines. And markers are often in
mate of genetic distance than the recombination linkage equilibrium, even across breeds. If the
fraction used by Sturtevant. On the other hand, as linkage phase is not known, we can usually infer
90 4 Linkage Map Construction

the parental linkage phase, as the number of that two random loci are linked. Nevertheless,
recombinants is expected to be smaller than the different LOD thresholds should be used for
number of nonrecombinants. different data sets. When a pair of markers is con-
However, there is some coincidental that by sidered during the analysis, it is known as pairwise
chance there are more recombinants. To this end, or two-point analysis, whereas those that con-
maximum likelihood is used to determine the sider many markers simultaneously are known as
most likely phase, and therefore to determine the multipoint analysis.
most likely recombination fraction. Information
about the gamete that was received by an off-
spring depends on the genotypes of offspring and Grouping, Ordering and Spacing
parents. If parents and offspring are all heterozy-
gous (e.g. Aa), then we do not know which allele Genetic map provides an essential resource to
was paternal and which was maternal. If marker understand the order and spacing of markers
genotypes of parents are not heterozygous, we (relative order when compared to those of other
have no information about recombination events similar species). Thus, the key step is identification
during their meiosis. For example, if the sire has of a set of markers that are arranged together as a
genotype AB/Ab we cannot distinguish between single group and finding the order and spacing of
recombinant gametes. However, if one parent is each marker in the given group. The mapping
homozygous, it increases the chance of having population consists of p plants that result from
informative meiosis on the other parent. a crossing experiment with a given experimental
design. The commonly used designs include
backcrossing, F2, doubled haploid and recombi-
Testing for Linkage: LOD Scores nant inbred lines (refer chapter 2 for more details).
Further, marker data can consist of different
Besides estimating the most likely recombination types: co-dominant or dominant. Thus, the pri-
fraction, it is important to test or validate those mary data set consists of m p matrix, with p
estimates statistically. In particular we want to members of a mapping population each scored
test whether or not two loci are really linked. for m markers. Taken together, the experimen-
Therefore, the statistical test to perform is the tal design and the marker type will define the way
likelihood of a certain recombination fraction (r) in which distances and other functions are calcu-
versus the likelihood of no linkage (r = 0.5). lated between distinct markers. The computa-
Different likelihoods are usually compared by tional approaches to be used in this linkage
taking the ratio of the likelihood. The 10 log ratio analysis can be split into three parts: grouping,
of this likelihood ratio which is indicated by LOD ordering and spacing.
score (abbreviation of log off odds) is the most Grouping divides the DNA marker set into
popularly used likelihoods. It was introduced by distinct linkage groups. The number of linkage
Haldane and Smith in 1947 and considered as groups in a species, as a rule, should be equal to
key concept in linkage analysis. its gametic chromosome number (or haploid
A LOD score above 3 is generally used as a number of chromosomes). Obviously, the ideal
critical value. A LOD score of >3 implies that the number of linkage groups should have one-to-
null hypothesis (r = 0.5) is rejected. This value one correspondence between linkage groups and
implies a ratio of likelihoods of 1,000 to 1 (i.e. haploid chromosomes (e.g. if there are five chro-
among the 1,000 analysis, there is chance of 1 mosomes in the gametes, it should have five link-
failure). This seems like a very stringent crite- age groups). However, this will depend on the
rion. However, it accounts for the prior probabil- density and proximity of the underlying markers,
ity of linkage. Due to the finite number of which is a consequence of the co-ancestry of the
chromosomes, there is a reasonable probability two parents in addition to the marker develop-
(e.g. 5% in humans with 23 chromosome pairs) ment strategies as well as regional recombination
Grouping, Ordering and Spacing 91

rates. On the other hand, a researcher knows that carry it out. Given a linkage group, we wish to
the entire DNA markers are derived from a single find the order of its markers that maximises or
chromosome, this analytical step is unnecessary. minimises some scoring function. This scoring
Several types of solution have been proposed for function is commonly known as an objective
the marker grouping problem. One type recogn- function. In simple terms, we want some way to
ises the underlying similarity to the well-studied (1) evaluate the quality of a given marker order
area of agglomerative hierarchical clustering. In and (2) to describe how one marker order is better
methods such as nearest neighbour locus, clusters or more suitable than another. Furthermore, we
of markers (i.e. linkage groups) are grown by require an objective function that is simple to cal-
sequentially adding that marker which shows the culate yet is also biologically and statistically
lowest recombination value to the current mem- meaningful. An example of a simple objective
bers of the cluster. For example, the strategy function, to be minimised, is the sum of adjacent
employed by MAPMAKER (Box 4.1) is of this recombination fractions (SARF) (refer Further
type. It begins by calculating all two-point maxi- Readings to get more on SARF and other compu-
mum likelihood distances and corresponding tational approaches). Since adjacent marker loci
LOD scores, with linkage established between tend to have the smallest recombination fractions,
pairs of markers if the LOD score is >3 and the the marker order that minimises SARF was
inter-marker distance is <80 Haldane cM (default referred to by its developer as the minimum dis-
values used by MAPMAKER. However, it can be tance map. Examples of other popular objective
changed by the user). MAPMAKER considers functions are the maximum sum of adjacent LOD
linkage to be transitive such that if marker A is scores (SALOD), the minimum number of cross-
linked to marker B, and if B is linked to C, then overs, the product of adjacent recombination
A, B and C are candidates for belonging to the fractions (PARF), the minimum entropy, the min-
same linkage group (but which may be excluded imum weighted least-squares marker order, the
later if they show significant deviation from addi- maximum likelihood (ML) and the maximum
tivity of their map distances). Another type of number of fully informative meiosis. It is worth
grouping method adopts ideas from graph theory. to mention that if the linkage group size was more
For example, MadMapper and MSTMAP both than six markers, it would take long time to com-
use graph partitioning approaches, creating a plete the ordering process even if we employ
complete graph of all markers connected to all superpower computers. Thus, optimising an
other markers and with connecting graph edges objective function over all m!/2 possible marker
weighted by some two-point function of the data. orders is not feasible for most data sets. Finding
Then, all edges over a certain threshold value are an optimal marker order for a particular objective
chopped, leaving a number of distinct subgraphs, function is known in computer science terminol-
each of which corresponds to a linkage group. It ogy as a non-deterministic polynomial (NP)-hard
is notable that many grouping methods require combinatorial problem and necessitates the use
input parameters to be specified by the user, of a search strategy that significantly reduces the
thereby influencing their output. Consequently, space of marker orders to explore. Initially, search
linkage group content can be changed to some strategies such as seriation and branch-and-bound
extent by a users expertise, knowledge and were used. In a seriation approach, a marker order
opinion. is grown in a greedy fashion from an initial pair
Ordering takes each of the linkage groups in of tightly linked markers, adding at each step the
turn and aims to find the relative orders of the single most informative marker in the position
markers within the group. For a linkage group of that optimises the objective function. In the
m markers, there are m!/2 possible orders. branch-and-bound strategy, an initial good solu-
Hence, if large data sets are used, this is not a tion is found, perhaps based on a two-point
simple task that can be undertaken exhaustively method. Subsequently, the initial marker order is
due to prohibitive computational time required to probed by incrementally constructing partial
92 4 Linkage Map Construction

orders, with those less good than the current full of the linkage group as the sum of those distances.
order eliminated, along with all full orders based Remember that the distance is not additive among
on, or descended from, it. Once a full order better three markers. This problem is solved by taking
than the current is discovered, it becomes the next or refining the two-point analysis calculated in
current order to be investigated. In this way, the the grouping step. The total map distance between
objective function never decreases from the ini- two genes of a linkage group may exceed 50 or
tial solution to the time the method terminates. even 100, but it doesnt mean that they would
Subsequent to these approaches, a convenient show more than 50% recombination. The fre-
relationship was discovered between the marker quency of recombination between two linked
ordering problem and the symmetric wandering genes cannot exceed 50%, which is the frequency
salesman problem, a variant of the travelling in the case of independent segregation. There is
salesman problem (TSP), perhaps one of the best 1:1 correspondence between map distance, and
researched and understood problems in computer the observed frequency up to 15 cM. However,
science. In this problem, a given set of m cities there is a progressive decline in the frequency of
has to be traversed so that every city is visited observed recombination for every additional
exactly once in such a way that the total distance 1 cM beyond 15 cM. Thus, a map distance around
travelled is minimised and that the choice of the 90 cM is expected to show close to 50%
first and last cities is free. Thus, algorithms for recombination.
solution of the TSP can be used within genetic
map estimation, with the m cities recoded as
our m markers. The type of strategy that seems Sources of Error
to cope best with the presence of missing data
and, hence that lends itself well to genetic map- It is necessary to be aware that genetic map esti-
ping where missing data are common is that of mation, like any estimation procedure, is prone to
the local search procedure. AntMap employs TSP error. Error may arise due to many factors, includ-
in ordering of markers in the given linkage group ing missing data, chiasma interference, genotyp-
(explained in Box 4.2). To estimate order, one ing error and segregation distortion. Missing data
may consider several candidate orders and maxi- can lead to an incorrect marker order, particularly
mise the appropriate likelihood under each of in dense regions of a map. Some scoring failures
them. The maximum likelihood estimate of order are likely to be the results of random processes.
is that order whose maximised likelihood is high- However, there is also an element of systematic
est. When one wants to map new locus to the bias, and we often see a particular marker for
existing map, one can follow this procedure. The which several plants are not scored. In such a
JoinMap package, which uses this greedy algo- case, we may wish to delete the marker from our
rithm, has several refinements to this general analysis. For less systematic cases, we may wish
scheme. For example, the order in which markers to infer missing values through some computa-
are added to the sequence is not random, but tional method. In the presence of chiasma inter-
depends on the amount of information a marker ference, the Haldane map function is not valid,
contains. In addition, after a marker has been since it assumes no interference has taken place.
added, a local reshuffling can be applied in However, many map functions account for chi-
order to prevent that the previous sequence will asma interference in varying degrees. For exam-
not be changed anymore, and the algorithm is ple, the Rao map function is a versatile function
trapped in a local optimum from which it cannot that accounts for interference along a sliding
escape. scale. Although the Rao map function is not
Spacing process involves finding the map dis- widely implemented in software tools (see
tances for an ordered set of markers in a given Box 4.3 for list of software that deals genetic
linkage group. Usually, it is in cM between each mapping), the Kosambi map function, which
adjacent pair of marker loci and hence the length accounts for interference, is supported by many
Sources of Error 93

such software. Genotyping errors can have a large marker. Although simulation analysis showed
impact on the accuracy of a map, inflating map that the presence of segregation distortion had
lengths (particularly when applying multipoint little effect on the accuracy of marker order or
maximum likelihood methods), reducing esti- map length, this contradicts the results of other
mates of chiasma interference and supporting studies and may be data set specific. Consequently,
incorrect marker orders. In practice, many methods which allow such markers to be identified
researchers will deal with genotyping errors by prior to analysis are useful, as they give the
searching for double recombinants on an esti- researcher the opportunity to analyse the data set
mated genetic map (and sometimes recombinants either with or excluding such markers (or poten-
over short distances), followed by checking of tially both). The interplay between these sources
potentially erroneous scores. However, such an of error is complex because of the interaction
approach will not always be practical and is between genotyping errors and chiasma interfer-
unlikely to uncover all cases of genotyping error. ence. It has also been noted that missing values
Consequently, two types of computational led to shorter map lengths for more widely spaced
approach to this problem have been developed. markers, particularly in the presence of segre-
The first type concerns the identification of poten- gation distortion, when using the weighted least-
tially erroneous scores. For example, the JoinMap squares method and further noted that missing
software implements a method that calculates a values had a lesser effect on the accuracy of
probability for each genotype, given the scores of marker order than did genotyping errors. Other
the two flanking markers and the inter-marker sources of error may include mixing marker
distances. Genotypes with low probabilities can types within a single scoring scheme can result in
then be investigated further. The second type attraction of similar types of marker inde-
concerns modifying the map either during or fol- pendent of their chromosomal locations.
lowing the estimation process. Both an error filter Consequently, diagnostic tests and methods that
for pairwise methods that corrected map length allow researchers to interact with their mapping
while considering the level of interference p data are desirable. Additional data may help to
and error corrections for multipoint methods are resolve errors. For example, physical mapping
described in the literature. Although it showed data and in particular complete genome sequences
that both methods performed well for certain data will also present a marker order. This is highly
sets, also highlighting the underestimation of attractive, as estimating marker order is the most
interference in their absence, it was noted that the difficult part of genetic map estimation for large
multipoint correction was potentially not as satis- data sets. However, we should also be aware
factory as the error filter method as it was per- when comparing genetic and physical marker
formed on a marker order obtained under the orders that the genome sequence is itself an esti-
assumption of no error. mate gained from a sequence assembly process
Where segregation distortion is found to have and may not be highly accurate for up to several
occurred (calculated via chi-squares, refer years following initial sequencing. Furthermore,
Chap. 3), the mapping population deviates from when comparing genetic and physical marker
allele and genotype frequencies expected under orders from different organisms, we must not
the HardyWeinberg law (which states that popu- underestimate the effect that micro-rearrange-
lation frequencies remain in equilibrium across ments could have on making inferences on the
generations unless disturbed by some phenome- accuracy of the genetic map. In general, the accu-
non). For plant mapping populations, such devia- racy of any genetic map estimation method relies
tions from the expected frequencies typically on the distribution of recombination frequencies,
arise as the result of gametic or post-zygotic the proportion of missing data, the quantity of
selection, resulting in a marker locus which, noise due to genotyping errors and genetic inter-
though appropriate for the marker scores, does ference. As more marker data sets grow, it is a
not correspond to the physical location of the challenge to the researchers to discover new
94 4 Linkage Map Construction

search methods that can facilitate fast and accu- still complicated since it involves stepwise process
rate use of objective functions. that builds on previous genetic and cytogenetic
The outcome of a mapping experiment depends information. Aneuploid stocks are employed to
on the composition of the sample population. The locate markers on the chromosomes and identify
larger the mapping population, the more confidence linkage groups to chromosomes. In cotton, mono-
we have in the estimates of recombination fre- somic and monotelodisomic stocks that are hem-
quencies and map distances. For most purposes izygous for one arm provide facile means to localise
populations of size in the range 80400 are used. marker loci to one arm or another of the given chro-
Remember that the population type also influences mosome. For example, TM-1/3-79 derived F1s
the standard errors of the estimates. It is good to have been evaluated for monosomic or monotelo-
realise that, for example, an experiment with disomic stocks (Kohel et al. 1970). In each F1, the
100 RILs will result in a (slightly) different map donor genotype is euploid Gossypium barbadense
when it was compared with sampling of an F2 pop- accession 379, and the recipient genotype is
ulation and the best map corresponding to each hypoaneuploid G. hirsutum, usually a backcross
sample. Although the variation between these derivative of accession TM-1. TM-1 is an inbred
maps with respect to marker order may be nil, the line derived from Deltapine 14 and is considered
resulting total map length and the inter-marker dis- as the genetic standard of upland cotton (G. hirsu-
tances are quite variable. This demonstrates that tum). The inbred 379 is a doubled haploid derived
the ultimate true linkage map does not exist. from G. barbadense. A monosomic F1 substitution
stock has a single chromosome from the donor sub-
stituted for the corresponding chromosome pair of
Chromosomal Assignment the recipient genotype. Similarly, monotelodisomic
F1 stocks lack alleles form the recurrent parent in
Once the linkage groups are identified and refined the hemizygous chromosome arm from the donor,
from the data sets, the next step is assigning chro- but carry alleles of the recurrent parent in the
mosome number to each linkage group. It is opposing arm (either in homozygous or heterozy-
usually done with the help of cytogenetic stocks. gous condition, depending on the patterns of cross-
Nullisomic/disomic/trisomic lines are used to ing over). In general, SSR markers in combination
identify which chromosome of the given species with cytogenetic stocks are used to construct the
contains the markers that constitute the given link- framework map, and other types of markers are
age group. Assignment of markers to specific chro- consequently added to this framework map.
mosomes can also be accomplished through PCR
using template DNA from each of the nullisomic
lines (or disomic or trisomic or tetrasomic lines Allopolyploidy and Autopolyploidy
depending on the availability) in the given species.
It is also possible to assign the chromosome using Polyploidy has played an important role in higher
microisolated translocation chromosomes as a tem- plant evolution and applied plant breeding.
plate in the PCR with the primer of the given Polyploids are commonly categorised as (1) allopo-
marker. Alternatively, deletion mapping using lyploids, resulting from the increase of chromo-
structural aberrations of specific chromosomes can some number through hybridization and subsequent
also be employed in this context. In many species, chromosome doubling, and (2) autopolyploids, due
the chromosomes are designated in sequential to chromosome doubling of the same genome
order based on their relative sizes. Recently, assign- by fusion of unreduced gametes. Allopolyploids
ment of markers to the individual chromosomes or undergo bivalent pairing at meiosis because only
chromosome arms is being extensively undertaken homologous chromosomes pair. For autopolyploids,
with the help of fluorescent in situ hybridization however, all homologous chromosomes can pair at
(FISH). Further, such FISH analysis helps in the same time so that multivalents and, therefore,
comparison of physical and genetic map and double reductions are formed. For some poly-
identification of introduced chromosomal segments ploids, these two types of pairing occur at the same
among related species. In polyploid species, it is time, leading to a mixed category. Alfalfa, banana,
Bridging Linkage Maps to Develop Unified Linkage Maps 95

canola, coffee, cotton, potato, soybean, strawberry,


sugarcane, sweet potato and wheat represent
excellent examples of polyploids of economic
importance. In spite of the economical relevance
of polyploid crops, genetic mapping of these spe-
cies has been relatively overlooked. Statistical
methods for genetic mapping have well been devel-
oped for diploid species but are lagging in the more
complex polyploids. This is because of intrinsic
difficulties such as the uncertainty of the chromo-
some behaviour at meiosis-I and the need for very
large segregating populations. An important, yet
underestimated, issue in mapping polyploids is the Fig. 4.5 Bridging different linkage maps of the same
species into single comprehensive linkage map Numbers
choice of the molecular marker system. An ideal
in parenthesis indicates number of markers in each stage
molecular marker system for polyploid mapping that have unified
should maximise the percentage of single-dose
markers detected and the possibility of recognising
allelic markers. The genetic mapping of polyploids, Bridging Linkage Maps to Develop
where genome number is higher than two, is further Unied Linkage Maps
complicated by uncertainty about the genotype
phenotype correspondence, inconsistent meiotic It is often difficult to construct a linkage map that
mechanisms, heterozygous genome structures and covers the entire genome due to unavailability of
increased allelic (action) and nonallelic (interaction) polymorphic markers, unavailability of recombi-
combinations. Readers are requested to refer Wu nants for the markers and several other reasons.
et al. (2001) for a review on several challenges due In such cases, maps developed with the help of
to the complexities of linkage analysis in polyploids different mapping populations can be integrated
and description of statistical models and algorithms into single map with the help of anchored marker
that have been developed for linkage mapping based as shown in Fig. 4.5. This figure schematically
on their distinct meiotic characteristics. Besides, represents the stepwise assemblage of a linkage
this paper also describes several issues that should map based on a number of different crosses using
be addressed to better understand the genome struc- a reference set of anchored markers. Maps A, B
ture and organisation of polyploids and the genetic and C are obtained from different mapping popu-
architecture of complex traits for this unique group lations. Integration is possible with the anchor
of plants. loci that are common to two or more data sets.

Box 4.1 Linkage Map Construction Using MAPMAKER/EXP


Data File Preparation f2 backcross
f3 self
The following is the excerpt from ri self
MAPMAKER/EXP tutorial. ri sib
The very first line of your raw data file The second line of the raw file should con-
should read like: tain a list of three numbers separated by spaces,
data type xxxx such as
where xxxx is one of the allowed data types, 46 362 2
either: The first of these values indicates the number
f2 intercross of progeny for which data are included in the

(continued)
96 4 Linkage Map Construction

Box 4.1 (continued)


file (in this case, 46). The second indicates the 46 362 2 case
number of genetic loci for which data are sup- If you do not wish to use case-sensitive
plied (362). The third indicates the number of genotypes, do not include the word case.
quantitative traits in the data set (here 2, To specify the coding scheme itself, include
although this may be zero, of course). on the end of the above line the word symbols
Additional information may be optionally followed by the coding scheme you wish to
supplied at the end of this line. In particular, use, defined in terms of the coding scheme
you may specify the coding scheme you use above. For example, if you wish to use the
for genotypes. By default, the codes used for following scheme with an RI data set,
F2 backcross (a.k.a. BC1) data are: 1 Homozygote for parental genotype a
A Homozygote for the recurrent parent 2 Homozygote for parental genotype b
genotype 0 Missing data for the individual (or line)
H Heterozygote at this locus
- Missing data for the individual at this then you would use a second line like
locus 46 362 2 symbols 1 = A 2 = B 0 = -
For F2 intercross data, the default codes are: Note that when interpreting this line,
A Homozygote for the allele from paren- MAPMAKER is in fact quite finicky about
tal strain a of this locus spaces and case distinctions (in order to keep
B Homozygote for the allele from paren- MAPMAKER from ever misunderstanding
tal strain b of this locus exactly what you mean). In particular, NO
H Heterozygote carrying both alleles a SPACES should surround the = signs.
and b To use with a backcross data set the scheme
C Not a homozygote for allele a (either a Homozygote for parental genotype a
bb or ab genotype) A Heterozygote
D Not a homozygote for allele b (either - Missing data for the individual (or line)
aa or ab genotype) at this locus
- Missing data for the individual at this you should use a line like
locus 46 362 2 case symbols a = A A = H
For RI data, the default codes are: The main restriction on coding schemes is
A Homozygote for parental genotype a that the only allowed symbols are letters,
B Homozygote for parental genotype b numbers and the characters - and +.
- Missing data for the individual (or After the first two header lines, the raw file
line) at this locus should then present the genetic locus data in
Also by default, MAPMAKER will match the following simple format: For each locus,
genotype characters in a case-insensitive you list (1) the name of the locus, preceded by
manner (i.e. a and A indicate the same an asterisk (*); (2) one or more spaces
genotypes). (or tabs etc.); and (3) the genotypic data for
However, you can tell MAPMAKER to use all individuals, in order. For example
whatever conventions you like, as long as you *locus1 BA-HHHAAABBB-HHAA
use the same conventions for the entire data would provide data for a locus named locus1
file. First off, if you follow the numbers on the with individual #1 having the B genotype,
second line with the word case, then individual #2 having the A genotype and so
MAPMAKER will match genotype characters forth. Data for each new locus should begin on
in a case-sensitive manner (i.e. a and A can a new line (with blank lines allowed), although
be used to indicate different genotypes). For the genetic data for any one locus may be
example, broken by any number of spaces, tabs and

(continued)
Bridging Linkage Maps to Develop Unified Linkage Maps 97

Box 4.1 (continued)


line breaks. This means that, among other the raw file, as they were with MAPMAKER
things, tab-delimited-text files (such as those Version 2.0. Instead, use a .prep initializa-
often exported by spreadsheet programs) will tion file, described in MAPMAKER manual.
work well, for example: Finally, note that comments may be inserted
*L2 B A - H H H A A A B B B - H on any line starting with a number sign char-
There is a system-dependent maximum acter (#).
line length, although it is fairly large (at least An example of a complete raw file is as
1,000 characters, where a tab counts as one follows:
character). data type f2 intercross
Locus names should be kept to at most 8 205 2
characters and must be limited to alphabetic and # tiny data set for practical class demo-
numeric characters, along with the underscore nstration
character (_) and periods (.). No other char- *locus1 BBBHH-AAABBBHHH-AABA
acters are allowed (although any dashes in locus *locus2 AB-ABHABHAB-ABHABHBH
names (-) will be converted to underscores). *locus3 ABBAHHHBHABHABHBBHH-
Locus names must start with an alphabetic char- # Locus3 may be mis-scored in individual
acter (so that they are not confused with locus 12!
numbers in MAPMAKER sequences). *locus4 ABHABAAAHAB-ABHAB HHB
Any quantitative trait data should come *locus5 ABHABHAA-ABHABHAHHHB
after the genetic locus data. These data follow *trait1 6.3 7.7 8.0 6.2 8.8 6.2 4.1 6.5 5.4 7.3
a similar format, except that the trait values for 8.7 9.0 5.2 6.8 7.2 7.1 7.6 8.3 8.1 7.5
each individual must be separated by at least *trait2 5.5 5.5 5.5 4.5 4.5 4.5 3.5 3.5 3.5 -
one space, tab or line break. A dash (-) alone 5.5 5.5 4.5 4.5 4.5 3.5 5.2 6.8 7.2 7.1
indicates missing data. For example
*weight 6.3 7.7 8.0 6.2 8.6 - 7.5 9.0 5.5 - -
8.4 7.7 7.4 6.9 - The MAPMAKER Data: How to
would correspond to a trait named weight, Prepare and How Does It Look Like?
for which individual #1 has a value of 6.3,
individual #2 has a value of 7.7 and so on. The For example, if there are 500 recombinant
sixth individual is missing data for this trait inbred lines scored for 200 SSR markers that
(and will be ignored for all analyses involving were polymorphic to the parent A and B used
these trait data). As for the genotypes, a new in recombinant inbred line development, the
trait should begin on a new line, and line data file can be prepared in the Microsoft
breaks are allowed. Tab-delimited-text files Office Excel sheet in the following format:
work well here too.
Traits may also be specified as functions of Data type
ri self
other existing trait data. For example:
500 200 0
*weight1 6.3 7.7 8.0 6.2 8.6 6.9 7.5 9.0 *ssr1 A A B B A B A B scoring up to
*weight2 6.7 7.9 7.5 6.8 8.0 7.3 7.5 9.5 500th RILs
*mean = (weight1 + weight2)/2 *ssr2 B B B - A B A B scoring up to
The format of these equations is described 500th RILs

under the make trait command. Such traits .


must be included in the number of traits indi- .
cated on the files second line. .
Note that genetic maps (particularly for *ssr200 A A - A A B B B scoring up to

500th RILs
MAPMAKER/QTL) are no longer included in

(continued)
98 4 Linkage Map Construction

Box 4.1 (continued)


Once the data file is prepared in the above- or if you have modified the raw data in an
said procedure in Office Excel, save this file as existing data set, you will do this using
*.txt (text tab delimited) kind of file type. MAPMAKERs prepare data command. If
Open the folder containing the above-said instead you are resuming an analysis of a par-
*.txt file and change the file extension as *.raw ticular (unmodified) data set, you may use the
using folder options. load data command, which preserves many
Important notes: of the results from your previous session. If
1. The * indicates a file name of your inter- you are just starting out, use MAPMAKERs
est. For example, the file name for the prepare data command to load data file RIL.
above-said data is specified as RIL. raw. From this file, MAPMAKER extracts:
2. If you could not find the file extension for The type of cross, number of markers and
the specified file name, then click the folder number of scored progeny
options, click the View tab and unclick The genotype for each marker in each indi-
the radio button Hide extension for known vidual (if available)
file types. By doing so, you can visualise Other information may be present in the
the file extension in the folder for the data files, such as quantitative trait data and
specified file namejust change the file precomputed linkage results. These issues
extension alone (i.e. RIL.txt is to be will be addressed later. Before performing
changed as RIL.raw). any analyses of data set, first instruct
MAPMAKER to save a transcript of this ses-
sion in a text file for later reference. Using
Running Mapmaker the photo command, a transcript named
RIL.out is started. Note that if the file
Precisely how you should start MAPMAKER already exists, MAPMAKER appends new
depends on your computer. It should be noted output to this file. The above-said two com-
that MAPMAKER downloaded from http:// mands are shown below as it looks in DOS
www.broad.mit.edu/ftp/distribution/software/ window.
mapmaker3/ can be installed only in Windows ************************************
XP or their previous operating system. It is not * MAPMAKER/EXP*
supported by other high-end operating systems * (version 3.0b)*
such as Window Vista and Window 7. Just get **
into the mapmaker folder and double-click the **********************************
mapmaker icon to get into the command Type help for help. Type about for gen-
prompt. eral information.
When MAPMAKER starts running, you 1 > prepare RIL.raw
will first see its start-up banner and a prompt preparing data from RIL.
1> for the first command. raw
Command that should be typed into ri self data (500 individu-
MAPMAKER is represented in the below als, 200 loci) ok
procedure in bold italics, while MAPMAKER saving genotype data in file
output is presented in regular type. RIL.data ok
The first step in almost every MAPMAKER 2 > photo RIL.out
session is to load a data file for analysis. If you photo is on: file is RIL.
are starting out an analysis on a new data set, out

(continued)
Bridging Linkage Maps to Develop Unified Linkage Maps 99

Box 4.1 (continued)


Finding Linkage Groups by Two- analysed for their fitness into a single link-
Point Linkage age group. For example, if SSR1 to SSR5
belong to chromosome 1, then the command
Initially begin the linkage map construction to be used is
analysis by performing a classical two-point 3 > sequence 1 2 3 4 5
or pairwise, linkage analysis of data set. First, However, there are 200 markers in this
we need to tell MAPMAKER which loci we data file, and suppose we dont know the
wish to consider in our two-point analysis. We chromosomal position of each marker. If that
do this using MAPMAKERs sequence com- is the case, this data set is too many to work
mand (seq will also work). When you type with at once since doing all possible orders
something like: of all these markers at once would take a long
3 > sequence 1 2 3 time. The next step is instructing the program
MAPMAKER is told which loci (and, in to divide the markers in the sequence into
some cases, which orders of those loci) any linkage groups; for this, type MAPMAKERs
following analysis commands should con- group command. To determine whether any
sider (e.g. SSR1, SSR2, SSR3). Since almost two markers are linked, MAPMAKER calcu-
all of MAPMAKERs analysis functions use lates the maximum likelihood distance and
the current sequence to indicate which loci corresponding LOD score between the two
they should consider, you will find that the markers: If the LOD score is greater than
sequence command must be entered before some threshold, and if the distance is less
performing almost any analysis function. than some other threshold, then the markers
The sequence of loci in use remains will be considered linked. By default, the
unchanged until you again type the LOD threshold is 3.0, and the distance thresh-
sequence command to change it. In this old is 80 Haldane cM. For the purpose of
two-point analysis, we want to examine all finding linkage groups, MAPMAKER con-
the loci in our sample data set. Thus, we now siders linkage transitive. That is, if marker A
type into MAPMAKER: is linked to marker B, and if B is linked to C,
3 > sequence 1 2 3 4 5 6 7 8 9 then A, B and C will be included in the same
10 11 12 13 (OR) linkage group. It will be too complicated if
3 > sequence all the above-said data set is used in this analy-
Mapmaker gives each marker in the data sis. In the below example, a simple data set is
file its own number; it does not work with explained which contains 13 markers. As you
SSR1, SSR2 etc. If at any point you want to can see, MAPMAKER has divided this 13
see the real name of the marker, use the trans- marker data set into two linkage groups,
late command after specifying the sequence which it names group1 and group2, and a
of those markers (e.g. seq 1 2 3, then trans- list of unlinked markers (if there are no
late or tra). unlinked markers in the given data set, you
Note that for two-point analysis, the order may not find it).
in which the loci are listed is unimportant. 4 > group
Alternatively, if you know the chromosomal Linkage groups at min LOD
location of each marker, you can specify 3.00, max distance 80.0
only those marker numbers belonging to the group1 = 1 2 3 5 7
given chromosome in the sequence com- group2 = 4 6 8 9 10 11 12
mand, and hence only those markers will be unlinked 13

(continued)
100 4 Linkage Map Construction

Box 4.1 (continued)


Exploring Map Orders by Hand Best 20 orders:
1: 1 3 2 5 7 Like: 0.00
To determine the most likely order of markers 2: 3 1 2 5 7 Like: -6.00
within a linkage group, we could imagine 3: 5 7 2 3 1 Like: -20.20
using the following simple procedure: For 4: 5 7 2 1 3 Like: -26.26
each possible order of that group, we calcu- 5: 2 5 7 3 1 Like: -27.25
late the maximum likelihood map (e.g. the 6: 2 5 7 1 3 Like: -28.39
distances between all markers given the data) 7: 2 3 1 5 7 Like: -28.85
and the corresponding maps likelihood. We 8: 5 2 3 1 7 Like: -32.33
then compare these likelihoods and choose 9: 2 1 3 5 7 Like: -34.12
the most likely order as the answer. This type 10: 5 7 1 3 2 Like: -35.55
of exhaustive analysis may be performed 11: 5 2 1 3 7 Like: -37.61
using MAPMAKERs compare command. 12: 1 3 5 2 7 Like: -37.76
In practice, however, this sort of exhaustive 13: 3 1 5 2 7 Like: -39.09
analysis is not practical for even medium- 14: 5 7 3 1 2 Like: -40.38
sized groups: A group of N markers has N!/2 15: 1 3 5 7 2 Like: -40.87
possible orders, a number which become 16: 3 1 5 7 2 Like: -41.55
unwieldy (for most computers) when N gets 17: 5 2 7 3 1 Like: -43.67
to be between 6 and 10. In practice, one needs 18: 5 2 7 1 3 Like: -44.78
to order subsets of the linkage group and then 19: 5 1 3 2 7 Like: -47.63
overlap those subsets, mapping any remain- 20: 2 5 3 1 7 Like: -52.28
ing markers relative to those already mapped, order1 is set
a process which is illustrated in the next sec- Note that while MAPMAKER examines
tion. In the above example, since group1 all 5!/2 possible orders, by default only the
consists of markers 1, 2, 3, 5 and 7, it is small 20 most likely ones are reported. For each of
enough to perform the fully exhaustive analy- these 20 orders, MAPMAKER displays the
sis. To do this, we first change MAPMAKERs log-likelihood of that order relative to the
sequence to {1 2 3 5 7}. Here, the {} indi- best likelihood found. Thus, the best order 1
cate that the order of the markers contained 3 2 5 7 is indicated as having a relative log-
within them is unknown and, thus that all likelihood of 0.0. The second best order 3 1
possible orders need to be considered. We 2 5 7 is significantly less likely than the best,
then type the compare command, instruct- having a relative log-likelihood of -6.0. In
ing MAPMAKER to compute the maximum other words, the best order of this group is
likelihood map for each specified order of supported by an odds ratio of roughly
markers and to report the orders sorted by the 1,000,000:1 (10 to the 6th power to one) over
likelihoods of their maps. Please note the any other order. We consider this good evi-
bracket type as other brackets have different dence that we have found the first order is the
meanings: [] mean markers within are at the right order.
same locus (so order does not matter) and < >
mean the order within is known but not the
order of the group itself (could be the inverse Displaying a Genetic Map
order).
5 > sequence {1 2 3 5 7} When we used the compare command previ-
sequence #2 = {1 2 3 5 7} ously, MAPMAKER calculated the map
6 > compare distances and log-likelihood for each of the 60

(continued)
Bridging Linkage Maps to Develop Unified Linkage Maps 101

Box 4.1 (continued)


orders we were considering. The compare 7 SSR7 ----------
command, however, only reports the relative 43.2 cM 5 markers log-like-
log-likelihoods and afterwards forgets the map lihood = -424.94
distances. To actually display the genetic ==============================
distances, we must instead use the map com-
mand. Like compare, the map command
instructs MAPMAKER to calculate the maxi- Mapping a Slightly Larger Group
mum likelihood map of each order specified
by the current sequence. If the current sequence As we mentioned earlier, exhaustive analyses of
specifies more than one order (e.g. the large linkage groups are not practical. Instead,
sequence {1 2 3 5 7} specifies 60 orders), to find a map order of a larger group, we need to
then the maps for all specified orders will be find a subset of markers on which we can per-
calculated and displayed. Because we found form an exhaustive compare analysis. Thus, to
one order of this group to be much more likely map group2 (in the above example), we could
than any other, we probably only care to see pick a subset of its 6 markers at random, although
the map distances for this single order. First, we might do better if we pick markers which are
we set MAPMAKERs sequence, putting the likely to be ordered with high likelihood.
markers in their best order and doing away Generally, this is true for sets of markers which
with the set brackets. Next, we simply type have (1) as little missing data as possible and (2)
map to display this orders maximum likeli- do not have many closely spaced markers.
hood map. As you can see, the distances To quickly see how much data is available
between neighbouring markers are displayed. for the markers in the given group, we set
Note, however, that these distances may be MAPMAKERs sequence appropriately and
considerably different than the two-point use MAPMAKERs list loci command.
distances between those markers: This is MAPMAKER prints a list of loci, showing
because MAPMAKERs so-called multipoint each marker by both its MAPMAKER-
analysis facility can take into account much assigned number as well as its name in the
more information, such as flanking marker data file. In the previous example, for each
genotypes and some amount of missing data. marker, MAPMAKER prints the number of
This is precisely the reason that we use multi- informative progeny (out of the 500 in the data
point analysis rather than two-point analysis set) and the type of scoring. In this case all loci
to order markers: Because more data is taken have been scored using co-dominant mark-
into account, you have a smaller chance of ers (e.g. SSR genotypes in a RILs), although
making a mistake. clearly markers 4 and 6 are the least informa-
7 > sequence 1 3 2 5 7 tive. To also look for markers which may be
sequence #3 = 1 3 2 5 7 too close, we use MAPMAKERs lod table
8 > map command. MAPMAKER prints both the dis-
============================== tance and LOD score between all pairs of
Map: markers in the current sequence. Unfortunately,
Markers Distance the closest pair is separated by over 6.0 cM, a
1 SSR1 4.2 cM distance which should almost always be
3 SSR3 15.0 cM resolvable in a data set with so many informa-
2 SSR2 11.9 cM tive meiosis. Given the results of these two
5 SSR5 12.2 cM analyses, a good subset to try might be:

(continued)
102 4 Linkage Map Construction

Box 4.1 (continued)


8 9 10 11 12 12 > map
Note that the above two tests could have Note that this time we do this using a spe-
been automatically performed using cial shortcut, order1, instead of specifying
MAPMAKERs suggest subset command. the marker sequence as shown in order1. This
9 > sequence 4 6 8 9 10 11 12 is to show that in both ways we can specify the
sequence #4 = 4 6 8 9 10 11 12 markers to be analysed by sequence command.
10 > list loci To determine the map position of the remain-
Linkage ing two markers in group2, we will use the fol-
Num Name Genotypes Group lowing procedure: Starting with the known
4 SSR4 273 codom group2 order of 5 markers, we will place the other two
6 SSR6 275 codom group2 (one at a time) into every interval in this order
8 SSR8 306 codom group2 and then recalculate the maximum likelihood
9 SSR9 327 codom group2 map of each resulting 6 marker order. In this
10 SSR10 297 codom group2 analysis, MAPMAKER recalculates all
11 SSR11 324 codom group2 recombination fractions for all intervals in
12 SSR12 319 codom group2 each map (not just the ones involving the
11 > lod table newly placed markers). This function is per-
Bottom number is LOD score; formed by MAPMAKERs try command. In
top number is centimorgan its output, MAPMAKER again displays rela-
distance: tive log-likelihood of each position for the
4 6 8 9 10 11 inserted markers. The relative log-likelihood
6 63.1 of 0 indicates the best position, while the neg-
3.33 ative log-likelihoods indicate the odd against
8 16.8 56.0 placement in each other interval.
39.06 4.33 13 > sequence {8 9 10 11 12}
9 56.3 17.8 54.8 sequence #5 = {8 9 10 11 12}
6.77 36.70 7.68 13 > compare
10 106.3 27.7 - 43.3 Best 20 orders:
0.89 22.51 15.08 1: 11 8 12 9 10 Like: 0.00
11 14.9 74.0 6.3 65.4 - 2: 10 11 8 12 9 Like: -14.57
43.78 2.20 80.87 5.76 3: 8 11 12 9 10 Like: -15.23
12 28.2 43.1 18.4 24.1 89.1 4: 10 9 11 8 12 Like: -27.20
30.1 5: 11 8 12 10 9 Like: -29.97
22.24 9.13 39.84 32.39 2.22 6: 10 8 11 12 9 Like: -30.14
23.90 7: 9 10 11 8 12 Like: -32.23
As before (did with small linkage groups), 8: 8 11 10 9 12 Like: -39.80
we can also change MAPMAKERs sequence 9: 10 9 8 11 12 Like: -39.91
to specify the subset we wish to test and then 10: 9 11 8 12 10 Like:
type the compare command. This time, the -40.05
results are even more conclusive, with order1 11: 11 8 10 9 12 Like:
more likely than any other. The sequence of -40.25
commands to be used here are: 12: 11 8 9 12 10 Like:
9 > sequence {8 9 10 11 12} -44.73
10 > compare 13: 8 11 12 10 9 Like:
11 > sequence order1 -45.21

(continued)
Bridging Linkage Maps to Develop Unified Linkage Maps 103

Box 4.1 (continued)


14: 10 11 8 9 12 Like: likely. The try command not only tries to
-46.57 place markers in each interval in the frame-
15: 8 11 9 12 10 Like: work but also tries to place each marker
-47.46 infinitely far away (i.e. forced 50% recombi-
16: 9 10 8 11 12 Like: nation between it and the framework). The
-47.94 relative log-likelihoods for this position are
17: 10 8 11 9 12 Like: indicated following the INF entry in the
-49.61 MAPMAKER output. In the same way that a
18: 8 11 10 12 9 Like: two-point LOD score indicates the odds of
-52.71 linkage between two loci when they are sepa-
19: 9 8 11 12 10 Like: rated by their maximum likelihood distance,
-52.74 these relative log-likelihoods indicate the odds
20: 11 8 10 12 9 Like: supporting linkage between one locus and a
-53.07 framework of loci when the locus is placed in
order1 is set its most likely position. As a last step, we now
14 > sequence order1 type the complete sequence for this group,
sequence #6 = order1 adding markers 4 and 6 into their most likely
15 > try 4 6 positions. Then we type map to see the com-
4 6 plete map of all markers in this group.
--------------- 16 > sequence 4 11 8 12 9 6 10
| 0.00 -42.68 | sequence #7 = 4 11 8 12 9 6 10
11 | | 17 > map
|-35.57 -118.6 | ==============================
8 | | Map:
|-19.65 -70.19 | Markers Distance
12 | | 4 T24 14.8 cM
|-46.80 -28.09 | 11 C15 6.4 cM
9 | | 8 T125 18.9 cM
|-51.35 0.00 | 12 T71 24.0 cM
10 | | 9 T83 18.1 cM
|-43.40 -21.09 | 6 T209 28.6 cM
|---------------| 10 T17 ----------
INF |-44.66 -45.03 | 110.8 cM 7 markers log-likeli-
--------------- hood = -688.99
BEST -619.33 -612.03 ==============================
In this case, we see that marker 4 should be Likewise we need to continue this process
preferably placed before marker 11. INF is for all the linkage groups. Note that some-
the probability that a marker is anywhere times, depending on the data file, a single
ELSE but not on this sequence. In the above chromosome may have more than one link-
test, we see that a log-likelihood of 44.66 sup- age group. However, when we add more
ports linkage between 4 and the rest of the markers in the data set to the particular chro-
group. We also see that marker 6 strongly pre- mosome, there is a possibility of finding sin-
fers to be in-between markers 9 and 10. Even gle linkage group (i.e. the added markers
the next most likely position for marker 6 is merges the two or more linkage groups into a
more than 10 to the 21.09th power times less single linkage group). It is also important to

(continued)
104 4 Linkage Map Construction

Box 4.1 (continued)


note that this program compares combination SSR59. An educated guess would be that
of markers and gives the likelihoods of pos- SSR56 and SSR58 are either at the same
sible sequence orders. It does NOT tell you locus or tightly linked (with not enough
the right sequence, but it will tell the most recombinations to create a statistically
likely orderyou must decide what LODs significant order). We can check this by
and cM distances you will accept; therefore, asking for a recombination difference
it can be highly subjective. Hence, most between the 2 markers, using the map
importantly, when you score the data, do not command. We can double-check our order
guess. When you make a mistake in scoring, by using ripple. This command assumes
it will look like a recombination has taken the general order is known but checks other
place. Therefore, missing data is better than a possible orders within each group of 3
wrong data. markers, moving down the given sequence.
MAPMAKER in Windows DOS can show (Note that you would not want to use ripple
the map distance; however, the graphical view for a completely unknown order as it only
of genetic map cannot be visualised in the looks at 3 markers at a time. Further, when
Microsoft Windows operating system. you specify the sequence command omit
MapChart is a specially designed Windows {}, or it will check all triplets of all possible
program that can produce the linkage map and combinations.)
QTL maps very easily. It is freely available at 2. A map with 20 cM or more between mark-
http://www.biometris.wur.nl/uk/Software/ ers might be questionable (remember, we
MapChart/. Alternatively, MapDraw can also dont know a sure order, just the most
be used for linkage map drawing, and it is likely).
available free of cost at http://www.nslij- 3. To make a complete map, you would need
genetics.org/soft/mapdraw.v2.2.xls. to keep going with this process until you
had a full set of good linkage groups. There
are many other commands you can try too,
Tips to Improve Your Analysis depending on your preferences.
4. You can probably see that there is no
1. While you are using the compare com- right way to use MAPMAKER. Instead
mand, recall that an LOD of 2 means one of choosing some markers of Group 1 to
event is 100 times more likely, LOD 3 is compare, we could also have grouped
1,000 times more likely, etc. A general again with more stringent LOD and cM
guideline is that an LOD of 2 or 3 is con- levels or we could have worked back-
ventionally acceptable. If suppose, first 2 wards by using the first order command
orders have exactly the same likelihood, to get an order, then pulled off markers
meaning that either order is equally as that didnt fit well. Likewise we can try
likely. However, if we look at the sequences, several options, since it is a very iterative
we can see that the only difference between and somewhat subjective process.
the first 2 orders is that the order of two Readers are strongly recommended to
markers (say SSR56 and SSR58) cannot be read the MAPMAKER manual which is
differentiated. The order of the other mark- available at http://linkage.rockefeller.
ers seems clearly to be, for example, SSR55 edu/soft/mapmaker/ before working with
(either SSR56 or SSR58), SSR 57 and this program.
Bridging Linkage Maps to Develop Unified Linkage Maps 105

Box 4.2 Linkage Map Construction Using AntMap


Locus ordering is an essential procedure in Input File Format
genome mapping. When the number of loci is
large, it is quite difficult to determine the opti- Input file format of AntMap is identical to
mum order with an exhaustive search of all *.raw files required by MAPMAKER (Lander
possible orders. The problem of searching for et al. 1987). AntMap can analyse data derived
the optimum order has been recognised as a from progeny of several types of crosses,
special case of the travelling salesman prob- including:
lem (TSP), that is, given a set of cities and dis- 1. F2 intercross
tances for each pair of them, find a round-trip 2. F2 backcross (e.g. BC1)
of minimal total length visiting each city 3. Recombinant inbred lines by self-mating
exactly once. In recent years, Ant Colony 4. Doubled haploid lines
Optimization (ACO), which is a set of algo- However, the current version of AntMap
rithms inspired by the behaviour of real ant does not support two types of cross, F3 inter-
colonies, has been successfully used to solve cross by self-mating (f3 self) and recombina-
discrete optimization problems, such as TSP. tion inbred lines by sib-mating (ri sib), which
Iwata and Ninomiya (2004) developed a novel are supported by MAPMAKER/ EXP.
system based on ACO for locus ordering in Step by step procedure to be followed while
genome mapping. Loci and absolute value of using AntMap clearly described in the AntMap
log-likelihood (or recombination fraction) Tutorial. The flowing are the excerpts from
between loci were regarded as TSP cities and them.
distance between cities, respectively. They
tested the system using a simulated segrega- Step 0: Start AntMap
tion population and found it highly efficient Start AntMap in Windows operating system
for linkage grouping as well as locus ordering by double-clicking the AntMap icon.
in genome mapping. AntMap can also be executed by using the
To commoditize newly developed system, executable jar file AntMap.jar on any plat-
they developed a software named AntMap for forms (Linux, Solaris and Mac OS as well as
constructing linkage map by the system. Windows).
AntMap performs segregation test, linkage
grouping and locus ordering and constructs a Step 1: Open an Input File
linkage map quite rapidly and nearly automat- Open an input file in MapMaker format (*.
ically. Rapidity of the algorithm based on raw) through File-Open menu. After open-
ACO enables us to conduct a bootstrap test of ing the file, contents of the file will appear in
estimated order. With the aid of this software, the Data panel. By clicking the Log tab,
researchers can save their time and labour and you can see a summary of the input data.
can obtain a linkage map whose reliability is
indicated by bootstrap values. Another advan- Step 2: Segregation Ratio Test
tage of AntMap is the fact that AntMap is Select Segregation Test from the Analysis
open source (http://lbm.ab.a.u-tokyo.ac. menu. By doing so, you can see the results of
jp/~iwata/antmap/), that is, source code and segregation ratio tests in the Result panel.
executable of AntMap are available under
General Public License (GPL). Java and C++ Step 3: Linkage Grouping
objects that code this newly developed system Click the Options tab. Then you can see the
will be utilised effectively for other applica- Grouping option panel. You can choose one
tions as well as AntMap. of the two grouping methods: nearest

(continued)
106 4 Linkage Map Construction

Box 4.2 (continued)


neighbouring locus and all combinations. You can also obtain a graphic of linkage map
The former makes a group by sequentially in the Map panel.
combining a locus which shows the smallest
recombination value against it. The latter will Step 5: One-Step Mapping
produce similar results with group command Select Full Course from the Analysis menu.
of MAPMAKER. You can also choose the This facilitates overall process from segrega-
grouping criterion, threshold value and the tion ratio test (Step 2) to locus ordering (Step
minimum number of markers for a single group. 4) at once.
Otherwise keep these options unchanged except
for the threshold value. Step 6: Redraw a Linkage Map
Select the Linkage Grouping from the Click the Options tab, and click the Draw
Analysis menu. Then you can see the map tab. Then you can see the Draw map
results of linkage grouping in the Result option panel. You change the Scale factor
panel. When you analyse your data, you may option and by doing so, drawing size of linkage
not be able to achieve a good separation of map can be changed. After changing the option
markers to linkage groups from the start. In value, select Redraw Map from the Analysis
such a case, please find a good set of the menu. Then you can obtain a modified linkage
threshold value, criterion and method map than one obtained previously.
through trial-and-error strategy. It is better to
organise your data according to chromo- Step 7: Bootstrap Test for Locus Order
somes and then proceed separately for each You can evaluate the reliability of estimated
chromosome. locus order by using bootstrap test. Bootstrap
test (or bootstrapping) is a method for estimat-
Step 4: Locus Ordering and Genetic ing the sampling distribution of an estimator
Map by resampling with replacement from the
Click the Options tab, and click the original sample. In a bootstrap test, a random
Ordering tab. Then you can see the Ordering sample of size n is drawn from the original
option panel. In the locus ordering, you can sample of size n, and estimates are obtained
choose one of the two criteria: LL and from the random sample. After repeating (iter-
SARF. LL is an abbreviation for log-like- ating) this operation many times (e.g. 100
lihood. SARF is an abbreviation for sum of 1,000 times), the stability of estimates (e.g.
adjacent recombination fractions. AntMap standard error or confidence interval of esti-
will search a locus order which maximises mators) is evaluated. In the bootstrap test for
log-likelihood or minimises SARF. You can locus order, we can obtain probability that a
also choose the number of runs of locus order- locus is located at its estimated order. Click
ing. You can find the meaning of this option in the Options tab, and click the Ordering tab.
the AntMap Options section of the AntMap Then you can see the Ordering option panel.
users manual. A map function for calculating You can change the number of iterations
a map distance between adjacent markers can (repeats) of bootstrapping. To get a good
be selected from Haldane or Kosambi estimate of percentage of correct locus order,
functions. Otherwise keep these options 100 may be sufficient. You can also choose a
unchanged. Select the Locus Ordering from group which is targeted in the bootstrap test.
the Analysis menu. Then you can see the Select the Bootstrap Test from the
results of locus ordering in the Result panel. Analysis menu. Then you can see the results

(continued)
Bridging Linkage Maps to Develop Unified Linkage Maps 107

Box 4.2 (continued)


of bootstrap test for locus order in the Result Step 8: Save Results of Linkage
panel. You can also obtain a graphic of linkage Mapping
map with bootstrap values in the Map panel. You can save information in Result, Log
The bootstrap test for all linkage groups may and Map panels through the Save submenu
take a long time even with high-end PC. Thus, in the File menu. The information in Result
you have better set your computer to perform and Log is saved as a text file. The informa-
this test at your lunch time or after going tion in Map (i.e. a graphic of linkage map) is
home. saved as a JPEG (*.jpg) file.

Box 4.3 List of Software Available for Linkage Map Construction


A comprehensive list of computer software on 9. MadMapper (http://cgpdb.ucdavis.edu/
genetic linkage analysis for human pedigree XLinkage/MadMapper/)
data, QTL analysis for animal/plant breeding 10. THREaD Mapper (http://cbr.jic.ac.uk/
data, genetic marker ordering, genetic associ- dicks/software/threadmapper/index.
ation analysis, haplotype construction, pedi- html)
gree drawing and population genetics is listed 11. QTL IciMapping (http://www.isbreeding.
out at http://linkage.rockefeller.edu/soft/list. net/oldweb/download_software_ICIM.
html in alphabetical order. However, the follow- aspx)
ing software are very often used by plant molec- In practice, it is almost certainly best to
ular breeders in genetic or linkage map use a mixture of approaches in developing
construction. and refining a map. This is not only because
1. MAPMAKER (http://www.broad.mit.edu/ each one brings something unique to the
ftp/distribution/software/mapmaker3/) analysis but also because we do not know
2. JoinMap (http://www.kyazma.nl/) which approach will succeed best for a new
3. AntMap (http://cse.naro.affrc.go.jp/iwatah/ data set and we do not know enough about
antmap/index.html) the behaviour of each tool to judge this in
4. Map Manager QTX (http://www.map- advance. It is strongly believed that map
manager.org/) estimation is an iterative process, where
5. QGene (http://www.qgene.org/) researchers should first grasp the global pat-
6. R/QTL (http://www.rqtl.org) tern of their data set before revaluating and
7. MSTMAP (http://www.138.23.191.145/ revising the grouping and ordering of mark-
mstmap/) ers rather that performing a rigid, linear
8. CarthaGene (http://www.inra.fr/mia/T/ three-stage methodology of grouping, order-
CarthaGene/) ing and spacing.
108 4 Linkage Map Construction

MAPMAKER v3.0 Tutorial. http://linkage.rockefeller.


Bibliography edu/soft/mapmaker/
Mendel G (1865) Available at http://www.dnalc.org/
view/16172-Gallery-3-Gregor-Mendel-Manu-
Literature Cited script-1865.html
Morgan TH (1911) Random segregation versus coupling
Bateson W, Saunders ER, Punnett R (1905) Experimental in Mendelian inheritance. Science 34:384
studies in the physiology of heredity. Rep Evol Comm Morton NE (1955) Sequential tests for the detection of
R Soc 2:155 linkage. Am J Human Genet 7:277318
Bovenhuis H, Meuwissen THE (1996) Detection and map- Sturtevant AH (1913) The linear arrangement of six
ping of quantitative trait loci. Animal Genetics and sex-linked factors in Drosophila, as shown by their
Breeding Unit. UNE, Armidale. ISBN 186389 323 7 mode of association. J Exp 2061(14):4359
Bulmer MG (1971) The effect of selection on genetic vari- Sutton WS (1903) The chromosomes in heredity. Biol
ability. Am Nat 105:201 Bull 4:231251
Correns C (1913) Selbststerilitat und Individualstoffe.
Biol Centralbl 33:389423
Haldane JBS, Smith CAB (1947) A new estimate of the
linkage between the genes for colour-blindness and
haemophilia in man. Ann Eugen 14:1031
Further Readings
h t t p : / / w w w. n c b i . n l m . n i h . g o v / b o o k s h e l f / b r.
fcgi?book=genomes Bailey NTJ (1961) Introduction to the mathematical
Iwata H, Ninomiya S (2006) AntMap: constructing genetic theory of genetic linkage. Oxford University Press,
linkage maps using an ant colony optimization algo- London
rithm. Breed Sci 56:371377 Cheema J, Dicks J (2009) Computational approaches and
Janssens FA (1909) La theorie de la chiasmatypie. software tools for genetic map estimation in plants.
Nouvelle interpretation des cinises de maturation. Brief Bioinfo 10(6):595608
Cellule 22:387411 McPeek MS (1996) An introduction to recombination and
Kohel RJ, Richmond TR, Lewis CF (1970) Texas Marker linkage analysis. http://www.stat.wisc.edu/courses/
1. Description of genetic standards for G. hirsutum L. st992-newton/smmb/files/broman/mcpeek96.pdf
Crop Sci 10:670671 Whitehouse HLK (1973) Towards an understanding
Lander ES, Green P, Abrahamson J, Barlow A, Daly MJ, of the mechanism of heredity. St. Martins Press,
Lincoln SE, Newburg L (1987) MAPMAKER: an New York
interactive computer package for constructing primary Wu R, Gallo-Meagher M, Littell RC, Zeng Z (2001)
genetic linkage maps of experimental and natural pop- General polyploid model for analyzing gene segre-
ulations. Genomics 1:174181 gation in outcrossing tetraploid species. Genetics
159:869882
Phenotyping
5

Phenotyping Versus QTL Mapping several genetic questions: How many genes
influence the given traits, and what are their rela-
The ultimate goal of plant breeding is to develop tive effect sizes? Do these genes show evidence
cultivars that have shown consistently good of non-neutral evolution at the sequence level?
performance for the primary traits of interest. What environmental and evolutionary forces lead
Primary traits are usually agronomically and eco- to the maintenance of variation at these loci? Do
nomically important traits and will vary among ecologically similar environments favour the
crop species. These traits are quantitative, rather same genes or is it possible to achieve a similar
than qualitative, in nature. Quantitative traits phenotype with different genetic mechanisms?
vary continuously (e.g. yield, quality and stress Recent breakthrough in molecular biology
tolerance), whereas qualitative ones are usually helped to find answers for many of these ques-
(not always) binary (yes vs. no; e.g. resistance to tions via quantitative trait loci (QTL) mapping.
a fungus and colour of flower). Quantitative traits The loci involved in the inheritance of quantita-
are typically governed by a number of genes, tive traits are commonly called QTL, and
while qualitative ones are often simply inherited identification of such QTL is referred to as QTL
(decided by one or two genes; hence called as mapping. The purpose of the phenotyping experi-
simpler or major traits). Although progress had ment (evaluating the given trait) is to assign a trait
been made in cultivar development in most crop value to each mapping population member. This
species since the rediscovery of Mendelism, fur- value is then combined with the allele score at the
ther genetic progress required more information set of marker loci distributed throughout the
on the inheritance of the primary traits and (refer chapter 4). A data file is then created which
associations with other traits that are needed includes all the trait data and all the marker data
in improved cultivars. Quantitative geneticists for the entire population. Various software appli-
believed that they could enhance breeding meth- cations can be applied to this data file to identify
ods if the inheritance of quantitative traits was statistical associations/correlations between the
better understood. However, some of the assump- presence of alternative alleles and the trait value.
tions (random mating populations, linkage equi- The greater this correlation is, the higher the
librium, two alleles per locus, no epistasis, etc.) probability that a certain gene contributes directly
used by the quantitative geneticists in developing to a specific trait. To calculate the strength of the
the theory and methods of estimation did not association between genotype and phenotype,
seem realistic to practicing plant breeders. the mapping population is split into two groups,
Initially, greater efforts were given to studies according to the allele they carry for that trait at
related to types of gene action. Identifying the each marker in turn. Then the mean trait value of
genes for primary traits will help in answering these two classes is compared. If the difference is

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 109
and Benefits, DOI 10.1007/978-81-322-0958-4_5, Springer India 2013
110 5 Phenotyping

significant, then this provides initial evidence for can be used to improve the accuracy of QTL
the location of a QTL in the neighbourhood of mapping by reducing experimental error or
the marker (refer chapter 6 for further details on background noise. High-throughput phenotyping
QTL-mapping methods and principles). for QTL mapping under highly controlled plant
Thus, the goal of QTL mapping is to deter- development conditions provides the best basis
mine the loci that are responsible for variation in for extracting a maximum of information from
quantitative traits. In some situations, determina- mapping populations. This way, reproducible and
tion of the number, location and the interaction of comprehensive datasets are generated. Some thor-
these loci is the ultimate goal besides identifying ough studies may include conducting phenotypic
the actual genes and their functions. For example, evaluations both in field and glasshouse trials.
breeding studies attempt to identify the loci that Moreover, QTL mapping assumes accurate
improve crop yield or quality and then to bring phenotypic scoring methods, something that can be
the favourable alleles together into elite lines via difficult to optimise and even more difficult to keep
marker-assisted breeding. Understanding of the working for months or years. Just a few mis-scored
response of QTL in different environments or individuals can totally confound QTL discovery
genetic backgrounds improves the efficiency of and placement. Even when a well-performed map-
marker-assisted breeding. If the genes underlying ping experiment indicates promising QTL, there is
the QTL are known (i.e. the QTL have been always much more that needs to be done to make
cloned; called as map-based cloning; discussed the mapping data ready for QTL analysis. In such
in chapter 7), then transgenic approaches can also cases, repetition over several years and several
be used to directly introduce beneficial alleles locations, repetition in larger sibling populations,
across wide species boundaries. repetition in genetically unrelated populations and
Identifying a gene or QTL within a plant detailed analyses in marker-generated near-isogenic
genome is like finding the proverbial needle in a lines (NILs) that isolate the effects of individual
haystack. However, QTL analysis can be used to QTL can be considered as additional steps to
divide the haystack in manageable piles and sys- improve and validate the QTL analysis. It is also
tematically search them. The data collection on important to consider that any one of these efforts
the given trait is often hampered by the significant could be expensive, time consuming or impossible
influence that environmental factors have on the in practice. Hence, it is essential to understand the
expression of a trait and the variability of these basic principles and a broad set of references that
environmental factors. This is especially true for are useful for the optimal management of pheno-
traits related to crop yield. In addition to their typing practices for QTL discovery.
sensitivity to environment and the phenomenon To be practical, the first step is to define the
of genotype-by-environment interaction (i.e. the target environments (also identified as the target
differential reaction of genotypes to environmen- population of environments (TPE)). Differences
tal changes), such traits are often controlled by a in TPE are largely determined by genotype-by-
large number of genes. These factors make it environment interactions (GEI). The identification
difficult to analyse their genetic basis and, there- and characterisation of a TPE is facilitated by the
fore, QTL analysis. use of crop simulation models based on historic
records of weather data. Simulation can describe
a TPE by the frequency of occurrence of specific
Need for Precise Phenotyping biotic and abiotic stresses and be based on the soil
profile (moisture, nutrient, microbial load, etc.)
The accuracy of phenotypic evaluation is of the along with the crop cycle. Within each TPE, GEI
utmost importance for the accuracy of QTL map- are frequently observed relating to yearly
ping. A reliable QTL map can only be produced fluctuations in environmental factors (e.g. rainfall
from reliable phenotypic data. Replicated phenotypic and temperature), diseases (e.g. foliar disease)
measurements or the use of clones (via cuttings) and/or parasites (e.g. insects). Ideally, phenotyping
Phenotyping for Biotic Stress 111

should be carried out across a broad range of stimuli, an approach that is very time consuming
environments present within the TPE, and it has and requires a tight control of environmental con-
shown in several occasions that they improved the ditions. High-throughput phenotyping platforms
QTL analysis. Further, in combination with high- allow for the automation of these procedures and
throughput phenotyping, multi-location trials streamline and standardise the collection of
help to standardise and improve the collection of highly accurate phenotypic data. State-of-the-art
phenotypic data and facilitate the creation of technology including imaging, robotic and com-
repository databases useful for QTL meta- puting equipment allows for the continuous phe-
analyses and other comprehensive approaches notypic measurement of tens of thousands of
(explained in chapter 6). Thus, an essential neces- plants automatically and non-destructively. On
sity in QTL analysis is a great emphasis on the the other hand, the installation and operating cost
basic factors that are crucial for the management of these platforms is very high. Additionally, it is
of experiments and the collection of meaningful critical that the experimental conditions mimic as
and error prone phenotypic data. closely as possible the dynamics of the ecologi-
Three basic principles of experimental designs cal environment prevailing in the fields of the
(replication, randomization and blocking control) TPE. At the same time, it is no matter how accu-
proposed by the early statistician, Fisher, should rate and precise our phenotyping will be, because
be strictly applied to a field or greenhouse test for the vast majority of the QTLs determining the
QTL identifications. In fact, for a QTL-mapping measured phenotype will remain undetected. The
project, field experiments should be more strin- majority of the genetic factors controlling quanti-
gent for experimental error control since minor tative traits will equally challenge their detection
QTLs with small effects are expected to be because their effects are simply too small to be
detected. In a trail with less than three replicates identified at a statistically significant level.
and small plot size per genotype, coefficient of
variation (CV) higher than 15% is usually con-
sidered less desirable. One may expect even Phenotyping for Biotic Stress
higher CV and environment variation when indi-
vidual plants (such as individual progenies of Biotic stresses, such as diseases and insects
mapping population used for QTL mapping) are (including fungi, bacteria, viruses, nematodes,
the units of measurements. Heritability estimates phytoplasmas, herbivorous insects and some-
(see below) based on individual plots are usually times weed species), account for significant
much higher than those of individual plants, annual yield losses in crop plants. Biotic stress
which is why breeders routinely test progenies in usually affects all parts of the plants in all the
replicated plots. crop-growing regions and seasons. Resistance to
Phenotyping under controlled conditions is these diseases and insects is controlled either by
relatively straightforward when scoring traits in a dominant or recessive major genes or by QTL.
binary fashion, such as for photoperiod sensitiv- Phenotyping of mapping populations for their
ity, and when environmental conditions do not resistance to the given biotic stress is the key step
have much effect on the target trait or are easily in QTL analysis. Upon identification of QTLs,
defined (e.g. light vs. darkness). However, it more durable resistance could be achieved by
becomes more complex when the target traits are pyramiding of resistance genes via marker-
quantitatively assessed, as in the case of growth, assisted selection (refer chapter 8 for further
and when environmental conditions that vary details). However, progress in this direction is
during the day (e.g. temperature, light intensity hindered by the pathogenic variability of insects
and soil water status) influence the target trait and pathogens and the evolution of new and more
(e.g. the rate of leaf elongation). In this case, the aggressive pathotypes or races. Though sources
phenotype is rather dynamic and better defined of resistances or tolerances to pests and diseases
by a series of response curves to environmental have been recently identified in several crops, in
112 5 Phenotyping

most cases genetic studies are not available. Only disease resistance in the same environment.
for few diseases (which have agronomic and eco- While conducting bioassay tests, it is necessary
nomic significance, depending on the pest/patho- to develop a pure pest population with a single
gen isolate or race), resistance or dominant genes colony grown in single host under controlled
were reported. At present, it is not clear whether conditions with appropriate standard procedure.
the reported resistance genes represent the same Replicated experiments should be carried out
or different loci because allelic tests were not with the same instar larvae or nymphs on the
performed. Involvement of other genes in expres- same phonological stage of the plants, and data
sion of resistance further complicates this picture. should be collected at different time points.
Yet another drawback in this context is when the Failures in doing so may cause differential
crop is screened in the field for biotic stress resis- responses and hence serious errors in phenotyp-
tance, several pathotypes/genotypes of the pest ing data. Further, recent evidences showed that
and pathogen coexist in the same field or even in plants respond to multiple stresses differently
the same infected plant part or regions. Since ran- from how they do to individual stresses, activat-
dom mating may occur between different patho- ing a specific programme relating to the exact
types or genotypes of the pest and pathogens encountered environmental conditions. Rather
carrying different mating type alleles, genetic than being additive, the presence of an abiotic
recombination may contribute to genotypic diver- stress can have the effect of reducing or enhanc-
sity and provide the pests/pathogens with an ing susceptibility to a biotic pest or pathogen and
additional means to adapt to resistant germplasm. vice versa. This interaction between biotic and
Thus, while screening of breeding materials for abiotic stresses is orchestrated by signalling path-
biotic stress resistance combination of several ways that may induce or antagonise one another
methods and strategies should be applied for and further controlled by a complex regulatory
assessment of such resistance. Numerous studies network. Hence, such phenotypic data should be
have indicated that testing under controlled glass- analysed very cautiously during QTL analysis
house or growth chamber conditions combined and interpretation.
with field screening would very much help to
improve the reproducibility of the results (which
is essential for accurate and consistent QTL Phenotyping for Abiotic Stress
identification) since severity and spread of the
pest and diseases are highly dependent on envi- Crop production is limited by various abiotic
ronmental conditions (especially on humidity, stresses such as water deficit, submergence, salin-
which may change from year to year). ity and deficiencies of P and Zn. In recent years,
It is also imperative to note that different loci advances in physiology, molecular biology and
may contribute to resistance at different points of genetics have greatly improved our understand-
the life cycle of the plant. Usually, the biotic ing of how crops respond to these stresses and the
stress resistance screening is followed with a basis of varietal differences in tolerance. Progress
scale (e.g. score 1 denotes completely resistance has relied on the application of rather specific
and score 9 denotes completely susceptible). As phenotypic screens that allow the effects of stress
the scale used for biotic stress resistance evalua- to be distinguished from other general differ-
tion is subjective particularly for intermediate ences. QTLs have been identified that explain a
values (in the above scoring, e.g. score 4mod- considerable portion of observed variation, and
erately resistance; scale 5moderately suscepti- in some cases, the genes underlying specific
ble), a bias may be introduced by the researcher QTLs have been identified (e.g. submergence tol-
that may affect the phenotyping data and ulti- erance in rice). The traits that are suitable for
mately the QTL-mapping process. In such QTL mapping of abiotic stress resistance/toler-
dilemma, it is commonly suggested to follow ance have been discussed as the key question for
different scoring systems for the given pest or long time. For example, the morpho-physiological
Heritability of Phenotypes 113

traits and the corresponding QTLs that affect factors limiting yield. The basic attributes of good
drought tolerance can be categorised as constitu- phenotyping carried out with appropriate genetic
tive (i.e. also expressed under well-watered con- materials are accuracy and precision of measure-
ditions) or drought-responsive (i.e. expressed only ments, coupled with relevant experimental condi-
under pronounced water shortage) (see chapter tions that are representative of the TPE. Accuracy
11 for more detailed description of drought toler- involves the degree of closeness of a measured or
ance in rice). While drought-responsive traits/ calculated quantity to its actual (true) value.
QTLs usually affect yield only under rather Accuracy is closely related to precision, also
severe drought conditions, constitutive traits/ termed reproducibility or repeatability, the degree
QTLs can affect yield at low and intermediate to which further measurements or calculations
levels of drought stress as well. The response of show the same or similar results. A further
QTLs for drought-adaptive traits (e.g. accumula- complexity of phenotyping a large number of
tion of osmolytes and relocation of water-soluble genotypes (e.g. a mapping population) for stress-
carbohydrates) to drought is probably due to reg- adaptive features is exemplified by those traits for
ulation of the expression of the underlying struc- which the value can vary considerably within a
tural genes in response to signalling cues such as rather short timeframe due to changing environ-
abscisic acid (ABA) accumulation which intern mental conditions. Good phenotyping means not
induced by cellular dehydration. Experimental only the collection of accurate data to minimise
evidence indicates that the progress achieved by the experimental noise introduced by uncon-
breeders during the last century can mainly be trolled environmental and experimental variabil-
accounted for by changes in constitutive traits that ity but also the collection of data that are relevant
affect dehydration avoidance rather than drought- and meaningful from a biological and agronomic
responsive traits. In this respect, emphasis is standpoint, under the conditions prevailing in
increasingly being placed on phenotyping traits farmers fields within the TPE. Although hundreds
that constitutively increase yield per se, rather of accurate studies reporting thousands of stress-
than on characteristics that enhance plant survival responsive genes and QTLs can be found in the
under extreme drought, in view of a possible neg- literature, the relevance of these data to real field
ative trade-off under less severe circumstances. conditions is often questionable.
An excellent collection of methods, principles
and protocols useful in abiotic stress resistance
screening (more particularly for drought screen- Heritability of Phenotypes
ing in crop plants) is comprehensively described
in the book Drought Phenotyping in Crops: From Collecting accurate phenotypic data that are
Theory to Practice. Before starting a phenotyping relevant to the TPE has always been a major
experiment for abiotic stress resistance, readers challenge for the improvement of quantitative
are requested to refer this book for better under- traits. The success of this endeavour is intimately
standing of the phenotyping, issues and challenges connected with the heritability of the trait, namely,
in planning and managing experiments specific to the portion of the phenotypic variability accounted
each crop or trait and its importance in QTL anal- for by additive genetic effects that can be inherited
ysis for abiotic stress resistance traits. through sexually propagated generations. Trait
Good phenotyping is pivotal for reducing the heritability varies according to: (1) the genetic
genotypephenotype gap, especially for quantita- make-up of the materials under investigation, (2)
tive traits, which are the major determinants of the conditions under which the materials are inves-
abiotic stress resistance. Keeping a good record of tigated and (3) the accuracy and precision of the
meteorological parameters (rainfall, temperatures, phenotypic data. With only a few notable excep-
wind, evapotranspiration, light intensity, etc.) tions, most of the traits determining the perfor-
allows for more meaningful interpretation of the mance of crops usually have low (~0.300.40)
results and identification of the environmental or, at best, intermediate (~0.400.60) heritability.
114 5 Phenotyping

This impairs our capacity to dissect their genetic data set, one can at least know where the limit of
basis properly. Despite this, careful evaluation QTL modelling is, so one can know if overfitting
and appropriate management of the experimental is likely to be a problem.
factors that lower the heritability of traits, cou- Typically, for both selection applications and
pled with a wise choice of the genetic material for QTL mapping, we mean the variance of
(e.g. use of phenotypically dissimilar parents to line-mean phenotypes. Thus, if we have data
obtain maximum extreme for mapping popula- from multiple replications and multiple envi-
tion development), can provide effective ways to ronments, we first compute the means of each
increase heritability. Once a sound association line across replications and environments, then
has been established between a marker and a we can calculate the variance of these means.
locus affecting a target trait, the problems encoun- This is the phenotypic variance. So even if envi-
tered in the conventional selection of quantitative ronment and experimental errors have large
traits, particularly the lowly heritable ones, can effects on the phenotype observed in a single
be partially overcome through the use of markers plot, one can reduce the effect of these nonge-
linked to QTLs for the target trait. This enables netic factors on the line mean by averaging
individuals to be scored based on their genetic across multiple replications and plots. This
make-up rather than their phenotypic features, results in an increase in the heritability on a
and the process is referred to as marker-assisted line-mean basis, even if the heritability is very
selection (refer chapter 8 for more details). In low on a single-plot basis. Since selection or
contradiction, the probability of identifying the QTL mapping is conducted on the basis of line
relevant chromosomal regions and accurately means, rather than individual plot values, one
estimating their effects relies on good phenotyp- can experimentally increase the line-mean heri-
ing of the genetic materials originally used to tability by good experimental design and exten-
establish the phenotypegenotype associations. sive environmental replication.
In other words, the effectiveness of marker-based The heritability estimates (say x) tell us that the
approaches intimately depends on how well and best possible QTL models (assuming we detect all
how accurately the target trait has been assessed the QTL affecting each trait) can explain at most
phenotypically in mapping populations. In fact, a x % of the phenotypic variance for given trait. The
low heritability impairs the probability of detect- remaining phenotypic variance (100 x %) cannot
ing the presence of QTLs, thereby increasing be explained by genetics or QTLs, since it is due to
Type II errors (i.e. false negatives). GEI or to error variance. We should be able to
Heritability measures the proportion of the detect QTLs that explain more variance within
phenotypic variance that is due to genetic effects. each environment because the within-environment
This measure is important for QTL mapping heritabilities are higher, but since the GEI variance
because it tells us what the maximum proportion is large, we expect that some of the QTLs in 1 year
of phenotypic variance that can be contributed will be different in location and/or effect than the
by the given QTLs. Thus, if a trait has a herita- QTLs detected in another year. Thus, this kind of
bility of 50% in a particular set of environments GEI is mainly noise. Hence, it is not advisable to
and if we detected all the QTL that affect the look for year-specific QTLs.
trait, the combined effects of all the QTL can Assuming that both the type and the number of
explain 50% (but no more than 50%) of the phe- treatments (genotypes, stress type (including
notypic variation. In practice, it is possible to intensity, degree and duration), etc.) to be evalu-
overfit a QTL model, so it seems to be explain- ated are adequate for the specific objectives of
ing more than the limit set by heritability, but in each experiment, the following general factors
such cases, the model is actually explaining should be evaluated carefully to ensure the col-
noise, rather than genetic effects, and will have lection of meaningful phenotypic data in field
less predictive value than one thinks. Thus, by experiments: experimental design, heterogeneity
knowing the heritability of a trait for a particular of experimental conditions between and within
Bibliography 115

experimental units, size of the experimental unit results of ANOVA table, the genetic variance s2 a
and number of replicates, number of sampled can be obtained as
plants within each experimental unit and genotype-
(Genotype Mean Square Error mean Square)
by-environment-by-management interaction. The s 2a =
Number of replication
relative impact of each factor on the quality of the
phenotypic data to be collected will vary greatly Error mean square is also denoted by s2e and
according to each experiment. As an example, an number of replication as r. From these values,
excessive heterogeneity in soil characteristics broad sense heritability (h; repeatability on a sin-
(depth, moisture, pH, etc.) and/or compaction gle plot level) is calculated as
among field plots will inevitably increase the
experimental error and will jeopardise an accurate s 2a
h= 100 %
evaluation of yield. The additional factors such as s 2e
s a+
2

variation in phenology, interaction with other
r
biotic and abiotic stresses and managing the
dynamics and intensity of given stress episodes The higher the h values, the higher the
should also receive due attention when planning repeatability of the given trait. In other words, the
and conducting the experiments. Insufficient atten- environment effect on this trait is getting low if
tion may lead to faulty conclusions, particularly in h nears 1. Therefore, if h is 0, there is no need
terms of interpreting cause and effect relationships of doing QTL analysis. The h can be interpreted
between yield and other traits/variables. as follows: if h is 030%: low heritability;
3160%: moderate heritability; and 61100%:
the trait is highly heritable.
Statistical Analysis of Phenotypic
Data: Simple Statistics, Heritability
Estimation and Correlation Bibliography

The data collected from phenotyping experiments Literature Cited


can be used for identifying mean, minimum and
maximum values for the given traits. Correlation Monneveux P, Ribaut JM (2012) Drought phenotyping in
crops: from theory to practice. CIMMYT/Generation
analysis should be done to understand the rela- challenge programme, Mexico. Freely available at:
tionship among investigated traits (widely https://www.integratedbreeding.net/drought-pheno-
Pearson correlation coefficient is preferred). A typing-crops-theory-practice
negative genetic correlation between two traits
indicates that a large proportion of the QTL
effects for the investigated traits are the same but Further Readings
in opposite direction. We expect to find some
QTL for the given two traits in the same chromo- Pask AJD, Pietragalla J, Mullan DM, Reynolds MP (2012)
Physiological breeding II: a field guide to wheat phe-
somal locations, if they have strong positive cor- notyping. CIMMYT, Mexico
relation. In order to calculate heritability, it is Reynolds MP, Pask AJD, Mullan DM (2012) Physiological
essential to perform single factor analysis of vari- breeding I: interdisciplinary approaches to improve
ance. This can be done by using any statistical crop adaptation. CIMMYT, Mexico
Shashidhar HE, Henry A, Hardy B (2012) Methodologies
software such as SAS, IRRISTAT and GENSTAT for drought studies in rice. International Rice Research
or simply by using Windows Excel. From the Institute, Los Baos
QTL Identication
6

significant difference between phenotypic means


QTL: A Prelude of the marker groups (either 2 or 3), depending
on the marker system and type of population,
Most of the important agronomic traits are quan- indicates that the marker locus being used to par-
titatively inherited and are controlled by several tition the mapping population is linked to a QTL
genes (i.e. polygenic). Thus, the nature of quanti- controlling the trait.
tative traits is that their expression is controlled The reason for looking for a significant P value
by tens, hundreds or even thousands of quantita- obtained from differences between mean trait val-
tive trait loci (QTL), and in general, they are hav- ues to indicate linkage between marker and QTL is
ing only a small effect on the trait. QTL is a due to recombination (refer chapter 4 for details on
genomic region that comprises gene(s) which recombination). The closer a marker is from a
govern(s) the expression of the quantitative trait. QTL, the lower the chance of recombination
Since the advent of molecular markers, research- occurring between marker and QTL. Therefore,
ers and breeders have aimed to identify functional the QTL and marker will usually be inherited
markers (refer chapter 3 for different kinds together in the progeny, and the mean of the group
of markers) associated with these QTL for imple- with the tightly linked marker will be significantly
mentation of marker-assisted selection. Histori- different (P < 0.05) to the mean of the group with-
cally, QTL detection started with linkage mapping out the marker. When a marker is loosely linked or
in biparental populations (refer chapter 2 for unlinked to a QTL, there is independent segrega-
population types (Sax 1923; Thoday 1961)). tion of the marker and QTL. In this situation, there
Identifying a gene or QTL within a plant genome will be no significant difference between means of
is like finding the proverbial needle in a haystack. the genotype groups based on the presence or
However, QTL analysis can be used to divide the absence of the loosely linked marker. Unlinked
haystack in manageable piles and systematically markers located far apart or on different chromo-
search them. In simple terms, QTL analysis is somes to the QTL are randomly inherited with the
based on the principle of detecting an association QTL; therefore, no significant differences between
between phenotype and the genotype of markers. means of the genotype groups will be detected.
Markers are used to partition the mapping popu- There are different methods used to detect the
lation into different genotypic groups based on the QTL and test the inheritance of QTL and markers.
presence or absence of a particular marker locus Those methods are discussed in detail hereunder,
and to determine whether significant differences and the comparisons of the commonly used meth-
exist between groups with respect to the quanti- ods in QTL detection are given in Table 6.1 and list
tative trait being measured. Thus, statistically a of QTL mapping software is given in Box 6.1.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 117
and Benefits, DOI 10.1007/978-81-322-0958-4_6, Springer India 2013
118

Table 6.1 Comparison of different types of methods used in QTL analysis


Features Single-marker analysis Simple interval mapping Composite interval mapping Multiple QTL mapping
Principle One marker is involved at a time to find It is based on the joint Multiple regression methods It uses multiple marker intervals simultane-
the QTL-marker association frequencies of a pair of are integrated with interval ously to fit multiple putative QTL directly
adjacent markers and a mapping to increase the in the QTL-mapping model
putative QTL flanked by the probability of including
two markers all significant QTL in the
model
Methods Simple t-test, ANOVA, linear regression, Likelihood approach, Combining simple interval Cockerhams model for interpreting
likelihood ratio test, maximum likelihood regression approach or mapping with multiple genetic parameters and the method of
estimation combination of above two regression methods maximum likelihood for estimating genetic
approaches parameters
Advantages Simple in terms of data analysis QTL location can be identified Multiple QTL in a single More powerful and precise than all the
Performed using common statistical linkage group can be above three methods
software identified Epistasis between QTL, genotypic values
Gene order and complete linkage map are of individuals and heritabilities of
not required quantitative traits can be readily
estimated and analysed
Limitations The putative QTL genotypic means and Requires prior construction of Inclusion of too many Sophisticated high-end systems are
QTL positions are confounded, and thus it good quality linkage map cofactors reduced the required with skilled manpower
causes biased estimation of QTL effects and Considers one QTL at a time power to identify QTL
low power in detection of such QTL in the model for QTL mapping relative to interval
QTL positions cannot be precisely and hence it is biased in mapping
determined due to the nondependence estimation of QTL when
among the hypothesis tests for multiple QTL are located in
linked markers that confound QTL effect the same linkage group
and position
Doing a t-test/ANOVA at every marker
6

results in many false positives


Reference Edwards et al. (1987) Lander and Botstein (1989) Jansen (1993), Rodolphe and Kao et al. (1999)
Lefort (1993), and Zeng
(1993)
QTL Identification
QTL: A Prelude 119

Box 6.1 List of QTL-Mapping Software


In the past decades, many QTL-mapping and PLABQTL are similar in many respects.
procedures have been developed. A larger QTL Cartographer is distinguished by its
number of computer programs are now avail- menu-driven interface, its more detailed docu-
able to implement these methods. These mentation, its resampling methods and the
programs have significantly simplified the lack of a licencing fee.
applications of the methods in QTL analysis. Map Manager QT is a program for Mac OS
A complete list of the programs is posted on distinguished by its graphical user interface
the web sites http://linkage.rockefeller.edu/ for data entry, editing, manipulation and
soft and http://www.stat.wisc.edu/~yandell/ display. It is designed to be used either as a
statgen/software/biosci/linkage.html. Most of mapping program itself or as a data-prepara-
the programs were developed as standalone tion program for other mapping programs.
software packages. These include MapMaker/ QGene is a commercial program for Mac
QTL [1], MapManager [2], QTL Express [3], OS whose strength is a variety of graphics for
MapQTL [4], MCQTL [5], MULTIMAPPER displaying trait data and relationships among
[6], Meta-QTL [7], WinQTLCart [8] and QTL marker genotypes and between traits and
Network [9]. Other programs were developed marker genotypes. These functions make it
using the R package, for example, R/qtl [10] uniquely useful for rapid exploration of data.
and R/qtlbim [11]. PROCBTL is a trial ver- However, it does not perform CIM.
sion of a SAS procedure for mapping binary MapQTL is a commercial program for
trait loci (BTL) [12]. Another SAS-based soft- several operating systems that is distinguished
ware package, PROC QTL Version 1.0, is by its ability to map QTL in populations
available at http://www.statgen.ucr.edu/soft- derived from non-inbred parents, in which
ware.html. To get more details on specific both markers and QTL may have more than
software, please refer the references given at two alleles. It also offers a nonparametric
the end. form of single-locus association, the Kruskal
MAPMAKER/QTL is a widely used pro- Wallis rank sum test, appropriate for data with
gram for UNIX or DOS operating systems distributions far from normal.
and is the original QTL-mapping program PLABQTL is a script-driven program for
intended for distribution. It can perform DOS or AIX that is designed to analyse auto-
composite interval mapping, although the matically a dataset at increasing levels of
documentation does not use that term; but it complexity in successive runs. The final level
cannot perform permutation tests. It requires is capable of evaluating the effect of different
the companion program MAPMAKER/EXP environments and the effect of interactions
to format data and to calculate marker between QTL and environmental effects.
maps. MQTL is a program for DOS or Sun OS
QTL Cartographer is a suite of programs that uses a simplified form of composite
for DOS, UNIX or Mac OS. They are designed interval mapping (sCIM) for mapping QTL
to be used in sequence, each accepting input in large data sets derived from multiple envi-
in the form of text files and storing its output ronments. Like PLABQTL, it will estimate
in text files for the next program. This suite environmental effects and QTLenvironment
offers several variations of CIM with auto- interactions.
matic selection of background loci. It also has Multimapper is a program for UNIX that
provision for estimating confidence intervals implements a Bayesian method for building
by resampling. QTL Cartographer, MapQTL multi-QTL models automatically. Multimapper

(continued)
120 6 QTL Identification

Box 6.1 (continued)


is designed to map QTL within a single link- 3. Seaton G, Haley CS, Knott SA, Kearsey
age group, and it produces a plot of QTL prob- M, Visscher PM (2002) QTL express: map-
ability as a function of map distance. This type ping quantitative trait loci in simple and
of plot seems intuitively more interpretable than complex pedigrees. Bioinformatics 18(2):
the plot of the likelihood ratio statistic or LOD 339340
score produced by other programs. However, it 4. Van Ooijen JW (2004) MapQTL_ 5,
seems to be most suited to the analysis of software for the mapping of quantitative trait
single chromosomes for which other programs loci in experimental populations. Kyazma B.
have indicated the possibility of multiple QTL. V., Wageningen
Multimapper is designed to work with QTL 5. Jourjon M-F, Jasson S, Marcel J, Ngom B,
Cartographer as a companion program. Mangin B (2005) MCQTL: multi-allelic QTL
The QTL Cafe is a program being devel- mapping in multi-cross design. Bioinformatics
oped in Java to make it available for multiple 21(1):128130
computer platforms. It is currently available as 6. Martinez V, Thorgaard G, Robison B,
an applet that runs in a Java-enabled World Sillanpaa MJ (2005) An application of
Wide Web browser. Bayesian QTL mapping to early development
Epistat is a program for DOS designed in double haploid lines of rainbow trout includ-
primarily for the detection and analysis of ing environmental effects. Genet Res 86(3):
interactions between QTL. It does not perform 209221
interval mapping and therefore does not require 7. Veyrieras J-B, Goffinet B, Charcosset A
mapped markers. It is an interactive program, (2007) MetaQTL: a package of new computa-
displaying graphic results in response to single- tional methods for the meta-analysis of QTL
keystroke commands. mapping experiments. BMC Bioinformatics
QTL IciMapping: It is an integrated soft- 8,article 49
ware for building genetic linkage maps and 8. Wang S, Basten CJ, Zeng ZB (2007)
mapping QTL. The modules are built very Windows QTL Cartographer 2.5, Department
user-friendly and this software is being of Statistics, North Carolina State University,
updated regularly. Raleigh, NC, USA, 2007. http://statgen.ncsu.
edu/qtlcart/WQTLCart.htm
9. Yang J, Hu C, Hu H et al. (2008) QTL
Key References for QTL Mapping network: mapping and visualizing genetic
Software architecture of complex traits in experimental
populations. Bioinformatics 24(5):721723
1. Lander ES, Green P, Abrahamson J et al 10. Broman KW, Wu H, Sen S, Churchill
(1987) MAPMAKER: an interactive computer GA (2003) R/qtl: QTL mapping in experimen-
package for constructing primary genetic link- tal crosses. Bioinformatics 19(7):889890
age maps of experimental and natural popula- 11. Yandell BS, Mehta T, Banerjee S et al.
tions. Genomics1(2):174181 (2007) R/qtlbimml: QTL with Bayesian inter-
2. Manly KF, Cudmore RH Jr, Meer JM val mapping in experimental crosses.
(2001) MapManager QTX, cross-platform Bioinformatics 23(5):641643
software for genetic mapping. Mammalian 12. SAS Institute (2007) SAS Online
Genome 12(12):930932 Doc_ 9.2. SAS Institute, Cary
Single-Marker Analysis (SMA) 121
Trait mean values

e
1x +
0+b
y=b
550.5 y = b0 + b1x + e,
471.5 where y is the phenotypic value of a line, b0 is the population
361.0 mean, b1 is the additive effect of the locus on the trait, and e is a
residual error term. x is directly related to the genotypic code at
the locus being tested for the line considered, it is -1 (for female
parent) or 1 (for donor or male parent).

AA Aa aa

Marker classes

Fig. 6.1 Principle of single-marker analysis

Single-Marker Analysis (SMA) genotypic classes are approximately the same,


it is decided that this marker does not appear to
Single-marker analysis (also single-point analy- be linked to a QTL.
ses) is the simplest method for detecting QTL The results from single-marker analysis are
associated with single markers. The statistical usually presented in a table, which indicates the
methods used for single-marker analysis include chromosome (if known) or linkage group con-
t-tests, analysis of variance (ANOVA) and linear taining the markers, probability values and the
regression. Linear regression is most commonly percentage of phenotypic variation explained by
used because the coefficient of determination (r2) the QTL (noted as r2). Sometimes, the allele size
from the marker explains the phenotypic varia- of the marker is also reported. QTL Cartographer,
tion contributed by the QTL linked to the marker. QGene and MapManager QTX are commonly
Typically, the null hypothesis tested is that the used computer programs to perform single-
mean of the trait value is independent of the gen- marker analysis. Other common statistical soft-
otype at a particular marker. The null hypothesis ware such as SAS, IRRISTAT or even Microsoft
is rejected when the test statistic is larger than a Office Excel can be employed for single-marker
crucial value, and it is declared that a QTL is analysis.
linked to the marker under investigation. The The chief advantage of analysis of variance at
t-test, ANOVA and simple linear regression the marker loci is its simplicity and can be per-
approach are all equivalent to each other when formed with basic statistical software programs.
their hypotheses are testing for differences in the In addition, a genetic map for the markers is not
phenotypic means. In analysis of variance required, and the method may be easily extended
(ANOVA, sometimes called marker regression) to account for multiple loci. A further advantage
at the marker loci, at each typed marker, one is the easy inclusion of covariates, such as sex,
splits the progenies into two groups, according to treatment or an environment effect. However, the
their genotypes at the marker, and compares the major disadvantage with this method is that the
phenotype distributions of the two groups. For farther a QTL is from a marker, the less likely it
example, in Fig. 6.1, we see that the individuals will be detected. This is because recombination
with genotype aa for a marker have somewhat may occur between the marker and the QTL. This
significantly higher phenotype values than those causes the magnitude of the effect of a QTL to be
with genotype Aa and AA at that marker, indi- underestimated. The use of a large number of
cating that the marker is linked to a QTL. In con- segregating DNA markers covering the entire
trast, when the phenotype distributions of the genome (usually at intervals less than 15 cM)
122 6 QTL Identification

may minimise both problems. Regression on Interval mapping methods produce a profile of
marker genotypes gives a great deal of informa- the likely sites for a QTL between adjacent linked
tion about markertrait associations, but there are markers. In other words, QTL are located with
some problems with this approach: (1) The respect to a linkage map. Given the marker geno-
approach only considers the marker positions and type data (and assuming that the recombination
has less power to detect a QTL between the mark- process in meiosis exhibits no interference), one
ers. (2) We cannot estimate the QTL effect and may calculate the probability that an individual
the recombination frequency separately. (3) There has genotype AA (or Aa or aa) at a putative QTL.
is a large amount of variation within each marker In interval mapping, we obtain maximum likeli-
class, and some of this will be due to other QTL hood estimates of the three parameters, defined to
affecting the trait: We need to take this into be the values for which this probability achieves
account for a more accurate test for the presence its maximum. The results of the test statistic for
of a QTL. Further, we must discard individuals SIM (as well as composite interval mapping
whose genotypes are missing at the marker since (CIM) which will be discussed subsequently) are
inclusion of such line may produce biased or typically presented using a logarithmic of odds
overestimation of the effect. Despite these prob- (LOD) score or likelihood ratio statistic (LRS).
lems, regression on marker genotypes is a good There is a direct one-to-one transformation
start in QTL analysis. It identifies associations between LOD scores and LRS scores (the con-
without knowing the position of the marker on version can be calculated by LRS = 4.6 LOD).
the map, and it may be adapted for use in any These LOD or LRS profiles are used to identify
type of population. the most likely position for a QTL in relation to
the linkage map, which is the position where the
highest LOD value is obtained. A typical output
Interval Mapping from interval mapping is a graph with markers
comprising linkage groups on the x-axis and the
Lander and Botstein in 1989 developed simple test statistic (LOD) on the y-axis (Fig. 6.2). The
interval mapping (SIM), which overcomes the peak or maximum must also exceed a specified
disadvantages of analysis of variance at marker significance level in order for the QTL to be
loci. SIM is currently the most popular approach declared as real (i.e. statistically significant).
for QTL mapping in experimental crosses. This Figure 6.2 displays the LOD (logarithm of
method makes use of linkage maps and analyses the odds favouring linkage, a score that mea-
intervals between adjacent pairs of linked mark- sures the strength of evidence for the presence of
ers along chromosomes simultaneously, instead a QTL) curve for a chromosome or linkage group.
of analysing single markers. The use of linked The LOD curve achieves its maximum at posi-
markers for analysis compensates for recombina- tion 32 cM (in between marker G and H), indicat-
tion between the markers and the QTL and is ing the presence of a QTL at this position. A
considered statistically more powerful compared question may arise: Is an observed peak actually
to single-point analysis. The intervals that are a QTL? when confronted with an LOD curves
defined by ordered pairs of markers are searched (or, with 19 or 20 such curves, one for each chro-
in increments (e.g. 2 cM), and statistical methods mosome). The LOD score indicates the strength
are used to test whether a QTL is likely to be of evidence for the presence of a QTL, with larger
present at the location within the interval or not. LODs corresponding to greater evidence. The
It is important to realise that interval mapping question is, how large is large? The standard
statistically tests for a single QTL at each incre- approach to answering this question has been to
ment across the ordered markers in the genome. formulate the problem as one of hypothesis test-
Interval mapping searches through the ordered ing. Consider the null hypothesis that there are no
genetic markers in a systematic, linear (also QTL segregating in the mapping population. We
referred to as one-dimensional) fashion, testing determine the distribution of the LOD score in
the same null hypothesis at each increment. this situation. The probabilities of obtaining an
Interval Mapping 123

Maximum likelihood QTL between loci G and H

6
LOD score

4 LOD level at which


QTL effect occurs
by chance (LOD Threshold;
2 usually fixed at 3.0)

Marker F 25 G 15 H 10 I 35 J
Locus position

Fig. 6.2 Principle of interval mapping by maximum likelihood method

LOD score as large as or larger than that was markertrait associations. The observed LOD
observed if there were no QTL are called the P score (with the phenotypes in the correct order) is
value. Large LOD scores give small P values; compared to the 1,000 LOD scores obtained from
very small P values indicate that either the null permuted versions of the data. The proportion of
hypothesis is false (really there is a QTL) or a these 1,000 LOD scores that exceed the actual,
very rare event occurred. When one performs a observed LOD score is reported as an approxi-
genome scan to identify QTL, one examines the mate P value. This provides a customised thresh-
LOD score at 100 or more marker loci (in fact, old tailor-made for the individual experiment.
during interval mapping, at all locations between Before permutation tests were widely accepted as
markers). Thus, the null distribution of the LOD an appropriate method to determine significance
score at a single location is not appropriate for thresholds, an LOD score of between 2.0 and 3.0
forming an overall threshold. Some adjustment (most commonly 3.0) was usually chosen as the
must be made for our examination of multiple significance threshold, as stated above. An LOD
putative QTL locations over the whole genome. score of 3 indicates that the chance of obtaining
Lander and Botstein (1989) performed extensive the observed data, given that there is a QTL at the
computer simulations to estimate the appropriate specified position, is 1,000 times more likely than
LOD threshold for various genome sizes and if there are no QTL.
marker densities and gave analytical calculations Many researchers have used MapMaker/QTL,
for the case of a very dense marker map. These QTL Cartographer and QGene to conduct SIM.
guidelines (e.g. fixing a minimum LOD threshold The most common way of reporting QTL is by
of 3.0) should suffice for most uses. indicating the most closely linked markers in a
Alternatively, the determination of significance table and/or as bars (or oval shapes or arrows) on
thresholds is most commonly performed using linkage maps (indicating as bars; Fig. 6.3).
permutation tests (discussed below in detail). The chromosomal regions represented by rect-
Briefly, the phenotypic values of the population angles are usually the region that exceeds the
are shuffled whilst the marker genotypic values significance threshold (Fig. 6.2). Usually, a pair
are held constant (i.e. all markertrait associa- of markersthe most tightly linked markers on
tions are broken), and QTL analysis is performed each side of a QTLis also reported in a table;
to assess the level of false positive markertrait these markers are known as flanking markers.
associations. This process is then repeated (e.g. The reason for reporting flanking markers is that
1,000 times), and significance levels can then be selection based on two markers should be more
determined based on the level of false positive reliable than selection based on a single marker.
124 6 QTL Identification

1 2 3 4 5

A F 0.5 N U
K O 3.5
12.5 10.5
G 14.5 5.7 V

B P 13.4
12.0 L
6.1
15.0 H Q W
15.0 10.2 4.1
13.0
C X
M R
8.0 2.3
D I S 17.2
5.0 8.0
5.4
E T Y
J

Plant height QTL

Internode length QTL

Fig. 6.3 Presentation of hypothetical QTL for plant Numbers in the left of each vertical bar represent genetic
height and internode length in linkage map. Numbers distance between the markers in cM. Horizontal bars and
above the vertical bar represent chromosome number. alphabets denote markers on the linkage map

Again, the reason for the increased reliability is mapping population used for phenotypic evalua-
that there is a much lower chance of recombina- tion must be available for marker genotyping and
tion between two markers and QTL compared to subsequent QTL analysis, which may be difficult
the chance of recombination between a single with completely or semi-destructive bioassays
marker and QTL. (e.g. screening for resistance to necrotrophic fun-
It should also be noted that QTL can only be gal pathogens).
detected for traits that segregate between the par- In general terms, the identified QTL may also
ents used to construct the mapping population. be described as major or minor. This definition
Therefore, in order to maximise the data obtained is based on the proportion of the phenotypic vari-
from a QTL-mapping study, several criteria may ation explained by a QTL (based on the r2 value):
be used for phenotypic evaluation of a single trait Major QTL will account for a relatively large
(for instance, rice yield can be evaluated based on amount (e.g. >10%), and minor QTL will usually
number of panicles, number of spikelet/panicle, account for <10%. Sometimes, major QTL may
1,000 grain weight, etc.). QTL that are detected refer to QTL that are stable across environments,
in common regions (based on different criteria whereas minor QTL may refer to QTL that may
for a single trait) are likely to be important QTL be environmentally sensitive, especially for QTL
for controlling the trait. Mapping populations that are associated with disease resistance or
may also be constructed based on parents that drought tolerance. In more formal terms, QTL
segregate for multiple traits. This is advantageous are classified as: (1) suggestive, (2) significant
because QTL controlling the different traits can and (3) highly significant. This classification was
be located on a single map. However, for many mainly proposed to avoid large numbers of false
parental genotypes used to construct mapping positive claims and also ensure that real linkage
populations, this is not always possible, because was not missed. Significant and highly significant
the parents may only segregate for one trait of QTL were given significance levels of 5 and
interest. Furthermore, the same set of lines of the 0.1%, respectively, whereas a suggestive QTL is
Interval Mapping 125

one that would be expected to occur once at indicate which markers are located in a specific
random in a QTL-mapping study (in other words, region containing a QTL and thus be used to
there is a warning regarding the reliability of identify more tightly linked markers. The study
suggestive QTL). The mapping program of similarities and differences of markers and
MapManager QTX reports QTL mapping results genes within and between species, genera or
with this classification. higher taxonomic divisions is referred to as
Although the most likely position of a QTL is comparative mapping (refer chapter 7). It involves
the map position at which the highest LOD or LRS analysing the extent of the conservation between
score is detected, QTL actually occur within maps of the order in which markers occur
confidence intervals (see below). There are several (i.e. collinear markers); conserved marker order
ways in which confidence intervals can be calcu- is referred to as synteny. Comparative mapping
lated. The simplest is the one-LOD support may assist in the construction of new linkage
interval, which is determined by finding the region maps (or localised maps of specific genomic
on both sides of a QTL peak that corresponds to a regions) and in predicting the locations of QTL in
decrease of one LOD score as performed by different mapping populations.
Mapmaker/QTL. Bootstrapping, a statistical method Interval mapping has several advantages over
for resampling, is another method to determine the analysis of variance at the marker loci. First, it
confidence interval of QTL and can be easily applied provides a curve, which indicates the evidence
within some mapping software programs such as for QTL location. Second, it allows for the infer-
MapManager QTX. ence of QTL to positions between markers. Third,
All linkage maps are unique and are a product it provides improved estimates of QTL effects
of the mapping population (derived from two (the apparent effect at a marker locus is decreased
specific parents) and the types of markers used. as a result of recombination between the marker
Even if the same set of markers is used to construct and the QTL). Fourth, and perhaps most impor-
linkage maps, there is no guarantee that all of the tant, appropriately performed interval mapping
markers will be polymorphic between different makes proper allowance for incomplete marker
populations. Therefore, in order to correlate infor- genotype data. In the calculation of an individu-
mation from one map to another, common markers als QTL genotype probabilities, conditional on
are required. Common markers that are highly its marker genotype data, one considers the
polymorphic in mapping populations are called closest flanking typed markers for that individual.
anchor (also core markers). Anchor markers If an individual is missing the marker genotype
are typically SSRs or RFLPs (refer chapter 3 for for a flanking marker, one moves to the next
details). Specific groups of anchor markers, that flanking marker for which genotype data are
are located in close proximity to each other in available. Allowance may even be made for the
specific genomic regions, are generally referred presence of genotyping errors.
to as bins. Bins are used to integrate maps and On the other side, although interval mapping
are defined as 1020 cM regions along chromo- is certainly more powerful than single-marker
somes; the boundaries of each are defined by approaches to detect QTL, it is limited by both
a set of anchor markers. If common anchor the model that defines it as a single-QTL method
markers have been incorporated into different and by the one-dimensional search that does not
maps, they can be aligned together to produce allow interactions between multiple QTL to be
consensus maps. Consensus maps are produced considered. Additional disadvantage of interval
by combining or merging different maps, con- mapping, in comparison to analysis of variance,
structed from different genotypes, together (see is that it requires some increase in computation
chapter 4). Such consensus maps can be extremely time and the use of specially designed software.
useful for efficiently constructing new maps (with An important, yet often ignored, issue in QTL
evenly spaced markers) or targeted (or localised) mapping concerns selection bias in the apparent
mapping. For example, a consensus map can (estimated) effects of QTL. Such estimated
126 6 QTL Identification

effects are often too large. Consider a single QTL widely used, it shares many of the disadvantages
with an effect of moderate size, and imagine there of analysis of variance at marker loci; most
is a marker very near the QTL. In a particular importantly, it requires complete marker geno-
experiment, the estimated effect of the QTL will type data. The simplest multiple QTL method
be somewhat different from its true effectthe that makes allowance for missing genotype data
observed difference between the phenotype aver- is the use of forward selection in interval map-
ages for the two QTL genotype groups will not ping. An approach that has received much atten-
be the same as the true difference. Nevertheless, tion and has been widely applied in practice is
to produce an LOD score sufficiently large to composite interval mapping (CIM; discussed
declare the presence of a QTL, the estimated below). In this method, one performs interval
effect must be large. This introduces bias in the mapping using a subset of marker loci as covari-
estimated effect (bias is also introduced in the ates. These markers serve as proxies for other
maximisation over possible QTL locations; the QTL to increase the resolution of interval map-
inferred location for a QTL is the one that gives ping by accounting for linked QTL and reducing
the largest estimated QTL effect.) Because this the residual variation. The key problem with CIM
bias is the result of the selection of only those loci concerns the choice of suitable marker loci to
for which there is sufficient evidence for the pres- serve as covariates. Yet another interesting devel-
ence of a QTL, it is called as selection bias. The opment is multiple interval mapping (MIM).
power to detect QTL with a larger effect is higher, MIM is the extension of interval mapping to
and the bias in their estimated effects will be multiple QTL, just as multiple regressions
lower but may still be substantial. QTL with very extends analysis of variance. MIM allows one to
large effect are always detected, and so the bias in infer the location of QTL to positions between
their estimated effects will be minimal. markers, makes proper allowance for missing
genotype data and can allow interactions between
QTL. This is not the final solution to the QTL-
Multiple QTL and Methods mapping problem; one is still confronted with
to Detect Multiple QTL comparing models and searching through models.
Statistical researchers have much work to do in
Interval mapping assumes the presence of a sin- this area. The above descriptions are the major
gle QTL. One may use interval mapping to iden- approaches to QTL mapping in experimental
tify multiple QTL, especially when they are on crosses. Several other approaches are available,
separate chromosomes, but there are several including Bayesian methods and the use of a
advantages to using methods that model multiple genetic algorithm. These new methods may
QTL simultaneously. First, by controlling for the become important in the future but are beyond
presence of a QTL, one may reduce the residual the scope of this elementary description of statis-
variation and obtain greater power to detect addi- tical methods for QTL mapping, and hence the
tional QTL. Second, one may better separate readers are requested to refer the further readings
linked QTL. Third, the identification of interac- for more details. However, the basic principles
tions between QTL (called epistasis; see below) and methods used in CIM and MIM are discussed
requires the joint modelling of multiple QTL. hereunder.
Thus, it is important to have a description of the
major statistical approaches or QTL mapping
that makes use of multiple QTL models. The Composite Interval Mapping
simplest such method is multiple regression. The
aim is principally to frame the problem as one of Of late, composite interval mapping (CIM) is
model selection and to describe the key issues in becoming popular for mapping QTL. The main
model selection (the most important of which is advantage of CIM is that it is more precise and
the choice of criteria for comparing models). effective at mapping QTL compared to single-
While this simple approach should be more point analysis and interval mapping, especially
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 127

when linked QTL are involved. CIM combines and Zeng in 1995 have proposed a multiple-trait
the approaches of interval mapping and single- version of the composite interval mapping.
marker analysis in a multiple regression frame- Their method is based on maximum likelihood
work. The motivation for CIM was that the error and requires special programs for analysis. It is
term (e) in the SIM model (y = b0 + b1x + e) is postulated that considerable increase in power of
composed in part of true experimental error but QTL detection can be expected when using infor-
also in part of variation due to QTL at other loci mation from two correlated traits.
(or genetic background segregation). Some of the
variability among lines that share a common QTL
genotype at QTL locus 1 is due to the fact that Testing for Linked QTL Versus
they can have different genotypes at QTL locus 2 Pleiotropic QTL
somewhere else in the genome. The CIM approach
begins by first conducting single-marker analysis, While doing single-trait analysis, when two QTL are
then by building multiple-marker models using found in the same region (i.e. for a single genomic
typical regression model building methods region linked to two different traits), the question
(forward or stepwise regression). Forward regres- arises whether these are actually the same genes
sion operates by first selecting the marker with affecting both traits or these are two separate QTL.
the highest statistical significance (highest LRT Unravelling this difference allows to better under-
or LOD score). Next, the second most significant stand the nature of a genetic correlation between two
marker is added to make a 2-locus model. The traits. This would provide information concerning
two markers are re-evaluated for significance, and the possibility to break an unfavourable genetic
if they both remain significant in the model, then correlation between two characters (in the case of
it proceeds by adding the third most individually linkage) or whether this is impossible (in the case of
significant marker and so on. At any step, if a pleiotropism (which refers to same gene(s)) involved
marker is no longer significant, it can be dropped in expression of several traits).
from the model. In this way, a model that includes The test can be carried out with
the most important markers that all remain H0: position 1 = position 2
significant when fitted simultaneously is built. H1: position 1 position 2
In CIM terminology, these markers are called Also other genetic models could be compared
cofactors. Once the model containing the cofac- and tested (depending on design) such as (1)
tors is built, we then rescan the entire genome existence of epistasis and (2) QTL effecting one
using interval mapping. Many researchers have trait only versus effect on both traits. Maximum
used QTL Cartographer, MapManager QTX and likelihood might be a bit laborious for multiple-trait
PLABQTL to perform CIM. analyses, especially when comparing a range of
genetic models. Moser in 1998 has proposed a
multiple-trait regression approach and showed again
Multiple Trait Mapping that regression is very similar to maximum likeli-
hood methods (at least in designed experiments).
Multiple traits that are correlated to each other
can add further information to the investigated
traits. It can also be noted that to some extent, Multiple Interval Mapping (MIM)
two measurements on correlated traits are fairly or Multiple QTL Mapping
like repeated measurements. Therefore, informa-
tion from correlated traits can reduce the effect of As stated earlier, both SIM and CIM were
error variance, making it easier (more powerful) designed to detect a single QTL at a time based
to detect QTL. Not only the power of QTL detection on a statistical test that a candidate position for a
is increased, also the precision of the QTL map QTL has significant effect or not. The investiga-
position is better and testing of models regarding tion was constructed to test each position in a
the genetic correlation between two traits. Jiang genome and thus created a genome scan for QTL
128 6 QTL Identification

analysis. Though intuitive and widely used, these 4. Optimise estimates of QTL positions based on
methods are still insufficient to study the genetic the currently selected model: Instead of per-
architecture of complex quantitative traits that forming a multidimensional search around the
are affected by multiple QTL. When a trait is regions of current estimates of QTL positions,
affected by multiple loci, it is more efficient sta- estimates of QTL positions are updated in turn
tistically to search for those QTL together. Also for each region. For the rth QTL in the model,
in order to study epistasis of QTL, multiple QTL the region between its two neighbour QTL is
need to be analysed together (Box 6.2). In this scanned to find the position that maximises
setting, QTL analysis is basically a model-selec- the likelihood (conditional on the current esti-
tion problem. mates of positions of other QTL and QTL
Multiple interval mapping (MIM) is targeted epistasis). This refinement process is repeated
to analyse multiple QTL with epistasis together sequentially for each QTL position until there
through a model-selection procedure to search is no change on estimates of QTL positions.
for the best genetic model for the quantitative Thus, model selection entails four distinct
trait. As shown by Kao et al. in 1999, given a steps: (1) Select a class of models (e.g. additive
genetic model (number, location and interaction models or models including pairwise interactions
of multiple QTL), this linear model suggests a between QTL), (2) search through the space of
likelihood function similar to that in SIM but models (there may be more possible models than
with more complexity. An expectation/maximis- may be inspected individually), (3) compare
ation algorithm can be used to maximise the like- models and (4) assess the performance of a
lihood and obtain maximum likelihood estimates model-selection procedure. If one allows only
of parameters. The following model-selection three or fewer QTL, one may perform a simulta-
method is used to transverse the genetic model neous search to consider each such model. But if
space during MIM analysis in QTL Cartographer one wishes to consider the possibility of many
(refer Box 6.2): more QTL, it will be impossible to inspect each
1. Forward selection of QTL main effects, possible model individually, so one must form
sequentially: In each cycle of selection, pick some procedure for searching through this space
the best position of an additional QTL, and of models to pick out the best ones. Finally, it is
then perform a likelihood ratio test for its main important to consider how one may assess the
effect. If a test statistic exceeds the critical performance of a model-selection procedure.
value, this effect is retained in the model. Stop Decisions should be guided by the aims of the
when no more QTL can be found. study. In a study seeking to use marker-assisted
2. Search for epistatic effects between QTL main selection to improve an agricultural product, one
effects included in the model, and perform may be willing to allow a few extraneous loci in
likelihood ratio tests on them: If a test statistic an effort to identify a reasonably large number of
exceeds the critical value, the epistatic effect QTL. A scientist wishing to perform positional
is retained in the model. cloning (refer chapter 7), a genomic region may
Repeat the process until no more significant be satisfied only with a small number of strongly
epistatic effects can be found. supported QTL; this avoids wasting expensive
3. Re-evaluate the significance of each QTLs and time-consuming efforts on extraneous loci.
main effect in the model: If the test statistic for These sorts of aims should guide the researcher
a QTL falls below the significant threshold in framing the desired performance characteris-
conditional on other retained effects, this QTL tics for a procedure, which may then be used in
is removed from the model. However, if a QTL choosing an appropriate mapping method. One
is involved in a significant epistatic effect with will need to rely on experience, educated guesses
other QTL, it is not subject to this backward and large computer simulation studies because,
elimination process. This process is performed unfortunately, the appropriate mapping method
stepwise until no effects can be dropped. will vary with the context.
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 129

Box 6.2 How to Analyse QTL Using QTL Cartographer


Windows QTL Cartographer (available with have generated. However, while analysing those
no cost at http://statgen.ncsu.edu/qtlcart/ linkage groups, it is found that chromosome 1
WQTLCart.htm) maps QTL in cross popula- does not have linkage group and chromosome
tions from inbred lines. It includes a powerful 4 has two linkage groups. Label those
graphic tool for presenting and summarising chromosomes or linkage groups as C1 to C4
mapping results and can import and export (C denotes chromosome; the number 14
data in a variety of formats. It provides single- describes respective linkage group, and the
marker analysis, interval mapping, composite alphabet in the suffix denotes different linkage
interval mapping, Bayesian interval mapping group belong to the same chromosome). The
and multiple interval mapping. There are two chromosomes are entered in first column as
stages in making a QTL map for a particular above. No linkage group is found for chromo-
trait (once youve scored hundreds of marker somes 1, and hence, it should not be men-
loci in hundreds of F2 or backcross progeny): tioned in this column. Note that chromosome
1. Construct a genetic map of your markers. 4 has two linkage groups, and hence they are
2. Feed the genetic map, marker data and considered as separate linkage groups in the
phenotype data into QTL Cartographer and above data file. The number of markers pres-
run the analysis. ent in each linkage group is entered in second
Constructing the genetic map is dealt in column with respect to their chromosome
detail in chapter 4. This box focuses on the number (remember, this number of markers is
second step. not the number of markers that have used in
the linkage mapping analysis, rather than it is
Preparation of Data Files the number of markers that are linked at the
end of linkage mapping analysis). This data
Four different data files are to be prepared to file is saved in Text (Tab delimited) type with
use in QTL Cartographer. They are (1) data a suitable name (e.g. file1.txt).
file of chromosome label and marker number,
(2) data file of marker label, (3) data file of Preparation of Data File
marker position and (4) data file of genotype of Marker Label
and phenotype. These data files should be pre- The data file is prepared in Microsoft Excel as
pared as explained below. per the following format:
NAU NAU NAU BNL NAU NAU NAU
Preparation of Data File 1246 3684 3875 3971 3083 3172 3839
of Chromosome Label and Marker
Number Marker labels are entered in the first row.
This particular data file is prepared in Microsoft The order of the markers is it starts from the
Excel by following the below-mentioned format. first marker of chromosome 2 (position 0)
C2 4 until the last marker of chromosome 2 and
C3 3 continued to the first marker of chromosome
C4a 2 (position 0 cM) 3 to the last marker. The
C4b 3 same format is followed as many chromo-
somes as entered in the first data file (in this
Consider there are four chromosomes and example, up to the last marker of chromo-
during the map construction four linkage groups some C4b). The data file is saved in Text

(continued)
130 6 QTL Identification

Box 6.2 (continued)


(Tab delimited) type with a suitable name The marker positions (in cM) are entered
(e.g. file2.txt). in the fi rst column as above. Data of marker
positions are started from position 0 for the
Preparation of Data File of Marker fi rst marker of fi rst chromosome (in this
Position case chromosome 2). After typing the posi-
The data file is prepared again using Microsoft tion of the last marker of the given chromo-
Excel as below. some, the next row is left blank before
0 continuing to next chromosome (see
23.3 above). The data fi le is saved in Text (Tab
67.5 delimited) type with a suitable name (e.g.
83.3 fi le3.txt).
0
9 Preparation of Data File of Genotype
51.9 and Phenotype
0 This data file is prepared in Microsoft Excel as
per the below-mentioned format.

Individual
label NAU1246 NAU3684 NAU3875 BNL3971 DFF PH YLD
BC1-1 2 0 . 2 55 68.7 15.2
BC2-2 0 . . 0 45 84.2 20.5
BC1-3 2 2 2 2 61 65.7 17.5

BC1-n 0 0 2 0 58 71.5 22.8

The first row is earmarked for header line, units (e.g. measurements in cm for plant
and it needs to be filled in the order of indi- height for all the individuals; feeding one indi-
vidual label; genotypic score for each marker viduals plant height data in cm and the other
(marker labels are entered in the same order as individuals data in mm or m is the wrong
in file3) followed by phenotypic values (in the input method). The data file is saved in Text
above example, phenotype1 is DFF (days to (Tab delimited) type with a suitable name (e.g.
first flowering), phenotype2 is PH (plant file4.txt).
height) and so on) are to be entered. In the
subsequent rows, the data, with one row for
each individual, as per the header line in each Importing Data File into QTL
column are to be entered. Genotype code used Cartographer
for scoring the genotypes is as follows for
backcross progenies (for other type of map- All the data files prepared as mentioned above
ping population, refer to the QTL Cartographer are imported in to QTL Cartographer for QTL
manual or help button available in the QTL analysis by essentially following the below
Cartographer main window): 2 = homozy- steps:
gous parent A, 0 = homozygous parent B, Run the QTL Cartographer by double
1 = heterozygous and . = missing value. clicking the radio button and select the
Likewise, mean phenotypic values scored for tab New from the File menu. A Basic
each progeny should be entered with uniform information box will appear and the

(continued)
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 131

Box 6.2 (continued)


following information is to be filled in that critically viewed to cross check the
box. imported data in all aspects. The data file
To save the data file, choose the destination will appear in the big lower window. The
directory by selecting the tab File name top left third of the screen shows some
and Save as menus. basic information about the data set, the top
The information about number of chromo- middle allows to visualise or modify
somes (linkage groups), number of traits, specific genotypic or phenotypic data
number of other traits (binary value such as points, and the right top of the screen has
sex), number of individuals and cross type the options for QTL analysis.
(e.g. in this case as B1, i.e. backcross to
parent 1) should be entered in Basic infor-
mation box and run the program. Importing Mapmaker Files into
The message input chromosome label and WinQTL Cartographer
marker number for each chromosome will
appear. Select the data file of chromosome Alternatively, all information generated using
label and marker number of each chromo- the above-said four files can be obtained from
some (in this case, file1.txt) from the cor- Mapmaker and easily imported in to the QTL
responding directory and send the data to Cartographer using the below flowchart, if the
QTL Cartographer by selecting the tab linkage and QTL-mapping analysis has been
Send Data. done using Mapmaker with the same data
The message input marker labels and posi- (refer chapter x):
tions for each chromosome will appear. 1. File > Import > Source DATA import 1/1
Select Labels tab and choose the data file In this window, enable MapMaker/QTL
of marker labels (in this case, file2.txt) from format and click < Next>
the corresponding directory and click Send 2. In Source Data Import 2/2 window:
Data to send the data to QTL Cartographer. Click < Map file > and provide the map-
Then, select the tab Positions and browse maker file with .map extension
the data file, marker position (in this case, Click < Cross Data > and provide the cross
file3.txt) from corresponding directory. data (input data used in the mapmaker)
The data is sent to QTL Cartographer by with .raw extension or .txt
selecting the tab Send Data. The source data file for WinQTL will be
By doing so, a message cross information created in the working directory with same
filename will appear. Select the data file of file with extension of _mps_ln.
genotype and phenotype (in this case, file4. 3. click < Finish>
txt) from the corresponding directory, and A new window will appear as The new
the data file is to be sent to QTL Cartographer source data file has been saved.
by selecting Send Data tab.
Finally, click the tab Finish to import all
the required data into QTL Cartographer Single-Marker Analysis (SMA)
which will result in a message appearing as
QTL Cart has created the source data file, From the Analysis menu, select Single
the new source data file has been saved. Marker Analysis option to perform SMA.
Description of the imported data will Select the option Graphic, and mention the
appear in the main window, and it must be destination directory to save the output file,

(continued)
132 6 QTL Identification

Box 6.2 (continued)


with a suitable name. From the tab Chrom, more doubtful of the QTL tests at those loci.
select the option All Chroms, and the graphic Following this is a table showing tests of
of all the chromosomes will be displayed. segregation distortion at each locus.
SMA of each chromosome can also be sepa- Chi2 is the c2 test of the null hypothesis
rately done by selecting the option First that the locus is segregating as expected for a
Chrom, Second Chrom and so on. Then the Mendelian locus in the population. This test is
graphic of each chromosome will displayed based on the difference between expected and
for individual chromosome wise analysis. observed numbers of lines in each genotypic
Under the tab Setting, select the options class. The larger the deviation from expecta-
Show Trait Names or Legend, Show Marker tion under the null hypothesis, the larger this
Names and Show Chromosome Names to number is. It is important to look for is there
display those information in the graphic. Use a significant deviation from the expected seg-
the option Copy Graph to Clipboard from regation at this locus? It is actually fairly
the File menu to import or paste the graph on tricky to answer. For example, from the results,
Microsoft Word or PowerPoint. you may find that the P value of a test is 0.022.
If you push the View info button in the If you consider a = 0.05 the threshold for
Single Marker Analysis box, youll get results significance, then you would consider the data
of linear regression analysis of the relationship to demonstrate a significant deviation from
between phenotype and marker genotype for expected segregation. However, keep in mind
each marker, individually. This analysis tells that setting a threshold of 0.05 means that one
us if there is any significant positive relation- expects that, by chance, one will declare 5%
ship between genotype and phenotype for the of all tests to be significant, even if the null
markers. If you push the View info button hypothesis of Mendelian segregation is always
in the Statistical Summary box, youll get true. When we reject the null hypothesis even
summary statistics on the pattern of trait varia- though it is true, we make a Type I error. Since
tion in the mapping population and on the we are testing many loci for segregation dis-
pattern of segregation at the marker loci, that tortion, one should probably use a more strin-
is, whether they follow Mendelian expecta- gent threshold to avoid making too many Type
tions. We can check whether the genotype I errors. One possibility is to use an experi-
proportions in our mapping population all ment-wise (or whole-genome-wise) threshold
appear to be consistent with Mendelian that adjusts the significance threshold to main-
expectations. tain the probability of making at least one
In the results, sample size refers to the Type I error at some constant level. This often
number of lines used in the analysis, the vari- leads to very stringent significance thresholds
ance (which is almost identical to the pheno- because it becomes very difficult to avoid
typic variance of line means; these numbers making just one Type I error if you conduct
should be nearly identical because they are many tests (remember, the number of tests
estimating the same thing although in slightly here is equal to the number of marker loci). So
different ways). Following the trait statistics as you do a better job of controlling the rate of
and histograms is a long table showing the Type I errors, you end up making more Type II
percentage of missing data at each marker errors (where you do not reject the null hypoth-
locus. You should at least scan this table to see esis in cases where it is not true). Worse, it is
if there are any loci with large amounts of not even how to correctly set this threshold for
missing data because that will warn you to be data where the tests are not all independent of

(continued)
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 133

Box 6.2 (continued)


each other. In the case of genetic data, tests at this region. By carefully examining Statistical
linked loci are not independent. If there is seg- Summary output and checking the segrega-
regation distortion in a genomic region, then tion distortion results in this region, we can
all loci in that linked region will exhibit distor- identify this fact (refer chapter 3 for c2 analy-
tion. In such cases, the following points may sis using AntMap).
help. First, decide what is the relative cost of During the computation, single-marker
making a Type I versus a Type II error. In this analysis considers one locus at a time and fits
example, what is the effect on QTL mapping if the following regression model (refer Fig. 6.1):
there really is segregation distortion? The big-
gest difficulty is that segregation distortion y = b0 + b1z + e,
leads to biased recombination frequency esti- where y is the phenotypic value of a line, b0 is
mates during linkage map construction (see the population mean, b1 is the additive effect
chapter 4 for detailed description). However, of the locus on the trait and e is a residual error
for single-marker QTL analyses, segregation term. x is directly related to the genotypic code
distortion causes no bias at all. We just need to at the locus being tested for the line consid-
keep in mind for the later methods of QTL ered; it is 1 (for female or recurrent parent)
analyses to be discussed that the map distances or 1 (for male or donor parent). The popula-
are not really known and may be estimated tion mean estimate, b0, should change very
with some bias. Second, set a significance little from marker to marker. The critical
threshold somewhere between 0.05 (the most parameter in this equation is b1; this tells us
liberal) and a Bonferroni-corrected threshold what is half the effect of changing the geno-
of 0.05/n, where n = number of tests (the most type from female homozygote (x = 1) to male
conservative), depending on how concerned homozygote (x = 1) at this locus? If the marker
one is about Type I versus Type II errors. An locus is not linked to a QTL, then we expect
ad hoc, somewhat liberal threshold that often that changing the genotype at the marker locus
used is created by dividing 0.05 by the number has no effect on the phenotype and b1 = 0. As
of chromosome arm pairs in the linkage map. the effect of changing the genotype is greater,
Since loci at the two different ends of a chro- the value of b1 increases, and the values of the
mosome tend to be independent of each other, error terms, e, must decrease. This leads to
we guess that there are at least two indepen- increased evidence against the null hypothesis
dent groups of tests on each chromosome. For of b1 = 0 (no QTL linked to the marker).
example, in rice, there are 24 chromosome The test of significance of b1 can be done by
arms, so the threshold is p = 0.05/24 = 0.002. regression, ANOVA or maximum likelihood.
The corresponding c2 value with one degree of The results of these methods for single-marker
freedom is 9.47. Even with this adjusted analysis are essentially identical. QTL Carto-
threshold, we can find significant segregation grapher actually does this test using maximum
distortion on every chromosome, and it may likelihood estimation. Maximum likelihood
be very strong for some markers. Obviously, it estimates the most likely value of b1 given the
can be assumed that there are problems with observed genotypic and phenotypic data and
the linkage map in this region. You may notice reports the likelihood of the model with the
one other interesting fact about such region: most likely value of b1 as L1. A significance
The QTL regions overlap with regions under- test is based on the likelihood ratio test (LRT).
going segregation distortion, and the favour- The LRT is calculated as 2 times the natural
able QTL alleles are in excess frequencies in log of the ratio of the likelihood of the model

(continued)
134 6 QTL Identification

Box 6.2 (continued)


where b1 is set equal to 0 (L0) to the most likely Interval Mapping
QTL model (L1). This can be converted to an
F-test. Notice that the values of x (the geno- To perform the interval mapping, select the
typic values) change for each locus, so the option Interval Mapping from Analysis
model is recalculated for each marker locus, menu. Mention the destination directory to
and the significance test is redone for each save the graphic of interval mapping results.
locus. Therefore, we will test as many QTL Since we are doing a lot of statistical tests
models as we have markers in the data set. when doing a QTL analysis, you have to take
Scanning the output table, we can find account of that fact in choosing a threshold
significant results that are notified by * and value of the likelihood ratio statistic for declar-
**. The point to be noted here is QTL ing that youve found a QTL. You can accept
Cartographers single-marker analysis is the default value, use one of your own or select
essentially identical to a regression or ANOVA one through permutations (which will take the
analysis conducted using the genotype data longest but produce the most reliable thresh-
for one marker at a time. It is natural to test the old value). The number of permutation tests
effect of the marker locus on the trait in this can be set as 3001,000 or more. QTL
fashion. But recall that we usually consider Cartographer will automatically calculate the
the markers to be neutral and we are really threshold when you press Go tab, and the
searching for QTL that are linked to the marker resulting LOD score will be fixed as threshold
loci. Therefore, the phenotypic effect observed for interval mapping. As mentioned above, the
at a marker locus is affected both by the true threshold value can be fixed manually in the
QTL effect and the recombination frequency appropriate tab that can be seen in the same
between the marker and the QTL. This makes window. Note that the default significance
sense, since recombinations between the threshold is an LRT value of 11.5, which
marker and the QTL result in progeny with the equals an LOD score of 2.5 (refer text for
opposite QTL allele compared to the parental details). Once this threshold value is set, the
arrangement. Between the two extremes of interval mapping can be performed. The other
marker and QTL are unlinked and tightly parameter you may want to change is the walk
linked, you can see that the estimated effect of speed. Thats the parameter that determines
the QTL decreases linearly as recombination the interval along the map at which QTL cal-
between the marker and QTL increases. This culations are done. If you have a very dense
means that unless the marker is right at the map, you can set the interval to be quite small,
QTL, then you will underestimate the true and youll have a much more precise idea of
effect of the QTL. The marker closest to the where any QTL you locate may be, but it will
QTL should have the largest effect. It is impor- take the program much longer to do the calcu-
tant to decide that suppose if there were eight lations. If there is no idea on this walk speed,
significant markers on chromosome 1 let us leave the walk speed at the default
identified, does it mean that the analysis has 2 cM.
found 7 QTL on chromosome 1? In reality, we The graphics of all the chromosomes can
really do not know if there are multiple QTL be obtained by selecting the All Chromos
or a single QTL whose effect extends to option from the tab Chrom. Interval map-
numerous linked loci, but the latter hypothesis ping for each chromosome can also be carried
is simpler, so it is usually accepted unless solid out separately by selecting the particular chro-
evidence to the contrary can be given. mosome (First Chrom, Second Chrom and

(continued)
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 135

Box 6.2 (continued)


so on), and the graph of each chromosome can LOD = 3 means that the best QTL model is
be saved separately (as shown in Fig. 6.4). 1,000 times more likely than the no-QTL
Similarly, interval mapping can also be per- model. A threshold of 2.53 is often used
formed for each trait separately by selecting to declare significance of QTL to minimise
one trait at a time (1: DFF, 2: PH, etc.). the frequency of Type I errors.
The additive effect of the particular character Notice that a horizontal line is drawn across
was also displayed separately as graphic, just the graph at the common threshold value of
below the graph of the LOD score (Fig. 6.4). 2.5. You can actually change the level of this
Analyse the graph of each chromosome to threshold on the graph by choosing Setting > Set
identify the QTL linked to the particular trait display parameters and entering the desired
as the peak of LOD score that exceeds the value in the box near the bottom right of the
threshold. These are the peaks of the likeli- dialog box. This raises the question of what
hood profile where QTL are most likely to be the appropriate threshold for significance
located (if you accept a peak as being should be for declaring a QTL to exist near a
significant, the exact position of the peak can marker (and that is why we used a permutation
be seen in the results table). test). An LOD of 2.5 corresponds to an LRT of
Figure 6.4 suggests that a QTL is present at 11.5, which corresponds to a P value of
about 20 cM from the left end of the chromo- 0.0007. This is lower than the ad hoc threshold
some. There are two parts to the graph. The of 0.05/24 = 0.002 previously suggested for
x-axis of both graphs is the marker positions rice. Again, we are faced with the problem of
along the linkage map. The top graph plots the balancing Type I and Type II errors.
LOD score for each marker against its The bottom graph plots the additive effect
position on the map. You can see that this has against the marker position. Notice that the
some relationship to the LRT discussed previ- additive effect can shift from positive to nega-
ously. Why are LOD scores given instead of tive according to the QTL. For example,
LRTs? It is for simplicity. Linkage map (such finding the corresponding line in the output
as MAPMAKER) results are often given in (position 20.0601), we can see that the addi-
LOD scores, so it makes some sense to also tive effect of the A allele at this locus is esti-
report the QTL results in terms of LOD. Also, mated to be 9.20 and that this QTL accounts
LOD scores are easier to interpret than LRTs. for about 22% of the variance (r2) in the trait
One can easily see from the definition of an (this values can be obtained from the table that
LOD score that: can be seen in the results output). The key
LOD = 0 means that the best QTL model and point to be noted here is interval mapping
the no-QTL models have identical likeli- should have higher power to detect QTL
hoods (thus, no evidence for a QTL). located between marker loci and should pro-
LOD = 1 means that the best QTL model is 10 vide better (unbiased) estimates of the QTL
times more likely than the no-QTL model effects. But, this is all based on the assumption
(which is considered only limited evidence that our linkage map is accurate!
for a QTL, not significant). The r2 value for a QTL peak can be inter-
LOD = 2 means that the best QTL model preted as the proportion of the phenotypic
is 100 times more likely than the no- variance explained by that QTL. But this
QTL model (which is still considered interpretation must be made with caution. If
only limited evidence for a QTL, not it were really true, then we could add up all
significant). of the r2 values for the QTL discovered and

(continued)
136 6 QTL Identification

Box 6.2 (continued)

Fig. 6.4 Interval mapping results for the sample data

obtain the proportion of phenotypic variance relative costs of making Type I and Type II
that all of our QTL combined explain. For errors. For that reason, it is suggested to
example, suppose if there were seven QTL perform permutation tests as a way to accu-
reported in the output, we got a cumulative rately obtain the overall genome-wise QTL
total of 94% of the phenotypic variation Type I error rate. And another possible reason
explained by all the 7 QTL. It is obvious that is by adding up individual QTL r2 values to
this must be an overestimate because the her- obtain a combined effect estimate, you are
itability of the trait is usually less than 94%. assuming that the QTL effects are indepen-
Therefore, realise that the total variance dent. This can be violated in at least three
explained by the QTL will typically be less ways in typical mapping studies: (1) The QTL
than the sum of the individual QTL r2 values may be linked on the same chromosome.
(in some cases, you can get individual QTL (2) The QTL may be on different chromosomes,
r2 values to sum to more than100%). but are not completely independent just
One obvious reason that the r2 values can because the sample size (number of mapping
sum to more than they really explain jointly is lines) is finite. (3) The QTL genetic effects
that some of the QTL peaks given in the SIM may interact epistatically. These problems of
output are false positives (Type I error). It is not knowing if a QTL is real or not and of
previously mentioned that if one conducts overestimating the QTL effects in single-
many independent tests, the overall probabil- marker analysis and SIM can be addressed are
ity of making at least one Type I error is much to build multiple QTL models such as com-
higher than the threshold rate for an individual posite interval mapping (but does not entirely
test. It is also discussed that it is difficult to solve the above-said problems). This should
determine an appropriate threshold level for help to eliminate some false positive QTL
declaring significance and it depends on the because it is more difficult for them to be

(continued)
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 137

Box 6.2 (continued)


included in a multiple QTL model and remain Many of the 21 columns in the output correspond
significant. It will also improve our estimates to comparisons among these hypotheses or to
of the QTL effects and get more realistic esti- estimates of additive and dominance effects
mates of the total variation explained by the under a particular hypothesis, and refer the
QTL jointly because the r2 value of the multi- manual to get detailed features of each column.
ple QTL model takes into account their lack of
independence. The other issue of the genome-
Composite Interval Mapping
wise error rate is also not entirely solved by
multiple QTL modelling, because it is still
The options available and procedure for com-
not clear what the probability of a Type I error
posite interval mapping are very similar to
is in multiple QTL models. For interval map-
those for interval mapping. Thats because
ping and composite interval mapping, how-
the underlying statistical model is very simi-
ever, we can get good estimates of the
lar. In fact, the only difference is the CIM is
genome-wide Type I error rate by using per-
attempting to statistically control for the gen-
mutation tests. The permutation test will nor-
otype at markers other than those immedi-
mally take some time to finish. Usually, 1,000
ately flanking the candidate QTL. It is
permutations are recommended for an accu-
obvious that graphic display generated by
rate estimate of the threshold value. The value
interval mapping and composite interval
that occurs at the bottom of the highest 5% of
mapping look pretty similar.
values is used as the threshold level that indi-
The idea is that including the cofactors in
cates an LRT test significant at the 5% level,
the model reduces the error term and should
and it is automatically fit by QTL Cartographer
provide higher statistical power to detect the
as stated above during analysis.
QTL using interval mapping. However, power
When analysing an F2 or any mapping pop-
of QTL detection can actually decrease if you
ulation design using interval mapping or com-
try to fit linked marker loci. QTL Cartographer
posite interval mapping, QTL Cartographer
deals with this issue by using a window that
reports 21 columns of information for each
slides along the chromosome as the interval
position in the walk along the chromosomes.
mapping proceeds and drops out of the model
Before enumerating those statistics, its useful
any cofactors that are within a set distance from
to point out that there are four hypotheses
the markers defining the interval being tested.
being examined at each position (refer the
Thus, if you set the window size to 10 cM and
manual for details):
you are testing a position within the interval
1. H0: a = 0, d = 0Both the additive allelic defined by loci B and C, then any markers
effect and the dominance deviation are within 5 cM to the left of B to 5 cM to the right
zero. of C would be dropped from the model if they
2. H1: a 0, d = 0The additive allelic effect happened to be cofactors. What this means is
is distinguishable from zero, but the domi- that the model being tested at each position is
nance deviation is zero. actually subject to change as cofactors drop in
3. H2: a = 0, d 0The additive allelic effect and out of the model due to being blocked by
is zero, but the dominance deviation is dis- the sliding window. This makes interpretation
tinguishable from zero. of CIM results difficult sometimes.
4. H3: a 0, d 0Both the additive allelic We implement the CIM analysis in QTL
effect and the dominance deviation are Cartographer by selecting Composite Interval
zero. Mapping from the Analysis drop-down

(continued)
138 6 QTL Identification

Box 6.2 (continued)


menu on the top right of the main window. with CIM than they are with SIM. This is
Again, we have the option to accept the because of the higher power of detection and
default threshold of LRT = 11.5 (LOD = 2.5) higher estimation precision gained by control-
or we can do a permutation test using CIM ling the genetic background variation with the
(the threshold could differ between CIM and cofactors. But these r2 values are still not based
SIM for the same data set because the analy- on fitting all of the QTL in a final model. And
sis methods are different) or simply accept we still have the problem of finding tightly
the default threshold. You can also see the linked QTL peaks. These problems can be
various options for selecting cofactors and addressed by making a model that fits each of
setting window size by clicking the Control the QTL positions as interval positions simul-
button at the top centre of the top panel. The taneously, without additional cofactors. This
default is Model 6 which selects only the would give us a valid estimate of the total vari-
most significant markers as cofactors using ation explained by the model and would give
multiple regressions. There are other model us the evidence of which peak of multiple
options for choosing the cofactors (and you linked QTL peaks is the most likely position
can even define the cofactors yourself), but of the QTL.
these other models are not generally recom- We can also estimate a 95% confidence
mended (there may be some special cases interval on the position of the QTL using these
where they would be useful). Having selected CIM results. This is based on the 1-LOD sup-
Model 6, we can still choose the multiple port interval, meaning that the confidence
regression method (forwards, backwards or interval includes the position of the QTL peak
forwards and backwards stepwise). The for- plus all positions to the right and left of it that
wards and backwards is generally recom- have LOD scores within 1 of the peak. For
mended as the best model-selection algorithm, example, you can get a rough guess at the 95%
but it will take longer to select the cofactors CI for the QTL at a particular position, say
or select the default of forward selection. If 215.6 by looking at the LOD profile graph.
you do choose stepwise regression, you will The LOD at the peak is about 3.6, so any posi-
need to decide on appropriate thresholds for tions flanking it that have LOD scores greater
permitting markers to enter the model and to than 2.6 should be included in the confidence
delete markers from the model. We can leave interval. You can also do this by looking at the
the window size as the default of 10 cM and results for each tested position in the output
accept the default number of control markers file. Suppose, if the LRT value for position
(cofactors) of 5. It is probably good to limit 215.6 is given as 16.79 (~3.6 LOD), so we
the number of cofactors to about 5 unless you need to include any positions around it with
have a very large population size, or you may LRT values greater than 11.97 (=2.6 LOD). In
end up with so many cofactors that there will fact, it is not really known how to obtain true
be little power to detect QTL in the interval confidence intervals for QTL located with
mapping scans. CIM, and the 1-LOD support interval may be
The output from the CIM analysis may an underestimate, but even so, it illustrates
show lesser number of QTL peaks than SIM, that point that in typical QTL-mapping stud-
but each of the CIM peaks may have higher ies, a QTL position cannot be located with
LOD scores than the SIM QTL tests. You can better precision than about 10 cM. This makes
also notice that the additive effects estimates relating QTL to underlying genes (positioned
and the r2 values of the QTL are usually higher on a physical map) extremely unreliable.

(continued)
Multiple Interval Mapping (MIM) or Multiple QTL Mapping 139

Box 6.2 (continued)


Multiple Interval Mapping cess continues until no more positions can be
added as significant markers. These two
Multiple interval mapping is a still more approaches result in MIM models that can be
sophisticated method of mapping. It allows then further refined by testing the effects of
you to identify more than one QTL and to moving one QTL position just slightly, while
refine your analyses as you go along. One nice maintaining the other positions constant to see
feature is that it provides an easily understand- if the model can be improved. This can be
able summary of the results. Choose multiple done iteratively until no further improvements
interval mapping (MIM) from the Analysis can be made in the model. Then the final
drop-down menu on the top right of the main model can be tested, providing total r2 values
window. We are prompted to select the trait for all QTL jointly and additive effects of QTL
and choose trait 1, PH. A new top window estimated simultaneously. However, for pre-
opens and says No MIM Model Exist. Create liminary analysis, it is advised to start MIM
a new MIM model by selecting New Model. using the CIM and MIM default methods to
The Create New MIM Model window opens, compare the models they select as best. Start
and we can choose the method we want to the MIM search procedure to build the initial
create the MIM Model. We can choose MIM model. A dialog box pops up, and we are
Forward Selection on Markers, Forward & asked to choose the model-selection criterion
Backward Selection on Markers, Scan Through from among Bayesian information criterion
Composite Interval Mapping or MIM Forward (BIC), Akaike information criterion (AIC)
Search Method. The first two options imple- and modified versions of the original BIC.
ment multiple regression model building These selection criteria are computations that
by fitting marker loci (not interval positions) weight the increase in likelihood of adding a
in the model as that of CIM does. The Scan parameter (such as a new QTL) to the model
Through Composite Interval Mapping against the possibility of over-fitting a model
approach inputs the information from CIM by adding too many parameters. Each addi-
and fits a multiple QTL model by first select- tional parameter can only be added if it
ing the position with the highest LOD score increases the likelihood more than some
from CIM, then fitting the position with the threshold value. The different criteria vary by
2nd highest LOD score from CIM and so how stringent they make that threshold. AIC is
forth. Only positions that remain significant the least stringent, and the original BIC is
when fitted with the previously included QTL probably a good choice. By doing such MIM
positions will be maintained in the MIM analysis, it estimates the additive effects of the
model. The MIM Forward Search Method QTL and their positions and effects. We can
builds a multiple interval position model by also test for epistasis among pairs of QTL. Hit
first selecting the position with highest LOD Refine Model, then Searching for new QTL
score from interval mapping. Then, the in the window that pops up; then in the new
genome is rescanned with interval mapping, top panel, select the Search for Epistasis but-
but including the first selected position in the ton and then hit Start.
model during the rescan. Then, the next most Caution: Interpreting the results requires
significant position found upon rescanning the more advanced knowledge on genetics of the
genome is fit into the model. Following this, traits and additional restrained interpretation.
the genome is rescanned again, but including Readers are requested to refer the manual/tuto-
the first two positions in the model. This pro- rial and the latest papers that have used MIM.

(continued)
140 6 QTL Identification

Box 6.2 (continued)


It is difficult to manually draw the QTL or chromosomes. On these bars, the positions
map (such as shown in Fig. 6.3) with publi- of loci are indicated, and next to the bars,
cation quality. MapChart which is freely QTL intervals and QTL graphs can be shown.
available at http://www.biometris.wur.nl/uk/ MapChart reads the linkage information (i.e.
Software/MapChart/ can be used for this pur- the locus and QTL names and their positions)
pose. MapChart is a computer package for from text files. This information has to be
the MS-Windows platform that produces calculated before using MapChart, usually
charts of genetic linkage maps and QTL data. with genetic mapping software such as
These charts are composed of a sequence of Mapmaker, QTL Cartographer, JoinMap
vertical bars representing the linkage groups and MapQTL.

Statistical Signicance in the genome. For example, epistasis is difficult


to model unless the interacting QTL are known in
Regardless of the method used to estimate and advance. When a detailed model accurately
locate single or multiple QTL, once the test describes complex relationships between multi-
statistics are calculated, the likelihood of the ple (interacting) QTL, it is often the case that
event is assessed. The statistical basis of these simulation-based thresholds are the only practi-
comparisons relies on model assumptions, cal way to assess statistical significance because
the most common of which requires the quantita- alternative approaches are so computationally
tive trait values to be normally distributed. In demanding. In QTL analysis, this statistic pro-
reality, however, the distribution of the trait vides only an approximate test, as the null hypoth-
values is not normal and needs to be considered esis involves a non-mixture distribution whereas
as a mixture of (normal) distributions. Violating the QTL model involves a mixture distribution.
the normality assumption has an impact on the Also regression analysis provides only approxi-
distribution of the statistic used to test for a QTL, mate test statistics, as they assume normal dis-
which makes standard statistical procedures tributed errors within marker type, whereas the
potentially inaccurate. distribution is really a mixture of two (or three).
One approach to obtaining the distribution Nonparametric resampling methods have pro-
(or behaviour, in the long term) of the test statistic vided a useful alternative to simulation-based
is to use a computer simulation to produce the thresholds. Permutation resampling and boot-
data. Thousands of data sets, taken from the same strap resampling have been applied as a means of
statistical model, are simulated and the test statis- randomising the phenotypic (trait) data for the
tics calculated. Together, these test statistics show purpose of evaluating any test statistic under a
the behaviour of the test in the long run and, null hypothesis that tests for a QTL.
therefore, represent the statistical distribution of
the particular test statistic. From this distribution,
one chooses the level of statistical significance or Permutation Testing
threshold above which results are considered
statistically significant (or valid). This approach Churchill and Doerge in 1994 proposed permuta-
is indeed useful if the model used to simulate the tion testing to obtain empirical distributions for
data is the true model. However, the model rarely test statistics. In a permutation test, the data is
describes the complicated relationships that occur randomly shuffled over the marker data. Analysis
Permutation Versus Bootstrapping and Other Methods 141

of the permutated data provides a test statistic, as In contrast, a bootstrap randomisation of the data
it is the result of the null hypothesis (marker not samples allows an individual acquires a pheno-
associated with QTL). The number of permuta- type with replacement such that after an individ-
tions required is about 10,000 for a reasonable ual receives a random trait assignment, some
approximation of threshold levels of 1%. The other individual might receive the same random
important property of this method is that it does trait assignment. The debate about permutation
not depend on the distribution of the data. A or bootstrap randomisation is continuing and is
permutation test is typically used to determine a based on the argument that a permutation retains
threshold value for significance testing of the the summary information of the trait, whereas the
existence of a QTL effect. bootstrap changes the mean and variance of the
bootstrap sample. In both resampling approaches,
the genotypic (marker) assignments remain as in
Bootstrapping the original data, and, therefore, the genetic map
does not change. An additional implication of not
Bootstrapping, described by Visscher et al. in changing the genetic map is that all genotypic
1996, is an alternative resampling procedure. and population information is retained (such as
From the original dataset, N individual observa- segregation distortion, missing data and recombi-
tions are drawn with replacement. An observation nation fractions).In general, empirical threshold
is a phenotype and its marker type; hence, unlike values obtained by permutation testing are widely
in permutation testing, the observed combinations mentioned in publications. Permutation testing
remain together. Note that some observation may can also be used to obtain genome-wide significance
appear twice in the bootstrap sample, whereas levels by simply repeating the procedure across
other may not appear at all. It shows that confidence all markers.
is approximated very well with this method, with However, both resampling methods have been
only 200 bootstrap samples used. A bootstrap noted as being computationally demanding
method is typically used to determine an empiri- techniques that require more than 1,000 resam-
cal confidence interval for the QTL location, ples, and each potentially leads to different
assuming that the QTL effect exists. In QTL anal- results. Additionally, when the models are very
ysis, usually many markers are tested, often for complex, the extension of resampling methods to
multiple traits and in multiple families. The risk these situations quickly becomes computation-
of false positives is very high with so many tests. ally too demanding, as one would have to provide
If a 5% significance level would be used, we up to 1,000 resamples for every model consid-
would expect 5% false positives. Therefore, a ered. Motivated by the computational intensity of
more stringent significance level is usually applied the resampling-based methods, Piepho suggests a
for genome-wide QTL detection, for example, quick method for calculating approximate QTL
0.1%. Hence, for 200 tests, we would need a thresholds. Because the Piepho thresholds are
significance level of 0.05/200 = 0.00025 to have theoretically based and do not retain the previ-
a chance of false positives of about 5%. Usually, a ously mentioned genetic specifications, they
significance level of around 0.1% is applied. remain constant across experiments, even though
it is well known that the environment has a large
role in the variation of a quantitative trait and,
Permutation Versus Bootstrapping therefore, the accuracy of QTL location. In situa-
and Other Methods tions in which the biological and statistical effects
are minimised (e.g. segregation distortion, envi-
In permutation, traits are randomly assigned to ronmental variation, small sample size and incom-
individuals in the data set with no single trait plete data), the theoretical and resampling-based
value being assigned to more than one individual. thresholds are generally the same.
142 6 QTL Identification

statistical models and heavy computational


QTL QTL Interaction: Impact demand. One heuristic approach that has been
of Epistasis taken is to first locate all single QTL, then to
build a statistical model with these QTL and their
Epistasis refers to interactions between alleles interactions and, finally, search in one dimension
from two or more genetic loci of the genome. The for significant interactions. Kao et al. 1999 made
consequence of epistasis is that the phenotype of such a proposal (see above) through a direct
an individual cannot be predicted simply by the extension of interval mapping to include a simul-
sum of the single-locus effects but rather depends taneous search for multiple epistatic QTL. Owing
on the specific combinations of loci. In germ- to the computational intensity of a multidimen-
plasm that has experienced selection, epistasis sional search, a simultaneous investigation is not
has been shown to contribute to the expression of possible, and the search is referred to as a quasi-
complex traits. Hence, estimation of genetic simultaneous investigation. Approaches like this
architecture of the trait in terms of contribution of have the potential to work in many situations, but
main effects and epistatic interactions to the are limited to the pool of QTL that resulted from
genotypic variance is important in plant breed- the first-pass QTL analyses, and have little hope
ing. Such an interaction may arise when two of establishing true epistatic effects for QTL that
genes are part of a common biochemical path- are not individually significant. Searching through
way, with gene 1 upstream of gene 2, so that in all potential models is a problem known as model
individuals homozygous for a null mutational selection and remains an active area of research
gene 1, mutations in gene 2 have no effect. This in genetical statistics.
is the origin of the term epistasis, which means It must be noted that the detection of epistatic
literally as to stop. Statistical geneticists now QTL will rely even more on large population sizes
apply the term more widely to indicate any than the detection of main effects. The most prom-
deviation from additivity between QTL. ising approach to detect epistatic QTL appears to
Among the approaches, multiple QTL models be a full two-dimensional scan for all possible
are more powerful than single-QTL approaches pairwise interactions. Such scans are nowadays
because they can potentially differentiate between computationally feasible and have successfully
linked and interacting QTL. Under epistasis, that been used to detect epistatic interactions.
is, when the alleles of two or more QTL interact, Contrastingly, some researcher has considered
it has great potential to alter the quantitative trait that epistasis appears to be of minor importance in
in a manner that is difficult to predict. One of breeding populations. For most crops and traits,
the most extreme (and simplest) cases is the com- epistasis could be detected, but the proportion of
plete loss of trait expression in the presence of a genotypic variance explained by these epistatic
particular combination of alleles at multiple QTL. QTL was small compared to that of the main
The crucial challenge in the search for multiple effect QTL. There are, however, exceptions where
QTL is to consider every position in the genome individual epistatic QTL have been identified
simultaneously, for the location of a potential which explain a proportion of genotypic variance
QTL that might act independently, be linked to comparable to that of the main effects. As the
another QTL or interact epistatically with other forces active in natural populations are not effec-
QTL. Interacting QTL are of particular interest as tive in breeding populations, epistatic interactions
they indicate regions of the genome that might may be selected and maintained, thus contributing
not otherwise be associated with the quantitative to the expression of the trait. In addition, some
trait using a one-dimensional search. Although results suggest the presence of epistatic master
the concept of locating multiple, interacting QTL regulators, that is, loci that appear to be involved
is straightforward, implementation is quite difficult in a large number of interactions. Though the con-
due to the tremendous number of potential QTL tribution of epistasis to the genetic architecture of
and their interactions, which lead to innumerable agronomic traits in breeding populations appears
QTL Environment Interaction 143

to be small, an epistasis scan seems advisable as such as the diverse sets of environments that are
single epistatic QTL may have large effects and often employed in mainstream cotton breeding
thus may improve knowledge-based breeding. programs. On the other hand, differences between
years were reflected in similar numbers of QTL
that were specific to each of the year. In other
QTL Environment Interaction words, several QTL were detected only in the
water-limited treatment, while only few were
All the genotypes are not responding similarly to specific to the well-watered treatment. This sug-
environmental signals, and there is variation in gests that improvement of fibre quality underwa-
response (variation is mainly in terms of reaction ter stress may be even more complicated than
or sensitivity to the environmental stimuli or improvement of this already-complex trait under
signal). Differential genotypic expression across well-watered conditions. As a component of the
environments is often referred to as genotype total phenotypic variance (the denominator in
environment interaction (G E or GEI) which is any heritability equation), G E affects heritabil-
one of the unifying challenges facing plant breed- ity negatively. The larger the G E component,
ers. G E is an age-old, universal issue that relates the smaller the heritability estimate; thus, prog-
to all living organisms. Genotypes and environ- ress from selection would be limited. A large
ments interact to produce an array of phenotypes. G E reflects the need for testing cultivars in
GEI can be defined as the difference between the numerous environments (locations and/or years)
phenotypic value and the value expected from the to obtain reliable results. If the weather patterns
corresponding genotypic and environmental val- and/or management practices differ in target
ues. Thus, G E is the variation caused by the areas, testing must be done at several sites repre-
joint effects of genotypes and environments. sentative of the target areas. The disadvantages of
Many agriculturally important traits are end-point discarding genotypes evaluated in only one envi-
measurements, reflecting the aggregate effects of ronment in early stages of a breeding program are
large numbers of genes acting independently and discussed in many occasions. The discarded gen-
in concert throughout the life cycle. External otypes might have the potential to do well at
factors at any time during the life cycle may another location or in another year. Thus, some
change the developmental process in ways that potentially useful genes could be lost due to
may not be predictable. The extent to which G E limited testing. With the increasing omnipres-
affects a trait is an important determinant of the ence of marker technology in plant breeding,
degree of testing over years and locations that the classical problem of how to handle G E is
must be employed to satisfactorily quantify the gradually being absorbed into more basic ques-
performance of a crop genotype. Because testing tions towards the existence and description
is a major factor in the time and cost of develop- of differential gene expression, where the term
ing new crop varieties, G E interactions and gene is replaced by QTL. Because of this pro-
their consequences have received much attention. cess, the need has arisen for statistical models
For example, it is found that the genetic control that are applicable in the contexts of both G E
of cotton fibre quality, as reflected by QTL detected and QTLenvironment interaction (Q E).
by genome-wide mapping, is markedly affected Though theory for QTL detection and estimation
both by general differences between growing has developed strongly during the past decades,
seasons (years) and by specific differences in still theory for Q E is scarce and applications of
water regimes. There appears to exist a basal set such theory are few. Noteworthy contributions
of QTL that are relatively unaffected by environ- are listed in the further readings, and readers are
mental parameters and may account for progress requested to go through those bibliographies for
from selection in a wide range of environments, cutting-edge knowledge on Q E.
144 6 QTL Identification

been studied, and many of these results have been


Congruence of QTL: Across the made available via public databases. One of the
Environments and Across the Genetic main purposes of these databases was to help
Backgrounds Is the Key in MAS researchers to compare results from different
QTL studies; to study the congruency of QTL
Relatively large numbers of QTL were detected for locations in order to find the QTL identified for a
agronomic traits, and most of the detected QTL given trait in a population is the same as that of
explained only less than half of the total genetic QTL detected in other populations.
variation. What causes the remaining genetic varia- In theory, one would expect that the variation
tion that is unexplained by QTL in large samples? of a quantitative trait within a species is explained
One possibility is that there are many QTL with by a finite number of genes. Thus, QTL congru-
very small effects, as assumed in classical models ency investigation will be a relevant approach to
of quantitative genetics, and these remain unde- improve knowledge on trait genetics. Nevertheless,
tected even with very large sample sizes. Another combining results from linkage studies can be
possibility is that higher-order epistatic interac- tedious since, even if several studies focus on the
tions, which are recalcitrant to QTL mapping. same trait within the same species, since the dif-
Further, a recurring complication in the use of QTL ferences in family structures, sample sizes,
data is that different parental combinations and/or genetic maps or simply QTL detection methods
experiments conducted in different environments may differ between studies. Some methods have
often result in identification of partly or wholly been recently developed to tackle such issues
nonoverlapping sets of QTL (as stated in the above raised by heterogeneity of between QTL studies.
cotton example). The majority of such differences Integration of genetic maps and QTL locations
in the QTL landscape are presumed to be due to by iterative projections on a reference map is now
environment sensitivity of genes. Hence, proper widely used to position both markers and QTL
care of including Q E analysis will improve the on a single and homogeneous consensus map
further progress of QTL mapping towards MAS. (referred to as comparative mapping; see chapter 7).
The use of stringent statistical thresholds to infer However, this process yields a consensus marker
QTL while controlling experiment-wise error rates map for which both the statistical properties
is another reason for identification of only a small and biological reality cant be clearly assessed,
fraction of these nonoverlapping or incongruence even if a robust ordered marker map was used
of QTL. Small QTL with opposite phenotypic as reference. Alternatively, an approach using
effects might occasionally be closely linked in cou- graph theory to integrate various types of maps
pling in early-generation populations and separated (such as genetic and physical maps) has been
only in advanced-generation populations after proposed, but it mainly dealt with dissection of
additional recombination. Comparison of multiple marker order inconsistencies between maps. In
QTL-mapping experiments by alignment to a com- order to study QTL congruency, Goffinet and
mon reference map offers a more complete picture Gerber in 2000 proposed a strategy called as
of the genetic control of a trait than can be obtained meta-analysis. Meta-analysis, which is mainly
in any one study. However, lack of common set of used in medical, social and behavioural sciences,
anchored markers in the published reports of many aims to pool results across independent studies in
crop plants limits the comparison of QTL across order to combine them in a single result or esti-
the genetic backgrounds. mate. The relevance of meta-analysis investiga-
tions in genetics and evolution has been discussed
widely. Yet another meta-analysis-based approach
Meta-QTL Analysis was proposed by Etzel and Guerra in 2002 to
overcome the between-study heterogeneity and
Since the first publication of a QTL localisation to refine both QTL location and the magnitude of
in tomato using molecular data by Paterson et al., the genetic effects. Nevertheless, both the meth-
in 1988, more and more species and traits have ods are limited to a small number of underlying
Concluding Remarks on QTL Methods 145

QTL positions (from one to four for the former tern of missing genotype data. Interval mapping
and only one for the later) which is a serious limi- and analysis of variance make use of a single-
tation for a whole-genome study of QTL congru- QTL model. Methods that consider multiple QTL
ency. Even if the average number of QTL per simultaneously have three advantages: greater
experiment is around four in plants, one would power to detect QTL, greater ability to separate
expect that more than four genes can be involved linked QTL, and the ability to estimate interac-
in the trait variation on a single chromosome. In tions between QTL. These more complex meth-
order to incorporate this fact, a computational ods may facilitate the identification of additional
and statistical package, called Meta-QTL, was QTL and assist in elucidating the complex genetic
developed for carrying out whole-genome meta- architecture underlying many quantitative traits.
analysis of QTL-mapping experiments. Contrary Model selection is the principal problem in mul-
to other methods, Meta-QTL offers a complete tiple QTL methods; the chief concern is the for-
statistical process to establish a consensus model mation of appropriate criteria for comparing
for both the marker and the QTL positions on the models. The simplest multiple QTL method,
whole genome. First, Meta-QTL implements a multiple regression, should be used more widely,
new statistical approach to merge multiple dis- although, like analysis of variance, it suffers in
tinct genetic maps into a single consensus map the presence of appreciable missing marker gen-
which is optimal in terms of weighted least otype data. A forward selection procedure using
squares and can be used to investigate recombi- interval mapping (i.e. the calculation of condi-
nation rate heterogeneity between studies. tional LOD curves) is appropriate in cases of
Secondly, assuming that QTL can be projected QTL that act additively and makes proper allow-
on the consensus map, Meta-QTL offers a new ance for missing genotype data. MIM is an
clustering approach based on a Gaussian mixture improved method that, although computationally
model to decide how many QTL underlie the dis- intensive, can, in principle, map multiple QTL
tribution of the observed QTL. Meta-QTL is and identify interactions between QTL. The
freely available at http://bioinformatics.org/mqtl. important aspects of the model-selection problem
require much further study and will not have gen-
eral solutions. From results of QTL experiments
Concluding Remarks on QTL Methods gathered over a wide range of plant species, it has
shown that confidence intervals around most
The simplest statistical method for QTL mapping likely QTL positions are, on average, approxi-
is analysis of variance at marker loci. This approach mately 10 cM, which usually includes several
suffers when there is appreciable missing marker hundreds of genes. Also several researchers have
genotype data and when the markers are widely pointed out that QTL detection is statistically
spaced. Interval mapping, though more compli- biased both in the true number of QTL, which is
cated and more computationally intensive, allows underestimated since only QTL with large effects
for missing genotype data. LOD scores are used are detected, and in the QTL effects which are
to measure the strength of evidence for the pres- over estimated as only significant effects are
ence of a QTL; the LOD curve for a chromosome reported (a phenomenon has commonly referred
indicates whether a QTL maybe present and to as the Beavis effect). A lot has been happened
where it is likely to be located. The region where in methodological development on multiple QTL
the LOD score is within 1.0 of its maximum may mapping, threshold determination and Bayesian
be taken as the plausible region for the location of QTL-mapping methods. This area has been
the QTL. Alternatively, permutation tests are advanced greatly by the interaction between
valuable for determining significance landmarks genotyping technologies and statistical method-
for the LOD score; although computationally ologies in the last several years and will continue
intensive, permutation tests allow for the observed to be so in the future. However, it is equally impor-
phenotype distribution, marker density, and pat- tant that these tools are applied with thorough
146 6 QTL Identification

understanding of the genetic data and the tools Selective genotyping (also known as distribu-
themselves. tion extreme analysis or trait-based marker
analysis) involves selecting individuals from a
population that represent the phenotypic extremes
Alternatives in Classical QTL Mapping or tails of the trait being analysed (Lander and
Botstein 1989). In other words, the segregating
There are several other alternative procedures population is evaluated phenotypically as a first
available for QTL mapping other than the meth- step. Then, genotypic evaluation is performed on
ods described above. It includes bulked segregant only a subset of the population: those genotypes
analysis, selective genotyping, association map- that occur in the tails of the distribution of the
ping and nested association mapping. trait of interest. Linkage map construction and
QTL analysis are performed using only the indi-
viduals with extreme phenotypes. By genotyping
Bulked Segregant Analysis a subsample of the population, the costs of a map-
and Selective Genotyping ping study can be significantly reduced. Selective
genotyping is typically used when growing and
The construction of linkage maps and QTL analy- phenotyping individuals in a mapping population
sis takes a considerable amount of time and effort are easier and/or cheaper than genotyping using
and may be very expensive. Therefore, alternative DNA marker assays.
methods that can save time and money would be The disadvantages of these methods are that
extremely useful, especially if resources are lim- they are not efficient in determining the effects of
ited. Two short-cut methods that are commonly QTL and that only one trait can be tested at a time
used to identify markers linked to QTL are bulked since the individuals selected for extreme pheno-
segregant analysis (BSA) and selective genotyp- typic values will usually not represent extreme
ing. Both methods require mapping populations. phenotypic values for other traits. Furthermore,
BSA is a method used to detect markers located single-point analysis cannot be used for QTL
in specific chromosomal regions (Michelmore detection, because the phenotypic effects would
et al. 1991). Briefly, two pools or bulks of DNA be grossly overestimated, and hence interval
samples are combined from 10 to 20 individual mapping methods must be used (Lander and
plants from a segregating population; these two Botstein 1989).
bulks should differ for a trait of interest (e.g. resistant
vs. susceptible to a particular disease). By making
DNA bulks, all loci are randomised, except for the Genomics-Assisted Breeding
region containing the gene of interest. Markers are
screened across the two bulks. Polymorphic mark- In the last decade, some scientific milestones,
ers may represent markers that are linked to a gene including genome sequencing projects, EST data-
or QTL of interest. The entire population is then bases and microarray technologies, have enhanced
genotyped with these polymorphic markers, and a the understanding of plant genomes and allowed for
localised linkage map may be generated. This the identification of genes responsible for a desired
enables QTL analysis to be performed and the trait. Besides using random markers derived from
position of a QTL to be determined. BSA is gener- anonymous polymorphic sites in the genome, it has
ally used to tag genes controlling simple traits, but become possible to generate functional markers;
the method may also be used to identify markers they are derived from polymorphisms within the
linked to major QTL. High-throughput or high- transcribed regions of the genome. Such markers
volume marker techniques such as RAPD or are completely linked to the desired trait allele and
AFLP (refer chapter 3), that can generate multiple have also been termed perfect markers. The main
markers from a single DNA preparation, are gen- limitation of applying random, non-perfect DNA
erally preferred for BSA. markers such as RFLPs, AFLPs or microsatellite
Array Mapping 147

markers is the limited number of detectable poly- microscopic slide or filter). Labelled cDNA pieces
morphisms, low throughput and high costs of assay- bind to their complementary counterpart on the
ing each locus. The development of SNPs allows array, and (4) a laser scanner is used to measure
higher throughput, but still marker development and the fluorescent signal of the hybridised probes.
PCR reactions are required. Thus, it was suggested As the intensity of the signals from the samples
that marker-assisted breeding and selection will correlates with the original concentration of mRNA
gradually evolve into genomics-assisted breeding in the cell/tissue, it can be estimated whether the
(the term genomic selection is also used in some expression of a gene is up- or downregulated,
publications). Currently, array mapping, association absent or unchanged. Besides RNA expression
mapping and EcoTILLING are often discussed as profiling, microarrays offer opportunities for DNA
methodologies within the context of genomics- polymorphism analysis and have been found use-
assisted breeding and refer chapter 10 for more ful in linkage mapping, the dissection of QTL or
details. assessment of population structure. Fragments
matching the array feature sequence perfectly will
hybridise with a higher affinity than a fragment
Array Mapping mismatching the sequence, and thus every array
oligonucleotide has the potential to measure a
With the completion of the genomic sequence of polymorphism. The sequence polymorphisms
several model crop plants (since Arabidopsis thali- detected as a difference in hybridisation intensity
ana, the first plant genome, was deciphered), plant between two samples function as molecular
genomics moved on to the era of functional genom- markers and are referred to as single-feature
ics. The mere sequence of a genome is of limited polymorphisms (SFPs; see chapters 3 and 10).
value in revealing the function of genes. Gene Microarrays can detect high numbers of SFP
expression needs to be studied in the next step and markers, and as several hundred thousand loci can
DNA microarrays have become the main techno- be measured in a single experiment, all markers
logical approach to expression studies. Microarrays can be scored simultaneously, thus allowing the
(also known as biochips, DNA chips and gene mapping of quantitative or multigenic trait loci.
chips) were developed by Schena and co-workers No amplification steps, gels or enzymatic manipu-
in 1995. There are several ways in which genes lation are required to carry out a microarray which
can be arrayed, the two most common technolo- makes such high-density oligonucleotide arrays an
gies being cDNA arrays and oligonucleotide effective platform for identifying allelic variation.
arrays. To conduct an oligonucleotide array, oligo- Wolyn et al. (2004) developed a method called
nucleotides are synthesised in situ for setting up eXtreme array mapping (XAM) that combines
the array, requiring knowledge of sequence data. array hybridisation with BSA in order to map
cDNA arrays are also applicable to non-model QTL, hoping for a way to reduce time and effort
organisms, as they only require a large cDNA needed to genotype and map QTL loci. Within
library and the development of ESTs. ESTs are each bulk, the individuals are identical for the
end segments of sequences from cDNA clones that trait/gene of interest but arbitrary for all other
correspond to mRNA, that is, parts of expressed genes. Ideally, the two samples differ genetically
genes. To conduct a cDNA array, several thousand only in the selected region and are expected to
ESTs are needed. A unique set of these ESTs is have equal mixtures of both parental genotypes at
amplified by PCR and used to conduct the array. loci unlinked to the mutation. The chromosomal
Irrespective of cDNA arrays or oligonucleotide region linked to the gene causing the phenotype
arrays, the basic steps are the following: (1) mRNA will be fixed for alternative alleles between the
from cells or tissues in a sample is extracted, (2) two pools. BSA has the advantage of identifying
converted into cDNA and fluorescently labelled, markers associated with a trait without needing
(3) hybridised with the array by robotically spot- the construction of a full genetic map. BSA is
ting the probe onto a planar surface (often glass widely used in many marker development
148 6 QTL Identification

programs. One possibility in BSA is to hybridise Additionally, the low number of alleles sampled
DNA from each pool to a microarray. In this way, per locus in each population makes it difficult to
SFPs can be identified, indicating a genomic examine the full range of genetic diversity avail-
region of interest containing alleles that can be able in crop germplasm.
tested before introgression into elite germplasm. Alternatively, an increasingly common method
Another application of the microarray tech- of refining the identification of QTL using the
nology to the analysis of DNA variation is the production of near-isogenic lines (NILs) and posi-
Diversity Array Technology (DArT). Using tional cloning is proposed. Nevertheless, techni-
DArT, the presence and amount of a specific cal limitations, such as the lack of contiguous
DNA fragment can be assessed in the total coverage and the large amounts of repetitive DNA
genomic DNA of an organism or a population. in the genomes of many plant species, prevent the
DArT does not rely on DNA sequence informa- successful implementation of positional cloning by
tion, and potential applications include germ- means of chromosome walking (refer chapter 7).
plasm characterisation, genetic mapping, gene Aside from these technical issues, positional
tagging or MAS. In terms of cost and speed of cloning may not be efficient at identifying genes
marker discovery/analysis, DArT can be a good responsible for complex traits. This is due in part
alternative to other marker techniques such as both to the difficulty of developing NILs for loci
RFLP, AFLP, microsatellite markers or SNP that explain less than 20% of the variance and to
(refer chapter 3). The major advantage of microar- constraints created by only using two alleles. For
rays is the fact that gene expression patterns for a example, the majority of genes cloned via posi-
large number of genes or even a whole genome tional cloning explain large portions of the pheno-
can be obtained in one experiment. As the ele- typic variation, for example, fruit weight2.2 in
ments placed on the chip are only between 20 and tomato, teosintebranched1 (tb1) in maize,
200 mm in diameter and only spaced 50 mm apart, heading date1 in rice and FRIGIDA and
a whole genome complement can be placed on CRYPTOCHROME2 in Arabidopsis. Further, the
one chip. production of NILs is a time-consuming process,
especially in long-generation species.
Similar kinds of limitations were documented
Association Mapping in animal genetics too. Linkage analysis has not
been successful in fine-scale mapping of disease
In plants, most of the QTL analyses have been loci in humans because construction of organised
conducted in highly structured populations with pedigrees from controlled breeding crosses is not
known pedigrees (such as F2 or backcross popu- possible. Even when studying families with high
lations). However, in general, such structured occurrence of a disease, it is often difficult to find
populations have two major limitations. First, the direct evidence of genetic recombination between
limited number of recombination events results polymorphic sites. Therefore, the medical com-
in poor resolution for quantitative traits. Second, munity turned to association analysis because
only two alleles at any given locus can be studied there was too few meiosis in most families to
simultaneously. In order to increase the resolu- finely map diseases. Association analysis, also
tion of mapping populations, large populations known as linkage disequilibrium (LD) mapping
that have undergone several rounds of random or association mapping, is a population-based
mating should be created. These rounds of mat- survey used to identify traitmarker relationships
ing increase the potential number of recombina- based on LD. Unlike linkage analysis, where
tion events, and structured populations such as familial relationships are used to predict correla-
recombinant inbred lines are potential resources tions between phenotype and genotype, associa-
in this context. Despite these efforts, the resolu- tion methods rely on previous, unrecorded
tion for many QTL is still several centimorgan sources of disequilibrium to create population-
(cM), corresponding to hundreds of genes. wide markerphenotype associations. Genetic
Association Mapping 149

diversity is evaluated across natural populations confusion occurs because tight linkage may result
to identify polymorphisms that correlate with in high levels of LD. For example, if two muta-
phenotypic variation. Association analysis is tions occur within a few bases of one another, they
extremely powerful because the individuals that undergo the same pressures of selection and drift
are sampled do not have to be closely related, through time. Because recombination between
which harnesses all of the meiotic and recombi- the two neighbouring bases is rare, the presence
nation events among those individuals to improve of these SNPs is highly correlated, and the tight
resolution. Because of these recombination linkage will result in high LD. In contrast, SNPs
events, only markers in LD with a disease or trait on separate chromosomes experience different
of interest will associate with the disease or trait. selection pressures and independent segregation,
Association analysis was successfully used for so these SNPs have a much lower correlation or
the identification and cloning of the cystic fibrosis level of LD. A variety of statistics have been used
gene, the diastrophic dysplasia gene and one of to measure LD, and each method has its own rela-
the major Alzheimers factors. tive advantages and disadvantages.
As in animals, association analysis recently Because allele frequency and recombination
emerged as a powerful tool to identify QTL in between sites affect LD, most of the processes
plants, thereby increasing mapping resolution observed in population genetics are reflected in LD
substantially over the current capabilities of patterns. Population mating patterns and admixture
standard mapping populations. Association anal- can strongly influence LD. Generally, LD decays
ysis has the potential to identify a single poly- more rapidly in outcrossing species as compared to
morphism within a gene that is responsible for selfing species. This is because recombination
the difference in phenotype. In addition, many is less effective in selfing species, where individu-
plant species have high levels of diversity for als are more likely to be homozygous, than in
which association approaches are well suited to outcrossing species. Admixture is gene flow
evaluate the numerous alleles available. LD plays between individuals of genetically distinct popula-
a central role in association analysis. The distance tions followed by inter-mating. Admixture results
over which LD persists will determine the num- in the introduction of chromosomes of different
ber and density of markers and experimental design ancestry and allele frequencies. Often, the resulting
needed to perform an association analysis. LD extends to unlinked sites, even on different
LD is also known as gametic phase disequilib- chromosomes, but breaks down rapidly with ran-
rium, gametic disequilibrium and allelic associa- dom mating.
tion. Simply stated, LD is the nonrandom LD can also be created in populations that have
association of alleles at different loci. It is the recently experienced a reduction in population
correlation between polymorphisms (e.g. single- size (bottleneck) with accompanying extreme
nucleotide polymorphisms (SNPs); refer chapter 3) genetic drift. During a bottleneck, only few allelic
that is caused by their shared history of mutation combinations are passed on to future generations.
and recombination. In a large, randomly mated This can generate substantial LD. Selection,
population with loci segregating independently, which produces locus-specific bottlenecks, also
but in the absence of selection, mutation or migra- causes LD between the selected allele at a locus
tion, polymorphic loci will be in linkage equilib- and linked loci. Moreover, selection for or against
rium. In contrast, linkage, selection and admixture a phenotype controlled by two unlinked loci may
will increase levels of LD. result in LD despite the fact that the loci are not
The terms linkage and LD are often confused. physically linked. There are several explanations
Although LD and linkage are related, they are dis- for why the LD patterns are so different between
tinctly different. Linkage refers to the correlated plant samples. First, most of the diversity in
inheritance of loci through the physical connec- plants such as maize is descended from an
tion on a chromosome, whereas LD refers to the extremely variable outcrossing wild relative with
correlation between alleles in a population. The large effective population sizes. Most of the
150 6 QTL Identification

observed recombinant haplotypes were probably A major unresolved question is how genome
generated before domestication of this wild rela- structure and the rate of recombination affect the
tive. Hence, the different rates of LD decay reflect structure of LD across the genome. It is generally
differing levels of population bottleneck, that is, accepted that different regions of genomes undergo
the progression from diverse landraces to diverse different rates of recombination. For example, in
inbreds to elite inbreds. Additionally, the LD maize, there is extensive evidence for tremendous
reported between loci 100 kb apart likely includes heterogeneity in rates of recombination across
recombinationally inactive repetitive regions of the the genome. There is also evidence that gene-
genome, which are not present in the other studies. rich stretches are likely to have more recombi-
The basic structure of LD is understood for nation than methylated, gene-poor regions. One
only few plant species. There are still many issues reason for decreased recombination in various
that need to be better studied and resolved before regions is that the retrotransposon composition
LD can be used routinely to dissect complex can be entirely different between two alleles.
traits. The reluctance to use this technique in Unfortunately, the direct connection between
plant systems and the mixed results seen in ani- the present locations of hot spots and structure of
mal systems are due in large part to the effects of LD produced through evolution has not been com-
population structure. The presence of population pletely demonstrated in plants. However, it is
stratification and an unequal distribution of alleles likely that this connection does exist, as in humans.
within these groups can result in non-functional, This suggests that predicting LD levels between
spurious associations. Highly significant LD two sets of polymorphisms based solely on physi-
between polymorphisms on different chromo- cal distance will be problematic. For example, two
somes may produce associations between a sites at either end of a 5-kb gene might have very
marker and a phenotype, even though the marker little LD if the gene is a hot spot, whereas two sites
is not physically linked to the locus responsible on either side of 100 kb of retrotransposons could
for the phenotypic variation. Effective recombi- have very high levels of LD. The design of LD
nation rate is related to the degree of selfing that mapping experiments and placement of SNPs will
a species exhibits. This is because recombination require a thorough understanding of how these hot
is less effective in selfing species where individu- spots are dispersed.
als are more likely to be homozygous at a given Association approaches have been the main
locus than in outcrossing species. Although application of LD, but the nature of LD in the
physical recombination may occur more often in population determines what type of association
selfing species, recombination is rarely between approach can be conducted. There are mainly two
distinct alleles; hence, the amount of effective approaches: whole-genome scan and candidate-
recombination is fairly low. This relationship gene(s)-based analysis. The rate of LD decay
between recombination and selfing can extend determines which one these two approaches can
to LD. Because effective recombination is be used in association mapping.
reduced severely in highly selfing species, LD In whole-genome scans, markers are distrib-
will be more extensive. As mentioned above, LD uted across the genome are employed to evaluate
is proportional to the recombination fraction. all genes simultaneously. For example, the human
One must be cautious, however, when predicting genome may require 70,000 markers, Arabidopsis
the structure of LD based on the present-day require 2,000 markers and diverse maize landraces
mating system because the mating system may require 750,000 markers, but only 50,000 markers
have changed significantly, whether by natural are required for elite maize lines. The first asso-
evolutionary processes or by human intervention. ciation study to attempt a genome scan in plants
Because selfing rates can change rapidly, it is was conducted in sea beet (Beta vulgaris ssp.
necessary to empirically determine the LD maritima), a wild relative of sugar beet (Beta
structure before employing association-based vulgaris ssp. vulgaris) (Hansen et al. 2001). For
methods. species other than Arabidopsis, rice and crops
Nested Association Mapping 151

that have physical maps, this could be a hefty there is more statistical power to evaluate epista-
number of markers although technological sis. The advantages of association mapping in
improvements in the future may enable the scor- terms of resolution, speed and allelic range are
ing of such huge number of markers. Despite this complementary to the strengths of F2-based QTL
advances in genotyping, the key problem in asso- mapping, namely, marker efficiency and statisti-
ciation mapping is the large number of resources cal power. There are two commonly used pro-
needed for phenotyping and fixing of statistical grams for association mapping: TASSEL (http://
issues. Statistical significance in a genome scan www.maizegenetics.net/tassel) and STRUCTURE
could only be obtained with large sample sizes of (http://pritch.bsd.uchicago.edu/structure.html).
thousands of individuals for QTL that explain Readers are requested to visit these websites and
modest amounts of variation. manuals for detailed procedure for association
There are two ways to circumvent this prob- mapping, which are self-explanatory and simple
lem: Either population with greater levels of LD to do. The free website, http://www.extension.org/
can be chosen or the analysis can be restricted to pages/62755/association-mapping-and-tassel-
candidate gene regions. By choosing a bottle- software-tutorial, may also be visited for further
necked population, one can substantially increase technical tips.
genome-wide LD. The limitation of this approach
is that the appropriate populations must be
identified, and by their nature, these bottlenecked Nested Association Mapping
populations will only contain a subset of the total
variation. Again, it is necessary to point out that From the above discussions, it is obvious that
novel alleles outside the elite germplasm will not linkage analysis often identifies broad chromo-
be identified. The candidate geneassociation some regions of interest with relatively low marker
approaches rely on combining multiple lines of coverage, while association mapping offers
evidence to restrict the numbers of genes that are high resolution with either prior information on
evaluated. Genome sequencing, comparative candidate genes or a genome scan with very high
genomics, transcript profiling, low-resolution marker coverage. An integrated mapping strategy
QTL analysis and large-scale knockouts all would combine the advantages of the two
provide opportunities to develop and refine candi- approaches to improve mapping resolution with-
date gene lists. These approaches are powerful at out requiring excessively dense marker maps.
identifying candidate genes but not at evaluating Nested association mapping (NAM) has been
allelic effects. The first association study of a proposed as a genome-wide complex trait dissec-
quantitative trait based on a candidate gene was tion strategy that integrates the advantages of
the analysis of flowering time and the dwarf8 (d8) linkage analysis and association mapping in a
gene in maize by Thornsberry et al. in 2001. single, unified mapping population. The proposed
The candidate gene approach can substantially procedure in NAM involves the following steps:
reduce the amount of genotyping required, but (1) selecting diverse founders and a single refer-
most importantly, it can reduce the multiple issues ence line for developing a large set of related
created by testing thousands of sites across the mapping progenies preferably recombinant inbred
genome. The statistical issues in combining these lines (RILs) for robust phenotypic trait collection,
disparate types of evidence have not been resolved. (2) either sequencing completely or densely geno-
In plants, another way to conduct a genomic scan typing the founders, (3) genotyping a smaller
is to use F1-derived mapping populations. These number of tagging markers on both the founders
populations are efficient for doing a genome scan, and the progenies to define the inheritance of
as often only a few hundred markers are needed. chromosome segments and to project the high-
Because only two alleles are being evaluated, density marker information from the founders
these populations will have more statistical power to the progenies, (4) phenotyping progenies for
to evaluate the effect of a chromosomal region in various complex traits and (5) conducting genome-
comparison to association mapping. Additionally, wide association analysis relating phenotypic
152 6 QTL Identification

traits with projected high-density markers of the pair allelic variation in target genes and can be
progenies. The aims of the experimental design in applied to any organism that can be chemically
NAM are to (1) capture crop genetic diversity, (2) mutagenised. It is, on the one hand, an attractive
exploit ancestral recombination, (3) efficiently strategy for functional genomics and, on the
take advantage of next-generation sequencing other hand, also attractive for agricultural appli-
technologies through genetic design, (4) generate cations. TILLING requires relatively few indi-
mapping materials that can be evaluated for agro- vidual plants and is therefore appropriate for
nomic traits at field locations of temperate regions, small- and large-scale screening. In TILLING,
(5) develop a mapping population that has traditional chemical mutagenesis is followed by
sufficient power to detect numerous QTL and PCR-based screening to identify point muta-
resolve them to a level of individual genes and (6) tions in regions of interest. First, the regions of
provide a community resource. Thus, NAM has interest are amplified by PCR. By denaturing
several advantages, and Yu et al. (2008) have and re-annealing the PCR products, heterodu-
provided a detailed comparison of the main plex molecules between wild-type fragments
characteristics of different mapping strategies. and mutated fragments form, provided that at
In NAM, the advantages of designed mapping least one plant in the pool includes a mutation in
populations from linkage analysis and of high the amplified region. The resultant double-
resolution from association mapping are inte- stranded products are digested by CEL I, an
grated through the development of a large number endonuclease that specifically targets and digests
of RILs from diverse founders. While the com- heteroduplexes at mismatch positions. The
mon parent specific markers allowed the predic- cleaved products are resolved on denaturing
tion of transmission of chromosome segments in polyacrylamidegels, individuals carrying a
RILs, the short range of LD within these segments mutation in the gene of interest are identified
across the diverse founders enabled improved and the mutant PCR product is sequenced. The
mapping resolution. The genetic background TILLING methodology has been adapted to the
effect of these parental founders on mapping indi- discovery of polymorphisms in natural popula-
vidual QTL, which can be a hurdle for association tions, termed EcoTILLING by COMAI et al.
mapping, is systematically minimised by (2004). The cutting with CEL I allows the dis-
reshuffling the genomes of the two parents of each play of multiple mismatches in a DNA duplex.
cross during RIL development as well as by the If an unknown homologous DNA is heterodu-
combined analysis of all RILs across all the plexed to a known sequence, the number and
crosses. In general, the strategy of projecting position of polymorphisms can be revealed, and
sequence information, nested within informative the approximate position of each SNP within a
markers, from the most connected individuals to few nucleotides is recorded. EcoTILLING is
the remaining individuals is applicable to a wide applicable to any species, including heterozy-
range of crop species though it was first shown in gous and polyploid ones. It often compares
maize. favourably to full sequencing because it reduces
the number of sequences that need to be
determined in order to identify a point muta-
EcoTILLING tion in a gene of interest. It is considered that
TILLING/EcoTILLING remains at the moment
EcoTILLING is based on the methodology of the technique of choice for medium- to high-
TILLING (Targeting Induced Local Lesions IN throughput reverse genetics in many organisms.
Genomes), which was developed as a strategy in EcoTILLING is gel based and thereby a low-
reverse genetics (McCallum et al. 2000). cost method. As a marker system, it combines
TILLING is a methodology that identifies DNA two advantages. Being based on the gene of
polymorphisms regardless of phenotypic conse- interest itself, it has the advantage of a func-
quence, allows the identification of single-base- tional marker, and it produces a high number of
Challenges in QTL Mapping 153

marker alleles because every SNP in the and earliest to obtain. An F2 is better than a
amplified sequence results in a change in the backcross since QTL with recessive alleles in a
overall fragment pattern. Currently, EcoTILLING recurrent parent could not be detected, and when
and microarrays, as two methods for natural dominance is present, backcrosses give biased
polymorphism discovery, seem to be two estimates of the effects because additive and
complementary tools. While microarrays have dominant effects are completely confounded
their strength in the detection of global natu- in this design. The degree of dominance can be
ral polymorphisms among a few genotypes, estimated in F2 progenies, but there are two
EcoTILLING is better suited for surveying important inconveniences of F2 and backcross
diversity at specific loci among many genotypes. populations: The genotype cannot be replicated
In general, it can be expected that developments (and therefore cannot be evaluated several times
in marker technologies during the next few years or in several environmental conditions, different
will go along with the development of sequenc- years, locations, etc.), as in the cases of doubled
ing technologies. The new generation of haploids (DHs) or recombinant inbred lines
sequencing technologies, called next-generation (RILs), and epistatic interactions could hardly be
sequencing, that has become available during studied. When n pairs of genes segregate inde-
the last few years permits the rapid production pendently, the number of different gametes is 2n,
of sequence information, and it can be expected while the number of possible genotypes in an F2
that sequence information of many different is 3n; that is, with doubled haploids or RILs,
crop plants will become available soon. fewer individuals need to be screened (and this is
economically very important when using molec-
ular markers) to cover a similarly wide spectrum
Challenges in QTL Mapping of recombinants. Using simulated populations, it
was concluded that the DH population (also valid
Though there are huge numbers of publications for a RIL population) could be used with smaller
in QTL mapping of agronomically and economi- sample sizes because of their advantage over
cally important traits in several crop plants have backcrosses. Moreover, more accurate estimates
been published, it has been repeatedly shown by of the location of the QTL were obtained with
the geneticists, statisticians and breeders that less variance. This result is to be expected because
QTL-mapping strategies used in the publications the interval mapping approach, in the absence of
are having several limitations and different overdominance, uses more widely separated
approaches that can be employed to overcome genotypic values than in a backcross.
these challenges are discussed hereunder. For RILs or DHs, the power of detecting a
given quantitative trait locus is clearly related to
its relative contribution to the heritability of the
Confronts with Mapping Populations character. The power of the test was about 90%
for heritabilities of QTL as low as 5%. To obtain
There are several types of experimental design a similar power for backcrosses, the heritability
that are suitable for QTL analysis, depending on attributable to the individual quantitative trait
the mating system of the crop species. Advantages locus should be around 14%. For a given type of
and limitations of each system in QTL analysis gene action, it seems that DHs have a similar
are discussed in chapter 2. Most QTL analysis in power to an F2. However, if dominance is present,
plants involves populations derived from pure DHs or RILs will only detect the additive compo-
lines, and several approaches have been devel- nent of a particular quantitative trait locus. This
oped to associate QTL with molecular markers in could be very important for QTL showing
such populations. In autogamous species, QTL- overdominant (or pseudo-overdominant) effects.
mapping studies commonly make use of F2 or The major technical advantage for DHs or RILs,
backcross progenies because they are the easiest independent of any effect of replication on the
154 6 QTL Identification

required number of offspring, lies in the fact that introduced as obtained in MAPMAKER, without
the lines can be reproduced independently and giving the phase or considering both possibilities
continuously evaluated with respect to additional per locus, linked markers that differ in phase will
quantitative traits and markers with all the infor- be placed in different linkage groups, although
mation being cumulative. If the effect of replica- they are closely linked. An important limitation
tion is taken into account, replicated progenies of the pseudo-testcross design is that only the
can bring about a major reduction in the number effect of an allele substitution (substituted by the
of lines that need to be scored. Reductions are alleles of the other parent) can be tested, which is
greatest when heritability of the trait is low, under much less powerful than the classical testing. In
the assumption of co-dominance at all QTL. other words, in addition to the effect of allele
Current statistical methods for mapping QTL substitution, only genotypic values can be esti-
based on controlled crosses are well-developed mated. If dominant markers are used, the phase
(Table 6.1). These methods depend critically on and power limitations clearly increase, although
well-defined mapping pedigrees, such as F2, F3 or many studies ignore it.
backcrosses, initiated with two inbred lines. The In considering how many progenies in a map-
development of such pedigrees is extremely ping population to obtain and how many markers
difficult in outcrossing species, particularly fruit to type, one thinks about both the chance of
and forest trees, owing to high heterozygosity detecting QTL and the resolution of localisation
(probably maintained by recessive lethals) and of QTL. The chance of detecting a QTL is called
long generation intervals. Therefore, other strate- the power. Suppose that under the null hypothesis
gies based on half- or full-sib families derived of no segregating QTL, one obtains a maximum
from controlled crosses have been proposed for LOD score, genome wide of at least 3, only 5%
outcrossing species. Alternatively, another of the time, so the threshold of 3.0 may be used to
approach that takes advantage of the haploid tis- define significant evidence for the presence of a
sue known as the megagametophyte in gymno- QTL. In this case, the power to detect a QTL is
sperms has been proposed. To be able to apply the chance that one will obtain an LOD score
the MAPMAKER program (see chapter 4), a above 3 in the region of the QTL. Power depends
full-sib family is usually analysed as a double on the type of cross, the size of the effect of the
pseudo-testcross, enabling the construction of a QTL, the number of progenies obtained, the den-
map for each parent and the utilisation of domi- sity of typed markers in the region of the QTL
nant markers (i.e. RAPD). In the cross between and the stringency of the chosen LOD threshold
two heterozygous individuals, many single-dose (i.e. the significance level). When a QTL has an
RAPD markers will be heterozygous in one effect of only moderate size, this power can be
parent, null in the other and therefore segregate extremely low. It is possibly more interesting to
1:1 in their progeny following a testcross consider the power to detect at least one QTL. If
configuration. Two separate data sets are then there are 10 unlinked QTL segregating in a cross
obtained, one for each parent. This is very conve- and for each of them the power is only 20%, one
nient when parents belong to different species or will still have approximately 90% power to detect
genera since they may differ in gene order because at least one of them. This has implications for the
of translocations, inversions or deletions during replication of experiments; if there are many
evolution. QTL-mapping studies that use a moderate-sized QTL segregating in a particular
pseudo-testcross format differ from those that cross, the set of QTL for which one will obtain
use inbred populations in that up to four different strong evidence may be quite different. Of course,
quantitative trait locus alleles (and marker alleles) QTL with quite strong effect will be detected
may be segregating. Because the two parents do with high power and so will be seen with each
not derive from the same F1 individuals, the group of progenies.
marker alleles in each may differ in state and in However, with a mapping population size
phase from the QTL alleles. If genotypes are of 200 typed at 1 cM spacing, the precision of
Challenges in QTL Mapping 155

localisation of the QTL is greatly improved. But score more individuals (for genotype and
these results are not necessarily typical. It is rec- phenotype) on fewer markers? Because observed
ommended that initial genotyping in an experi- recombinants provide the information, scoring
mental cross be performed with markers at a more individuals addresses previously mentioned
1015-cM spacing. It is also suggested that for concerns.
markers spaced at 10 cM or closer, there is really
little point in increasing marker density when the
goal is simple detection of a linked QTL. Typing Segregation Distortion
additional markers in the region of an inferred
QTL may improve the resolution of its localisa- The first step in any QTL-mapping experiment
tion, but such improvement will likely only occur is usually to construct populations that origi-
if one has typed many progenies in that popula- nate from homozygous, inbred parental lines.
tion or the QTL has a relatively large effect. The resulting F1 lines will tend to be heterozy-
gous at all markers and QTL. From the F1 popu-
lation, crosses are made (e.g. backcross, F2
Markers and Its Implications intercross and crosses to generate recombinant
inbred lines), and the segregation of markers
There is no absolute value for the number of and QTL are statistically modelled. In general,
DNA markers required for a genetic map, since experimenters assume that markers are segre-
the number of markers varies with the number gating randomly, but if markers are subject to
and length of chromosomes in the organism. For segregation distortion, it is not possible to
detection of QTL, a relatively sparse framework anticipate how the resulting estimates of recom-
(or skeletal or scaffold) map consisting of bination will be affected, as well as any poten-
evenly spaced markers is adequate, and prelimi- tial QTL locations. Two important issues should
nary genetic mapping studies generally contain be considered when assessing these statistical
between 100 and 200 markers. However, this results. The first consideration is sample size.
depends on the genome size of the species; more The number of individuals studied provides
markers are required for mapping in species with information for the estimation of phenotypic
large genomes. It was repeatedly shown that the means and variances. A large sample of indi-
power of detecting a QTL was virtually the same viduals provides the opportunity to observe
for a marker spacing of 10 cM as for an infinite recombinant events (thus to have a knowledge
number of markers and only slightly decreased on segregation distortion) and to estimate
for marker spacing of 20 or even 50 cM. parameters with greater accuracy and, there-
Typically, when investigations focus on fore, a greater ability to detect QTL.
questions of genomic location, then more sophis- Missing data and markers with distorted seg-
ticated methods of QTL analysis, which rely on regations may make ordering of markers difficult
the estimated order of markers, are used. The to decide. Especially, markers deviating
added information that is gained from knowing significantly from expected Mendelian segrega-
the relationships between markers is essential to tion ratios and markers with less than 100 data
QTL methodologies that aim to locate QTL. The points are excluded from the QTL analysis. High
accuracy of locating QTL is limited by the infor- marker density is usually seen as a guarantee of
mation, in particular the number of recombinants being a high-standing QTL analysis regardless
that is gained from observing the genotypic states of the proportion of dominant versus co-domi-
of the markers. These observed recombinants can nant markers or the reliability in the order of
be limited by both small sample size and missing markers. At the same time, the abundance of
genotypic data. A question that is very often dominant markers (RAPDs, AFLPs) may cause
asked by the researchers at this stage is Should I problems in the construction of maps and in the
genotype more markers on fewer individuals or analysis of QTL by interval mapping procedures.
156 6 QTL Identification

In QTL analysis, the genotype at a chromosomal such as lack of research funding and time, and
position is inferred by the genotype of the marker possibly a lack of understanding of the need to
at that position. If the marker cannot distinguish confirm results, QTL-mapping studies are rarely
between the genotypes in the progeny (e.g. a confirmed. An important issue for QTL detection
dominant marker in an F2), such reduction in in breeding populations is that the phenotypic
information affects the power of QTL detection. data from breeding programs is often generated
In cases where the markers are very tightly by combining multiple trials, thus resulting in
linked, analysis of hundreds of segregating prog- unbalanced designs. Another important consider-
eny may be required to determine the correct ation is that a statistically sound joint analysis of
order of markers. Linkage maps with a high den- the phenotypic data requires overlapping geno-
sity of markers therefore have to be obtained types between different trials, locations and
from huge segregating populations. An alterna- years (breeding cycles). Another crucial factor
tive methodology for constructing dense genetic that strongly determines the success of a QTL-
linkage maps has been recently reported (Jansen mapping experiment is the phenotyping intensity.
et al. 2001). It is based on simulated annealing to High heritabilities are a prerequisite for reliable
obtain the best map according to the number of QTL results and a high predictive power of the
recombination events. It uses the Gibbs sampler detected QTL, that is, a low bias in the estimation
for missing data imputation and, notably, estab- of the proportion of genotypic variance explained
lishes posterior intervals for the positions of by these QTL.
markers, as a measure of precision of the genetic Another major concern in trait evaluation is
linkage map obtained. not only trying to diminish environmental varia-
tion versus genetic variation but also because
of the distribution of values in the segregating
Phenotyping populations. Some deviations from normality are
corrected by a variable transformation (log10,
The accuracy of phenotypic evaluation is of the arcsin, etc.). For others, nonparametric tests for
utmost importance for the accuracy of QTL QTL detection should be used. Again, many stud-
mapping (see chapter 5). A reliable QTL map can ies ignore these features and their effect in QTL
only be produced from reliable phenotypic data. analysis and efficiency and profitability of MAS.
Replicated phenotypic measurements or the use Also, the trade-off between extent of replication
of clones (via cuttings) can be used to improve and environments over which the progeny needs
the accuracy of QTL mapping by reducing back- to be evaluated versus number of progeny should
ground noise. Thorough studies should include be considered. The cost-effectiveness of all of
phenotypic evaluations that have been conducted these depends upon the relative costs for geno-
in both field and glasshouse trials, and QTL- typic and phenotypic analyses, of course. It is
mapping studies should be independently clear that a single approach to the QTL analysis
confirmed or verified. Such confirmation studies of a quantitative trait is never enough to fully
(referred to as replication studies) may involve understand its genetic control.
independent populations constructed from the As genes, QTL effects may be environmen-
same parental genotypes or closely related geno- tally sensitive, and this sensitivity results in
types used in the primary QTL-mapping study. phenotypic plasticity or the ability of the organ-
Sometimes, larger population sizes may be used. isms to take on alternative developmental fates,
Furthermore, some recent studies have proposed depending on environmental cues. Phenotypic
that QTL positions and effects should be evalu- plasticity is likely to be of particular importance
ated in independent populations because QTL in plants since their sedentary nature dictates that
mapping based on typical population sizes results they adjust to their local environment. Species
in a low power of QTL detection and a large bias with great phenotypic plasticity have been seen
of QTL effects. Unfortunately, due to constraints as likely progenitors for novel species which
Challenges in QTL Mapping 157

express only one of the possible developmental cross-validation techniques has reported that
fates of their ancestors. It has shown that selec- QTL mapped in populations of typical size have
tion during maize domestication for a QTL allele poor predictive power in independent samples
(teosinte branched1), which lacks environmental from the same population. Thus, perhaps we
plasticity, may have led to the fixation of a mor- should be less concerned with Type I errors
phological form that can be induced in teosinte (finding false positive QTL) than with Type II
(its ancestor species) by environmental conditions. errors (missing real QTL).
Many authors deal with G E interaction at the
level of QTL as a matter of lack of consistency of
QTL effects across environments, concluding Statistical Issues
with their lack of interest for MAS purposes.
However, if a QTL shows G E interaction, then As we discussed, a QTL is a region of any genome
selection of genotypes adapted to specific envi- that is responsible for variation in the quantitative
ronments may well be achieved. The proportion trait of interest. The goal of identifying all such
of this kind of QTL is especially impressive in regions that are associated with a specific com-
fruit and forest tree species. Selection pressure plex phenotype might, at first, seem quite simple,
on phenotypic plasticity has to be stronger on especially with all the genomic and computa-
perennials than on annuals. Following this rea- tional tools available to help us. Unfortunately,
soning, plasticity (ability to change gene expres- the task is difficult because of the sheer number
sion depending on environmental conditions) of QTL, and the possible epistasis or interactions
should be the rule in tree species rather than between QTL, and because of the many addi-
replicates the exception. In any case, the study tional sources of variation. To combat this, QTL
of G E interaction needs carefully designed experiments can be designed with the aim of
experiments with several replications of each containing the sources of variation to a limited
genotype per environmental condition tested, number so that dissection of a complex pheno-
which is not usually achieved in QTL studies of type might be possible. In general, a large sample
woody species. of individuals has to be collected to represent the
For traits with low heritability, extensive repli- total population, to provide an observable num-
cation and evaluation across different environ- ber of recombinants and to allow a thorough
ments is critical to get good estimates of QTL assessment of the trait under investigation. This
effects. It is suggested that larger population sizes is the first key step in QTL analysis, and it is
and more phenotypic testing are higher priorities ignored in most of the studies.
than making dense linkage maps (e.g. increasing Composite interval mapping and multiple
marker density beyond one marker per 1520 cM). QTL mapping achieve the same result by reducing
Other effects of small sample size include under- the number of potential models under consider-
estimation of the number of QTL involved in a ation. Both methods extend the ideas of interval
trait because the power of the QTL significance mapping to include additional markers as cofac-
tests is reduced. Simultaneously, the effects of torsoutside a defined window of analysisfor
QTL that are detected with small progeny sizes the purpose of removing the variation that is
are overestimated, sometimes greatly so. The r2 associated with other (linked) QTL in the genome.
values based on studies with small population The limitations of both approaches are that they
sizes may be impressively high, but they are are restricted to one-dimensional searches across
probably not realistic. In the few cases when the the genetic map and are challenged at times by
QTL models developed in small populations are the multiplicity of epistatic QTL effects. There is
tested against independent validation data sets also a risk of putting too many markers in the
with larger populations, the real amount of varia- model as cofactors, and care should be taken to
tion they explain is much less. It has also shown preserve the amount of information that is avail-
that the predictive power of QTL mapping with able for estimation of the QTL effect.
158 6 QTL Identification

The importance of developing models with the search for different models and their compari-
multiple QTL is well understood for linked QTL sons with the information gained from complet-
and has an even greater role in the estimation and ing the QTL genotype information. The power in
location of epistatic QTL. The limiting feature in breaking a problem into two independent parts is
successfully using multiple QTL models is not not new as it was dealt with by Jansen in 1993
our inability to write an equation for a model; it is and lies in the fact that information is gained in
our inability to identify the best model or subset the first part that can be used in the second part.
of models (from potentially millions). Enumeration Once the QTL genotypes are estimated, Sen and
of all possible QTL models that consider the Churchill explore all possible models using an
appropriate genetic architecture for the experi- approach that allows distinct models of different
ment, as well as linkage and epistasis, is a QTL numbers to be considered. As the QTL
daunting task. Accurate and fast simultaneous genotypes are calculated independently from the
multidimensional searches through the most QTL effect and location, previous issues of
likely models, and their comparisons, are required epistasis and linked QTL are eliminated because
to determine the most feasible models that war- the state of the QTL genotype and QTL number
rant further investigation. As shown previously, is known before the estimation of their effects
one-dimensional searches (e.g. interval mapping and interactions. Multi-trait QTL mapping can
and composite interval mapping) have benefited also benefit from the computational framework
the mapping community but are limited in of Sen and Churchill by simply extending from a
their inability to accommodate multiple linked single phenotype to multiple correlated pheno-
QTL. Because a stepwise linear approach to types and by dissecting the problem in a similar
model building, by adding and deleting every manner. Although the Sen and Churchill view
combination of multiple (linked) QTL and their has been shown to benefit QTL mapping, it might
interactions, is not computationally feasible, have an even larger potential for accommodating
many investigators have proposed solutions by other types of problem and data structure
addressing the computational issues rather than (for details, see Doerge 2002).
the QTL-mapping method itself. One approach is The most obvious applications of QTL analy-
to globally search for the optimum multiple QTL sis are MAS in crop breeding and QTL cloning
genotype using genetic algorithms. The applica- for transgenic technology. The success (or
tion of genetic algorithm(s) to multiple QTL efficiency) in both endeavours primarily depends
problems is one of many beneficial approaches on the reliability and accuracy of the QTL analy-
because it allows a sampling of the QTL models sis where information has been obtained.
across unequal QTL numbers to be considered Chromosomal QTL regions are quite often large
and because it can be used in conjunction with any and can include many open reading frames or
QTL-mapping methodology that is implemented favourable QTL alleles in repulsion. This situa-
for a multidimensional search of a genome. tion can exacerbate linkage drag in the applica-
An inclusive computational framework for tion of QTL analysis for plant breeding or
addressing many of the previously mentioned introgression into elite germplasm of undesirable
challenges, namely, covariates, nonnormal trait characters that are linked to a desirable QTL.
distributions, epistatic QTL and the issues of Thus, a principal objective of QTL analysis is
multiple simultaneous searches, has been put confining QTL to narrow chromosomal regions,
forward by Sen and Churchill. The approach which implies joint consideration of the type of
breaks the QTL problem into two distinct parts: experimental design or segregating population,
the relationship between the QTL and the quanti- its size, number, informativeness and level of poly-
tative trait and the location of the QTL. Disjoining morphism of DNA markers and the statistical
these two independent relationships allows the methodologies both to build up the linkage map
initial focus to be placed on estimation of the and to perform the QTL analysis. These are the
unknown QTL genotypes and then on allowing methodological features that should be considered
Challenges in QTL Mapping 159

seriously. Other factors also have an important tion with an additional sample and resampling
influence on this accuracy: the experimental strategies such as bootstrapping.
design (including the type of segregating It has also long been clear that the confidence
population),its size, the heritability of the trait, intervals (CIs) associated with QTL locations in
the number and contribution of each quantitative segregating populations are larges since QTL are
trait locus to the total genotypic variance, their estimated with poor precision. The CI for a QTL
interactions, their distribution over the genome, using likelihood methods is generally a 1-LOD
the number and distance between consecutive support interval, which means that any position
markers, the percentage of co-dominant markers, around a likelihood peak that has an LOD score
the reliability of the order of markers in the link- of not less than 1 lower than the peak is included
age map, the evaluation of the trait, etc. There are in the CI. Generally, QTL have been located to
also situations that may reduce the efficiency of intervals of 1520 cM. This is probably sufficient
MAS, when the environment or the genetic back- for marker-assisted selection, but this level of
ground, or both together, affects the final contri- precision is nowhere near satisfactory to contem-
bution of the QTL (i.e. when G E and epistatic plate map-based cloning of QTL. The reliability
interactions are involved in the phenotypic value). depends on the heritability of the individual quan-
QTL analysis not only provides DNA markers for titative trait locus. Given a typical trait with an
efficient selection, it is also of particular value in overall broad heritability of 50% or less, the indi-
resolving these interacting environmental and vidual quantitative trait locus will have heritabili-
genetic effects which are common in agronomi- ties of a fraction of this 50%. Thus, with five
cally important traits such as days to flowering, equally sized QTL, each can only have a herita-
stay-green or tolerance to abiotic stresses. These bility of 10%. Simulations have shown that the
aspects are also considered because their study 95% CI of such a quantitative trait locus in an F2
will not only help plant breeding and germplasm population of 300 individuals is more than 30 cM,
enhancement but also plant genomics connecting while it is very difficult to reduce the CI to much
the proteins of known biochemical function to less than 10 cM, even for a very highly heritable
the agronomic traits where they are involved. quantitative trait locus. More markers beyond a
Another basic problem that concerns QTL density of one in every 15 cM do not help much.
analysis is the true number of QTL governing a These distances should be viewed in the context
quantitative trait. It has been shown that it is that, on average, a chromosome is about 100 cM
difficult to locate more than 12 QTL in any given long. Several approaches have been explored to
population at any one time, and generally far overcome this problem. Again, increasing the
fewer. Moreover, because only significant effects number of genotypes is the most efficient way of
are reported, published QTL effects will be biased improving precision, which is easy to achieve
towards larger values; the more stringent the with F2 or backcross populations of herbaceous
significance level, the greater the bias. It is not plants. Another strategy is to enhance the herita-
the estimation procedures that are biased, it is the bility of individual QTL in one of two ways. First,
fact that only the significant estimates are used; the environmental variation can be minimised by
the poorer the power of the test (low progeny having many replicates of each individual, as can
number), the greater the bias. This bias will be easily be achieved with RIL and DH lines (or
greater on estimates of dominance than on addi- vegetatively propagated fruit trees). Second, the
tive effects because dominance effects are more residual variation caused by other QTL can be
difficult to detect. All these biases are larger with identified and removed from the error as in mul-
QTL of small effects and together imply that one tiple QTL-mapping approach or composite inter-
will tend to underestimate the true number of val mapping. However, in such cases, CIs cannot
QTL but exaggerate their additive and dominance be reduced to much less than 10 cM and then
effects. Suggestions in the statistical literature to only for the QTL with the largest effects. Note
diminish these problems include model valida- that 10 cM equates to 300 kbp in Arabidopsis and
160 6 QTL Identification

6,000 kbp in wheat. Because of the wideness of proof of the molecular basis of the quantitative
CI, it is difficult to demonstrate the existence of trait locus. Progress in this direction will require
more than three QTL per chromosome. This limi- association tests, gene expression profiling and
tation affecting the distribution of QTL along the complementation tests (functional and quantita-
chromosomes is largely due to the low chiasma tive). It is clear that the experimental set-up in an
frequency per chromosome (around two, on aver- expression quantitative trait loci (eQTL; see
age), which limits recombination and hence chapter 7) mapping study is similar in structure to
quantitative trait locus resolution. To go below a traditional QTL-mapping study, but with thou-
10 cM resolution, it would be necessary to resort sands of phenotypes. The simplicity with which
to fine QTL-mapping designs, such as advanced this difference can be stated obscures the result-
intercross lines or near-isogenic lines, or to ing challenges posed for the statistical analysis of
greatly increase population sizes (refer chapter eQTL data. The statistical methods available for
2). Analysis of hundreds or thousands of segre- multi-trait QTL mapping consider relatively few
gating progeny might be required, which is a traits and are not easily extended to the eQTL set-
costly and time-consuming affair. Alternatively, ting as they require estimation of a phenotype
pooled sample approach to the construction of covariance matrix, which is not feasible for hun-
high-resolution genetic maps was proposed. dreds or thousands of traits (for a review of eQTL
Increasing resolution allows the discovery of methods, refer Kendziorski et al. 2006 and refer-
new QTL since linked QTL with favourable ences therein).
alleles in repulsion would mask each other. Some of the studies simply show QTL at dif-
Increasing resolution is also very important to ferent map positions, or with different effects in
reduce genetic drag during the marker-assisted different environments, which may result from
introgression of wild genes because a good QTL statistical uncertainty. Those studies, in annual
allele for a trait might be linked in phase to a bad species, show that the expression of QTL can
QTL allele for another important trait. There are vary among environments, and, together, they
two situations in plant genomics where the wide- suggest that most of the identified QTL show
ness of CI is important: distinguishing linked significant G E interaction. The percentage of
QTL governing different traits from a quantita- such interaction is expected to be larger as the
tive trait locus with pleiotropic effects over the difference among the target environments
traits and candidate gene analysis. QTL with becomes larger, as in the case of control versus
pleiotropic effects seem to be crucial in coordi- stress. Very often, G E interaction is confounded
nating (or regulating) the connected physiologi- with the effect of the research team. For example,
cal pathways of traits. Genes with related when two traits that are evaluated in two loca-
functions usually cluster through the genome. tions by two different teams, only three QTL out
Gene clustering seems to be the case, at least, for of 12 and three out of 16 are detected by the two
resistance genes or genes controlling floral traits, teams, at both locations. This can be easily seen
which is very convenient for comparative genom- in the published reports. Therefore, the effect of
ics. Correlated traits also usually have QTL in the research team may be more important than
common genomic regions. Several statistical the G E interaction as such or it is at least as
approaches to analyse several quantitative traits large. How the traits were evaluated might also
simultaneously, such as those based on multivari- be important because, in all cases, the evaluation
ate methodologies using Markov chain Monte was visual using a simple scoring scale from 1 to
Carlo approaches (Guo and Thompson 1992) or 5 or 9. Unless the population size is large enough,
using canonical transformation of the traits into the lines or families are uniform and the evalua-
canonical variates, to which univariate techniques tion is consistent through researchers, the study
(Mangin et al. 1998) are being explored. of QTL E interaction is not relevant.
Taking a step forward, high-resolution map- A considerable body of research in quantita-
ping may deliver several candidate genes but no tive genetics suggests that epistatic interactions
Challenges in QTL Mapping 161

among loci at two-locus, three-locus and higher- straints, this is only a quasi-simultaneous QTL-
order levels often have major effects on adaptabil- mapping method.
ity and have a considerable influence on phenotype.
If there is gene interaction, populations can dif-
ferentiate not only for population means but also Practical Utility
for local average effects. The consequence of this
differentiation is that the local average effects of In practical point of view, the following common
alleles change relative to each other so that an question is often raised: Is the information from a
allele favoured by selection in one population QTL analysis enough for being successful in MAS
may be removed by selection in other populations. for QTL? The experimental results showed mixed
The importance of two-locus genetic model and response. Schneider et al. (1997) have reported that
inclusion of measures of genetic population dif- MAS improved drought resistance performance by
ferentiation, it was theoretically shown that the 11% under stress and 8% under non-stress in com-
potential role of additive dominant and domi- mon beans. A MAS study for malting quality in
nant dominant epistasis in reproductive isolation barley, based on two QTL, gave contrasting results
and inbreeding depression at the QTL level. It was (Han et al. 1997). Whereas tandem genotypic and
also concluded that the same forces that reduce phenotypic selection proved useful for one quanti-
the apparent contribution of genetic interactions tative trait locus, a second putative quantitative trait
to the variance within populations lead to popula- locus identified in the original mapping population
tions differentiating from the local average effects vanished in the population used for selection. The
of alleles. Epistasis between QTL assayed in pop- proportion of genetic variance explained by the
ulations segregating for an entire genome has QTL, individually and together, in the QTL experi-
been found at a frequency close to that expected ment is a first key point. The second key point is
by chance alone. Yet, when RILs, DHs and that G E and epistatic interactions at any quantita-
isogenic lines are used, epistasis is detected more tive trait locus may be involved in the phenotypic
frequently. Therefore, QTL mapping may under- value. Concerning the first point, it is often difficult
estimate the number of non-additive interactions to determine from the literature how much of the
for three reasons. First, when advanced backcross genetic variance is explained by the QTL, either
progenies are used, it is not useful for detecting individually or together, because only the total
epistatic QTL since every backcross generation phenotypic variance is reported. It is therefore not
greatly reduces the number of genotypic combi- possible to decide whether any variation left unex-
nations because the donor genotype is being plained is caused by other QTL or the environment.
recovered. For example, the frequency of individ- Taking into account that for QTL alleles of small
uals with phenotype AB derived from the two- effect the magnitude of the bias will be larger than
locus double heterozygoteAaBb by self-pollination for QTL alleles of large effect, one should be
will be 9/16, while by backcrossing it will be 1 especially cautious with QTL of small effect.
or1/4 (testcross). Second, even large F2 mapping Fortunately, in some cases, a small number of QTL
populations will contain few individuals in the have been reported as contributing to a large pro-
two-locus double homozygous classes, limiting portion of the trait variance. This would explain
the statistical power detecting non-additive devia- why MAS experiments have generally been suc-
tions for these genotypes. Finally, searching for cessful when using the marker information for
epistatic interactions involves many statistical introgressing or accumulating QTL alleles of large
tests, so significance thresholds must be increased effect. At the same time, the purpose of the QTL
accordingly. Unless epistatic interactions contrib- analysis is not only MAS but also the genetic
ute largely to the total variance, they will not show dissection of the quantitative trait. Therefore, all
up in F2 populations. Kao et al. (1999) described a QTL have to be identified regardless of whether
method for simultaneous mapping of multiple their effect is large or small, or environmentally
interacting QTL, but owing to computational con- sensitive or not. This task requires information
162 6 QTL Identification

from different progenies, indifferent environments, Lander ES, Botstein D (1989) Mapping Mendelian factors
development and implementation of robust QTL- underlying quantitative traits using RFLP linkage
maps. Genetics 121:185199
mapping methodologies and complementing Mangin B, Thoquet P, Grimsley N (1998) Pleiotropic
experimental designs to confirm, at least, QTL QTL analysis. Biometrics 54:8899
positions. McCallum CM, Comai L, Greene EA, Henikoff S (2000)
Targeting induced local lesions IN genomes (TILLING)
for plant functional genomics. Plant Physiol
123:439442
Bibliography Michelmore RW, Paran I, Kesseli RV (1991) Identification
of markers linked to disease-resistance genes by bulked
Literature Cited segregant analysis: a rapid method to detect markers in
specific genomic regions by using segregating popula-
tions. Proc Natl Acad Sci USA 88:98289832
Churchill GA, Doerge RW (1994) Empirical threshold Moser G, Muller E, Beeckmann P, Yue G, Geldermann
values for quantitative trait mapping. Genetics 138(3): H (1998) Mapping QTL in F2 generations of Wild
963971 Boar, Pietrain and Meishanpigs. In: Proceedings of
Comai L, Young K, Till BJ, Reynolds SH, Greene EA, the 6th world congress on genetics applied to live-
Codomo CA, Enns LC, Johnson JE, Burtner C, Odden stock production, vol 26, Armidale, pp 478481
AR, Henikoff S (2004) Efficient discovery of DNA Paterson AH, Lander ES, Hewitt JD, Peterson S, Lincoln
polymorphisms in natural populations by Ecotilling. SE, Tanksley SD (1988) Resolution of quantitative
Plant J 37:778786 traits into Mendelian factors by using a complete link-
Edwards MD, Stuber CW, Wendel JF (1987) age map of restriction fragment length polymorphisms.
Molecular marker facilitated investigation of quanti- Nature 335:521529
tative trait loci in maize. I. Numbers, genomic distri- Rodolphe F, Lefort M (1993) A multi-marker model for
bution and types of gene action. Genetics 116: detecting chromosomal segments displaying QTL
113125 activity. Genetics 134:12771288
Etzel C, Guerra R (2002) Meta-analysis of genetic- Sax K (1923) The association of size difference with seed-
linkage of quantitative trait loci. Am J Hum Genet coat pattern and pigmentation in Phaseolus vulgaris.
71:5665 Genetics 8:552560
Goffinet B, Gerber S (2000) Quantitative trait loci: a Schena M, Shalon D, Davis RW, Brown PO (1995)
meta-analysis. Genetics 155:463473 Quantitative monitoring of gene expression patterns
Guo SW, Thompson EA (1992) Performing the exact test with a complementary DNA microarray. Science
of Hardy-Weinberg proportion for multiple alleles. 270:467470
Biometrics 48:361372 Schneider AK, Mary EB, James DK (1997) Marker-
Han F, Ullrich SE, Kleinhofs A, Jones BL, Hayes PM, assisted selection to improve drought resistance in
Wesenberg DM (1997) Fine structure mapping of the common bean. Crop Sci 37:5160
barley chromosome- 1 centromere region containing Thoday JM (1961) Location of polygenes. Nature
malting-quality QTLs. Theor Appl Genet 95: 191:368370
903910 Thornsberry JM, Goodman MM, Doebley J, Kresovich S,
Hansen M, Kraft T, Ganestam S, Sll T, Nilsson NO Nielsen D et al (2001) Dwarf 8 polymorphisms associ-
(2001) Linkage disequilibrium mapping of the bolting ate with variation in flowering time. Nat Genet
gene in sea beet using AFLP markers. Genet Res 28:286289
77:6166 Visscher PM, Thompson R, Haley CS (1996) Confidence
Jansen RC (1993) Interval mapping of multiple quantita- intervals in QTL mapping by bootstrapping. Genetics
tive trait loci. Genetics 135:205211 143:10131020
Jansen J, De Jong AG, Van Ooijen JW (2001) Constructing Wolyn DJ, Borevitz JO, Loudet O, Schwartz C, Maloof J,
dense genetic linkage maps. Theor Appl Genet Ecker JR, Berry CC, Chory J (2004) Light-response
102:11131122 quantitative trait loci identified with composite
Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic interval and eXtreme array mapping in Arabidopsis
mapping for quantitative trait loci. Genetics thaliana. Genetics 167:907917
140:11111117 Yu J, Holland JB, McMullen MD, Buckler ES (2008)
Jiang C, Zengt ZB (1995) Multiple trait analysis of genetic Genetic design and statistical power of nested associa-
mapping for quantitative trait loci. Genetics tion mapping in maize. Genetics 178:539551
140(3):11111127 Zeng ZB (1993) Theoretical basis for separation of
Kao C-H et al (1999) Multiple interval mapping for quan- multiple linked gene effects in mapping quantitative
titative trait loci. Genetics 152:12031216 trait loci. Proc Natl Acad Sci 90:1097210976
Bibliography 163

Further Readings Jorde LB (2000) Linkage disequilibrium and the search


for complex disease genes. Genome Res 10:
Asns MJ (2002) Present and future of quantitative trait 14351444
locus analysis in plant breeding. Plant Breed Kang MS (2002) Quantitative genetics, genomics, and
121:281291 plant breeding. In: Papers from the symposium on
quantitative genetics and plant breeding in the 21st
Broman KW (2001) Review of statistical methods for
century, Louisiana State University, 2628 Mar 2001,
QTL mapping in experimental crosses. Lab Anim
CAB International 2002
30(7):4452
Kendziorski CM et al (2006) Statistical methods for
Delvin B, Risch N (1995) A comparison of linkage dis- expression quantitative trait loci (eQTL) mapping.
equilibrium measures for fine-scale mapping. Biometrics 62:1927
Genomics 29:311322 McMullen MD et al (2009) Genetic properties of the
Doerge RW (2002) Mapping and analysis of quantitative maize nested association mapping population. Science
trait loci in experimental populations. Nat Rev 325:737740
3:4353 Wrschum T (2012) Mapping QTL for agronomic traits in
Hospital F (2009) Challenges for effective marker-assisted breeding populations. Theor Appl Genet 125:201210
selection in plants. Genetica 136:303310, http:// Xu Y, Crouch JH (2008) Marker-assisted selection in plant
www.knowledgebank.irri.org/ricebreedingcourse/ breeding: from publications to practice. Crop Sci
bodydefault.htm#QTL_mapping.htm 48:391407
Fine Mapping
7

free from any undesirable linkage. The large size


Need for Fine Mapping of the regions encompassing QTLs and the likely
or High-Resolution Mapping presence of undesirable linked genes make it
essential to fine map such regions to facilitate
The ultimate aim of molecular genetic studies of their precise introgression and to identify candi-
quantitative genetic variation is to find the genes date genes within these QTLs.
that influence the trait. However, the use of MAS Further, fine mapping will help to clone the
does not require the gene to be known, but can be genes residing at the target QTLs (referred to as
effective with linked markers. So, the critical map-based cloning; see below). This provides
point is how closely a QTL is mapped with more detailed knowledge of the functional genes
respect to the markers. Several simulation studies underlying these QTL and allows a better under-
have shown that for MAS, informative markers standing of the physiology of the quantitative
that flank a QTL within 5 cM seem adequate. In trait. This might also allow better prediction of
contrast, virtually all QTL-mapping studies have the effects of the QTL in different genetic back-
been conducted with panels of 100300 markers grounds and environmental conditions and on
covering the entire genome, corresponding to an different characteristics of performance. In addi-
average distance between markers of ~5 and tion, specific management strategies could be
20 cM. Hence, it is imperative to fine map at least developed for specific genotypes to enhance their
those QTL regions with more number of mark- performance.
ers. Such mapping process is also referred to as Thus, the initial QTL-mapping step typically
high-resolution mapping. needs to be followed by a fine-mapping step. To
Fine mapping of QTL will also increase the select the optimal fine-mapping strategy, one
efficiency of foreground selection in introgres- needs to have a good understanding of what
sion programs through MAS because the genomic factors limit the achievable fine- or high-map-
region that has to be controlled is smaller. This ping resolution. Among them, the primary four
will reduce the number of individuals that is factors are:
required and the genotyping cost. In addition, 1. Marker density: Mapping consists of placing
introgression of a smaller genomic region helps a QTL in a given marker interval. The more
to eliminate unwanted genes that are located markers one has, the smaller the average
around the target QTL. This is particularly impor- interval size and, thus, the higher the map
tant when the donor is an exotic genetic resource. resolution.
Similar considerations also hold true for recur- 2. Crossover density: Actually, recombinant
rent MAS (refer chapter 8 for more details). For chromosomes are the only ones that provide
MAS to be effective, the target QTLs must be mapping information.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 165
and Benefits, DOI 10.1007/978-81-322-0958-4_7, Springer India 2013
166 7 Fine Mapping

3. QTL detection methods: This corresponds to efficient polymorphic markers, good number
the accuracy with which one can infer the QTL recombinants are needed, to establish the relative
genotype of a given individual or chromosome. position of the locus of interest (for which we
Positioning a QTL with respect to a crossover need 1,000 s of progenies). Further, since trans-
requires knowledge of the QTL allele carried formation is a routine activity in several plant
by the corresponding chromosome. species, functional complementation would be a
4. Molecular architecture of the QTL: Many more productive approach to analyse the function
QTL probably reflect the combined effect of of identified candidate genes. Thus, it will help to
not one, but several, linked QTLs. Approaching identify more tightly linked markers besides
such a composite QTL using a model that deciphering a physiological and molecular mech-
assumes a single location may result in fuzzy anism of expression of such quantitative trait.
positioning.

Physical Mapping and Its Role in Fine


Types of Molecular Markers Suitable Mapping
for Fine Mapping
A physical map is an ordered set of DNA frag-
Increasing the marker density in a chromosome ments, among which the distances are, expressed
segment of interest is conceptually the easiest in physical distance units, that is, in base pairs.
limiting factor to resolve. However, developing The resolution or accuracy with which this can be
markers that target specific regions is a laborious done ranges from mapping loci to a particular
and time-consuming task. Fortunately, this has chromosome (low resolution) to the determina-
recently changed with the availability of the tion of the precise nucleotide sequence (high
nearly complete genome sequences of the major resolution). Physical maps are an important
crop species. Microsatellite markers can be resource for several molecular researches such as
directly identified from the genomic sequences positional or map-based cloning of agronomi-
and suitable primers can be designed and used in cally important genes, analysing chromosomes
fine mapping since they are simple and exchanged and genome structure in detail and establishing
among laboratories. However, the frequency of the relationship between genetic and physical
polymorphism detected using microsatellite locus distance (thereby increasing the efficiency of fine
generally is not sufficient (since it is very low mapping).
(<10%), SSRs are not suitable for fine mapping It has shown in Arabidopsis that on an aver-
and it prompts the use of other markers). In such age, 1 cM equals to 280 kbp. On the other hand,
cases, insertions or deletions (INDELs) and/or in barley, 1-cM genetic distance covers more than
single nucleotide polymorphisms (SNPs) in both 7,000 kbp. Thus, the relationship between genetic
intergenic and coding regions (see chapters 3 and physical distance can vary up to 100-fold (or
and 10) might be more useful in fine mapping even more) in different regions of the same
since they are far more efficient than microsatel- genome. For example, in tomato, the average
lites in detecting polymorphism. Although the amount of DNA per cM was estimated as 750 kbp,
validity of each SNP needs to be confirmed, the but this varies in certain regions of tomato genome
conversion rate is generally very high and the and shown to be as low as 50 kbp per cM to more
transportability across populations remarkably than 4,000 kbp per cM. The large discrepancy is
good. If the genome sequence is available for the mainly due to the existence of recombination
target species, it is possible to identify the puta- suppression (at centromeric regions) and recom-
tive genes present in the QTL regions. SNPs can bination hot spots on the chromosomes.
be developed for these regions to enhance the A prerequisite for physical mapping is the
efficiency of identifying causal polymorphism. availability of libraries containing large inserts of
In addition to selection of appropriate highly genomic DNA and the techniques and resources
Comparative Mapping 167

such as pulsed field gel electrophoresis, rare-cut- at this level. In comparative mapping, loci in
ting restriction enzymes and Southern blotting different species originating from the same ances-
facilities. Large genomic DNA inserts derived tral locus are called orthologous loci. Paralogous
from the given crop genome are usually cloned loci are those loci in different (or the same) spe-
into high-capacity vectors such as cosmids, yeast cies that arose due to a duplication of an ancestral
artificial chromosomes (YAC), bacterial artificial locus. Comparative mapping has been done in
chromosomes (BAC), bacteriophage P1-derived several crop species that usually belong to a single
artificial chromosomes (PAC) and mammalian family. For example, in Solanaceae, comparative
artificial chromosomes (MAC). Using such vec- maps are available for tomato and pepper and
tors, insert DNA of 45 to 800 kbp can be cloned. tomato and potato. Similarly, in Gramineae, a
Such large insert libraries facilitate the develop- comparative map between rice and maize is avail-
ment of small insert libraries which will be able. The details of such comparative map have
sequenced to determine the order of nucleotides shown that maize has 6 more nuclear DNA than
in those small inserts using state-of-the-art auto- rice. However, sixfold more DNA did not increase
mated DNA sequencing technologies (such as in recombination in the conserved region as com-
pyrosequencing, massively parallel signature pared to that in rice. Single copy gene of rice
sequencing (MPSS), polony sequencing and always duplicated in maize, and 72% of the dupli-
sequencing with Illumina or SOLiD; see chapter 10). cation still exist in maize genome. Further, it is
Then, the sequencing results are ordered or noticed that loss of 28% of duplicated copy of
assembled as contigs, and from this assembly, the maize genes could have resulted from deletions or
complete physical map of the genome is pre- loss of entire chromosomes or chromosomal seg-
pared. Such physical map can be compared with ments. Pairs of homologous chromosomes in
the genetic map, and new markers (such as SNPs maize are similar and collinear to rice chromosomes
and/or INDELs) can be obtained from the physi- and have the same gene content but shuffling of
cal map for fine or localised mapping of the target gene orders. An example of Gramineae compara-
QTL region in the genetic map. tive map can be found at CMap (the Comparative
Map Viewer) which allows you to construct com-
parisons between different maps. CMap is avail-
Comparative Mapping able at http://www.gramene.org/cmap/. In this
module, you can view genetic, physical, sequence
Genetic or physical maps constructed in one spe- and QTL maps for many species of cereal crops.
cies can be compared by means of common mark- All data (map sets, maps, features and correspon-
ers (or common single gene traits) with closely dences) in the Maps Module are built from the
related species. Such common markers are Markers Module. Users are encouraged to consult
referred to as anchored markers. These compara- the Markers Module for primary information
tive maps can be used to study genome evolu- about markers and their mappings. The Maps
tionhow the genome has been rearranged Module should be considered to be primarily a
through timeand to make inferences about gene visualisation tool.
organisation, repeated sequences, etc. Further, Thus, it is obvious that the gains of compara-
map-based cloning (see below) may be easier in tive mapping are severalfolds: (1) Maps con-
some species than othersfor example, rice (with structed in one species can be compared by means
a small genome) versus wheat (with a massive of common (or anchor) markers with closely
genome). Conservation of the gene order within a related species. (2) These comparative maps can
chromosomal segment between different species be used to study genome evolutionhow the
is referred to as colinearity, whereas conservation genome has been rearranged through timeand
of the order of genes in DNA fragments that are to make inferences about gene organisation,
bigger than 50 kb is referred to as microlinearity. repeated sequences, etc. (3) It facilitates easier
Deletion, inversions and duplications are detected map-based cloning.
168 7 Fine Mapping

spots of trans-acting eQTL, interpreted as regions


Genetical Genomics/eQTL Mapping rich in regulatory genes that co-regulate many
downstream targets.
Transcriptome analysis (studies on gene expres- The majority of expression studies are being
sion at mRNA level with spatial and temporal performed in mapping populations in order to aid
pattern) with microarrays is opening exciting in the identification of eQTLs of interest, as well
possibilities for the genetic dissection of complex as to take advantage of simplified genetics due to
traits (see chapter 10). In an approach called genet- the homozygosity of the selfed progeny derived
ical genomics, the expression levels of many (not from the biparental cross. As one might expect,
all) genes are measured in one or more tissues differential gene expression can be explained
assumed to be relevant with respect to the given simply by sequence differences in the gene itself,
phenotype. Jansen and Nap (2001) have first intro- for example, the promoter regions that respond to
duced this concept as genetical genomics. The transcription factors to varying degrees. In other
transcript levels of individual genes are treated as words, a motif integral for transcription factor
quantitative traits and subjected to QTL mapping binding may contain a polymorphism or muta-
to identify expression QTL (eQTL). In general, tion that prevents effective binding, and therefore,
tissue samples are harvested and the mRNA is decreases transcription of that gene. Additionally,
purified and then subjected to some means of mea- polymorphisms in the intronic regions could
surement. In microarray hybridisation technology, affect splicing, or changes in untranscribed
the purified mRNA is converted to labelled cDNA regions (UTR) could affect mRNA stability, both
which hybridises with complementary DNA on potentially creating degradation-susceptible tran-
the microarray slide. The relative amounts of tran- scripts. When samples are derived from a geno-
script present for each gene represented on the typed mapping population and subjected to
microarray are determined by measuring the high-resolution mapping, these cases of poly-
amount of label bound following the hybridisa- morphism are identified as cis-eQTLs. In this
tion. eQTLs are derived from polymorphisms in case, the genomic marker allele that most closely
the genome that result in differential measurable associates with a phenotype is located in close
transcript levels. Of course, any method of expres- proximity to the gene being measured. However,
sion profiling based on RNA, protein or metabo- genomic markers most closely associated with an
lites can be used as quantitative trait in genetical eQTL phenotype may physically lie far from the
genomics. eQTLs are typically sorted into local gene being measured. In these trans-eQTLs, the
eQTL when the affected gene lies within the polymorphism that results in differential tran-
confidence interval of the eQTL, as opposed to script levels may be located in the transcription
distant eQTL when not. Local and distant eQTL factor itself, thereby creating a dysfunctional or
are also, respectively, referred to as cis- versus hyperactive protein. It could be expected that cis-
trans-acting eQTL. One possible explanation for a acting factors would have larger effects on mea-
local eQTL is a cis-acting regulatory mutation that surable transcript levels.
directly controls the transcript level of the corre- In addition to understanding general patterns
sponding gene, whereas a distant eQTL necessar- of gene expression, these genetical genomic stud-
ily implies a trans-acting molecular mechanism. ies are creating caches of information useful for a
Thus, genetical genomics approach provides a multitude of applications. As one gene regulates
novel way of discovering, at a genetic level, regu- the level of expression of another (trans-acting
lators of gene expression acting either in cis or in eQTL), novel upstream or downstream compo-
trans relative to the target gene. The eQTL posi- nents in gene regulation pathways can be
tion may coincide with the gene itself displaying identified. In addition to steady state analysis, the
cis regulation or be different, thus revealing trans- induction of stimuli such as drought can lead to a
acting factors controlling expression. A common deeper understanding of gene networks that are
feature of eQTL studies is the detection of hot activated under such conditions. Correlation of
Genetical Genomics/eQTL Mapping 169

measured transcript levels (eQTL phenotype) QTL points towards primary events, co-localised
with classic QTL phenotypes may suggest func- distant eQTL may help unravel the networks that
tional roles for the allelic variation in gene expres- connect primary events and phenotypes. The
sion and serve as a predictor of downstream phenotype may be controlled by the products of
effects on plant development, morphology and the genes regulated in trans. Just as for overlap-
agronomic interest. Finally, the analysis of the ping local eQTL, one has to be wary of fortu-
activation of particular genes under steady state itous QTL overlap and resulting trait correlations.
or external stimuli treatment provides insight into The same conditional correlation measures and
the functionality of endogenous promoters. While gene perturbations may be applied to probe into
promoters used for transgenic expression have the nature of the observed correlations. Despite
been thoroughly analysed in model systems and its many attractive features, genetical genomics
model inbred lines, the understanding of agro- has its limitations. It can only detect effects that
nomically important phenotypes may benefit are mediated by alterations in transcript levels.
from the analysis of genetic polymorphisms of Moreover, it can only detect effects that manifest
trans-acting regulators affecting transgene expres- themselves in the panel of examined tissues,
sion and therefore can allow for the optimization which are usually limited. Further, it should be
of expression both of current and future trans- noted that QTL regions appear often quite com-
genic lines. plex and approximate and may contain hundreds
Local eQTL that co-localises with QTL affect- of genes. Consequently, the actual involvement
ing the phenotype of interest denotes possible of the candidate gene in most cases remains to
causal genes. Genetical genomics thus provides a be confirmed by genetic and physical map-
highly parallelized shortcut bypassing, at least in ping, positional cloning, expression analysis
some instances, tedious QTL fine mapping. It is or genetic transformation experiments. Cost-
important to realise that finding local eQTL over- saving alternatives to large genome-wide and
lapping a phenotypic QTL, a common occurrence population-wide analyses with minimal loss of
given the abundance of local eQTL in most informativeness have been proposed: analysing
experiments, provides interesting candidates, but pooled samples of phenotypically extreme
does not establish causal connection. A correla- members of the population or concentrating on
tion between the corresponding expression traits genotypically selected individuals. Though
and phenotype in the studied population does not anonymous DNA markers are useful to cover the
establish causal connection either. One possible entire genome and efficient QTL analysis,
strategy to distinguish unexpected from causal deployment of gene-derived markers will be
correlation is to apply conditional correlation more desirable since they can validate the
measures. If transcript levels of the candidate identified QTLs by elucidating the genes under-
gene directly affect the phenotype, one will find a lying those QTLs. Transcript-derived markers
correlation between transcript levels and the phe- such as EST-SSR, CAPS, dCAPS and more
notype both across and within genotypes; if tran- recently SNPs have promising applications at
script levels and the phenotype are not causally this juncture. Further, advances in microarray
linked, the correlation will be observed across, technologies reveal global changes in gene
but not within, genotypes. Alternatively, one can expression, and mapping of these changes in the
attempt to specifically disturb the candidate gene same mapping population used for QTL analysis
either genetically (for instance, knockout or might lead to identify informative eQTLs. However,
knockdown strategies) and measure the effects it should be noted that genetical genomics is
on the phenotype. only in its infancy. It is also vital to note that
The contribution of genetical genomics to the several works in proteomics (see chapter 10) have
molecular dissection of complex traits is not lim- indicated that functionally important changes in
ited to facilitating the discovery of causal genes. the levels of transcripts are not necessarily reflected
Whereas local eQTL coinciding with phenotypic in changes in the levels of proteins, and hence
170 7 Fine Mapping

assessing the genetics of protein, transcripts and


DNA markers is essential to infer causal networks Map-Based Cloning
to understand how the system works as an inte-
grated whole. Another concern during eQTL Successful isolation of genes underlying the tar-
analysis is how transcript variation relates to get QTL using the information on QTL map and
other genomic and/or physiological levels? As physical map is referred to as map-based cloning.
stated, preliminary evidence suggests that metab- There are at least three important steps in map-
olites and transcripts have different levels of based cloning, since it may vary depending on
heritability as well as epistasis underlying their the crop and purpose:
genetic architecture. The differences in heritabil- 1. Mapping of target QTL and identification of
ity between these three trait levels (metabolite, more closely linked markers through fine map-
enzyme activity and transcript) could be explained ping. For preliminary QTL mapping, a popula-
by transcripts being functionally linked to poten- tion size of 60150 individuals with 100200
tial DNA polymorphisms in their genes or regu- markers that span the entire genome is sufficient.
lators, which would leave less potential for However, for fine mapping, it is essential to
stochastic noise to be introduced between the increase the population size to >1,000 with more
genetic and transcript variations (see chapter number of informative/polymorphic markers.
10). In comparison, the variations in metabolite 2. Physical localization of the target QTL on the
and enzyme activities require a DNA polymor- physical map using the markers sequence
phism to be processed via transcription and information (referred to as chromosome land-
translation, with the extra steps allowing more ing). This identifies the genomic fragment
stochastic noise into the system. An alternative which is flanked by the target markers. The
explanation is that fundamental differences exist identified genomic region is then scanned
in the physics of the three trait levels. Metabolites towards the putative candidate genes (referred
within a network are directly linked such that the to as chromosome walking). It is usually done
atoms in one metabolite are transferable to a dif- by screening a large insert genomic library
ferent metabolite via few direct enzymatic steps. with the closely linked marker and isolate the
This interconnectedness could magnify small clones that hybridise with the marker. This is
biological perturbations, allowing more noise in followed by creating new markers (usually
metabolic networks than in corresponding tran- sequences at the end of the clone) and screen-
scripts. Similarly, enzymatic networks are likely ing the segregating population (often this pop-
dominated by MichaelisMenten kinetics, which ulation is large (1,0003,000 individuals))
introduces a nonlinear relationship between the with the new markers. The goal is to find a set
levels of protein and enzyme activity. As such, of markers that co-segregate with the gene
the use of linear statistical approaches to define under the QTL. Co-segregation means that
heritability may produce a bias in heritability whenever one allele of the gene is expressed,
within a nonlinear enzymatic network in com- the markers associated with that allele are also
parison with transcript networks if transcription present (i.e. recombination is not occurred
behaves in a more linear fashion. between the gene and the marker). Such
Thus, keeping all these limitations in mind, it identified genes are called positional candi-
is suggested that integration of the advances in date genes, which are in the region of genome
quantitative genetics, functional genomics and scan as likely to host a QTL.
bioinformatics as system quantitative genetics 3. Gene identification, characterization and valida-
can greatly facilitate systems level understanding tion: Co-segregation confirms that the genes are
of the biological cause and effect relationship. A within the two flanking markers. Step 2 usually
number of experiments are underway in this finds large number of putative candidate genes
direction and will hopefully yield exciting results (which are identified by predicting open reading
in the near future. frames (ORFs) in the DNA sequence of the
Testing the Markers in Related Germplasm Accessions 171

selected clone through bioinformatics tools). acid substitution (Leu [CTC] to Phe [TTC]).
It is now necessary to determine the actual can- Expression analysis suggests the existence of at
didate gene behind the QTL. This can be done least one more locus of gibberellin oxidase which
by several approaches such as generation of may prevent severe dwarfism from developing in
transgenic plants with the identified putative sd-1 mutants. Accordingly, they have successfully
candidate genes and generation of independently shown the potential of accelerated positional clon-
derived mutant alleles at the target gene (referred ing and its applications in plants.
to as recombinational or mutant analysis).
Map-based cloning has been first successfully
employed in mammalian system, for the cystic Validation of QTLs
fibrosis gene. In plants, it has been demonstrated
in several occasions. For example, map-based The markers identified in preliminary genetic
cloning has been applied for isolating AB13 gene mapping studies are seldom suitable for marker-
and omega-3 fatty acid desaturase gene in assisted selection without further testing, valida-
Arabidopsis. Similarly, fruit weight2.2 in tomato, tion and additional development. Markers that
teosinte branched1 (tb1) in maize, heading are not adequately tested before use in MAS pro-
date1, Sub1 and SalT in rice and FRIGIDA grams may not be reliable for predicting pheno-
and CRYPTOCHROME2 in Arabidopsis have type and will therefore be useless. Generally, the
been isolated using positional cloning approaches. steps required for the development of markers for
The map-based cloning of sd-1 gene, as an use in MAS include high-resolution mapping,
example, is explained here briefly: Several studies validation of markers and possibly marker con-
have reported that sd-1 is closely linked to several version, testing the markers in related germplasm
molecular markers on chromosome 1; however, accessions and testing the genes isolated from the
the resolution of these genetic analyses is not map-based cloning using transgenic tests. The
enough for gene responsible for the trait, semi procedure of fine mapping and its importance
dwarfism (sd). By employing advanced positional have been discussed above and the rest is dis-
cloning strategies with high-throughput genetic cussed hereunder.
mapping using CAPS, dCAPS or single nucle-
otide polymorphism (SNP) markers, Monna et al.
(2002) successfully identified sd-1 as a single Testing the Markers in Related
open reading frame (ORF) which encoded gibber- Germplasm Accessions
ellin oxidase, the key enzyme in the gibberellin
biosynthesis pathway. Analysis of 3,477 seg- Generally, markers should be validated by testing
regants using several PCR-based marker technol- their effectiveness in determining the target phe-
ogies, including CAPs, derived-CAPS and SNPs, notype in independent populations and different
revealed one ORF in a 6-kb candidate interval. genetic backgrounds, which is referred to as
Normal-type rice cultivars have an identical marker validation. In other words, marker vali-
sequence in this region, consisting of 3 exons dation involves testing the reliability of markers to
(558, 318 and 291 bp) and 2 introns (105 and predict phenotype. This indicates whether or not a
1,471 bp). Dee-Geo-Woo-Gen-type sd-1 mutants marker could be used in routine screening for
have a 383-bp deletion from the genome (278-bp MAS. Markers should also be validated by testing
deletion from the expressed sequence), from the for the presence of the marker on a range of culti-
middle of exon 1 to upstream of exon 2, including vars and other important genotypes that possess
a 105-bp intron, resulting in a frameshift that pro- the target trait. Even when a single gene controls
duces a termination codon after the deletion site. a particular trait, there is no guarantee that
The radiation-induced sd-1 mutant Calrose 76 has DNA markers identified in one population will be
a 1-bp substitution in exon 2, causing an amino useful in different populations, especially when
172 7 Fine Mapping

the populations originate from distantly related


germplasm. For markers to be most useful in
Bibliography
breeding programs, they should reveal polymor-
phism in different populations derived from a wide Literature Cited
range of different parental genotypes.
There are two instances where markers may Monna L, Kitazawa N et al (2002) Positional cloning of
rice semi-dwarfing gene, sd1: rice GreenRevolution
need to be converted into other types of markers: Gene encodes a mutant enzyme involvedin gibberel-
when there are problems of reproducibility (e.g. lin synthesis. DNA Res 9:1117
RAPDs) and when the marker technique is com- Jansen RC, Nap JP (2001) Genetical genomics: the added
plicated, time consuming or expensive (e.g. RFLPs value from segregation. Trends Genet 17:388391
or AFLPs). The problem of reproducibility may be
overcome by the development of SCARs or STSs Further Readings
derived by cloning and sequencing specific RAPD
markers (see chapter 3 for more details). SCAR Holloway B, Li B (2010) Expression QTLs: applications
markers are robust and reliable. They detect a sin- for crop improvement. Mol Breed 26:381391
gle locus and may be co-dominant. RFLP and Kliebenstein D (2009) Quantitative genomics: analyzing
intraspecific variation using global GeneExpression poly-
AFLP markers may also be converted into SCAR morphisms or eQTLs. Annu Rev Plant Biol 60:93114
or STS markers. The use of such PCR-based mark- ParanI ZD (2003) Quantitative traits in plants: beyond the
ers that are converted from RAPD, RFLP or AFLP QTL. Trends Genet 19(6):303306
markers is technically simpler, less time consum-
ing and cheaper. In addition, STS markers may
also be transferable to related species.
Marker-Assisted Selection
8

Conventional plant breeding is largely dependent alleles as a diagnostic tool to identify plants carrying
on selection of desirable plants which is highly the genes or QTLs.
decided by the genotype and environment inter- Major MAS methods include the following:
action. Selecting plants in a segregating progeny (1) Marker-assisted introgression or marker-
that contain appropriate combinations of genes is assisted backcross, where one gene from a donor
a critical component of plant breeding. Usually, line is introgressed into the genetic background of
breeders improve crops by crossing plants with a recipient parent by repeated backcrossing to the
desired traits, such as high yield or disease recipient parent. Here, markers are used either to
resistance, and selecting the best offspring over control the presence of the target gene or to accel-
multiple generations of testing under multi- erate the return of background genome to recipi-
location trials. Thus, to develop a new variety, it ent type. (2) Population screening: the simple
may take 1015 years. Any technique that may screening of populations (e.g. F2, F3, recombinant
speed up this process or make it more efficient is inbred lines, doubled haploids) for genotypes of
really a boon to breeders. interest based on markers. (3) Gene pyramiding
Molecular marker technology offers such a pos- schemes, where two (or more) parent line(s), each
sibility. Marker-assisted selection (MAS) involves hosting one (or more) gene(s) of interest, are
selecting individuals based on their marker pattern crossed, then the offspring population is screened
(genotype) rather than their observable traits for individuals carrying both (or all) genes of
(phenotype). The term marker-assisted selection interest. The process can be iterated further to
was first used by Beckmann and Soller in 1986. combine more genes. More complex methods are
Since then, the term marker-assisted selection has (4) marker-based recurrent selection (several
attracted plant breeders and geneticists, and subse- generations of selection on markers with random
quently, both the numbers of publications on MAS mating) and (5) selection on an index combining
and on QTL mapping have increased dramatically. molecular and phenotypic score. These methods
Sometimes, the term SMART breeding, an acro- are discussed in details in this chapter.
nym for Selection with Markers and Advanced
Reproductive Technologies, which was first used
in animal breeding, is used to describe marker- Advantages of MAS
supported breeding strategies. In some of the pub-
lications, genotype-assisted selection was also MAS can theoretically enhance breeders selec-
used instead of MAS. Once markers that are tightly tion efficiency because:
linked to genes or QTLs of interest have been 1. It can be performed on seedling material,
identified, prior to field evaluation of large numbers thus reducing the time required before a
of plants, breeders may use specific DNA marker plants genotype is known. In contrast, many

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 173
and Benefits, DOI 10.1007/978-81-322-0958-4_8, Springer India 2013
174 8 Marker-Assisted Selection

important plant traits are observable only MAS can result in greater progress than
when the plant has reached flowering or har- phenotypic selection in such a situation.
vest maturity. Knowing a plants genotype 6. Elimination of unreliable phenotypic evalua-
before flowering can be particularly useful in tion associated with field trials due to envi-
order to plan the appropriate crosses between ronmental effects.
selected individuals. 7. Testing for specific traits where phenotypic
2. MAS is not affected by environmental condi- evaluation is not feasible (e.g. quarantine
tions. Some crop production constraints restrictions may prevent exotic pathogens to
(such as disease, insect pests, temperature be used for screening).
and water stress) occur sporadically or non- 8. MAS may be cheaper and faster than con-
uniformly. Therefore, evaluating resistance ventional phenotypic assays, depending on
to those constraints may not be possible in a the trait. For example, evaluating nematode
given year or location. MAS offers the chance resistance is usually an expensive operation
to determine a plants resistance level inde- because it requires artificial inoculation of
pendent of environment. plants with nematode eggs, followed by a
3. When recessive alleles determine the trait of labour-intensive technique to count the num-
interest, they cannot be detected through ber of nematodes present. Selecting on the
phenotypic evaluation of heterozygous back- basis of a reliable marker would probably be
cross plants, because their presence is masked cost-effective in this case. On the other hand,
by the dominant allele. In a traditional back- plant height is cheap and easy to measure, so
cross program, plants with recessive alleles there may not be an economic advantage in
are identified by progeny evaluation after using markers for that trait and hence simply
self-pollination or testcrossing to a recessive regular conventional selection method is
tester. This time-consuming step can be sufficient. Economic aspects of MAS in a
eliminated in a MAS program, because reces- maize breeding program are discussed in
sive alleles are identified by appropriate detail in several publications. Readers are
linked markers. requested to refer Dreher et al. (2003) and
4. Gene pyramiding or combining multiple Morris et al. (2003). Economics will be a
genes simultaneously: When multiple resis- major driver of the application of MAS. For
tance genes are pyramided (or combined) certain traits that are expensive or logistically
together in the same variety or breeding line, difficult to evaluate, MAS is an attractive
the presence of each individual gene is alternative. Time savings obtained through
difficult to verify phenotypically. The pres- MAS may be as important as cost savings
ence of one resistance gene may conceal the where there are competitive markets for
effect of additional genes. This problem can improved cultivars. Any cost change in DNA
be overcome if markers are available for each extraction or genotyping methods, or on the
of the resistance genes. other hand, in phenotypic evaluation meth-
5. Selecting for traits with low heritability: ods, will affect the relative economic benefits
Environmental variation in the field reduces of MAS.
a traits heritability, the proportion of pheno- 9. A consideration that may affect cost-
typic variation that is due to genetics. In a effectiveness of MAS is that multiple mark-
low heritability situation, progress from phe- ers can be evaluated using the same DNA
notypic selection will be slow, because so sample. Once DNA is extracted and purified,
much of the variation for the trait is due to it may be used for multiple markers, for the
environmental variation, experimental error same or different traits, thus reducing the
or genotype environment interaction, and time and cost per marker.
will not be passed on to the next generation. 10. Markers can be applied in the choice of par-
If a reliable marker for a trait is available, ents in crossing programs. Here, they can
Prerequisites for an Efficient Marker-Assisted Selection Program 175

either help to maximise diversity, and in this 3. Sometimes, markers that were used to detect a
way support the exploitation of heterosis, or locus must be converted to breeder-friendly
they can minimise diversity, if gene com- markers that are more reliable and easier to
plexes built up in elite inbred germplasm are use. Examples are: RFLP markers need to be
to be preserved. converted to STS markers, and RAPD markers
11. Recessive genes can be maintained without are converted to SCAR markers for more
the need for progeny tests in each generation, reliability.
as homozygous and heterozygous plants can 4. Imprecise estimates of QTL locations and
be distinguished with the aid of co-dominant effects may result in slower progress than
markers. In backcrossing, DNA markers can expected. Many QTLs have large confidence
help to minimise linkage drag around the tar- intervals of 20 cM or more or their relative
get gene and reduce the generations required importance in explaining trait inheritance has
to recover a recurrent parents genetic been overestimated.
background. 5. Markers developed for MAS in one popula-
tion may not be transferrable to other popu-
lations, either due to lack of marker
Limitations in MAS polymorphism or the absence of a marker
trait association.
MAS is not universally advantageous and cannot
be applied to all the traits in all the crops. Some
limitations of the technique are briefly discussed Prerequisites for an Efcient
hereunder: Marker-Assisted Selection Program
1. MAS may be more expensive than conven-
tional techniques, especially for start-up Before practising, the following most important
expenses and labour costs. In certain situa- requirements should be considered in detail for
tions, conventional breeding method may suit implementing successful MAS.
well to meet out the breeding objective. An
important consideration for MAS, often not High-Throughput DNA Extraction and Marker
reported, is that while markers may be cheaper Technology: Most breeding programs would need
to use, there is a large initial cost in their to screen hundreds to thousands of plants for
development. desired marker patterns. In many cases, the results
2. Recombination between the marker and the will be needed quickly to allow the breeder to
gene of interest may occur, leading to false make selections in a timely manner. Both of these
positives. For example, if the marker and the considerations demand a simple and efficient
gene of interest are separated by 5 cM and DNA extraction system that can handle a large
selection is based on the marker pattern, there number of samples in a streamlined operation and
is an approximately 5% chance of selecting low-cost, high-throughput marker technology.
the wrong plant. This is based on the general Many labs conducting MAS should develop a
guideline that across short distances, 1 cM of strategy that extracts DNA from small tissue sam-
genetic distance is approximately equal to 1% ples in 96- or even 384-well plates and assays the
recombination. The breeder will need to given tightly linked markers to the desired QTL,
decide the error rate that is acceptable in the within a reasonable period of time. Although
MAS program, keeping in mind that errors are DNA markers have received the most attention,
also usually involved in phenotypic evalua- other types of markers (protein, morphological,
tion. To avoid this last problem, it may be nec- cytological) can also be used in MAS programs.
essary to use flanking markers on either side For efficient MAS, important attributes of mark-
of the QTL of interest to increase the probabil- ers include ease of use, small amount of DNA
ity that the desired gene is selected. required, low cost, repeatability of results, high
176 8 Marker-Assisted Selection

rate of polymorphism, occurrence throughout the Efficient Data Management System: Large numbers
genome and co-dominance. As stated earlier, of samples are handled in an MAS program, with
co-dominance is the ability to detect both parental each sample potentially evaluated for multiple
forms of a marker in heterozygotes. It is an advan- markers. This situation requires an efficient sys-
tage when heterozygous individuals are screened, tem for labelling, storing, retrieving and analysing
such as in backcross breeding programs or in an large data sets, and producing reports useful to
F2 population. SSRs combine the desirable fea- the breeder.
tures listed above and are the current marker of
choice for many crop species. SNPs require more
detailed knowledge of the specific, single nucle- Procedure for a Generalised MAS
otide DNA changes responsible for genetic varia- Program for Selection from Breeding
tion among individuals. Only a small number of Lines/Populations
SNPs are currently available for MAS in plants,
but within a few years, many more are expected to The simplified basic procedure (Fig. 8.1) for con-
be developed and may become an important ducting MAS with DNA markers is as follows:
marker type for MAS. 1. Extract DNA from tissue of each individual or
Genetic Maps: Linkage maps provide a frame- family in a population.
work for detecting markertrait associations and 2. Screen DNA samples via PCR for the molecu-
for choosing markers to employ in MAS. Once a lar markers linked to the QTL.
marker is found to be associated with a trait in 3. Analysis of PCR products, using an appro-
a given population, a dense molecular marker priate separation and detection technique such
(or high-resolution or fine) map in a standard as agarose gel electrophoresis.
reference population will help identify markers 4. Identify individuals having the desired marker
that are closer to, or that flank, the target QTL. allele linked to target QTL.
5. Combine the marker results with other selection
Selection of QTLs for MAS: It is important to criteria (e.g. phenotypic data or other marker
decide the number QTLs selected for MAS. results), select the progenies of the population that
Theoretically, all markers that are tightly linked are positive to the given marker allele and advance
to QTL could be used for MAS. However, due to those individuals in the breeding program.
the cost of utilising several QTL, only markers Markers are used for selecting qualitative as well
that are tightly linked to three QTLs are typically as quantitative traits. MAS can aid selecting for
used, although there have been reports of up to 5 all target alleles that are difficult to assay pheno-
QTLs being introgressed into tomato via MAS. typically. Especially in early generations, where
Even selecting for a single QTL via MAS can be breeders usually restrict their selection activities
beneficial in plant breeding; such a QTL should to highly heritable traits because a visual selection
account for the largest proportion of phenotypic for complex traits like yield is not possible
variance for the trait. Furthermore, all QTLs with only few plants per plot being available,
selected for MAS should be stable across MAS is said to be effective, cost- and time-saving.
environments. To improve early-generation selection, markers
Knowledge of Associations and Validation should decrease the number of plants retained
Between Molecular Markers and Trait of Interest: due to their early-generation performance, and at the
The most crucial ingredient for MAS is knowl- same time they should ensure a high probability
edge of markers that are associated with the given of retaining superior lines. Important prerequisites
traits. This information on marker validation for successful early-generation selection with
might collectively come from QTL studies, MAS are large populations and low heritability
bulked segregant analysis, classical mutant anal- of the selected traits, as under individual selec-
ysis, fine mapping, comparative mapping, map- tion, the relative efficiency of MAS is greatest
based cloning or some other means. for characters with low heritability.
Marker-Assisted Backcross Breeding 177

Identify molecular marker linked to the trait of interest.


P1 (S) x P2 (R)
For example: R - resistance; and S susceptible to disease;
R and S lines have different banding patterns

F1

Selfing


Generation of large F2 population

Extract DNA from tissue of each individual

Marker assay for DNA samples (e.g. using PCR)

Analysis ( e.g. agarose gel electrophoresis of PCR products)

Identify individuals
having the desired
marker allele ; lines
having S banding and
heterozygotes are
removed.

Combine the marker results with other selection criteria advance those individuals

Fig. 8.1 Basic procedure in MAS

Table 8.1 Percentage of recurrent parent genome after


Marker-Assisted Backcross Breeding backcrossing
Percentage of recurrent
Using conventional breeding methods, it typi- Backcross generation parent genome
cally takes 68 backcrosses to fully recover the BC1 75.0
recurrent parent genome. The theoretical propor- BC2 87.5
tion of the recurrent parent genome after n gen- BC3 93.8
erations of backcrossing is given by BC4 96.9
BC5 98.4
2 n+1 1
BC6 99.2
2 n+1
(where n = number of backcrosses, assuming an Therefore, if tightly linked markers flanking QTL
infinite population size). The percentages of and evenly spaced markers from other chromo-
recurrent parent recovery after each backcross somes (i.e. unlinked to QTLs) of the recurrent
generation are presented in Table 8.1. The per- parent are used for selection, the introgression of
centages shown in Table 8.1 are only achieved QTLs and recovery of the recurrent parent may
with large populations; the percentages are usu- be accelerated. This process is called marker-
ally lower in smaller population sizes that are assisted backcrossing (MABC). MABC is always
typically used in actual plant breeding programs. successful, except of course when the effect of
Although the average percentage of the recur- the target gene is unstable (e.g. a QTL of low
rent parent genome is 75% (for the entire BC1 effect on a complex trait). However, MABC is
population), some individuals possess more of considered as the simplest form of MAS, in which
the recurrent parent genome than others. the goal is to incorporate a major gene from the
178 8 Marker-Assisted Selection

Selection of 2-4 polymorphic markers per chromosome (as background markers)


Selection of 2-3 flanking markers on each side of target QTL (as recombinant markers)
Selection of tightly linked markers (for foreground markers)

Recurrent parent x Donor parent

Recurrent patrent x F1

BC1F1 Get 100 300 seeds

Grow the plants and genotype for chosen markers (foreground, recombinant and background selection)

Select the BC1F1 progenies based on recovery


of target QTL and background markers

Recurrent parent x Selected BC1F1 Get 100 300 seeds

BC2F1

Continue the same process until BC3F1

Selected BC3F1 Get 100 300 seeds

Selfing the selected BC3F1

Testing BC3F2 for homozygosity at target QTL

Seed multiplication of homozygous positive progenies

Fig. 8.2 Schematic representation of marker-assisted ulations are required at each generation. For more loci,
backcross program for single QTL. Two to three QTLs conduct parallel MABC and combine the loci at the end
can be backcrossed with the same process but larger pop- (i.e. by crossing final BC3F1s)

donor parent into an elite cultivar or a breeding tightly linked marker or the direct marker or
line (the recurrent parent). The use of additional perfect marker to the QTL) of the donor par-
markers to accelerate cultivar development is ent at the target locus. The objective is to
sometimes referred to as full MAS or complete maintain the target locus in a heterozygous
line conversion. Whatever it may, the desired state (one donor allele and one recurrent
outcome is a line containing only the major gene parent allele) until the final backcross is
from the donor parent, with the recurrent parent completed. Then, the selected plants are self-
genotype present everywhere else in the genome. pollinated and progeny plants identified that
The use of markers can reduce the number of are homozygous for the donor allele.
generations required to achieve the desired pro- Foreground selection is the part of MABC that
portion of the recurrent parent genome. For is the most similar to MAS. In this case, how-
example, if conventional backcrossing program ever, one of the goals besides the selection of
takes six generations to achieve more than 99% the target trait at each generation is to mini-
recurrent parent (Table 8.1), it takes only three mise the amount of linked genomic region
backcross generations in MABC (Fig. 8.2). Under from the donor parent that ends up being trans-
this situation, two types of selection are ferred along with the trait. In traditional back-
recognised: crossing, the linked regions from the donor
1. Foreground selection, in which the breeder parent can cover a very large span of the
selects plants having the marker (i.e. the chromosome on either side of the introgressed
Marker-Assisted Backcross Breeding 179

gene even after many generations of back- selection is important in order to eliminate
crossing. This can lead to linkage drag, where potentially deleterious genes introduced from
deleterious traits from the donor parent are the donor through linkage drag, the inheri-
inadvertently transferred to the recipient par- tance of unwanted donor alleles in the same
ent along with the target trait. Ensuring the genomic region as the target locus. It was con-
cleanest transfer of the target trait includes the sidered as a difficult to overcome problem
following steps: (a) the availability of several with conventional backcrossing, but now it
closely linked markers on each side of the can be addressed efficiently with the use of
target trait. This is easy for transgenic traits in markers. The background selection is focused
crops where a dense set of mapped markers is on recovering as much as possible of the
available but could be harder to achieve if the genome of the recurrent parent on the chromo-
markertrait linkage is not strong, and espe- somes not carrying the target trait (that par-
cially in the case of quantitative traits where ticular chromosome is primarily handled as
the region to introgress may be quite large. (b) part of the foreground selection). The concept
Enough plants are screened for the linked is to use a set of well-spaced markers that
markers at each generation to increase the cover all those chromosomes. At each back-
chances of recombination close to the target cross generation, the plants preselected from
region. This is done typically in two succes- the foreground selection step are genotyped
sive steps: (1) In the BC1 generation, the focus for this array of markers and scored for their
is on finding the closest possible recombina- similarity to the genome of the recurrent
tions on one side of the target trait (besides parent. At each generation, the plants that
ensuring that the proper alleles on the other have recovered the most of the recurrent
side are still present). Enough plants are parent are used for the next generation of
selected at this stage to still allow for back- backcrossing. Plants with more than 95%
ground selection (see below). (2) In the BC2 recovery of the recurrent parents genome can
generation, the same takes place for the other be obtained by the BC2 or BC3 generation
side of the target trait. (c) Selfing will then be depending on the intensity of the work done.
needed to fix the introgressed region. That will In practice, both foreground and background
be done at the end of the background selection selections are often conducted in the same
process, which may take an additional genera- backcross program, either simultaneously or
tion. This selection of a very clean introgres- sequentially. However, the efficiency of marker-
sion can thus be done quickly in two assisted backcrossing depends on a number of
generations of backcrossing. One caution is factors, including the population size of each
that the size of the final donor region sur- backcross generation, distance of markers from
rounding the introgressed gene will depend on the target locus and number of background
the intensity of the effort, especially in terms markers used. Experienced MAS researchers
of number of BC1 and BC2 plants that are have shown that faster recovery of the recurrent
screened. Enough plants need to be screened parent genome with MAS compared to conven-
not only to find a close recombination at each tional backcrossing when foreground and back-
step (usually markers that flank the target ground selection are combined. The recurrent
QTLs are used as recombinant markers) but parent genome is recovered more slowly on the
also to have enough plants remaining for a chromosome carrying the target locus than on
sufficient background selection. other chromosomes because of the difficulty in
2. Background selection, in which the breeder breaking linkage with the target donor allele.
selects for recurrent parent marker alleles in Refer the further readings (particularly Neeraja
all genomic regions except the target locus, et al. 2007) for methods for optimising sample
and the target locus is also additionally sizes and selection strategies in marker-assisted
selected based on phenotype. Background selection.
180 8 Marker-Assisted Selection

The below procedure describes MABC with all the desired marker patterns increase
process for single locus: exponentially with the number of QTLs involved.
1. Selection of markers In a backcrossing scheme, there may be little
Two to four well-spread polymorphic markers opportunity to select for the recurrent parent
per chromosome should be selected for genome, because few individuals will have the
background (recurrent genome) selection. desired marker pattern at all the target loci.
Similarly, two or three flanking markers on If some of the genes are QTLs, whose locations
each side of the target QTL should be selected. and effects are often imprecisely estimated, then
If the QTL is 25 cM apart from the markers, there is uncertainty that the results of MAS will
better to find more markers in that interval and meet expectations. Finally, the more the genes
those additional markers should also be used undergoing selection, the greater the chances of
to introgress the target QTL. incorporating unfavourable alleles through link-
2. Crossing program age drag. Hence, the following suggestions are
Start the crossing program between the recur- proposed for selecting multiple QTLs:
ring parent (elite line or cultivar) and the donor 1. Limit the number of QTLs undergoing selec-
parent (which contains the target QTL) and tion to three or four.
get the F1 plants. The F1 plants are to be 2. Target only verified QTLs that have medium
backcrossed with the recurrent parent and get to large effects and that are consistently
100300 BC1F1 seeds. detected in several environments.
3. Genotyping of BC1F1 3. Examine the QTL analysis results carefully to
Grow all the BC1F1 seeds and genotype them decide which markers to select (usually both
with the chosen foreground and background the markers that flank the selected QTL).
markers. The BC1F1 plants are selected based 4. If desired, an index can be constructed that
on (1) close recombination on one side of weights some markers differently than others,
target QTL (between two flanking markers) depending on their relative importance in
and (2) best recovery of recurrent background terms of effect sizes (and/or contribution to
at noncarrier chromosomes. the expression of phenotype).
4. Repeating steps 2 and 3 until to produce 100 5. When more than two QTLs are involved, con-
300 BC3F1 seeds sider a stepwise backcrossing procedure. For
5. Selfing and genotyping example, if four target QTLs are to be intro-
Self all the selected BC3F1 progenies and gen- gressed into the same genetic background, one
otype the selfed progenies for homozygosity could first conduct two parallel backcross
at introgressed QTL. Bulk all the homozygous schemes, each incorporating two target QTLs.
positive progenies and increase the seeds Then, the selected individuals from each
through selfing and make a final genotyping scheme are crossed and plants with all four
test before proceeding further for multi-loca- targets identified. This procedure gives greater
tion trial for evaluation of the phenotype gov- opportunity to conduct background selection
erned by the target QTL. The same procedure for the recurrent parent genome than selecting
can be followed to backcross twothree for all four targets simultaneously.
QTLs at the same time, but larger populations 6. Alternatively, F2 enrichment, backcrossing
will be needed at each generation (e.g. for three and inbreeding can be employed (Bonnett
QTLs, we may need up to 1,000 progenies). et al. 2005) to reduce the population size
Alternatively, conduct parallel MABC for needed to attain selection goals.
each selected QTL and combine the loci at the Another important point to be considered here
end by crossing the final BC3F1s. is MAS never replace phenotypic selection
It should also be noted that use of markers to entirely. Especially for disease resistances, a
select for multiple QTLs is more complex, and final testing of breeding lines is always required,
less proven, than selection for a single gene. regardless how tight a marker is linked to a QTL.
Population sizes required to recover individuals It is no doubt that the collection and use of very
Marker-Assisted Recurrent Selection (MARS) 181

high quality phenotypic data are critical for the masks the presence of other genes. For example, the
application of MAS. It is also concluded that it is Barley Yellow Mosaic Virus (BaYMV) complex
risky to carry out selection solely on the basis of is a major threat to winter barley cultivation in
marker effects, without confirming the estimated Europe. As the disease is caused by various
effects by phenotypic evaluation, and further strains of BaYMV and Barley Mild Mosaic Virus
that laboratory-based breeding should remain (BaMMV), pyramiding resistance genes seems
the servant of the field breeder and not its mas- an intelligent strategy. However, phenotypic
ter. Further, it has been observed that backcross- selection cannot be carried out due to the lack of
ing is a very conservative breeding strategy and differentiating virus strains. Thus, MAS offers
should not become the prime focus of a breeding promising opportunities. Suitable strategies have
program, as it does hardly ever broaden the been developed for pyramiding genes against the
genetic basis of plants in a substantial way. To BaYMV complex. At the same time, pyramiding
overcome the limitation of only being able to has to be repeated after each crossing, because
improve existing elite genotypes, other approaches the pyramided resistance genes are segregating in
like marker-assisted recurrent selection (see the progeny.
below) have to be considered.

Accelerated Methods of Gene


Gene Pyramiding or Stacking Pyramiding

In many cases, the breeders goal will not be to Gene pyramiding is considered as one of the
introgress a single trait but potentially to intro- best MAS methods currently available (along
gress several traits at the same time, possibly with marker-assisted introgression, which is
from different sources. Instead of trying to handle complementary since its aim is slightly differ-
all those traits together in the backcrossing pro- ent). But, even such a best method can accumu-
cess, the best approach usually is to perform all late only a couple of major genes from two
those conversions into the same background indi- parents and requires a couple of generations. If
vidually in parallel and then to intercross the final large sources of major genes were really to be
single conversions to combine the traits together unlocked, then an efficient marker-assisted gene
(see above). In that case, only MAS is needed at pyramiding scheme would need to tackle
the end since the narrowing of the introgressed multiple, possibly linked, genes, from multiple
regions through foreground selection and the parents. Methodological developments in this
recovery of the recurrent parent through back- area are only starting and still need more work
ground selection have already been done for each (Hospital 2003).
individual trait.
The most frequent strategy of pyramiding is
combining multiple resistance genes. Different Marker-Assisted Recurrent
resistance genes can be combined in order to Selection (MARS)
develop broad-spectrum resistance to diseases
and insects. Either qualitative resistance genes In marker-assisted recurrent selection (MARS),
can be combined or quantitative resistances con- the breeders take advantage of favourable alleles
trolled by QTLs. An example for the combina- originating from both parents involved in the
tion of two resistance QTLs is the pyramiding of crossing program. QTL alleles impacting the
a major stripe rust resistance gene and two QTLs major traits of interest to the breeders are
in the same genotype. In order to pyramid disease identified within breeding populations and accu-
or pest resistance genes that have similar pheno- mulated through successive intercrossing using
typic effects, and for which the matching races only genotypic selection. Recombined lines are
are often not available, MAS might even be the then subjected to a final phenotypic screen to
only practical methodespecially where one gene select the best varieties to release. This allows the
182 8 Marker-Assisted Selection

Parent 1 x Parent 2

F1

F2 (generate 300 progenies using


single seed descent method )
F3

F3:4 GENOTYPING

F3:5 (if required)

Evaluation at multi location PHENOTYPING

QTL ANALYSIS

MODELLING AND SELECTION OF QTLS FOR RECOMBINE

IDENTIFY F3 DERIVED PROGENIES FOR RECOMBINE

GENOTYPE 8 16 SEEDS PER PROGENY OF F3:6 AND SELECT


BEST 8 PLANTS (e.g. A H) TO CROSS
A x B C x D E x F G x H
1ST recombination cycle
F1 x F1 F1 x F1
2nd recombination cycle
F1 x F1
3rd recombination cycle
F1

F2

F3

F3:4
Multi location phenotyping

Fig. 8.3 Flow chart explaining marker-assisted recurrent selection

generation of progenies with an optimum combi- better than either of the two parents (Fig. 8.3). In
nation of key alleles from both parents that could contrast to MARS which use de novo QTL map-
never be obtained by chance recombination alone. ping as part of their process, the use of MAS or
Thus, MARS has a clear breeding objective, as MABC implies prior knowledge of mapping
opposed to QTL discovery conducted in good x information for the targeted traits.
bad crosses. The concept is to identify QTL If one of the two parents presents a large QTL
effects for polygenic traits (usually minor) that such as for a quality trait or biotic stress resis-
are specific to that population and to recombine tance (identified through published report or
them via genotypic selection to generate superior historical data or de novo identification), such a
progenies for variety development. To do this, QTL can also be included in the selection and the
de novo QTL detection is performed with each favourable allele is fixed at an early stage of recombi-
population of interest and the best lines are nation. MARS can be used to select for specific
recombined to obtain a progeny that performs traits like yield under water stress conditions,
Marker-Assisted Recurrent Selection (MARS) 183

but it should also include many other traits of fitting in that format (if we include the parents
interest to the breeder (such as yield under opti- and may be some checks).
mal conditions, maturity, disease resistance) so 3. Parental and progeny genotyping
that the final selection of alleles to recombine can MARS does not need a large density of mark-
take all those factors into account and negative ers since relatively little recombination has
correlations between traits at a given locus can be taken place during the F3 population develop-
identified and/or eliminated. Thus, with the use ment. Typically, having markers covering the
of markers, recurrent selection can be accelerated genome with approximately a 10 cM average
considerably. In continuous nursery programs, distance between markers should be adequate.
pre-flowering genotypic information is used for SSRs or SNPs can be used but SNPs will
marker-assisted selection and controlled pollina- greatly facilitate the expansion to multiple
tion. Accordingly, several selection cycles are MARS projects. For large-scale MARS use,
possible within 1 year, accumulating favourable the best would be to have the parental geno-
QTL alleles in the breeding population. typing with a relatively high density of SNP
Additionally, it is possible today to define an markers (1,0002,000) so that specific sets of
ideal genotype as a pattern of QTLs, all QTLs SNPs polymorphic for a given MARS popula-
carrying favourable alleles from various parents. tion can be quickly chosen. DNA samples are
If individuals are crossed based on their molecu- obtained directly from the F3 plants or from
lar marker genotypes as in MARS, it might be bulked F4 progenies from each F3 if more leaf
possible to get close to the ideal genotype after material is needed or if sampling could not be
several successive generations of crossings. It is done at the F3. These samples are genotyped at
likely that through such a MARS breeding the polymorphic loci identified from the
scheme, higher genetic gain will be achieved than parental screening.
through MABC. 4. Phenotyping
Basic Steps Involved in MARS Multi-location field trials, using replicated
1. Selection of parents experimental designs, are then conducted to
MARS works best with populations that are obtain good evaluation of the target traits
derived from good x good crosses, that is, using (refer chapter 5). Accurate plant phenotyping
parental lines that are used in a regular breeding is critical to the success of MARS. Evaluation
program. Excessive segregation for traits such of nontarget traits segregating in the popula-
as maturity or height should be avoided to allow tion can also generate new useful information,
a good quality yield evaluation. It is probably a including potential negative correlations with
good idea to start more crosses between various target traits.
parents and then to focus the MARS project on 5. Identification of QTLs
the most suitable populations. Many QTL analysis procedures are available
2. Population development for QTL identification for the traits of interest.
MARS does not need very advanced popula- Using a selection index with different weight
tions, and F3-derived populations are gener- being given to various key traits is often useful
ally sufficient. Progenies are advanced to the for final QTL selection. Ideally, the breeder
F3 generation through single-seed descent will use different models to compare the
(single F3 plants are selfed to generate F3:4 or results and decide on the QTLs to recombine
F3:5 progenies, depending of the amount of (refer chapter 6).
seed necessary for multi-location yield test- 6. Recombination cycles
ing). The population size will depend on the Once a set of key QTLs has been identified, a
precision of QTL mapping desired by the few sets of F3-derived progenies are chosen
breeder and can range from 200 to 500. based on their complementarity for the presence
Usually, the population size is made to fit a of favourable alleles and on their overall pheno-
96-well PCR plate format so it would be a typic performance. Several individual plants
multiple of a given number (92, 94 or other) (F4 or F5 depending on what makes the most
184 8 Marker-Assisted Selection

sense for that crop) of each progeny are grown while in BC2 and BC3 populations are evaluated
and genotyped (nearest marker to the QTL for traits of interest and genotyped using molecular
peak, or flanking markers) to identify the best markers. In this way, the identification of QTL
individual plants to use in the recombination happens while these QTLs are transferred into an
crosses. An example would be to cross four adapted genetic background. The AB-QTL
pairs of progenies (8 lines), then the two pairs of method can be employed to exploit unadapted
resulting F1s in the second cycle, and then the germplasm for the quantitative trait improvement
final two F1s in the final cycle. At each stage, the of crop plants and has been applied successfully
F1s are genotyped and the best ones are used in several crop species, for example, barley,
again for the next cycle of recombination. At the maize, rice, tomato and wheat.
end of the process, the resulting lines are selfed
for few generations for fixation.
In order to ensure the variability at the unse- Mapping-As-You-Go (MAYG)
lected loci for the final phenotypic evaluation, a
few different independent sets of parental proge- In 2004, Podlich et al. suggested the Mapping-
nies and several progenies from the final recom- As-You-Go (MAYG) approach, to overcome the
bination cycles will be employed. Lines can also problem of inaccurate estimation of QTLs and
be developed from each intermediate recombina- their effects. MAYG is a mapping-MAS strategy
tion step. The specific strategies used for the that accounts for the presence of epistasis and
recombination process will depend on the crop genotype by environment (G E) interactions.
(ease of crossing, number of progenies obtained The effectiveness of the MAYG approach has
per cross, cycle length, etc.), on the number of been investigated through simulation. In the
loci to recombine, and on the breeders prefer- MAYG approach, estimates of QTL allele effects
ence (which is again based on availability of are continually revised by remapping new elite
expertise/labour, resources, etc.). germplasm generated during cycles of MAS, thus
ensuring that QTL estimates remain relevant to
the current set of germplasm in the breeding
Advanced Backcross (AB)-QTL program. It is considered as a mapping-MAS
Analysis strategy that explicitly recognises that alleles of
QTL for complex traits can have different values
QTL studies using populations which carry as the current breeding material changes with
alleles of both parents at relatively high frequency time. The integration of genetic mapping and
(e.g. F2, BC1) are well suited for QTL mapping, MAS offers two major advantages: (1) ability to
but have some drawbacks when it comes to carry out markertrait association analysis using
detecting and transferring useful QTLs from breeding populations directly rather than having
unadapted germplasm into elite breeding lines. to follow time-consuming development of genetic
Undesirable QTL alleles from the unadapted par- populations and (2) combining markertrait asso-
ent occur in high frequency and epistatic interac- ciation development and validation. This saves
tions are likely to occur, because donor alleles are time, both in the process itself but also in the gen-
present at a high frequency. Tanksley and Nelson eration of the necessary genetic materials.
(1996) proposed a method for simultaneously
discovering valuable QTLs from unadapted ger-
mplasm (e.g. land races, wild species) and trans- Application of Markers in Germplasm
ferring them into elite breeding lines. The method Storage, Evaluation and Use
is named advanced backcross QTL analysis
(AB-QTL) and delays QTL analysis until the BC2 Marker-assisted germplasm evaluation is another
or BC3 generation. In BC1, negative selection is important tool in the acquisition, storage and use
conducted to reduce deleterious donor alleles, of plant genetic resources, and the evaluation of
Bibliography 185

germplasm can be considerably improved with Bringing Genomics to the Wheat Fields (http://
the assistance of markers. Markers can be used maswheat.ucdavis.edu/).
prior to crossing to evaluate the breeding material. 2. Grafgen: Design of Precision Graphical Geno-
Also, mixing of seed samples can be discovered types (http://moulon.inra.fr/~fred/programs/
using markers instead of growing plants to matu- programs.html), a computer program devel-
rity and assessing morphological characteristics. oped by Frederic Hospitals group at INRA,
In order to broaden the genetic base of core France. Using marker data for a population,
breeding material, germplasm of diverse genetic the program displays each individuals allelic
background for crossings with elite cultivars can composition in a graphical format as an aid to
be identified with the assistance of markers, and selecting desirable genotypes.
markers are on the whole a valuable tool for char- 3. Molecular Plant Breeding (http://www.molecular
acterising genetic resources, delivering detailed plantbreeding.com/), an Australian-based ini-
information usable in selecting parents. The tiative to incorporate marker-assisted strate-
genotypic evaluation of germplasm based on gies into plant breeding programs.
molecular markers (marker-assisted germplasm 4. PLABSIM, MAS simulation software available
evaluation, MAGE) and/or QTL analysis can be from Matthias Frischs website at the University
used to identify and extract superior alleles from of Hohenheim, Germany ( http://www.uni-
inferior germplasm. This complements pheno- hohenheim.de/~frisch/).
typic selection. The advancements in the field of 5. Popmin (http://moulon.inra.fr/~fred/programs/
genomics have considerably contributed to programs.html), another computer program from
increase the use of wild relative genes, as they Frederic Hospitals group at INRA, France. This
allow for the isolation of beneficial genes, the program calculates optimum population sizes for
selection for traits which are difficult to detect marker-assisted backcrossing programs.
based on phenotype or the screening of whole 6. Molecular marker assisted selection as a
collections of wild relatives. MAS has increas- potential tool for genetic improvement of crops,
ingly been applied for the maintenance of reces- forest trees, livestock and fish in developing
sive alleles in backcrossing pedigrees and for countries (http://www.fao.org/biotech/Conf10.
pyramiding resistance genes. Molecular markers htm). This site reports results of a conference
can also be used for (1) differentiating cultivars sponsored by FAOs Electronic Forum on
and creating, maintaining and improving het- Biotechnology in Food and Agriculture.
erotic groups; (2) assessing collections and iden- 7. Molecular marker maps that have been con-
tifying germplasm redundancy, underrepresented structed for a wide range of crops are available
alleles and genetic gaps; (3) monitoring genetic at www.ncbi.nlm.nih.gov/genomes/PLANTS/
shifts that can occur during medium- or long-term PlantList.html.
storage, regeneration, domestication and breed-
ing; (4) identifying unique germplasm; and
(5) constructing core collections. Bibliography

Literature Cited
Resources for MAS on the Web
Beckmann JS, Soller M (1986) Restriction fragment
length polymorphisms in plant genetic improvement.
A large collection of web resources are available
Oxford Surv Plant Mol Cell Biol 3:197246
for MAS in the World Wide Web, and some of Bonnett DG, Rebetzke GJ, Spielmeyer W (2005) Strategies
them are listed below: for efficient implementation of molecular markers in
1. As an example of current opportunities for wheat breeding. Mol Breed 15:7585
Dreher K, Khairallah M, Ribaut JM, Morris M (2003)
MAS in wheat, protocols for over 20 trait-
Money matters (I): costs of field and laboratory
associated markers (associated with disease procedures associated with conventional and marker-
resistance, insect resistance and grain quality) assisted maize breeding at CIMMYT. Mol Breed
are posted on the website MAS Wheat: 11:221234
186 8 Marker-Assisted Selection

Morris M, Dreher K, Ribaut JM, Khairallah M (2003) Knapp S (1998) Marker-assisted selection as a strategy
Money matters (II): costs of maize inbred line conver- for increasing the probability of selecting superior
sion schemes at CIMMYT using conventional and genotypes. Crop Sci 38:11641174
marker-assisted selection. Mol Breed 11:235247 Knight J (2003) Crop improvement: a dying breed. Nature
Tanksley SD, Nelson JC (1996) Advanced backcross 421:568570
QTL analysis: a method for the simultaneous discov- Morgante M, Salamini F (2003) From plant genomics to
ery and transfer of valuable QTLs from unadapted breeding practice. Curr Opin Biotechnol 14:214219
germplasm into elite breeding lines. Theor Appl Genet Neeraja C, Maghirang-Rodriguez R, Pamplona A, Heuer S,
92:191203 Collard B, Septiningsih E et al (2007) A marker-assisted
backcross approach for developing submergence-toler-
ant rice cultivars. Theor Appl Genet 115:767776
Peleman JD, van der Voort JR (2003) Breeding by design.
Further Readings Trends Plant Sci 8:330334
Podlich DW, Winkler CR, Cooper M (2004) Mapping as
Beavis WD (1998) QTL analysis: power, precision, and you go: an effective approach for marker-assisted
accuracy. In: Paterson AH (ed) Molecular dissection of selection of complex traits. Crop Sci 44:15601571
complex traits. CRC Press, Boca Raton, pp 145161 Ribaut JM, Hoisington D (1998) Marker-assisted selection:
Frisch M, Melchinger AE (2001) Marker-assisted back- new tools and strategies. Trends Plant Sci 3:236238
crossing for introgression of a recessive gene. Crop Smith S, Beavis W (1996) Molecular marker assisted
Sci 41:14851494 breeding in a company environment. In: Sobral BWS
Frisch M, Bohn M, Melchinger AE (1999a) Minimum (ed) The impact of plant molecular genetics.
sample size and optimal positioning of flanking mark- Birkhauser, Boston, pp 259272
ers in marker-assisted backcrossing for transfer of a Thomas WTB (2003) Prospects for molecular breeding of
target gene. Crop Sci 39:967975 barley. Ann Appl Biol 142:112
Frisch M, Bohn M, Melchinger AE (1999b) Comparison Xu Y (2003) Developing marker-assisted selection strategies
of selection strategies for marker-assisted backcross- for breeding hybrid rice. Plant Breed Rev 23:73174
ing of a gene. Crop Sci 39:12951301 Xu Y, Crouch JH (2008) Marker-assisted selection in plant
Frisch M et al (2000) PLABSIM: software for simulation breeding: from publications to practice. Crop Sci
of marker-assisted backcrossing. J Hered 91:8687 48:391407
Hospital F (2003) Marker-assisted breeding. In: Newbury Young N (1999) A cautiously optimistic vision for marker-
HJ (ed) Plant molecular breeding. Blackwell assisted breeding. Mol Breed 5:505510
Publishing/CRC Press, Oxford/Boca Raton, pp 3059 Yousef GG, Juvik JA (2001) Comparison of phenotypic
Kearsey MJ, Farquhar AGL (1998) QTL analysis in and marker-assisted selection for quantitative traits in
plants; where are we now? Heredity 80:137142 sweet corn. Crop Sci 41:645655
Success Stories in MAS
9

There is a tremendous amount of publications resistance is clearly dominating among publica-


reporting the identification of new QTLs in crop tions since they are mainly controlled by major
plants since its first description in tomato during genes and detection of such QTLs is more or less
1988. However, reports on the successful appli- accurate. However, few studies reported the
cation of MAS in plant breeding programs are successful application of MAS for improved
still limited. This fact is discussed in several yield, quality traits, abiotic stress tolerance,
papers and reviewed the current status and appli- variety detection or growth character (see below).
cations of molecular markers in public and private Another important fact among MAS studies is
sector breeding programs (see further readings). that the main marker technologies applied are
Most of the critical reviewers have come to the predominantly microsatellite markers. Though
conclusion that rate, scale and scope of uptake of almost all the publications are results from public
genomics and MAS in crop breeding programs breeding programs, it would be incorrect to
continually lag behind expectations. Thus, it has conclude that MAS is mainly conducted in public
been repeatedly stated that the vast majority of breeding programs. What has to be considered is
the favourable alleles at these identified QTL that publishing is of little or no importance for
reside in publications rather than in cultivars that private plant breeders, while it is one of the main
have been improved through the introgression or aims in public research institutes and at universi-
selection of such QTLs. However, the aim of this ties. The following section provides success
book is to show the successful detection of QTLs stories made in different crops that employed
by circumventing all the challenges that limit the MAS, and the list is not exhaustive. Due to space
transfer of knowledge from QTL mapping constraints, only few examples in each crop have
to routine MAS in plant breeding program. been shown, merely to showcase that MAS has
The previous chapters have addressed those been widely employed in crop plants for their
approaches, and this chapter describes how those genetic improvement. Please refer the further
approaches have successfully applied in develop- readings to get more examples.
ment of new crop cultivars. Critical analysis of
published reports brought an impression that
MAS has great potentials in genetic improvement Tomato
of crop plants, if the limitations are properly
looked for. Among the different MAS-based This was the first crop in which both QTL
breeding strategies applied (refer chapter 8), mapping and MAS has been demonstrated.
MABC/introgression is the main strategy that has Tanksley et al. in 1981 have first demonstrated
been used in most of the publications. Regarding the real MAS-based selection on metric charac-
the breeding objective, breeding for disease/pest ters using isozyme markers in early generations

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 187
and Benefits, DOI 10.1007/978-81-322-0958-4_9, Springer India 2013
188 9 Success Stories in MAS

of tomato lines. Lecomte et al. (2004) intro- plantsciences.ucdavis.edu/plantbreeding/main/


gressed five QTLs controlling fruit quality in history.htm), which contains the introgressed stripe
tomato from a parental line into three improved rust resistance gene Yr17 and leaf rust resistance
lines through marker-assisted backcross program. gene Lr37 (Helguera et al. 2003). Similarly, several
other related genes Lr1, Lr9, Lr24 and Lr47 were
introgressed into common wheat cultivars by MAS
Maize (Nocente et al. 2007). Marker-assisted pyramiding
of two cereal cyst nematode resistance genes from
This was the second crop that has successfully Aegilops variabilis in wheat has also been reported
been used to show that isozyme markers can also (Barloy et al. 2007). In wheat, there is extensive
be used for genetic improvement of yield in 1982 use of DNA markers for cereal cyst nematode
by Stuber. In another study, Yousef and Juvik (Heterodera avenae Woll.) resistance (Eagles et al.
(2002) showed that QTLs identified in a mapping 2001). The extensive use of MAS in CIMMYT
population can very well exert the same effects in wheat breeding programs is reported elsewhere.
different genetic backgrounds and across two Large wheat MAS programs have also been
environments. By introgressing three marker QTL developed in Australia for around 20 genes or
alleles associated with enhanced seedling emer- chromosome regions used in cultivar development.
gence into elite lines utilising marker-assisted During the last few years, remarkable progress in
backcrossing, this trait was successfully enhanced implementation of MAS strategies for cultivar
in sweet corn. The AB-QTL method, which can be development has been achieved by the MAS
used for the simultaneous identification and trans- Wheat Consortium in the United States, including
fer of favourable QTL alleles, has successfully the completion of 80 MAS projects (visit the
been used to improve yield in elite maize lines (Ho consortium website for more detail).
et al. 2002) and also Bouchez et al. (2002) suc-
cessfully introgressed favourable QTLs for grain
yield into maize elite lines. As abiotic stress resis- Rice
tance is a complex trait, only few successful MAS
applications in breeding for such traits have been Ashikari et al. (2005) provide a good example of
published. An example is the results of a marker- successful gene pyramiding experiments. First,
assisted backcross experiment conducted at the introgression of one QTL for grain number
CIMMYT to improve grain yield in tropical maize and one QTL for plant height separately in the
under water-limited conditions (Ribaut and Ragot same genetic background improved both traits.
2006). Other important examples for the successful Second, the lines generated by pyramiding both
application of MAS in maize are the use of micro- QTLs in the same genetic background exhibited
satellite markers for the conversion of normal trait values slightly lower than expected based
maize lines into Quality Protein Maize (QPM), on single introgression lines, but overall, the
containing more lysine and tryptophan than the addition of genetic loci was still beneficial and
native lines (Babu et al. 2004), or the introgression permitted improvement of the yield of a strain of
of favourable QTL for earliness and grain yield rice. There are many other successful examples
between maize elite lines (Bouchez et al. 2002). in numerous species, including pyramiding of
Xa7 and Xa21 for the improvement of disease
resistance to bacterial blight in hybrid rice (Zhang
Wheat et al. 2006). Up to now, MAS in rice breeding has
mainly been utilised for the pyramiding of disease
Examples of commercially released genetic mate- resistances, namely, bacterial blight and blast
rial include Patwin (Hard White Spring wheat), (Narayanan et al. 2002). In 2002, two cultivars
the first variety developed by MAS released by the resistant to bacterial leaf blight were released in
University of California at Davis (http://www. Indonesia, which have been selected using MAS.
Varieties Released Through MAS 189

The variety Angke carries the resistance gene of six genetic backgrounds, the study demonstrates
xa5, and Conde carries Xa7 (Bustamam et al. the potential of incorporating wild alleles with the
2002). Several publications report introgression assistance of markers. In soybean, the most promi-
from wild relatives (e.g. O. glumaepatula, O. nent example for MAS application in breeding is
rufipogon) in order to improve yield (Liang et al. resistance to soybean cyst nematode (Heterodera
2004). In 2006, two lines showing strong glycines). Mudge et al. (1997) showed that with
submergence tolerance were developed by MAS using SSR markers that flank rhg1, they
introgressing a locus conferring submergence were 98% accurate in identifying resistant lines
tolerance from cultivar FR13A into the variety from a cross between Evans and PI 209332.
Swarma (Xu et al. 2006). Jantaboon et al. (2011) Refer Concibido et al. (2004) for an excellent
have successfully shown to introgress four QTLs review on MAS for cyst nematode resistance in
that confer submergence tolerance and cooking soybean.
quality traits into the development of an ideo-
types using MAS. Marker-assisted backcross
breeding approach was employed to incorporate Varieties Released Through MAS
blast resistance genes, namely, Piz-5 and
Pi54, from the donor lines C101A51 and Tetep MAS-breeding programs have been used to pro-
into the genetic background of PRR78 to duce two low-amylose rice varieties, Cadet and
develop Pusa1602 (PRR78 + Piz5) and Pusa1603 Jacinto (Hardin 2000), and two Indonesian rice
(PRR78 + Pi54), respectively (Singh et al. 2012). varieties, Angke and Conde, with resistance to
bacterial leaf blight (Bustamam et al. 2002). A
white bean variety resistant to BGYMV and com-
Barley mon bacterial blight, Verano (Beaver et al. 2008),
a leaf rust resistant wheat variety from Argentina,
In Australia, a marker linked (0.7 cM) to the Yd2 Biointa 2004 (Bainotti et al. 2009), and an
gene for resistance to barley yellow dwarf virus Australian barley variety, SloopSA, resistant to
was successfully used to select for resistance in a cereal cyst nematode (Barr et al. 2000) have also
barley backcross breeding scheme (Jefferies et al. been released. The soybean cultivar Sheyenne,
2003). Field test data showed that BC2F2-derived tolerant to iron deficiency-induced chlorosis and
lines containing the linked marker had fewer leaf resistant to lodging, was derived from a Pioneer
symptoms and higher grain yield when infected variety. Sheyenne was confirmed to be different
by the virus compared to lines lacking the marker. from that variety with the help of markers (Helms
Castro et al. (2003) provided an example of gene et al. 2008). Other important examples for success
pyramiding in barley by combining a qualitative in MAS are a maize variety named Sunrise, with
gene with QTL alleles for resistance to barley high resistance against the western corn root worm
stripe rust. Preliminary results indicated combin- (Diabrotica virgifera) or a potato producing pure
ing qualitative and quantitative resistance genes amylopectin, which is the first product in Germany
improved resistance levels in the presence of a developed by TILLING that achieves market read-
virulent race of the pathogen. iness. The maize variety was developed by the
German Saaten-Union; the potato was developed
by German Fraunhofer researchers and is pro-
Soybean cessed by Emsland group, the largest German
potato processor. As both examples originate from
Soybean yields were increased by using marker- private breeding programs, they will most proba-
assisted backcrossing to introgress a yield QTL bly never appear in scientific journals (Brumlop
from a wild accession into commercial genetic and Finckh 2010). Nevertheless, press reports
backgrounds (Concibido et al. 2003). Although announcing MAS-breeding projects or releases of
the yield enhancement was observed in only two varieties that were bred with the assistance of
190 9 Success Stories in MAS

markers are mentioned here. In the USA, the variety glaucum), the parental lines of the original hybrid
Tango, carrying two QTLs for adult resistance to (HHB 67) were improved for downy mildew
stripe rust, was released in 2000 (Hayes et al. (caused by Sclerospora graminicola (Sacc.)
2003), claiming to be the first commercially Schroet.) resistance through MAS combined with
released barley variety using MAS. However, conventional backcross breeding, leading to the
Tango yields less than its recurrent parent and is release in India of a new hybrid HHB 67-2
therefore primarily seen as a genetically character- (Navarro et al. 2006).
ised source of resistance to barley stripe rust rather
than a variety of its own. As a result of the South
Australian Barley Improvement Program, the MAS in Multinational Companies
malting variety Sloop was improved with cereal
cyst nematode resistance introgressed from the Although there is very limited specific information
variety Chebec and released in 2002 as SloopSA on the successes of molecular breeding, the first
(Brumlop and Finckh 2010). commercial products of MAS are expected to be
released to the market by all the major multina-
tional breeding companies in the very near future.
Hybrids Developed Through MAS The first cultivar developed through MAS by
Monsanto was released to the US market in 2006.
A common application of marker-assisted back- Examples for patent applications related to MAS
crossing has been the introgression of transgenes technologies are available at the free patents online
into an adapted variety or line (e.g. introgression database (www.freepatentsonline.com). A search
of the Bt insect resistance transgene into different in a patent database using marker-assisted selec-
genetic backgrounds in maize, cotton). It has been tion as search item will result in providing list of
shown in previous chapters that the easy scenario patents related to MAS. Check for latest updates.
is when the marker allele M and the QTL allele Q
are always together. This is only the case if the
marker is actually measuring the relevant poly- Contrasting Stories
morphism within the gene that causes the effect.
Such a direct marker is very convenient, because In some cases, MAS is not as efficient as expected.
the marker genotype will directly inform us about Most of the time, this depends on how stable are
the QTL genotype. In contrast, if indirect or linked QTL effects, which may be altered in different
markers are used in MAS, there is a chance of ways. In some cases, the QTL effect vanishes
recombination between the marker and QTL after MAS or introgression (Shen et al. 2001).
alleles. These are typically markers for genes that One can then wonder whether the QTL was a
were known to exist before they were mapped and false positive (ghost QTL) or a true positive for
had a large effect. Direct markers are generally which the effect (expression) depended on one or
much preferred to linked markers, if they are truly several of the interactions listed below. There is
markers for major gene effects. Their biggest also a tendency for supposedly additive QTL
benefit is that they can even be used without trait effects not to really sum up! Refer Hospital
measurement or pedigree recording. Often, the (2009) for more details on reasons for failures of
target gene can also be detected phenotypically MAS in crop plants.
(pest resistance given by Bt gene), and markers
are used to select for the recurrent parent genome.
The technique has reportedly accelerated the Conclusions and Future Prospects
recovery of the recipient genome by about two
backcross generations, and almost all the Bt Marker-assisted selection has been successful for
hybrids released in India are developed using this introgressing and pyramiding major-effect genes;
strategy. Similarly, in pearl millet (Pennisetum however, many challenges remain to be resolved
Bibliography 191

before MAS can routinely provide added value Bustamam M, Tabien RE, Suwarno A, Abalos MC, Kadir
for breeding very complex traits. The genetic TS, Ona I, Bernardo M, Veracruz CM, Leung H (2002)
Asian rice biotechnology network: improving popular
basis of complex traits and the interaction between cultivars through marker-assisted backcrossing by the
all related traits will become much better under- NARES. Poster presented at the international rice con-
stood because of the rapid developments in the gress, 1620 Sept 2002, Beijing
omics studies. This will allow accurate model- Castro AJ et al (2003) Mapping and pyramiding of quali-
tative and quantitative resistance to stripe rust in bar-
ling of gene networks and the development of ley. Theor Appl Genet 107:922930
robust simulation tools for designing target Concibido VC, Diers BW, Arelli PR (2004) A decade of
genomic ideotypes. Integration of all the state-of- QTL mapping for cyst nematode resistance in soy-
the-art branches of biotechnology, physiology, bean. Crop Sci 44:11211131
Concibido VC et al (2003) Introgression of a quantitative
biochemistry, soil science and plant breeding, and trait locus for yield from Glycine soja into commercial
genetics is the need of the hour. With the avail- soybean cultivars. Theor Appl Genet 106:575582
ability of such knowledge and tools, the early Eagles HA, Bariana HS, Ogbonnaya FC, Rebetzke GJ,
stages of plant breeding programs will become Hollamby GJ, Henry RJ, Henschke PH, Carter M
(2001) Implementation of markers in Australian wheat
much more efficient in a designing of knowledge- breeding. Aust J Agric Res 52:13491356
based plant breeding program. However, there Fraley R (2006) Presentation at Monsanto European
will be no substitute for multi-locational repli- investor day, 10 Nov 2006. Available at www.mon-
cated evaluation trials for screening elite breeding santo.com/investors/presentations.asp
Hardin B (2000) Rice breeding gets marker assists.
lines for the selection and validation of finished Available at www.ars.usda.gov/is/AR/archive/dec00/
products of MAS before distribution to local rice1200.pdf. Verified 19 Nov 2012
breeding companies and farmers fields. Hayes PM, Corey AE, Mundt C, Toojinda T, Vivar H
(2003) Registration of Tango barley. Crop Sci
43:729731
Helguera M, Khan IA, Kolmer J, Lijavetzky D, Zhong-Qi
L, Dubcovsky J (2003) PCR assays for the Lr37-
Bibliography Yr17-Sr38 cluster of rust resistance genes and their use
to develop isogenic hard red spring wheat lines. Crop
Sci 43:18391847
Literature Cited Helms TC, Nelson BD, Goos RJ (2008) Registration of
Sheyenne soybean. J Plant Regist 2:2020
Babu ER, Mani VP, Gupta HS (2004) Combining high Ho C, McCouch R, Smith E (2002) Improvement of
protein quality and hard endosperm traits through hybrid yield by advanced backcross QTL analysis in
phenotypic and marker assisted selection in maize. elite maize. Theor Appl Genet 105:440448
In: Proceedings of the 4th international crop science Jantaboon J, Siangliw M, Im-mark S, Jamboonsri W,
congress, Brisbane Vanavichit A, Toojinda T (2011) Ideotypes breeding
Bainotti C, Fraschina J, Salines JH, Nisi JE, Dubcovsky for submergence tolerance and cooking quality by
J, Lewis SM, Bullrich L, Vanzetti L, Cuniberti MAS in rice. Field Crops Res 123(3):206213
M, Campos P, Formica MB, Masiero B, Alberione E, Jefferies SP, King BJ, Barr AR, Warner P, Logue SJ,
Helguera M (2009) Registration of BIOINTA 2004 Langridge P (2003) Marker-assisted backcross intro-
wheat. J Plant Regist 3:165169 gression of the Yd2 gene conferring resistance to bar-
Barloy D, Lemoine J, Abelard P, Tanguy AM, Rivoal R, ley yellow dwarf virus in barley. Plant Breed
Jahier J (2007) Marker assisted pyramiding of two 122:5256
cereal cyst nematode resistance genes from Aegilops Lecomte L, Duff P, Buret M, Servin B, Hospital F, Causse
variabilis in wheat. Mol Breed 20:3140 M (2004) Marker- assisted introgression of five QTLs
Barr AR, Jefferies SP, Warner P, Moody DB, Chalmers KJ, controlling fruit quality traits into three tomato lines
Langridge P (2000) Marker-assisted selection in theory revealed interactions between QTLs and genetic back-
and practice. In: Proceedings of the 8th international grounds. Theor Appl Genet 109:658668
barley genetics symposium, vol I. Adelaide, Australia, Liang F, Deng Q, Wang Y, Xiong Y, Jin D, Li J, Wang B
pp 167178 (2004) Molecular marker-assisted selection for yield-
Beaver JS, Porch TG, Zapata M (2008) Registration of enhancing genes in the progeny of 9311 O.
Verano white bean. J Plant Regist 2:187189 rufipogon using SSR. Euphytica 139:159165
Bouchez A, Hospital F, Causse M, Gallais A, Charcosset Mudge J, Cregan PB, Kenworthy JP, Kenworthy WJ, Orf
A (2002) Marker-assisted introgression of favorable JH, Young ND (1997) Two microsatellite markers that
alleles at quantitative trait loci between maize elite flank the major soybean cystnematode resistance
lines. Genetics 162:19451959 locus. Crop Sci 37:16111615
192 9 Success Stories in MAS

Narayanan NN, Baisakh N, Vera Cruz CM, Gnanamanickam factor-like gene that confers submergence tolerance to
SS, Datta K, Datta SK (2002) Molecular breeding for rice. Nature 442:705708
the development of blast and bacterial blight resistance Yousef GG, Juvik JA (2002) Enhancement of seedling
in rice cv. IR50. Crop Sci 42:20722079 emergence in sweet corn by marker-assisted back-
Navarro RL, Warrier GS, Maslog CC (2006) Genes are crossing of beneficial QTL. Crop Sci 42:96104
gems: reporting agri-biotechnologya sourcebook Zhang J, Li X, Jiang G, Xu Y, He Y (2006) Pyramiding of
for journalists. In: International crops and research Xa7 and Xa21 for the improvement of disease resis-
institute for the semi-arid tropics, Patancheru, Andhra tance to bacterial blight in hybrid rice. Plant Breed
Pradesh, India 125(6):600605
Nocente F, Gazza L, Pasquini M (2007) Evaluation of leaf
rust resistance genes Lr1, Lr9, Lr24, Lr47 and their
introgression into common wheat cultivars by marker-
assisted selection. Euphytica 155(3):329336
Further Readings
Ribaut JM, Ragot M (2006) Marker-assisted selection to
improve drought adaptation in maize: the backcross Anthony VM, Ferroni M (2012) Agricultural biotechnol-
approach, perspectives, limitations, and alternatives. ogy and smallholder farmers in developing countries.
J Exp Bot 58:351360 Curr Opin Biotechnol 23:278285
Shen L, Courtois B, McNally KL, Robin S, Li Z (2001) Ashikari M, Sakakibara H, Lin S, Yamamoto T, Takashi T,
Evaluation of near-isogenic lines of rice introgressed Nishimura A et al (2005) Cytokinin oxidase regulates
with QTLs for root depth through marker-aided selec- rice grain production. Science 309:741745
tion. Theor Appl Genet 103:7583 Brumlop S, Finckh MR (2010) Applications and potentials
Singh VK et al (2012) Incorporation of blast resistance of marker assisted selection (MAS) in plant breeding.
into PRR78, an elite Basmati rice restorer line, Final report of the F+E project Applications and
through marker assisted backcross breeding. Field Potentials of Smart Breeding (FKZ 350 889 0020) On
Crops Res 128:816 behalf of the Federal Agency for Nature Conservation
Stuber CW (1982) Improvement of yield and ear number December 2010. http://www.bfn.de/0502_skripten.html
resulting from selection at allozyme loci in a maize Hospital F (2009) Challenges for effective marker-assisted
population. Crop Sci 22:737 selection in plants. Genetica 136:303310
Tanksley SD, Medino-Filho DH, Rick CM (1981) The Ribaut JM, Hoisington D (1998) Marker assisted selection:
effect of isozyme selection on metric characters in an new tools and strategies. Trends Plant Sci 3(6):236239
interspecific backcross of tomato: basis of an early Zong G, Ahong W, Lu W, Guohua L, Minghong G, Tao S,
screening procedure. Theor Appl Genet 60:291296 Bin H (2012) A pyramid breeding of eight grain-yield
Xu K, Xu X, Fukao T, Canlas P, Maghirang-Rodriguez related quantitative trait loci based on marker-assistant
R, Heuer S, Ismail AM, Baileyerres J, Ronald PC, and phenotype selection in rice (Oryza sativa L.).
Mackill DJ (2006) Sub1A is an ethylene-response- J Genet Genomics 39(7):335350
Curtain Raiser to Novel MAS
Platforms 10

in molecular, biochemical and physiological


Current Techniques in Molecular, studies and their potential role in MAS.
Biochemical and Physiological
Studies and Its Integration
into MAS Molecular Techniques

Plant breeding programmes key goal revolves in To realise the importance of rapidly accumulating
generation of elite crop plants that are having data as well as to understand the functioning of
combination of superior genes/alleles. However, the cell at the organism level, there is a need
the critical limitation is lack of understanding for high-throughput molecular techniques. The
of what most genes do in terms of the desired studies that use such techniques are collectively
phenotype expression (e.g. pest resistance, salt called as functional genomics. The term func-
tolerance and yield increase) in plants. We do tional genomics is defined as the development
know that all the agronomically important traits and application of global or genome-wide experi-
are quite complex. For example, in halophytes, mental approaches to assess gene function by
we know that salt tolerance depends on the abil- using the information and components provided
ity to compartmentalise ions, which in turn by structural genomics. Several approaches have
depends on regulation of transpiration, the tight been used to explore the probable function of
control of leakage of ions through the root the genes, as well as to monitor their expression
apoplast, the nature of the membranes in the leaf in relation to various other genes, and they are
vacuoles, synthesis of compatible solutes such as explained hereunder.
glycine betaine and the ability to tolerate low K
and Na ratios in the cytoplasm of mature cells or
the ability of protein synthesis to operate at low Expression Proling
K:Na ratios in the cells, etc. Under such condi-
tions, how QTL mapping might be useful in A major part of functional genomics is the
increasing the yield under those unfavourable analysis of gene expression. Having knowledge
environments? In order to have efficient knowl- of when and where a gene product, that is,
edge-based MAS, it is necessary to understand RNA and/or protein, is expressed can give vital
the techniques that are being used to unravel the information about the particular gene in question.
function of genes, and such knowledge should be The very first step in generating a genome-wide
incorporated to the QTL mapping procedure. This expression profile is the preparation of expressed
chapter provides the state-of-the-art techniques sequence tags (EST) profiles. ESTs are DNA

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 193
and Benefits, DOI 10.1007/978-81-322-0958-4_10, Springer India 2013
194 10 Curtain Raiser to Novel MAS Platforms

Fig. 10.1 cDNA library construction and EST database development

sequences read from either end of complementary ESTs provides a suitable substrate for a variety of
DNA (cDNA) molecules. Since cDNAs are pre- high-throughput techniques used for expression
pared from mRNA, these provide information analyses such as microarrays. Such a collection
about the expressed part of the genome. Thus, of ESTs could be provided with quality value if
EST data sets have been generated on a large ESTs represent an outcome from differential
scale for almost all the crop species, and they screening in relation to a particular state, for
have deposited in the NCBI (National Center for example, drought or salt stress. At the same time,
Biotechnological Information) database for in the above said example, 28,000 full-length
ESTs (dbEST; Fig. 10.1). The large number of sequences of cDNA reported for rice could help
EST sequences, however, may not be a represen- annotation of genes accurately and provide
tation of the number of expressed genes because resources for gene discovery and manipulation.
several of them are redundant. For example, total Other techniques used in expression genomics
numbers of 252,364 sequences (221,715 ESTs include traps and the serial analysis of gene
and 30,649 mRNA sequences) have been clus- expression (see below). The technique used to
tered into only 31,080 genes in rice (as on 10th analyse EST is referred to as cDNA library con-
September, 2012). A minimally redundant set of struction, and it is described in detail hereunder.
cDNA Library Construction 195

DNA, mitochondrial DNA, ribosomal RNA and


cDNA Library Construction adaptor dimers). Additionally, current library
construction methods for directional cloning suf-
The generation of full-length cDNA libraries is fer from their reliance on methylation, a process
indispensable for characterising the structure that is often incomplete in protecting internal
and function of newly discovered genes. Several restriction sites and is also inefficient for cloning.
procedures for the construction of cDNA librar- To overcome these limitations, several protocols
ies are available depending on their applications. for cDNA library construction have been
The synthesis of cDNA libraries is a chain of described that exploit the mRNA cap structure
enzymatic reactions, each requiring specific buf- to enrich for full-length sequences. Leading tech-
fers, substances and enzymes. In most cDNA nologies in this field include the oligo-capping
libraries, the first step is the isolation of total method, CAPture, SMARTTM approach and
RNA followed by removal of the highly abun- CAP-trapper. As an example, the oligo-capping
dant rRNA and tRNA components to isolate method is described in detail.
mRNA. However, in PCR approach to cDNA Usually, cDNA libraries constructed by many
synthesis, total RNA is the starting material. types of conventional methods have high content
Usually the first strand of cDNA is synthesised of non-full-length cDNA clones. One of the rea-
by a reverse transcriptase, and it is followed by sons for this high content is that reverse tran-
second-strand synthesis by DNA polymerase. scriptase tends to stop during the first-strand
Subsequently, such cDNAs are ligated into an synthesis and falls off, leaving non-full-length
adaptor, and such adaptor ligation facilitates cDNA. Thus, non-full-length cDNA is an
their easy integration into the vector (Fig. 10.1). unavoidable result of the use of reverse tran-
Such recombinant vectors are later sequenced scriptase for the cDNA synthesis. In order to
(see below) to characterise the nucleotide make a full-length cDNA library, some types of
sequence of each EST. Advantage of cDNA selection procedure need to be designed such as
libraries is that if the gene of interest is highly selection of cDNA that contain both ends of the
expressed in a particular tissue, there will be mRNA. For that purpose, the features which are
abundance of that mRNA, and it will be easy to characteristic to the 3-end and the 5-end of
isolate because it will be enriched in a cDNA mRNA should be used as tags.
library made from that particular tissue. A cDNA The polyA stretch is a characteristic feature
library will represent individual genes, although of the 3-end of mRNA. Conventional methods
not all the genes are represented. Further, there have been using the polyA as a sequence tag to
were no promoters or introns will be present. select the 3-end of mRNA. According to the
Thus, conventional cDNA library construction conventional methods, the first-strand cDNA is
methods suffer from several major shortcomings. usually synthesised from the oligo(dT) primer.
First, the majority of cDNA clones are not full- Because dT primer mostly hybridises at the
length, especially for mRNAs longer than 2 kb. polyA, most of the cDNA is selectively syn-
This loss of sequence is typically due to prema- thesised from the 3-end of the mRNA. Thus, the
ture termination of reverse transcription or 5- conventional methods include the selection
terminal sequence loss caused by cDNA blunt-end step for the 3-end tag of the mRNA. On the
polishing before cloning. As a result, cDNA 5 contrary, they include no step to select the 5-end
ends are significantly underrepresented in cDNA of mRNA. As a result, the largest part of the
libraries. Second, an adaptor-mediated cloning cDNA library is occupied by the cDNA which
process is still a common approach for cDNA lack the 5-end of the mRNA.
library construction (Fig. 10.1). Thus, the result- The 5-end of mRNA also has a characteristic
ing cDNA libraries can be comprised of up to structure, called the cap structure, but unfortu-
20% undesirable ligation by-products (chimeras) nately it is not a sequence tag. Unlike the polyA
and inserts of non-mRNA origin (e.g. genomic at the 3-end, it cannot be used for the hybridisation.
196 10 Curtain Raiser to Novel MAS Platforms

If the 5-end tag of the mRNA were also a Differential display uses an arbitrary primer to
sequence tag, it would be easy to use it to select amplify cDNAs obtained from different mRNA
the 5-end of mRNA. In order to overcome this samples randomly. One primer (5-T11NN, where
difficulty, a new method was introduced: a NN are any two specific nucleotides) selects only
sequence tag at the 5-end, which is called as cDNAs that have the nucleotides NN immedi-
oligo-capping method. This method allows us ately adjacent to the polyA tail. When PCR is
to replace the cap structure of mRNA with the carried out using this primer in conjunction with
synthetic oligonucleotide enzymatically. Each a random 10-mer primer, the same subset of
mRNA product of the oligo-capping contains cDNAs is selectively amplified in each sample
the sequence tags at both ends, which is polyA analysed. PCR reactions from the different sam-
at the 3-end and the cap-replaced oligo at the ples are run side by side on sequencing gels, so
5-end. Thus, with oligo-capped mRNA as a that gene expression differences can be visual-
starting material, a new system is developed to ised as bands present in one lane and absent in
selectively clone the cDNA which contains both another. The bands of interest are cut out of the
of the sequence tags at the respective ends. gel, and the DNA is eluted, cloned, sequenced
and used for further analysis. This method is use-
ful for analysing many different tissues or treat-
Differential Display and ments at once, but a large number of different
Representational Difference Analysis primers are needed to survey for differences in all
of the cDNAs in a sample.
A large number of PCR-based methods have been
developed for analysing gene expression. The
sensitivity of PCR makes it especially useful in Subtractive Hybridisation
analysing rare transcripts that cannot be analysed
by Northern blotting techniques. For known Subtractive hybridisation is a popular technique
sequences, quantitative PCR is used to analyse for gene discovery from non-model organisms
relative levels of gene expression in different tis- without an annotated genome sequence. They are
sues or after different treatments. Various PCR- valuable tools for identifying differentially reg-
based methods have been developed to identify ulated genes important for cellular growth and
and isolate differentially expressed genes. Two of differentiation. Over the last decade, numerous
the most commonly used procedures are repre- subtractive hybridisation techniques have been
sentational difference analysis (RDA) and differ- developed and used to isolate significant genes in
ential display. RDA is used to select for genes many systems. The simple suppression subtrac-
expressed in only one mRNA population (the tes- tive hybridisation (SSH; see below) is a widely
ter mRNA) compared to a second mRNA popula- used method for separating DNA molecules that
tion (the driver). After cDNA synthesis and distinguish two closely related DNA samples.
amplification of both populations, adapters are Two of the main SSH applications are cDNA
ligated only to the tester cDNA population subtraction and genomic DNA subtraction. It is
(T-adapters). The tester and driver are mixed, based primarily on a suppression polymerase
denatured and hybridised so that common chain reaction (PCR) technique and combines
sequences between the populations form tester normalisation and subtraction in a single proce-
driver hybrids. Because of the excess of driver in dure. The normalisation step equalises the abun-
the hybridisation mix, only tester-specific dance of DNA fragments within the target
sequences form testertester molecules. These population, and the subtraction step excludes
are amplified using T-adapter-specific primers sequences that are common to the populations
and used for further studies. RDA results in being compared. This dramatically increases the
identifying a set of tissue- or treatment-specific probability of obtaining low-abundance differ-
cDNAs. entially expressed cDNAs or genomic DNA
Subtractive Hybridisation 197

fragments and simplifies analysis of the subtracted (5) isolating of the complete sequence of the
library. SSH technique is applicable to many remaining target nucleic acid. Variations are
comparative and functional genetic studies for possible at each step, and the materials used and
the identification of disease, developmental, methods chosen depend on the desired results.
tissue-specific or other differentially expressed When choosing appropriate sources for driver
genes (e.g. diseased vs. normal tissues, drought and tester, it must be kept in mind that the less
stressed or irrigated plant cells). As shown in complex the source of tester and driver and the
many examples, the SSH technique may result in more sequences they have in common, the easier
over 1,000-fold enrichment for rare sequences it is to isolate specific target sequence differences.
in a single round of subtractive hybridisation. For example, it is easier to identify RNA differ-
SSH has been shown as an efficient technique ences between cell types than it is to identify
for identifying and characterising differences differences between tissues because fewer genes
between two populations of nucleic acids. For are expressed in single cells.
example, it detects differences between the RNA 1. Preparation of Driver and Tester
in different cells, tissues, organisms or sexes In principle, both tester and driver samples
under normal conditions, or during different can be either DNA or RNA, but it is often
growth phases, after various treatments (i.e. hor- most practical for the tester to be DNA
mone application, heat shock) or in diseased (because the tester is present in a low concen-
(or mutant) versus healthy (or wild-type) cells. tration, and DNA is more stable than RNA)
Subtractive hybridisation also detects DNA and for the driver to be RNA (after hybridisa-
differences between different genomes or tion, excess driver RNA can be eliminated
between cell types where deletions or certain enzymatically or by alkali degradation). In
types of genomic rearrangements have occurred. the basic subtractive hybridisation protocol,
Subtractive hybridisation requires two popula- RNA from the tester source is reverse tran-
tions of nucleic acids; the tester (or tracer) con- scribed into complementary DNA (cDNA)
tains the target nucleic acid (the DNA or RNA and hybridised to polyA + driver RNA. The
differences that one wants to identify), and the testerdriver hybrids are removed, excess
driver lacks the target sequences. The two popu- fresh driver is added, and the hybridisation is
lations are hybridised with a driver to tester ratio repeated once. The remaining target cDNA
of at least 10:1. Because of the large excess of is either cloned or used to make a probe. This
driver molecules, tester sequences are more likely basic procedure is useful if the starting mate-
to form drivertester hybrids than double- rial is not very complex and is easy to isolate.
stranded tester. Only the sequences in common If little starting tissue is available or if the
between the tester and the driver hybridise, how- starting material is complex, multiple rounds
ever, leaving the remaining tester sequences of hybridisation-subtraction are needed, and it
either single-stranded or forming testertester is necessary to use a library- or a PCR-based
pairs. The drivertester, double-stranded driver technique. Tester and driver are prepared from
and any single-stranded driver molecules are sub- cDNA libraries as phagemids or as library
sequently removed (the subtractive step), leav- inserts amplified by PCR or in vitro transcrip-
ing only tester molecules enriched for sequences tion. Alternatively, cDNA from tester and
not found in the driver. Usually multiple rounds driver sources is ligated to different primers,
of subtractive hybridisation are necessary to iden- amplified by PCR and hybridised. The steps
tify truly tester-specific nucleic acid sequences. are repeated as needed.
There are five basic steps to subtractive hybridi- 2. Hybridisation
sation: (1) choosing material for isolating tester When single-stranded nucleic acids are
and driver nucleic acids, (2) producing tester hybridised to each other, more abundant
and driver, (3) hybridising, (4) removing driver sequences anneal more rapidly because they
tester hybrids and excess driver (subtraction) and encounter each other more frequently. During
198 10 Curtain Raiser to Novel MAS Platforms

subtractive hybridisation, the hybridisation hybrids. Streptavidin can also be attached


step is driven by the excess driver sequences, to beads or to a column and used to
so tester sequences that have complementary remove excess driver and drivertester
sequences in the driver population rapidly hybrids.
form drivertester hybrids, whereas sequences The effectiveness of the subtraction is
unique to the tester population remain single- monitored by using radiolabelled tester
stranded or form testertester pairs more and determining whether the levels of
slowly. Rare sequences from either popula- single-stranded tester decrease after sub-
tion take longer to pair up than abundant traction. Alternatively, enrichment for
sequences. The ratio of driver to tester, the target sequences is monitored. If there
overall concentration of driver, the tempera- are known genes common to the driver
ture and the length of hybridisation should be and tester and one or more specific to the
chosen based on the complexity of the driver tester, it can be determined, after each
and tester, the abundance class of the target round of hybridisation and subtraction,
nucleic acids and the length of the driver and whether the tester-specific gene is becoming
tester sequences used. more abundant compared with the com-
2.1. Subtraction mon genes.
The purpose of the subtraction step is to 2.2 Isolation of Target Sequences
remove drivertester hybrids formed After one or more hybridisation and
during the hybridisation step, leaving subtraction steps, the resulting tester
behind tester enriched for the target nucleic acids should be greatly enriched
sequences. Many different methods are used for target sequences. However, it is still
for subtraction, depending on the nature possible that rare sequences common to
of the driver and the tester. A few possi- both the driver and the tester remain,
bilities are mentioned. Hydroxyapatite and in many cases, the sequences
chromatography is used to bind double- isolated are only partial gene sequences.
stranded driver and drivertester hybrids, The remaining tester sequences are iso-
leaving single-stranded nucleic acids lated and analysed in a variety of ways.
behind. This is a good choice if the driver Tester can be made into an enriched
is RNA because single-stranded RNA can library and probed with driver and tester
be removed chemically or enzymatically, sequences to look for tester-specific
leaving only single-stranded cDNA tester clones, or the tester is labelled and used
after the subtraction. to probe tester and driver libraries and
If the tester is a single-stranded to isolate full-length clones. It is neces-
phagemid library and the driver is first- sary to further analyse isolated tester
strand cDNA, after hybridisation, the sequences by Northern blotting, in situ
double-stranded drivertester hybrids hybridisation or PCR methods to deter-
can be digested with a frequent-cutting mine whether the sequences are truly
restriction enzyme and the hybridisation tester-specific.
mixture used to infect bacteria. Only the Alternatives to standard subtractive
single-stranded tester phagemids infect, hybridisation techniques may include
and they can thus be isolated. A common positive selection (hybridisation of tester
procedure is to use biotinstreptavidin and driver is still carried out but, rather
binding to separate nucleic acids. than removing unwanted drivertester
Streptavidin binds to biotinylated driver and driver sequences by subtraction
sequences, and phenol extraction is used during step 4, double-stranded tester
to remove the streptavidin protein and sequences are positively selected for
the bound driver and drivertester selective cloning or selective amplification.
Microarray 199

Again, various methods are employed to data gathering at an unprecedented rate. Mixtures
carry out positive selection. A simple of DNA or RNA isolated form biological sources
method is to digest tester with a restriction are labelled enzymatically by incorporating
enzyme producing cohesive ends while nucleotides bearing reporter genes and hybridised
using sonication to shear the driver DNA to microarrays. Hybridisation reactions yield
randomly. After hybridisation, DNA heteroduplexes between individual components
ligase and vector DNA are added. Only of the fluorescent sample (probe) and comple-
double-stranded tester is cloned into the mentary sequences (target) on the chip surface.
vector, and then it can be used to trans- Since each target element or feature is chemi-
form bacteria), suppression subtractive cally homogeneous and occupies a known loca-
hybridisation (in this positive selection tion, the identity and quantity of each component
technique, both driver and tester are in the fluorescent mixture can be ascertained by
digested with a frequent-cutting restric- measuring the fluorescence intensity at each
tion enzyme to give blunt ends. Tester is position on the microarray. Though the basic
divided into two samples, which are principles behind DNA chips (e.g. the hybridisa-
ligated to different adapters, P1 and P2, tion of samples to immobilised DNA molecules)
and then hybridised to excess driver. Then are conceptually similar to those used in earlier
the two tester populations are mixed, and filter-based assays (such as Southern blotting),
additional driver is added. Hybrids the precision, speed and scale afforded by DNA
formed between members of the two sub- chip assays are unmatched and represent a major
tracted tester populations are selectively technological advance in molecular biology.
amplified by PCR using primers specific The characteristic features of microarrays that
to P1 and P2. Molecules that have either make them highly useful in functional genomics
P1 or P2 adapters at both ends form pan- are:
handles as the adapters hybridise to each 1. Parallelism: Microarray analysis allows paral-
other, and these molecules are not lel acquisition and analysis of massive data.
amplified by PCR; this results in the This greatly increases the speed of experimen-
suppression). tal work. It allows meaningful comparison
between genes or gene products represented
in microarrays and may eventually allow the
Microarray analysis of the entire genome of any organism
in a single reaction. Recent gene expression
The microarray is also called as DNA chips or experiments in yeast are important examples
biochips. DNA chips are made up of silicon or of achieving this goal.
nylon or glass on which DNA fragments are fab- 2. Miniaturisation: Microarray analysis involves
ricated. The sources of DNA fragments may be miniaturisation of DNA, thus reducing times
obtained from cDNA clones, EST clones, genomic and reagent consumption.
clones or DNA amplified from open reading 3. Speed: Microarray analysis is highly sensitive
frames. Size of the single DNA chips varies from and allows rapid data acquisition with either
1 to 3.24 cm2. But within this small size, we can confocal scanner or cameras equipped with
display nearly all the genes of a crop plant. charged coupled devices (CCD).
DNA chip technologies utilise microscopic 4. Multiplexing: This is a process by which mul-
arrays (microarrays) of molecules immobilised tiple samples are analysed in a single assay.
on solid surfaces for hybridisation analysis. The labelling and detection methods help to
Advanced arraying technologies such as photoli- analyse multiple samples on a single DNA
thography, micro-spotting and ink-jetting, cou- chip. Multiplexing also increases the accuracy
pled with sophisticated fluorescence detection of comparative analysis by eliminating com-
systems and bioinformatics, permit molecular plicating factors such as chip to chip variation,
200 10 Curtain Raiser to Novel MAS Platforms

discrepancies in reaction conditions and other chips, thus, can be used for studying gene
shortcomings inherent in comparing separate expression patterns in time and space.
experiments. It has already been used in The above two types of microarrays can be
expression analysis, genotyping and DNA produced by using two different approaches:
resequencing. synthesis and deposition. In the synthesis
5. Automation: Advanced manufacturing tech- approach, microarrays are prepared in a stepwise
nologies permit the mass production of DNA fashion by in situ synthesis of nucleic acids from
chips, and the automation led to proliferation biochemical building blocks, the nucleotides.
of microarray assays by ensuring their quality, With each round of synthesis, individual nucle-
availability and affordability. As a result, DNA otides are added to growing chains until the
chips may eventually become like commodity desired length is achieved. In the deposition or
items in the computer industry. delivery approach, on the other hand, separately
6. Combinatorial synthesis: Using the combina- prepared samples of nucleic acids are deposited
torial synthesis strategy, a set of all 4k oligo- exogenously for chip fabrication. Molecules,
nucleotides of the length k nucleotides such as cDNA fragments, are amplified by PCR
(k-mers) can be generated in 4k synthesis and purified; small quantities of these fragments
cycles. For example, the set of all 4-mers are then deposited onto known locations using a
(256) can be synthesised in 4 rounds, each variety of delivery technologies. The key parameters
round having 4 cycles, thus making a total of for evaluating both the techniques include
16 cycles. microarray density and design, biochemical com-
position, quality, cost and ease of prototyping.

Types of DNA Chips and Their


Production Hybridisation and Detection Methods

Two major types of DNA chips are available for Hybridisation of the target DNA to a microarray
DNA analysis. yields sequence information. The target DNA is
labelled and incubated with the array. If the target
Oligonucleotide-Based Chips DNA has regions complementary to the probes
This type of DNA chips contains a high density on the array, then the target DNA will hybridise
of short oligonucleotide microarrays, which are with these probes. Under a fixed set of hybridisa-
prepared by photolithography. Such arrays con- tion conditions, for example, target concentra-
tain 100,000400,000 oligonucleotides immobil- tion, temperature and buffer and salt concentration,
ised within an area of 1.6 cm2. This allows the use the fraction of probes bound to targets will vary
of targeted regions of genomic DNA for sequenc- with the base composition of the probe and the
ing or for a large-scale analysis of single nucle- extent of the targetprobe match. In general, for a
otide polymorphisms (SNPs). given length, probes with high GC content will
hybridise more strongly than those with high AT
DNA-Based Chips or cDNA Arrays content. Similarly, probes matching the target
This type of DNA chips contains a high density will hybridise more strongly than probes with
of DNA microarrays, most often derived from mismatches, insertions and deletions. Various
cDNA (hence, they are currently made by roboti- detection methods are currently available for the
cally spotting a large number of PCR-amplified analysis of hybridisation patterns on microarrays
DNA fragments onto glass or nylon surfaces). of immobilised probes. Some rely on the use of
The hybridisation is carried out with fluorescently enzymes to enable detection, while others detect
labelled mRNA or its corresponding cDNA, and hybridisation directly.
the hybridised duplexes are identified by colour For the detection of hybridisation patterns on
fluorescence detection methods. These DNA DNA chips, the technique of reverse dot-blot,
Microarray 201

used earlier on the membranes, is utilised. The and which correspond to actual signal. Due to the
technique is so described because as opposed to huge number of spots on the array, automatic
dot-blots, where the target DNA is dot-blotted on determinations must be made concerning issues
the membrane and the probes are labelled on such as background intensity, the presence of
DNA chips, the probes are anchored in the form brightly glowing dust or lint artefacts, the occur-
of microarrays and the target DNA is labelled. rence of donut-shaped signals rather than solid
Once hybridisation is completed, the detection spots and the warping or irregularities in the array
of hybridisation is achieved either with the help itself. Image analysis software (e.g. Array Vision,
of an enzyme system (enzyme-assisted detec- Clone Tracker, ImaGene and Gene Vision) has
tion) or directly due to radiolabelling and/or been steadily improving to meet these challenges.
fluorescence. Microarrays have a large number of applica-
The target DNA is either nonradioactively tions, which will expand in future. Some of them
labelled (biotin or digoxigenin labelling) or include:
radioactively labelled, the former requiring enzy-
matic detection and the latter requiring direct
detection through autoradiography, gas phase 1. DNA Sequencing by Hybridisation
ionisation and phosphorimagers. However, there
are drawbacks with the detection methods involv- The two popular methods of sequencing include
ing radioactivity (such as low resolution). In the Sangers dideoxy synthetic method and the
order to circumvent these problems, fluorochromes Maxam and Gilberts degradation method (see
may be used which will also allow direct detec- below). Sangers method is even currently used
tion due to fluorescence. This would also allow as a routine method for DNA sequencing.
multiplexing, where more than one target DNA However, the efficiency, cost and reliability of the
labelled with different fluorochromes can be used above two methods were not able to cope with
for hybridisation of microarray on the DNA the requirements of large-scale genome sequenc-
chips. The hybridisation patterns can be scanned ing. Therefore, in the late 1980s, a new approach
in this case using automatic scanner. These detec- towards DNA sequencing was suggested simulta-
tion systems are based either on lens-based sys- neously by four groups. The approach was
tems (epifluorescent and confocal microscopes) described as sequencing by hybridisation or SBH:
or on CCD-based systems. The lens-based sys- The method involves manufacturing the sequenc-
tems, including confocal microscopy, allow ing DNA chips that contain a complete set of
selective detection of the surface-bound mole- immobilised oligonucleotides of a particular size
cules, as opposed to those in the surrounding (e.g. 8-mers) and hybridisation of the target DNA
fluid medium. However, these are not well suited of unknown sequence (whose sequence is to be
to the level of miniaturisation already achieved in determined) onto these DNA chips. The hybridi-
DNA chip technology. Therefore, more recently sation patterns are then recorded using one of
CCD detection systems have been developed to the several suitable devices discussed earlier.
detect small quantities of array-bound molecules. Identification and analysis of the overlapping
In this method, labelled target DNA is hybridised oligomers that form perfect duplexes with the
to an immobilised probe on a silicon wafer. The DNA of interest permits reconstruction of the tar-
wafer is then placed on the CCD surface, and a get DNA sequence. During the 1980s, it was
signal is generated. A fluorescence microscope believed that SBH using microarrays carrying all
fitted with a CCD camera and a computer is used the possible 65,536 octamer oligonucleotides
for data capturing. could possibly be used as an alternative to
Once the microarray scanners have captured Sangers dideoxy and Maxam and Gilberts
the image of the microarray biochip, that image methods of sequencing. However, this objective
must be rigorously analysed to determine which has not been successfully achieved, since uniform
elements correspond to artefacts or contamination hybridisation signals are not available for a large
202 10 Curtain Raiser to Novel MAS Platforms

number of oligonucleotides in parallel due to due to recombination, thus making it necessary to


sequence-dependent variability in heteroduplex have many SNPs associated with a trait.
formation. This leads to false positives and false
negatives so that unambiguous determination of
an unknown sequence is not always possible. 3. Functional Genomics
Further complications arise due to repeated
sequences. Consequently, the technical barriers Microarrays for gene expression analysis provide
of SBH are now obvious, and microarrays which an integrated platform for functional genomics.
are initially considered to be useful only for SBH Samples of mRNA form a variety of cells and tis-
are now used for a variety of other purposes. sues that are used for microarray analysis and
would yield information about specific changes
in gene expression patterns. The mRNA samples
2. Single Nucleotide Polymorphisms of interest are labelled and used for hybridisation-
and Point Mutations based microarray analysis, yielding quantitative
data on the expression of thousands of cellular
Restriction fragment length polymorphisms genes. Parallel measurement of transcript levels
(RFLPs) and simple sequence repeats (SSRs) for thousands of genes is one of the most wide-
were the markers of choice in the past, but these spread uses of DNA chip technology. Both oligo-
markers had some drawbacks. For instance, they nucleotide and cDNA microarrays are very useful
need gel-based assays and are, therefore, time for estimating levels of transcripts.
consuming and expensive. Recently, single
nucleotide polymorphisms (SNPs) as biallelic
genetic markers have been extensively used 4. Reverse Genetics
as the markers of choice (refer to chapter 3).
Although they have the disadvantage of being DNA chips can also be used for characterisation
biallelic as against SSRs, which are polyallelic, of mutant populations exposed to various selec-
their abundance (more than 1 per 1,000 bp) makes tion pressures, to collect information about the
them attractive. Genotyping individuals using fitness value of a variety of alleles for each of the
SNPs through microarray needs only plus/minus large number of genes in a species. This is done
assay, and hence, it permits easier automation. particularly in organisms where complete
Further, high-density oligonucleotide arrays sequence of the genome is already available and
allow genotyping at a large number of these bial- studying the impact of deletions/insertions fol-
lelic loci in parallel. The approach used for this lowed by analysis of their fitness. (such an
purpose relies on the capacity to distinguish a approach where we start a study with DNA
perfect match from a single-base mismatch. A set sequence and conclude it with the analysis of
of four groups of oligonucleotides of known and phenotype is described as reverse genetics).
related sequences is used, such that correspond- This can be achieved if the mutants are first sub-
ing oligomers that form the four groups differ jected to a selection pressure and then character-
only for the central base. For this purpose, a til- ised. This can be illustrated using the example of
ing strategy proposed by Affymetrix makes use yeast, where the genome has been completely
of a microarray of 40,000 oligomers for rese- sequenced and was shown to carry 6,000 open
quencing a 10 kb gene. Use of SNPs offered great reading frames (ORFs). Unique molecular
promise for rapid and highly automated genotyp- sequences or bar codes can be introduced in
ing, leading to rapid development in developing each of the above 6,000 ORFs in the yeast
high-resolution genetic map (refer to chapter 7). genome. A mixture of yeast strains containing
However, it was emphasised that there are also individual bar codes for all 6,000 genes is then
some problems with this technology, since asso- subjected to a selection pressure. Samples of
ciation of SNPs with individual traits can break cells are taken, and bar code sequences are
Microarray 203

labelled using multiplex PCR with fluorescent Homologous segments are identified by the
primers. A pool of fluorescent amplicons is then formation of heteroduplexes that are free of any
hybridised to an oligonucleotide microarray con- mismatches. Fragments of chromosomal DNA
taining sequences complementary to each of the representing inherited regions are hybridised to a
amplified bar codes, and after detection of microarray of ordered genomic clones, and posi-
fluorescent signals, an estimate of fitness of each tive hybridisation signals pinpoint regions of
strain under a given selection pressure can be identity by descent at high resolution. The mapped
worked out. In species, where the genome PCR products could be used to prepare a microar-
sequence is not yet fully determined, ESTs can ray of physical fragments and can also be used for
be used to identify mutants. Hybridisation of detecting meiotic recombination breaking points.
PCR amplicons (derived from these lines carry- GMS is only one example of the use of the gene
ing insertion elements) to microarray of ESTs microarrays to characterise the composition of
can be used to identify mutant lines. nucleic acid mixture subjected to in vitro selection.
Restriction endonuclease protection, selection
and amplification (REPSA) is another example of
5. Diagnostics and Genetic Mapping a selection method that could be adopted to a
DNA microarray-based detection. REPSA makes
DNA chips are also being used for diagnostics. use of a combination of restriction enzyme cleav-
Since some information about the alleles belong- age, PCR amplification and filter binding to selec-
ing to genes responsible for a number of diseases tively identify DNA sequences used for binding
is available, the search can be focused on a of DNA-binding proteins.
restricted number of polymorphisms, thus reduc-
ing the required number of features on a DNA
chip. For instance, human diagnostic chips have 7. DNA Chips and Agriculture
been prepared to detect mutant alleles in CFTR
(cystic fibrosis), BRCA 1 (cancer susceptive DNA chips with ESTs can also be used to collect
gene) and beta globin genes. For CFTR, one data on expression in an agricultural crop under
microarray containing 428 features was designed different conditions. This information can prove
to detect mutations in exon 11 of CFTR, and to be of practical utility in agricultural biotech-
another microarray containing 1,480 features was nology. For instance, if the expression of genes
designed for detection of known deletions, inser- on hormone is known, hormone can be moni-
tions or base substitutions. Hybridisation of tored. Transgenic plants can also be rapidly anal-
genomic DNA samples from CFTR patients with ysed using microarray and expression patterns
already characterised mutations to diagnostic under environmental conditions that can be pre-
chips for CFTR gave expected results. Similarly, dicted at the gene level. Action of herbicide
genotyping of patients with uncharacterised can be similarly determined and decision be
mutations by microarrays could be confirmed by taken on the application of herbicide. DNA
techniques of RFLP and PCR. These results microarray is also being extensively used for a
confirmed the utility of microarrays in diagnos- study of DNA polymorphism (e.g. SNPs) to develop
tics. DNA chips technology was also successfully molecular markers tagged to specific economic
applied to the genotyping of hepatitis virus in traits (see above). The molecular markers thus
blood samples. developed can be used in diagnostics and for
actual molecular marker-aided selection in breed-
ing programmes. The main advantage of DNA
6. Genomic Mismatch Scanning chips for developing molecular markers is the
simultaneous analysis of thousands of polymor-
Genomic mismatch scanning (GMS) is a hybridi- phisms in a single experiment. This will of course
sation-based method for linkage analysis. require a cost-effective microarray technology.
204 10 Curtain Raiser to Novel MAS Platforms

The current excitement and activity in this technique contains a DNA-binding domain fused to a
suggests that the complete microarray system second protein of interest. Specific interaction
will soon be available in affordable price. between two chimeric proteins leads to transcrip-
Functional analysis, through parallel expres- tional activation of the reporter genes which
sion monitoring, should help researchers better can be easily scored with colour-based assays.
understand the fundamental mechanisms that The identity of the two proteins of interest is
underlie plant growth and development. By accu- confirmed sequence analysis of each clone
mulating databases of expression information as a thus identified. Therefore, major sequencing
function of tissue type, developmental stage, hor- work is involved in the above two-hybrid system.
mone and herbicide treatment, genetic back- As alternative to DNA sequencing needed in
ground and environmental condition, it should be two-hybrid analysis as mentioned earlier, DNA
possible to identify the genes involved in many chip arrays can be used to identify the genes
aspects of plant biology. Microarray analysis pro- involved in proteinprotein interactions. In cases
vides a way to link genomic sequence information where the entire genome sequences are available,
and functional analysis. Several specific research DNA chips can be used in parallel resequencing
areas will be of significant commercial interest. so that clones involved in the two-hybrid system
Because of the central role of plant hormones in can be identified through single hybridisation to
plant growth and development, microarray-based genomic chips. Phage presentation library can
gene expression analysis of plant hormone action also be used for DNA chip-based detection sys-
will be an important commercial project. The tem. This involves use of fusion proteins encoded
interplay of genes and the environment is also of by chimeric sequences of phage viral coat protein
particular importance in plants and will constitute gene and gene of interest.
another area of research interest. Microarrays will
assist plant biotechnology companies by allowing
rapid analysis of transgenic plants. These data 9. Nucleic Acid Sequencing
will permit genome-wide correlations between
expression patterns and a host of desirable traits The term DNA sequencing involves biochemical
such as fertility, seed set, yield and resistance to methods for determining the order of the nucle-
environmental stress and insects. It may ultimately otide bases, adenine, guanine, cytosine and thy-
be possible to reduce the need for costly field tri- mine, in a DNA molecule. The sequence of DNA
als by chip-based analysis of transgenic lines. The constitutes the heritable genetic information in
use of microarray technology to understand the nuclei, plasmids, mitochondria and chloroplasts
effect of small molecules on gene expression that forms the basis for the developmental pro-
might serve to speed the discovery of herbicides grammes of all living organisms. Determining the
and elucidate their mechanism of action. DNA sequence is therefore useful in basic research
studying fundamental biological processes, as
well as in applied fields such as diagnostic or
8. Proteomics forensic research, genetic mapping and MAS. The
advent of DNA sequencing has significantly
Like genomics, the proteomics relates to the accelerated biological research and discovery.
study of proteinprotein interactions. DNA chips The rapid speed of sequencing attained with mod-
can also be used for this area of study. Protein ern DNA sequencing technology has been instru-
linkage maps can also be created using genomic mental in the large-scale sequencing of the plant
sequence information. Proteinprotein interac- genomes. The field of DNA sequencing technol-
tions can be studied using the yeast two-hybrid ogy development has a rich and diverse history.
system. In this system, two fusion proteins are However, the overwhelming majority of DNA
used for the activation of transcription of a sequence production to date has relied on some
reporter gene in yeast. The first fusion protein version of the Sanger biochemistry.
Microarray 205

Actually, in the late 1970s, two DNA sequencing up to a maximum of about 700800 bp in length.
techniques for longer DNA molecules were However, it is possible to obtain full sequences of
invented. These were the Sanger (or dideoxy) larger genes and, in fact, whole genomes, using
method and the MaxamGilbert (chemical cleav- stepwise methods such as primer walking and
age) method. The MaxamGilbert method is shotgun sequencing.
based on nucleotide-specific cleavage by chemi- In primer walking, a workable portion of a
cals and is best used to sequence oligonucleotides larger gene is sequenced using the Sanger method.
(short nucleotide polymers, usually smaller than New primers are generated from a reliable seg-
50 base pairs in length). The Sanger method is ment of the sequence and used to continue
more commonly used because it has been proven sequencing the portion of the gene that was out of
technically easier to apply and, with the advent of range of the original reactions. Shotgun sequenc-
PCR and automation of the technique, is easily ing entails randomly cutting the DNA segment of
applied to long strands of DNA including some interest into more appropriate (manageable) sized
entire genes. This technique is based on chain fragments, sequencing each fragment and arrang-
termination by dideoxy nucleotides during PCR ing the pieces based on overlapping sequences.
elongation reactions. This technique has been made easier by the appli-
In the Sanger method, the DNA strand to be cation of computer software for arranging the
analysed is used as a template, and DNA poly- overlapping pieces.
merase is used, in a PCR reaction, to generate
complimentary strands using primers. Four
different PCR reaction mixtures are prepared, Second-Generation DNA Sequencing
each containing a certain percentage of dideoxy-
nucleoside triphosphate (ddNTP) analogues to Alternative strategies for DNA sequencing can be
one of the four nucleotides (ATP, CTP, GTP or grouped into several categories. These include
TTP). Synthesis of the new DNA strand contin- (1) micro-electrophoretic methods, (2) sequenc-
ues until one of these analogues is incorporated, ing by hybridisation, (3) real-time observation of
at which time the strand is prematurely truncated. single molecules and (4) cyclic-array sequencing.
Each PCR reaction will end up containing a mix- Here, we use second generation in reference to
ture of different lengths of DNA strands, all end- the various implementations of cyclic-array
ing with the nucleotide that was dideoxy labelled sequencing that have recently been realised in a
for that reaction. Gel electrophoresis is then used commercial product (e.g. 454 sequencing (used in
to separate the strands of the four reactions, in the 454 Genome Sequencers, Roche Applied
four separate lanes, and determine the sequence Science; Basel), Solexa technology (used in the
of the original template based on what lengths of Illumina (San Diego) Genome Analyser), the
strands end with what nucleotide. SOLiD platform (Applied Biosystems; Foster
In the automated Sanger reaction, primers are City, CA, USA), the Polonator (Dover/Harvard)
used that are labelled with four different coloured and the HeliScope Single Molecule Sequencer
fluorescent tags. PCR reactions, in the presence technology (Helicos; Cambridge, MA, USA)).
of the different dideoxy nucleotides, are per- The concept of cyclic-array sequencing can be
formed as described above. However, next, the summarised as the sequencing of a dense array of
four reaction mixtures are then combined and DNA features by iterative cycles of enzymatic
applied to a single lane of a gel. The colour of manipulation and imaging-based data collection.
each fragment is detected using a laser beam, and Although these platforms are quite diverse in
the information is collected by a computer which sequencing biochemistry as well as in how the
generates chromatograms showing peaks for each array is generated, their workflows are conceptu-
colour, from which the template DNA sequence ally similar. Library preparation is accomplished
can be determined. Typically, the automated by random fragmentation of DNA, followed by
sequencing method is only accurate for sequences in vitro ligation of common adaptor sequences.
206 10 Curtain Raiser to Novel MAS Platforms

Thus, what is common to these methods is that nucleotide is introduced. On templates where this
PCR amplicons derived from any given single results in an incorporation event, pyrophosphate
library molecule end up spatially clustered, either is released. Via ATP sulfurylase and luciferase,
to a single location on a planar substrate (in situ incorporation events immediately drive the gen-
polonies, bridge PCR) or to the surface of micron- eration of a burst of light, which is detected by
scale beads, which can be recovered and arrayed the CCD as corresponding to the array coordi-
(emulsion PCR). The sequencing process itself nates of specific wells.
consists of alternating cycles of enzyme-driven In contrast with other platforms, therefore, the
biochemistry and imaging-based data acquisition. sequencing by synthesis must be monitored live
(i.e. the camera does not move relative to the
array). Across multiple cycles (e.g. A-G-C-T-A-
454 Pyrosequencing G-C-T), the pattern of detected incorporation
events reveals the sequence of templates repre-
The 454 system was the first next-generation sented by individual beads. Like the HeliScope
sequencing platform available as a commercial (discussed below), the sequencing is asynchro-
product. In this approach, libraries may be con- nous in that some features may get ahead or
structed by any method that gives rise to a mix- behind other features depending on their sequence
ture of short, adaptor-flanked fragments. Clonal relative to the order of base addition. A major
sequencing features are generated by emulsion limitation of the 454 technology relates to
PCR, with amplicons captured to the surface of homopolymers (i.e. consecutive instances of the
28-mm beads. After breaking the emulsion, same base, such as AAA or GGG). Because there
beads are treated with denaturant to remove is no terminating moiety preventing multiple con-
untethered strands and then subjected to a secutive incorporations at a given cycle, the length
hybridisation-based enrichment for amplicon- of all homopolymers must be inferred from the
bearing beads (i.e. those that were present in an signal intensity. This is prone to a greater error
emulsion compartment supporting a productive rate than the discrimination of incorporation ver-
PCR reaction). A sequencing primer is hybri- sus non-incorporation. As a consequence, the
dised to the universal adaptor at the appropriate dominant error type for the 454 platform is inser-
position and orientation, that is, immediately tiondeletion, rather than substitution. Relative to
adjacent to the start of unknown sequence. other next-generation platforms, the key advan-
Sequencing is performed by the pyrosequenc- tage of the 454 platform is read-length. For exam-
ing method. In brief, the amplicon-bearing beads ple, the 454 FLX instrument generates ~400,000
are pre-incubated with Bacillus stearothermophi- reads per instrument run at lengths of 200300 bp.
lus (Bst) polymerase and single-stranded binding Currently, the per-base cost of sequencing with
protein and then deposited on to a micro-fabri- the 454 platform is much greater than that of other
cated array of picoliter scale wells (with dimen- platforms (e.g. SOLiD and Solexa), but it may be
sions such that only one bead will fit per well) to the method of choice for certain applications
render this biochemistry compatible with array- where long read-lengths are critical (e.g. de novo
based sequencing. Smaller beads are also added, assembly and metagenomics).
bearing immobilised enzymes which are also
required for pyrosequencing (e.g. ATP sulfury-
lase and luciferase). During the sequencing, one Illumina Genome Analyser
side of the semi-ordered array functions as a flow
cell for introducing and removing sequencing Commonly referred to as the Solexa, this plat-
reagents, whereas the other side is bonded to a form has its origins in work by Turcatti and col-
fibre-optic bundle for CCD (charge coupled leagues and the merger of four companiesSolexa
device)-based signal detection. At each of several (Essex, UK), Lynx Therapeutics (Hayward, CA,
hundred cycles, a single species of unlabelled USA), Manteia Predictive Medicine (Coinsins,
Microarray 207

Switzerland) and Illumina. Libraries can be less can be identified through quality metrics
constructed by any method that gives rise to a associated with each base-call. As with other
mixture of adaptor-flanked fragments up to several systems, modifications have recently enabled
hundred bp in length. Amplified sequencing mate-paired reads, for example, each sequencing
features are generated by bridge PCR. In this feature yielding 2 36 bp independent reads
approach, both forward and reverse PCR primers derived from each end of a given library molecule
are tethered to a solid substrate by a flexible several hundred bases in length.
linker, such that all amplicons arising from any
single template molecule during the amplification
remain immobilised and clustered to a single AB SOLiD
physical location on an array. On the Illumina
platform, the bridge PCR is somewhat unconven- This platform has its origins in the system
tional in relying on alternating cycles of exten- described by J. Shendure and colleagues in 2005
sion with Bst polymerase and denaturation with and in work by McKernan and colleagues at
formamide. The resulting clusters each consist Agencourt Personal Genomics (Beverly, MA,
of ~1,000 clonal amplicons. Several million clus- USA), which is acquired by Applied Biosystems
ters can be amplified to distinguishable locations (Foster City, CA, USA) in 2006. Libraries may
within each of eight independent lanes that are be constructed by any method that gives rise to a
on a single flow cell (such that eight independent mixture of short, adaptor-flanked fragments,
libraries can be sequenced in parallel during the though much effort with this system has been put
same instrument run). After cluster generation, into protocols for mate-paired tag libraries with
the amplicons are single stranded (linearisation) controllable and highly flexible distance distribu-
and a sequencing primer is hybridised to a uni- tions. Clonal sequencing features are generated
versal sequence flanking the region of interest. by emulsion PCR, with amplicons captured to the
Each cycle of sequence interrogation consists of surface of 1-mM paramagnetic beads. After break-
single-base extension with a modified DNA poly- ing the emulsion, beads bearing amplification
merase and a mixture of four nucleotides. These products are selectively recovered and then immo-
nucleotides are modified in two ways. They are bilised to a solid planar substrate to generate a
reversible terminators, in that a chemically dense, disordered array. Sequencing by synthesis
cleavable moiety at the 3 hydroxyl position is driven by a DNA ligase, rather than a poly-
allows only a single-base incorporation to occur merase. A universal primer complementary to
in each cycle, and one of four fluorescent labels, adaptor sequence is hybridised to the array of
also chemically cleavable, corresponds to the amplicon-bearing beads. Each cycle of sequenc-
identity of each nucleotide. After single-base ing involves the ligation of a degenerate popula-
extension and acquisition of images in four chan- tion of fluorescently labelled octamers. The octamer
nels, chemical cleavage of both groups sets up for mixture is structured, in that the identity of
the next cycle. Read-lengths up to 36 bp are cur- specific position(s) within the octamer (e.g. base
rently routine; longer reads are possible but may 5) correlates with the identity of the fluorescent
incur a higher error rate. label. After ligation, images are acquired in four
Read-lengths are limited by multiple factors channels, effectively collecting data for the same
that cause signal decay and dephasing, such as base positions across all template-bearing beads.
incomplete cleavage of fluorescent labels or ter- Then, the octamer is chemically cleaved between
minating moieties. The dominant error type is positions 5 and 6, removing the fluorescent label.
substitution, rather than insertions or deletions Progressive rounds of octamer ligation enable
(and homopolymers are certainly less of an issue sequencing of every 5th base (e.g. bases 5, 10,
than with other platforms such as 454). Average 15, 20). Upon completing several such cycles, the
raw error rates are on the order of 11.5%, but extended primer is denatured to reset the system.
higher accuracy bases with error rates of 0.1% or Subsequent iterations of this process can be
208 10 Curtain Raiser to Novel MAS Platforms

directed at a different set of positions (e.g. bases detection system is used to directly interrogate
4, 9, 14, 19) either by using a primer that is set single DNA molecules via sequencing by synthesis.
back one or more bases from the adaptor-insert Template libraries, prepared by random frag-
junction or by using different mixtures of octam- mentation and polyA tailing (i.e. no PCR
ers where a different position (e.g. base 2) is amplification), are captured by hybridisation to
correlated with the label. An additional feature surface-tethered poly-T oligomers to yield a
of this platform involves the use of two-base disordered array of primed single-molecule
encoding, which is an errorcorrection scheme in sequencing templates. At each cycle, DNA poly-
which two adjacent bases, rather than a single merase and a single species of fluorescently
base, are correlated with the label. Each base labelled nucleotide are added, resulting in template-
position is then queried twice (once as the first dependent extension of the surface-immobilised
base and once as the second base, in a set of 2 bp primertemplate duplexes. After acquisition of
interrogated on a given cycle) such that miscalls images tiling the full array, chemical cleavage
can be more readily identified. and release of the fluorescent label permits the
A related system to the SOLiD is the Polonator, subsequent cycle of extension and imaging. As
also based in part on the system developed by J. described in some reports, several hundred cycles
Shendure and the Church group at Harvard. This of single-base extension (i.e. A, G, C, T, A, G, C,
platform also uses sequencing features generated T) yield average read-lengths of 25 bp or
by emulsion PCR and sequencing by ligation. greater. Notable aspects of this system include
The cost of the instrument, however, is substan- the following. First, like the 454 platform, the
tially lower than that of other second-generation sequencing is asynchronous, as some strands will
sequencing instruments. Additionally, the instru- fall ahead or behind others in a sequence-depen-
ment is open source and programmable, poten- dent manner. Chance also plays a role, as some
tially enabling user innovation (e.g. the use of templates may simply fail to incorporate on a
alternative biochemistries). The current read- given cycle despite having the appropriate base at
lengths, however, may be significantly limiting. the next position. However, because these are
An additional disadvantage, common to 454, single molecules, dephasing is not an issue, and
SOLiD and the Polonator, is that emulsion PCR such events do not in and of themselves lead to
can be cumbersome and technically challenging. errors. Second, no terminating moiety is present
On the other hand, it is possible that sequencing on the labelled nucleotides. As with the 454 system,
on a high-density array of very small (1 mm) therefore, homopolymer runs are an important
beads (with sequencing by ligation, polymerase issue. However, because single molecules are
extension or another biochemistry) may represent being sequenced, the problem can be mitigated
the most straightforward opportunity to achieve by limiting the rate of incorporation events.
extremely high data densities, simply because Additionally, it was noted that consecutive incor-
1-mm beads physically exclude one another at a porations of labelled nucleotide at homopolymers
spacing that is on the order of the diffraction produced a quenching interaction that enabled
limit. Furthermore, high-resolution ordering of the researchers to infer the discreet number of
1-mm bead arrays may enable the limit of one incorporations (e.g. A vs. AA vs. AAA). Third,
pixel per sequencing feature to be closely the raw sequencing accuracy can be substantially
approached. improved by a two-pass strategy in which the
array of single-molecule templates (here with
HeliScope adaptors at both ends) is sequenced as described
The Helicos sequencer, based on work by Quakes above and then fully copied. As the newly synthe-
group, also relies on cyclic interrogation of a dense sised strand is surface-tethered, the original
array of sequencing features. However, a unique template can be removed by denaturing.
aspect of this platform is that no clonal amplification Sequencing primed from the distal adaptor then
is required. Instead, a highly sensitive fluorescence yields a second sequence for the same template,
Microarray 209

obtained in the opposite orientation. Positions 10 exons, quoted estimates from non-commercial
that are concordant between the two reads have genome centres and commercial sequence service
Phred-like quality scores. And finally, largely providers ranged from $300,000 to over
secondary to the incorporation of contaminating, $1,000,000 (as on August, 2012). Clearly, this
unlabelled or non-emitting bases, the dominant cost is beyond the range of most individual lab-
error type is deletion (27% error rate with one oratories. In addition to reducing the per-base
pass, 0.21% with two passes). However, substi- cost of sequencing by several orders of magni-
tution error rates are substantially lower (0.01 tude, second-generation instruments have fewer
1% with one pass). With two passes, the per-base infrastructure requirements; instead, the principle
raw substitution error rate (approaching 0.001%) challenge is downstream data management.
may currently be the lowest of all the second-
generation platforms.
Advantages and disadvantages of different Microchip-Based Electrophoretic
approaches in terms of costs, limitations and Sequencing
practical aspects of implementation, clear differ-
ences between conventional sequencing and the Significant progress has been made toward devel-
second-generation platforms determine which oping methods whereby conventional electropho-
general strategy represents the best option for any retic sequencing can be carried out on a
given project. The applications of conventional micro-fabricated device. The primary advantages
sequencing (i.e. Sanger) have grown diverse, and of this approach include faster processing times
for small-scale projects in the kilobase-to- and substantial reductions in reagent consump-
megabase range, this will likely remain the tech- tion. An ideal device for this purpose would inte-
nology of choice for the immediate future. This is grate all aspects of sample processing, with
a consequence of its greater granularity (i.e. the microfluidic transport of the reaction volume
ability to efficiently operate at either small or between steps, for example, clonal amplification
large production scales) relative to the new tech- by nanoliter-scale PCR from a single cell or a
nologies. Even so, it is clear that despite limita- single template molecule; template purification;
tions relative to Sanger sequencing (e.g. in terms cycle sequencing reaction; isolation and concen-
of read-length and accuracy), large-scale projects tration of extension fragments; and injection into
will quickly come to depend entirely on next- a microchannel for electrophoretic separation
generation sequencing. As an example of the (potentially parallelised, e.g. with 384 or more
advantages of the new platforms, consider that channels concentrically arranged around a
large-scale resequencing studies for identifying rotating fluorescence scanner). Many of the key
germline variation or somatic mutations have challenges have already been overcome in proof-
relied on Sanger-based resequencing approaches of-concept experiments. Although it is unclear in
that in turn are reliant on one-at-a-time PCR the immediate moment whether these efforts will
amplification of each targeted region. In this con- be able to keep pace with cyclic-array sequencing
text, the requirements of a Sanger sequencing and other strategies, it is worth bearing in mind
approach include major costs beyond just that the Sanger biochemistry coupled to electro-
reagents. These include robotic support of phoretic separation remains by far the best option
reagents, processing of multiple samples in 96- for DNA sequencing in terms of read-length and
or 384-well formats, maintenance of capillary- accuracy; we simply lack methods to parallelise
based sequencers, extensive bioinformatics it to the extent possible with cyclic-array strate-
infrastructure to handle the flow of data and dedi- gies. One could imagine that lab on-a-chip
cated support staff to maintain complicated nucleic acid analysis could supplant conventional
equipment. It is estimated that the overall cost to DNA sequencing for low-scale applications and
conventionally sequence 100 genes from 100 may also prove useful in the context of point-of-
samples, assuming each gene has an average of care diagnostics.
210 10 Curtain Raiser to Novel MAS Platforms

Sequencing by Hybridisation parallel interrogation with comprehensive sets of


short oligonucleotides (e.g. 4,096 6-mersor
The basic concept of sequencing by hybridisation 8,192 7-mers) followed by sequence recon-
is that the differential hybridisation of labelled struction. Recently, this basic strategy was used
nucleic acid fragments to an array of oligonucle- in the context of an array of rolling circle-
otide probes can be used to precisely identify amplified sequencing features to perform rese-
variant positions. Usually, the oligos tethered to quencing of an E. coli genome. This successful
the array are designed as a tiling representation of proof of concept is perhaps better classified as a
the reference sequence corresponding to the cyclic-array method, where serial hybridisation
genome of interest. As that of the approach taken rather than polymerase-driven synthesis was used
by Affymetrix (Santa Clara, CA, USA) and for the actual sequencing.
Perlegen (Mountain View, CA, USA) (in per-
forming extensive SNP discovery in, e.g. human,
mouse and yeast), each possible single-base sub- Sequencing in Real Time
stitution is represented on the array by an inde-
pendent feature. Roche NimbleGen (Madison, Several academic groups and companies are
WI, USA), in performing sequencing by hybridi- working on technologies for ultrafast DNA
sation of microbial genomes, takes a two-tier sequencing that are substantially different from
approach, with an initial array directed at per- the current next-generation platforms. One
forming approximate localisation, and a second approach is nanopore sequencing, in which
custom array directed at pinpointing and nucleic acids are driven through a nanopore
confirmation of variant positions. Although (either a biological membrane protein such as
microarrays are clearly useful and cost effective alpha-hemolysin or a synthetic pore). Fluctuations
for genomic resequencing as well as a range of in DNA conductance through the pore, or, poten-
other genome-scale applications (see above), it is tially, the detection of interactions of individual
unclear what will happen as next-generation bases with the pore, are used to infer the nucle-
sequencing technologies begin to compete for otide sequence. Although progress has been made
many of the same applications (e.g. resequenc- in achieving early proof-of-concept demonstra-
ing, but also expression analysis, structural varia- tions with such methods, major technical chal-
tion analysis, DNA-protein binding). lenges remain along the path to a truly practical
In terms of sequencing, limitations of nanopore-based sequencing platform. Another
microarrays include the following: (1) Sequences approach involves the real-time monitoring of
that are repetitive or subject to cross hybridisa- DNA polymerase activity. Nucleotide incorpora-
tion cannot easily be interrogated; (2) it remains tions can potentially be detected through
unclear how de novo sequencing can be achieved fluorescence resonance energy transfer (FRET)
with hybridisation-based strategies; and (3) interactions between a fluorophore-bearing poly-
without very careful data analysis, false posi- merase and gamma phosphate-labelled nucle-
tives pose an important problem, and it is not otides (Visigen; Houston), or with zero-mode
clear how to obtain the equivalent of redundant waveguides (Pacific Biosciences; Menlo Park,
coverage that is possible with conventional and CA, USA), with which illumination can be
cyclic-array sequencing. Thus far, sequencing restricted to a zeptoliter-scale volume around a
by hybridisation has likely had its greatest impact surface-tethered polymerase such that incorpora-
in the context of genome-wide association tion of nucleotides (with fluorescent labels on
studies, which rely on array-based interrogation phosphate groups) can be observed with low
(i.e. genotyping by hybridisation) of a highly background. Pacific Biosciences demonstrated
defined set of discontinuous genomic coordi- substantial progress toward a working technol-
nates. A different (and earlier) take on the idea of ogy, including the potential for longer reads than
sequencing by hybridisation involves serial or Sanger sequencing, in several presentations and
Microarray 211

publications. Although technical hurdles remain Global advantages of second-generation or


and the bar has been raised by cyclic-array meth- cyclic-array strategies, relative to Sanger sequenc-
ods, we are also unlikely to run out of nucleotides ing, include the following: (1) In vitro construc-
to sequence anytime soon. tion of a sequencing library, followed by in vitro
clonal amplification to generate sequencing
features, circumvents several bottlenecks that
Targeted Capture of Genomic Subsets restrict the parallelism of conventional sequenc-
ing (i.e. transformation of E. coli and colony
For genomic resequencing (i.e. sequencing for picking). (2) Array-based sequencing enables a
somatic or germline variation discovery in much higher degree of parallelism than conven-
individual(s) of a species for which a reference tional capillary-based sequencing. As the effec-
genome is available), it is frequently the case that tive size of sequencing features can be on the
investigators would prefer to use finite resources order of 1 mm, hundreds of millions of sequenc-
to sequence a specific subset of the genome ing reads can potentially be obtained in parallel
across more individuals, rather than the whole by restored imaging of a reasonably sized surface
genome of fewer individuals. Examples of area. (3) Because array features are immobilised
genomic subsets that may be highly relevant to a planar surface, they can be enzymatically
include (1) a specific megabase scale region of manipulated by a single reagent volume. Although
the genome to which a disease phenotype has microliter scale reagent volumes are used in prac-
been mapped, (2) exons of specific candidate tice, these are essentially repaid over the full set
genes belonging to a disease-related pathway and of sequencing features on the array, dropping the
(3) the full complement of protein-coding DNA effective reagent volume per feature to the scale
sequences. These subsets generally total to of picoliters or femtoliters. Collectively, these
megabases, raising the question of how they can differences translate into dramatically lower costs
be efficiently isolated barring hundreds or thou- for DNA sequence production.
sands of individual PCR reactions. In other On the other hand, the advantages of second-
words, analogous to how PCR served as an effec- generation DNA sequencing are currently offset
tive front-end for resequencing of kilobase- by several disadvantages. The most prominent of
sized targets with capillary electrophoresis, there these include read-length (for all of the new
is a strong need for flexible targeting methods platforms, read-lengths are currently much
that are matched to the megabase scale granular- shorter than conventional sequencing) and raw
ity at which the next-generation sequencing plat- accuracy (on average, base-calls generated by the
forms operate. Fortunately, a variety of such new platforms are at least tenfold less accurate
methods have shown convincing proof-of-con- than base-calls generated by Sanger sequencing).
cept demonstrations in the past several years. Although these limitations create important algo-
These include methods that, like PCR, rely on a rithmic challenges for the immediate future, we
combination of oligonucleotide hybridisation and should bear in mind that these technologies
enzymatic activity (e.g. polymerase or ligase) to will continue to improve with respect to these
confer specificity but, unlike PCR, are more com- parameters, much as conventional sequencing
patible with high degrees of multiplexing. For progressed gradually over three decades to reach
example, Ji and colleagues in 2007 described the its current level of technical performance.
multiplex capture of 177 exons by selective cir- There are important differences among the
cularisation of restriction fragments. Another second-generation platforms themselves that may
approach is capture by hybridisation. It has been result in advantages with respect to specific appli-
demonstrated that 10,000-fold hybridisation- cations. Some applications (e.g. resequencing)
based enrichment of sequences was derived may be more tolerant of short read-lengths than
from BAC (bacterial artificial chromosome)- others (e.g. de novo assembly). For applications
sized genomic regions. relying on tag counting (e.g. quantification of
212 10 Curtain Raiser to Novel MAS Platforms

proteinDNA interactions), one would actually contain an incredible amount of information.


prefer a given amount of sequencing to be split So much in fact that special computer programs
into as many reads as possible (above some had to be designed to help interpret just a fraction
minimum length that allows placement to a refer- of the data. When a DNA sequence is published in
ence). The overall accuracy as well as the specific a scientific journal, it is also deposited in a com-
error distributions of individual technologies puter database known as GenBank. When a
(e.g. the rate of insertiondeletion vs. substitution sequence is placed in GenBank, the known and
errors, the propensity for systematic consensus predicted features of the sequence are also indi-
errors) may also be highly relevant. Mate-paired cated. These include promoters, open reading
reads, useful in de novo assembly and for frames and transcription factor binding sites. Just
mapping structural variants, for example, are a listing of As, Cs, Gs and Ts is known as a raw
now available with all of the second-generation sequence, and the sequence with all of the features
platforms, but the extent to which the distance indicated is known as an annotated sequence.
distribution with which the read pairs are sepa- What can be learned from sequence searches?
rated can be controlled or varied may be an First, DNA sequence searches are more stringent
important factor. Finally, of course, the cost of than protein sequences. Two DNA sequences
sequencing varies greatly between the second- either have an adenine in the same position or
generation platforms, and as consumers, we hope they do not. Protein sequences can have the same
for more competition between vendors than was amino acid in the same place and are, thus, iden-
the case with conventional sequencing in the past tical at that position. Proteins can also have simi-
decade. Comparisons of per-base costs can be lar amino acids in one position, such as valine in
helpful but occasionally misleading, as, for one protein and alanine in the other. Because both
example, more accurate bases may be worth more amino acids are hydrophobic, they can frequently
than less accurate bases. carry out the same functions. In this case, the
The DNA sequence of the entire genome con- proteins are said to be similar in a given position.
stitutes the ultimate objective of physical map- If two proteins have similarity over a large
ping (see chapter 7). It provides the most detailed segment of their sequences, they may have similar
description of an organisms genome and can act functions. This kind of analysis is especially use-
as a bridge between the structural and the func- ful if the function of one of the proteins has been
tional phases of genomics. With the advances in identified. Knowing the function of one of the
sequencing strategies, including automation and proteins suggests that the other protein should
the vast input of computational biology, there has also be checked for this function.
been accelerated accumulation of sequence data More limited regions of sequence similarity or
of many plant species (visit NCBI website for a identity can indicate the presence of a cofactor
list of plant species that have completely binding site. An example of this is the Walker
sequenced). These are significant milestones in box, which is an ATP binding site. Sequence
the sequence-based era of genomic research. similarities can provide very valuable informa-
tion about an unknown sequence and dramati-
cally influence the direction of experiments on
Handling and Storage of Sequence the novel gene or protein.
Information Before being sequenced, most genomes con-
tain few genes whose locations have already
To date, many millions of base pairs of DNA from been determined, which, coupled with the enor-
many species have been sequenced and deposited. mous amount of DNA in a genome and the
For example, the chromosomes of at least 100 complexities of gene structure, makes finding
bacterial species, several yeasts and almost the genes a difficult task. Computer programs have
entire human and rice and other crop chromo- been developed to look for specific sequences
somes have been determined. These sequences in DNA that are associated with certain genes.
Microarray 213

For example, protein-encoding genes are charac- comparing DNA and protein sequences from the
terised by an open reading frame, which includes same and different organisms. Genes that are
a start codon and a stop codon in the same read- evolutionarily related are said to be homologous.
ing frame. Homologous genes found in different species that
Specific sequences mark the splice sites at the evolved from the same gene in a common ances-
beginning and end of introns; other specific tor are called orthologs. For example, both mouse
sequences are present in promoters immediately and human genomes contain a gene that encodes
upstream of start codons. Still other sequences the alpha subunit of haemoglobin; the mouse and
are associated with particular functions in certain human alpha-haemoglobin genes are said to be
classes of proteins. Computer programs have orthologs, because both genes evolved from an
been developed that scan the DNA for these alpha-haemoglobin gene in a mammalian ances-
sequences and identify genes on the basis of their tor common to mice and humans. Homologous
presence and position. Some of these programs genes in the same organism (arising by duplica-
are capable of examining databases of EST and tion of a single gene in the evolutionary past) are
protein sequences to see if there is evidence that called paralogs. Within the human genome is a
a potential gene is expressed. gene that encodes the alpha subunit of haemoglo-
It is important to recognise that the programs bin and another homologous gene that encodes
that have been developed to identify genes on the the beta subunit of haemoglobin.
basis of DNA sequence are not perfect. Therefore, These two genes arose because an ancestral
the numbers of genes reported in most genome gene underwent duplication and the resulting
projects are estimates. The presence of multiple two genes diverged through evolutionary time,
introns, alternative splicing, multiple copies of giving rise to the alpha- and beta-subunit genes;
some genes and much non-coding DNA between these two genes are paralogs. Homologous genes
genes makes accurate identification and counting (both orthologs and paralogs) often have the
of genes difficult. same or related functions; so, after a function has
been assigned to a particular gene, it can provide
a clue to the function of a homologous gene.
Predicting Function from Sequence Databases containing genes and proteins found
in a wide array of organisms are available for
The nucleotide sequence of a gene can be used to homology searches. Powerful computer programs
predict the amino acid sequence of the protein have been developed for scanning these databases
that it encodes. The protein can then be synthe- to look for particular sequences. A commonly
sised or isolated and its properties studied to used homology search program is BLAST (Basic
determine its function. However, this biochemi- Local Alignment Search Tool). Suppose a geneti-
cal approach to understanding gene function is cist sequences a genome and locates a gene that
both time consuming and expensive. A major encodes a protein of unknown function. A homol-
goal of functional genomics has been to develop ogy search conducted on databases containing the
computational methods that allow gene function DNA or protein sequences of other organisms
to be identified from DNA sequence alone, may identify one or more orthologous sequences.
bypassing the laborious process of isolating and If a function is known for one of these sequences,
characterising individual proteins. that function may provide information about the
function of the newly discovered protein.
In a similar way, computer programs can
Homology Searches search a single genome for paralogs. Eukaryotic
organisms often contain families of genes that
One computational method (often the first have arisen by duplication of a single gene. If a
employed) for determining gene function is to paralog is found and its function has been pre-
conduct a homology search, which relies on viously assigned, this function can provide
214 10 Curtain Raiser to Novel MAS Platforms

information about a possible function of the two proteins depend on each other and will evolve
unknown gene. However, paralogs often evolve together. One protein cannot function without the
new functions; so information about their func- other, and they will either both be present or both
tions must be used cautiously. Of the genes newly be absent.
identified through genomic-sequencing projects, To understand this concept, consider the
50% are significantly similar to orthologs and following proteins in four bacterial species:
paralogs whose function has already been described. E. coli: protein 1, protein 2, protein 3, protein 4,
The 50% of newly identified genes that cannot be protein 5, protein 6
assigned a function on the basis of homology Species A: protein 1, protein 2, protein 3,
searches will undoubtedly decrease in number as protein 6
functions are assigned to more and more genes and Species B: protein 1, protein 3, protein 4,
as more genomes are sequenced. protein 6
Species C: protein 2, protein 4, protein 5
We can create a phylogenetic profile by con-
Other Sequence Comparisons structing a table comparing the presence (+) or
Strategies absence () of the proteins in the four bacterial
species.
Complex proteins often contain regions that The phylogenetic profile reveals that proteins
have specific shapes or functions called protein 1, 3 and 6 are either all present or all absent in
domains. For example, certain DNA-binding all species, so these proteins might be function-
proteins attach to DNA in the same way; these ally related. Examining fusion patterns among
proteins have in common a domain that provides proteins is another method for predicting func-
the DNA-binding function. Each protein domain tional relations; this technique is sometimes
has an arrangement of amino acids common to called the Rosetta Stone method. Functionally
that domain. There are probably a limited, related, separate proteins in one organism some-
though large, number of protein domains, which times exist as a single, fused protein in another
have mixed and matched through evolutionary organism. Thus, the presence of a fused AB
time to yield the protein diversity seen in pres- protein in one species suggests that separate
ent-day organisms. proteins A and B in another organism may be
Many protein domains have been character- functionally related.
ised, and their molecular functions have been Yet another method for determining the func-
determined. The sequence from a newly identified tion of an unknown gene is gene neighbour
gene can be scanned against a database of known analysis. Genes that encode functionally related
domains. If the gene sequence encodes one or proteins are often closely linked in organism
more domains whose functions have been previ- (called as linked genes; see chapter 4). For
ously determined, the function of the domain can example, if two genes are consistently linked in
provide important information about a possible the genomes of several bacteria, they might be
function of the new gene. functionally related. Functionally related genes
Another computational method for predicting are sometimes also linked in eukaryotes; exam-
protein function is a phylogenetic profile. In this ples are the hox genes, which play an important
method, the presence-and-absence pattern of a role in embryonic development. It is important to
particular protein is examined across a set of recognise that functions suggested by computa-
organisms whose genomes have been sequenced. tional methods such as homology searches,
If two proteins are either both present or both phylogenetic profiling, fusion proteins and neigh-
absent in all genomes surveyed, the two proteins bour analysis do not define a proteins function;
may be functionally related. For example, the rather, these computational methods provide
two proteins might function as consecutive steps hints about possible functions that can be pursued
in a biochemical pathway. The idea is that the through detailed analyses of the biochemistry
Serial Analysis of Gene Expression (SAGE) 215

and cellular location of the protein. Nevertheless, of the gene collection. The body mapping project
these computational methods and others like them was the unique and direct attempt to construct
have proved to be invaluable in determining the gene expression profiles of a number of cells and
functions of genes revealed in genomic studies. tissues by random sequencing of a 3-directed
cDNA library. About 300-bp fragments of these
3-regions were called gene signature, and each
Serial Analysis of Gene Expression represented a particular mRNA species. By
(SAGE) sequencing 1,000 or so cDNA clones, they could
make a rough pattern of gene expression and
The genomic sequences of a wide variety of identify mRNAs of highly abundant class.
organisms were revealed during the last decade. However, as an unavoidable weakness common
The genomes of eukaryotic organisms are long to both EST and body mapping projects, they
and massive and contain an enormous number of include an inefficient sequencing step, in which
genes. By precisely regulating activities of these one sequencing process yields only one cDNA
genes, each organism can supply required amount sequence. Mainly because of this low through-
of products at an appropriate time that confer put, the profiles obtained by the body mapping
functions in the given organism. It is thus believed project unavoidably became a long way from
that the majority of biological phenomena found what is expected and demanded. Although the
in a variety of organisms can be explained by the more recent methods of hybridisation-based
quantity of gene products. Although the gene analyses (DNA microarray) using immobilised
function is certainly conducted by its final cDNAs or oligonucleotides (see above) can
product, protein, there are a large number of potentially examine the expression patterns of a
observations that the amount of protein produced relatively large number of genes, the method can
is directly dependent on the amount of mRNA only examine expressed sequences that have
that encodes it. This means that, to generally already been identified.
understand the cellular functions under the cer- In contrast, the SAGE method allows for a
tain conditions at a certain time, it can be attained quantitative and simultaneous analysis of a large
by measuring the species and respective numbers number of transcripts in any particular cells or
of mRNAs at a point of time. However, each cell tissues, without prior knowledge of the genes
contains more than 10,000 species, copies of each (Velculescu et al. 1995). As the body mapping
species ranging from less than one to more than procedure does, this method takes advantage of
10,000 and, as a total, up to half a million mRNA the 3-portion of mRNA as the gene tag but of
transcript copies. It was therefore practically much shorter form (910 bp). These tags can be
impossible to determine them. A feasible tactic serially connected before cloning into a plasmid
was only to identify genes whose expression was vector. Since the resulting plasmid clones contain
influenced by a variety of internal or external fac- multiple tags, sequences of several dozens of
tors. These were classical differential colony mRNAs can be obtained by a single sequencing
(plaque) hybridisation of cDNA clones, subtrac- reaction. Rapid and cost-saving sequencing by
tive hybridisation and differential display method this original device allows quantification and
(see above). Large-scale random cDNA sequenc- identification of a large number of cellular
ing by EST project was very useful for the transcripts.
identification of unknown genes expressed in SAGE is based mainly on two principles, rep-
given cells or tissues. However, this approach resentation of mRNAs (cDNAs) by short sequence
was not designed to quantify expressed genes, tags and concatenation of these tags for cloning
since the cDNA library to be sequenced was to allow the efficient sequencing analysis. If one
usually normalised to eliminate recurring tran- wants to elucidate the gene expression profile of
scripts derived from abundant class mRNA this particular cell, they would have to conduct
sequences for the purpose of expanding the size several cDNA sequencing reactions. However, if
216 10 Curtain Raiser to Novel MAS Platforms

each mRNA species can be represented by a short gered ends of the products are then blunt ended by
unique sequence stretch (such as 9-bp tag), the T4 DNA polymerase. Two portions are mixed
purpose would be attained by sequencing them, again and ligated. Since the 5-ends of the linkers
because a sequence stretch as short as 9 bp can are blocked by amino group, only the mRNA-
distinguish 49 transcripts, provided a random derived termini are able to be ligated in a tail-to-
nucleotide distribution throughout the genome. tail orientation. The products are PCR-amplified,
This ability appears sufficient for the discrimina- cleaved by NlaIII, an anchoring enzyme, and then
tion of all the human transcripts, because the separated by polyacrylamide gel electrophoresis
human genome is estimated to encode between (PAGE). Ditag fragments flanked both ends with
28,642 and 153,478 genes. However, since cur- NlaIII cohesive terminus are isolated and ligated
rent sequencing procedure handles one clone at a to obtain concatemers. Highly concatenated
time, one has to conduct at least seven sequenc- products are recovered by PAGE and cloned into
ing reactions for the profiling of this hypothetical a plasmid vector for sequencing. Thus, SAGE
cell. There is no particular merit by replacing analysis is derived to provide a readout, via
mRNA with short sequence stretch, and this is sequencing, of the spectrum of genes being
the reason why the body mapping project fell into expressed in a cell.
a setback despite its ideological importance. Thus, in simple terms, the steps that underlie
However, if we could connect these tags into a the SAGE methodology include the following:
long stretch of DNA molecule, sequencing reac- (1) a short sequence tag (1015 bp) contains
tion would be needed only once. Since a currently sufficient information to uniquely identify a tran-
used automated DNA sequencer stably gives script provided that the tag is obtained from a
5600 nucleotides for any given clones, one unique position within each transcript, (2)
would be able to obtain 5060 number of 9-bp sequence tags can be linked together to form long
tag-represented mRNA sequences by a single serial molecules that can be cloned and sequenced
reaction and run. This is more than enough for and (3) quantification of the number of times a
the elucidation of gene expression profile of this particular tag is observed provides the expression
hypothetical cell. SAGE procedure can be level of the corresponding transcript.
explained briefly as follows: Double-stranded Extra stringency step that facilitates gene
cDNA is synthesised from mRNA by means of a identification is that the tag must include the 3
biotinylated oligo(dT) primer. The cDNA is then most anchoring site in a predicted transcript. A
cleaved with a restriction enzyme (called anchor- fraction of genes will have multiple tags due to
ing enzyme). Any four-base recognising enzymes alternative splicing near the 3 end, or use of
may be used, because they cleave every 256 bp alternative polyadenylation sites, but for the most
(44) on average, while the majority of mRNAs are part, these can be identified. The number of times
considered to be much longer. Actually, NlaIII is a specific tag is found in the SAGE sequences
the most frequently used enzyme. The 3-most reflects its abundance in the mRNA population.
portion of the cleaved cDNA with a common Therefore, SAGE is described as a method that is
NlaIII cohesive end at its 5-terminus is then used to obtain comprehensive, unbiased and
recovered by binding to streptavidin-coated quantitative gene expression profiles. Its major
beads. After dividing the reaction mixture into advantage over arrays is that it does not require a
two portions, two independent linkers are ligated priori knowledge of the genes to be analysed and
using NlaIII cohesive termini to each portion. reflects absolute mRNA levels. Since the original
These linkers are designed to contain type IIS SAGE protocol was developed in a short-tag (10-
enzyme (usually FokI or BsmFI and designated bp) format, several modifications have been made
as tagging enzyme) site near (or partially over- to produce longer SAGE tags for more precise
lapping) the 3-NlaIII sequence. After the reac- gene identification and to decrease the amount of
tion mixtures are digested with type IIS enzyme, starting material necessary. Several SAGE-like
released portions are recovered. Resulting stag- methods have also been developed for the
cDNA-AFLP 217

genome-wide analysis of DNA copy number from cDNA. The principle of this technique is
changes and methylation patterns, chromatin described briefly hereunder.
structure and transcription factor targets. cDNA is synthesised from total RNA or
Unlike array and chip methods, you do not poly(A) RNA and is digested with TaqI and AseI,
have to make cDNAs and ESTs. The expression which recognise 4 and 6 bp, respectively. A com-
information derives from SAGE tags, which are plete digest of plant cDNA with these enzymes
produced as part of the analysis. Sequence infor- produces five different types of molecules: Ase/
mation is required to assign the tags to individual Ase fragments, Ase/Taq fragments, Taq/Taq frag-
ORFs. However, unassigned SAGE tags are also ments and two terminal fragments with only one
useful (in species for which the complete genomes cohesive end. TaqI, which cuts DNA frequently,
have not been sequenced, unassigned tags will be generates small cDNA fragments (around 256 bp
encountered frequently). They can be used to pull on average), which amplify well and lie in the
out promoters from genomic clones, to provide optimal size range for separation on sequencing
information about coordinated gene regulation, gels. AseI, which cuts only rarely due to its longer
and to identify previously unknown genes. recognition sequence, reduces the number of
Quantitative comparison of SAGE samples is not fragments to a manageable size. Following diges-
always easy to interpret. A tag present in four tion, double-stranded adapters are ligated to the
copies in one sample of 50,000 tags and two cop- restriction fragments to generate templates for
ies in another may actually be twofold induced, amplification. PCR amplification is carried out in
or the difference is due to random sampling. two steps. In the first step, around 15 cycles of
non-specific amplification are carried out using
primers without extensions. The products of this
cDNA-AFLP reaction are then subjected to a second round of
PCR amplification using primers bearing at their
For many years the isolation of genes for which 30 end two additional nucleotides which extend
products and mutants were not known was only into the sequence of the restriction fragments,
possible by differential screening of cDNA librar- allowing only a subpopulation to be amplified.
ies. The first in vitro technique for the determina- All the 256 possible primer combinations are
tion of transcript patterns was differential display necessary to amplify the whole cDNA population.
reverse transcription PCR (DDRT-PCR). For the The amplicons are separated on a polyacrylamide
first time it was possible to determine simultane- gel and visualised by autoradiography. Most of the
ously a large part of the transcripts present in a bands represent Ase/Taq fragments because Ase/
eukaryotic cell within a single experiment with Ase fragments are rare and Taq/Taq fragments are
high sensitivity. The technique was applied not visible on the gel. RNA probes from different
widely, and for several years no other method sources (A, B) will produce different cDNA-
was available by which comprehensive transcript AFLP banding patterns, which allow differentially
patterns of eukaryotic cells could be obtained. expressed cDNAs to be identified. However, there
Later, Fischer and his group combined DDRT- are variations to the above said protocol, and three
PCR and amplified fragment length polymor- of them are described hereunder.
phism (AFLP), a method developed by Vos et al. 1. cDNA-AFLP with Two Restriction Enzymes
in 1995 for the characterisation of genomic DNA. cDNA-AFLP is an RNA fingerprinting tech-
The new technique, termed restriction fragment nique that evolved from AFLP (amplified
length polymorphism-coupled domain-directed fragment length polymorphism), a method
differential display (RC4D), provided a useful described by Vos and his co-workers during
tool to detect differentially expressed members of 1995 for the fingerprinting of genomic DNA
individual gene families. The cDNA-AFLP tech- (see chapter 3). The classical cDNA-AFLP
nique is based on the selective PCR amplification procedure uses the standard AFLP protocol on
of adapter-ligated restriction fragments derived a cDNA template. The technique involves
218 10 Curtain Raiser to Novel MAS Platforms

three steps: (1) restriction of cDNA and liga- different probes. cDNA from each probe is
tion of oligonucleotide adapters, (2) selective restricted with MboI and ligated to one of up
amplification of sets of restriction fragments to six adapters having short insertions of vari-
using PCR primers bearing selective nucle- ous lengths into a common sequence (poly-
otides at the 30 end and (3) gel analysis of the morphic adapters). Following ligation, the
amplified fragments. Restriction of plant differentially adapted cDNAs are pooled and
cDNA with a combination of two restriction 3 end fragments are selectively amplified with
enzymes, a tetra cutter and a hexa cutter, a gene-specific primer and a fluorescently
allows a significant fraction of the cDNA pop- labelled adapter primer. The amplicon is then
ulation to be cleaved and to be represented as separated on an automatic sequencer. Due to
a discrete banding pattern on a sequencing length heterogeneity introduced by the poly-
gel. In genomic AFLP with plant DNA, three morphic adapters, iAFLP fragments from dif-
selective bases on the end of each primer are ferent probes will produce distinct peaks on
required to give a useful banding pattern. The the electrophenogram. Transcript abundance
lower complexity of cDNA allows the use of is determined by evaluating peak areas rela-
two selective bases for each primer giving a tive to an internal standard.
total of 256 possible primer combinations.
The largest cDNA-AFLP products visible on a
polyacrylamide sequencing gel are around Applications
1,000 bp in size, the lower end of the gel rep-
resenting approx. 100 bp. In this size window, cDNA-AFLP and its application to plants was
an average of 40 bands can be observed for first described by Bachem et al. in 1996, who
each primer combination, corresponding to a analysed differential gene expression in a syn-
total of approx. 10,000 bands. chronised potato in vitro tuberisation system.
2. cDNA-AFLP with One Restriction Enzyme During screening with different primer combina-
A systematic comparison of known potato tions, two lipoxygenase cDNA fragments were
cDNA sequences showed that approx. 45% isolated on the basis of their differential expres-
are cleaved by the AseI/TaqI restriction sion during potato tuber formation. Both tran-
enzyme combination. Thus, in so far as only scripts are highly tuber specific and are expressed
one pair of enzymes is applied, about half of strongly in 15-d-old tubers, but not in stolons,
the transcripts present in a cell will not be leaves or petioles and only at very low levels in
detected by the standard cDNA-AFLP tech- stems. The dramatic induction of a lipoxygenase
nique. To obtain more comprehensive pat- gene just after the start of tuberisation led the
terns, the cDNA-AFLP protocol has modified authors to speculate that the expression of at least
and showed that the rarely cutting enzyme can one of these enzymes might directly be linked to
be omitted, and meaningful banding patterns the tuber development process. Following this
can be produced using TaqI alone. Samples initial report, a small number of papers have
derived from buds of red and white flowers of described the use of cDNA-AFLP fingerprinting
the common morning glory (Ipomoea purpu- in plant and animal systems. Habu et al. in 1997
rea) were compared using 96 different primer compared mRNA samples obtained from the
combinations, each of which gave approxi- flower buds of two lines of Ipomoea purpurea.
mately 50 bands, corresponding to a total of Fourteen cDNA fragments (approximately 0.3%)
approximately 5,000 bands. amplified differently in the two samples. Two of
3. iAFLP these were shown to have been derived from a
iAFLP (introduced AFLP) is a quantitative gene that was actively expressed in the buds of
high-throughput expression profiling method red flowers but not in those of white flowers.
specifically designed to measure the concen- Sequence analysis showed that this cDNA carries
trations of known transcripts in numerous a sequence highly homologous to the chalcone
Gene Tagging by Insertional Mutagenesis 219

synthase gene, a key enzyme in the flavonoid population of family member cDNA fragments
biosynthetic pathway. cDNA-AFLP was also of different lengths. To get rid of the unligated
applied to identify differentially expressed genes fragments, a further round of PCR is performed
in cold-tolerant and cold-sensitive alfalfa geno- using the FSD primer and a primer directed
types and rice. against the linker. Amplification products are
then used as a template to extend a radiolabelled
FSD primer, and extension products are separated
RFLP-Coupled Domain-Directed on acrylamide gels. Different probes will produce
Differential Display (RC4D) different RC4D banding patterns, which allow
identification of differentially expressed cDNAs.
Many genes and their protein products have a RC4D was first used to analyse differential
modular structure where the presence of certain expression of MADS box genes in male and female
domains (family-specific domains, FSDs) defines inflorescences of maize. The name MADS was
membership in different gene families. This is constructed from the initials of the first four mem-
well characterised for the chlorophyll a/b binding bers of the gene family, which were MCM1 (yeast),
proteins and for many transcription factors. AGAMOUS (plants), DEFICIENS (plants) and
Restriction fragment length polymorphism-cou- SRF (human). A small collection of MADS box
pled domain-directed differential display (RC4D, primers was designed, directed against sequences
which was first described by Fischer and his team encoding derivatives of a highly conserved amino
in 1995) is a method specifically designed to acid motif which covered all its variations known
analyse expression of multi-gene families at dif- from plants. RC4D yielded many fragments
ferent developmental stages, in diverse tissues or significantly different in size. Most of them were
in different organisms. RC4D combines cDNA- equally present in both sexes. Four already known
AFLP technology with a gene family-specific and two new MADS box genes were identified,
version of DDRT-PCR. In RC4D, instead of arbi- being either specifically expressed in the female
trary decameric primers, longer primers directed sex or preferentially expressed in male or female
against an FSD are used, allowing cDNAs belong- inflorescences, respectively. The two new MADS
ing to the same gene family to be selectively box genes belong to a subfamily showing sequence
amplified. As the amplification products are rela- similarity to floral homoeotic and transcription
tively uniform in length, restriction fragment factor genes. Another example of using RC4D was
length polymorphism (RFLP) is introduced by identification of several cDNAs coding for cal-
digestion with a frequently cutting restriction cium-dependent protein kinases involved in cal-
enzyme. This reduces the amplicon size from cium signalling during cold induction of the kin
approximately 1 kbp to several hundred base genes of Arabidopsis thaliana.
pairs, which is optimal for separation on acryl-
amide gels. Family members can thus easily be
distinguished by size. The RC4D protocol can be Gene Tagging by Insertional
explained briefly as cDNA is synthesised from Mutagenesis
mRNA with an oligo(dT) primer bearing a PCR
downstream primer binding sequence at its 5 Identification of genes by insertional mutagenesis
end. PCR is performed with the downstream is quite advantageous due to the ease of isolating
primer and an upstream primer specific for a the tagged gene in comparison with functional
family-specific domain (FSD). This results in a analysis based on mutations derived from chemi-
mixture of truncated family member cDNAs. The cal or physical treatments. The process of inser-
amplicon is digested with a frequently cutting tional mutagenesis involves the insertion of a
restriction enzyme, and double-stranded linkers known segment of DNA into a gene of interest.
are ligated to the cohesive ends. PCR with a This inserted sequence often creates a knockout
linker primer and an FSD primer results in a mutation by blocking or disrupting the expression
220 10 Curtain Raiser to Novel MAS Platforms

of the gene and might result in a mutant phenotype transposons. But, now maize transposon sys-
that can be screened. In addition, the insertion tems have been used for mutagenesis in heter-
sequence also tags the affected gene, which can ologous transgenic plant species which otherwise
be isolated by using hybridisation probes based lack an active endogenous transposon family.
on the sequence of the gene tag. Once the mutated For example, the Ac element was introduced
gene is known, the initial wild-type gene can also into rice, and checking for hygromycin resis-
be identified. Such a method has a major advan- tance identified the transposed plants, since the
tage of not requiring any prior knowledge of the autonomous Ac element had been cloned
gene product or its expression. Also, this approach between the promoter and the hph-coding
provides a direct route to determine the function region. A strategy, using the maize Ac-Ds system,
of a gene product in situ unlike other methods has also been effectively used for gene tagging in
which are correlative and do not necessarily case of rice. Retrotransposons, transposable
prove a relationship between a gene sequence elements that transpose via an RNA intermediate
and its function. Two types of insertion sequences and are structurally similar to integrated copies of
are commonly used for mutagenesis in case of retroviruses, have also been shown to be efficient
plants: transposable elements and Agrobacterium gene tags as demonstrated by the introduction
tumefaciens-mediated T-DNA (transfer DNA) of tobacco retrotransposon Tto1 into rice and
insertions. its autonomous transposition through reverse
transcription.
Classical genetic approaches to identify
T-DNA Tag genes, as mentioned earlier, are generally based
on the creation of mutations leading to a recogn-
The process of gene tagging using T-DNA as the isable phenotype reflecting the gene function,
insert has been used effectively to isolate genes, such as in gene tagging. However, this is not
especially in Arabidopsis. T-DNA insertional always possible, since many genes show func-
mutagenesis has also been used to produce 22,090 tional redundancy, and thus mutation in one gene
primary transgenic rice plants having approxi- or locus could be compensated for by the func-
mately 25,700 tags. Another efficient T-DNA tioning of one or more other family members.
tagging system for japonica rice has also been Moreover, certain genes function at different
described in which over 1,000 T-DNA tags in rice stages of development. Mutations in such genes
genome have been characterised. It clearly could cause early lethality or could be highly
revealed that preferential insertion has occurred pleiotropic. This can thus prevent the identification
in gene-rich regions. of the role of the gene. Trapping techniques have
been developed keeping these limitations in
mind. Entrapment strategies rely on the use of
Transposon Tags inserts, such as transposons or T-DNA, contain-
ing reporter gene constructs, whose expression is
Transposons, first recognised by Barbara dependent on cis-acting regulatory sequences at
McClintock in maize, have become a powerful the site of insertion. The inserts then allow for
tool for gene isolation. The mutagenic potential the identification of genes, based on their expres-
of mobile elements and their ability to tag the sion pattern, even though they might not display
mutated sequences along with their widespread an obvious mutant phenotype. Three basic types
distribution have been exploited for use as tools of gene traps are constructed using reporter
for gene isolation as these properties help in the genes such as those encoding b-glucuronidase
cloning of genes. The application of transposon (GUS) and green fluorescent protein (GFP):
tagging was initially restricted to plants, such as enhancer trap, promoter trap and gene trap.
maize (Zea mays) and snapdragon (Antirrhinum), Another approach used to access gene function
with active and well-characterised endogenous is activation tagging. This technique is based on
MicroRNAs 221

the use of an insertion element carrying a strong quelling and RNAi exists. Thus, understanding
enhancer. Thus, on integration into the genome, such gene regulation mechanisms also has strong
it causes activation of an adjacent gene or influence in characterising the QTLs at molecular
enhances its expression, resulting in gain-of- level.
function mutants.

MicroRNAs
Post-transcriptional Gene Silencing
MicroRNAs are a class of post-transcriptional
Epigenetic regulation of gene expression is a her- regulators. They are short ~22 nucleotide RNA
itable change in gene expression that cannot be sequences that bind to complementary sequences
explained by changes in gene sequence. It can in the 3 untranslated region (UTR) of multiple
result in the repression or activation of gene target mRNAs, usually resulting in their silenc-
expression and is therefore referred to as gene ing. MicroRNAs target ~60% of all genes, are
silencing or gene activation, respectively. Until abundantly present in cells and are able to repress
the end of the 1980s, only modifications of DNA hundreds of targets each. These features, coupled
or protein that lead to transcriptional repression with their conservation in organisms ranging
or activation, or to the formation of prions, were from the unicellular algae Chlamydomonas rein-
classified as epigenetic. During the 1990s, how- hardtii to mitochondria, suggest they are a vital
ever, a number of gene-silencing phenomena that part of genetic regulation with ancient origins.
occur at the post-transcriptional level were dis- MicroRNAs were first discovered in 1993 by
covered in plants, fungi, animals and ciliates, Victor Ambros, Rosalind Lee and Rhonda
introducing the concept of post-transcriptional Feinbaum during a study into development in the
gene silencing (PTGS) or RNA silencing. PTGS nematode Caenorhabditis elegans regarding the
results in the specific degradation of a population gene lin-14. This screen led to the discovery that
of homologous RNAs. It was first observed after the lin-14 was able to be regulated by a short
introduction of an extra copy of an endogenous RNA product from lin-4, a gene that transcribed
gene (or of the corresponding cDNA under the a 61 nucleotide precursor that matured to a 22
control of an exogenous promoter) into plants. nucleotide mature RNA which contained
Because RNAs encoded by both transgenes and sequences partially complementary to multiple
homologous endogenous gene(s) were degraded, sequences in the 3 UTR of the lin-14 mRNA.
the phenomenon was originally called co-sup- This complementarity was sufficient and neces-
pression. A similar phenomenon in the fungus sary to inhibit the translation of lin-14 mRNA.
Neurospora crassa was named quelling. Later, Retrospectively, this was the first microRNA to
several groups showed that PTGS can also affect be identified, though at the time Ambros et al.
transgenes that are not homologous to endoge- speculated it to be a nematode idiosyncrasy. Since
nous genes, suggesting that this phenomenon is then, several thousand miRNAs and their targets
not a simple regulatory mechanism that controls have been discovered in all eukaryotes including
the expression of endogenous genes. Fire et al. in mammals, fungi and plants.
1998 identified a related mechanism, RNA inter- In plants, the successful targeting reaction
ference (RNAi), in animals. RNAi results in the requires complementarity of the miRNA at most
specific degradation of endogenous RNA in the of the residues. The consequence of the targeting
presence of homologous dsRNA either locally reaction depends on the nature of the targeted
injected or transcribed from an inverted repeat RNA and the extent of complementarity with the
transgene. Injected dsRNA, as well as transgenes miRNA. The target RNA is cleaved, and the level
expressing dsRNA, also triggers silencing of of the protein product is reduced if there is near
homologous (trans)genes in plants. This strongly complete complementarity, including positions 9
suggests that a mechanistic link between PTGS, and 10 of the miRNA. Translational suppression
222 10 Curtain Raiser to Novel MAS Platforms

without turnover of the target RNA is mediated by They display different expression profiles from
miRNAs with incomplete complementarity to tissue to tissue, reflecting the diversity in cellu-
their target. In addition, there may be miRNA- lar phenotypes and as such suggest a role in tis-
mediated targeting of chromatin-associated RNAs sue differentiation and maintenance. Hence,
that lead directly or indirectly to targeted epige- integration of such information in QTL map-
netic modification. In some instances, miRNA- ping studies can open up new avenues in the
mediated gene silencing is a simple negative MAS.
switch: Whenever the miRNA gene is active, the
target mRNA is silent. However, these versatile
RNA regulators may also participate in feedback Biochemical Techniques
loops and carry out more subtle roles in genetic
regulation. They might dampen fluctuations in Biochemistry involves the study of chemical pro-
target gene expression, for example, or influence cesses that occur in the living organisms with the
temporal changes. In some instances, the miRNAs ultimate aim of understanding the nature of life in
or their precursors may move through plasmodes- molecular terms. There are several biochemical
mata, and different stages in the feedback system techniques that have their role in unravelling the
occur in adjacent cells or in separate roots and molecular basis of life. One- and two-dimen-
shoots. miRNAs may also initiate regulatory cas- sional electrophoresis is the most widely used
cades with multiple mRNA targets. These cas- techniques in protein identification and charac-
cades involve secondary small interfering RNAs terisation. Mass spectrometry is mainly used to
(siRNAs) that associate with argonaute (AGO) predict protein structure and function (proteom-
proteins, similarly to miRNAs. The first step in ics) and small metabolites (metabolomics). There
these cascades requires an RNA-dependent RNA are large numbers of biochemical techniques that
polymerase (RDR, RDR6 in Arabidopsis thali- have potential application in MAS, and only a
ana), and it takes place when the initiator miRNA few major techniques are discussed hereunder.
duplex structure is asymmetrical, if the initiator
miRNA is 22 nucleotides rather than 21 nucle-
otides long, or if there are two target sites for Plant Proteomics
21-nucleotide RNAs. The initiator miRNA stimu-
lates the RDR to convert the targeted RNA into Proteins are the workhorses of the cell and have
long, double-stranded RNA that is then processed important functions in both normal and abnormal
by Dicer into secondary siRNAs. A high propor- states. In order to understand how proteins inter-
tion of the secondary siRNAs are in a 21-nucle- act and regulate various cellular processes, it is
otide phased register in which the first position is important to understand their expression behav-
the cleavage target of the initiator miRNA. iour under a wide range of experimental condi-
Comparing miRNAs between species can tions. Unlike the genome which contains a fixed
even be used to delineate molecular evolution- number of genes, the levels of protein within the
ary history on the basis that the complexity of an cells are highly dynamic. Proteins are constantly
organisms phenotype may reflect that of the processed within the cell in response to external
microRNA found in the genotype. Unfortunately, stimuli and undergo a wide range of posttransla-
the rate of validation of microRNA targets is tional modifications. As a result, it is hard to
substantially more time consuming than that of accurately determine the exact number or quanti-
predicting sequences and targets. Due to their ties of proteins which are present within the bio-
abundant presence and far-reaching potential, logical systems. In addition, protein families are
miRNAs have all sorts of functions in physiol- extremely diverse and have considerable differ-
ogy, from cell differentiation, proliferation and ences in their physical sizes, chemical and struc-
apoptosis to the endocrine system, haematopoi- tural properties, affinity constants and relative
esis, fat metabolism and limb morphogenesis. abundance within the cells. As a result, accurately
Plant Proteomics 223

characterising such interactions is extremely The first protein studies that can be called
challenging. proteomics began in 1975 with the introduction
The term proteomics was first coined in 1995 of the two-dimensional gel by OFarrell, Klose
and was defined as the large-scale characterisa- and Scheele, who began mapping proteins from
tion of the entire protein complement of a cell Escherichia coli, mouse and guinea pig, respec-
line, tissue or organism. Today, two definitions of tively. Although many proteins could be sepa-
proteomics are encountered. The first is the more rated and visualised, they could not be identified.
classical definition, restricting the large-scale Despite these limitations, shortly thereafter, a
analysis of gene products to studies involving large-scale analysis of all human proteins was
only proteins. The second and more inclusive proposed. The goal of this project, termed the
definition combines protein studies with analyses human protein index, was to use two-dimen-
that have a genetic readout such as mRNA analy- sional protein electrophoresis (2-DE) and other
sis, genomics and the yeast two-hybrid analysis. methods to catalogue all human proteins.
However, the goal of proteomics remains the However, lack of funding and technical limita-
same, that is, to obtain a more global and inte- tions prevented this project progress. Although
grated view of biology by studying all the pro- the development of 2-DE was a major step for-
teins of a cell rather than each one individually. ward, the science of proteomics would have to
Using the more inclusive definition of proteom- wait until the proteins displayed by 2-DE could
ics, many different areas of study are now grouped be identified. One problem that had to be over-
under the heading proteomics. These include come was the lack of sensitive protein sequenc-
proteinprotein interaction studies, protein ing technology. Improving sensitivity was
modifications, protein function and protein local- critical for success because biological samples
isation studies to name a few. The aim of pro- are often limiting and both one-dimensional
teomics is not only to identify all the proteins in a (1-D) and two-dimensional (2-D) gels have
cell but also to create a complete three-dimen- limits in protein loading capacity. The first
sional (3-D) map of the cell indicating where pro- major technology to emerge for the identification
teins are located. These ambitious goals will of proteins was the sequencing of proteins by
certainly require the involvement of a large num- Edman degradation. A major breakthrough was
ber of different disciplines such as molecular the development of microsequencing tech-
biology, biochemistry and bioinformatics. It is niques for electroblotted proteins. This tech-
likely that in bioinformatics alone, more power- nique was used for the identification of proteins
ful computers will have to be devised to organise from 2-D gels to create the first 2-D databases.
the immense amount of information generated Improvements in microsequencing technology
from these endeavours. resulted in increased sensitivity of Edman
In the quest to characterise the proteome of a sequencing in the 1990s to high-picomole
given cell or organism, it should be remembered amounts.
that the proteome (the complete set of proteins at One of the most important developments in
the given time) is dynamic. The proteome of a cell protein identification has been the development
will reflect the immediate environment in which it of mass spectrometry (MS). In the last decade,
is studied. In response to internal or external cues, the sensitivity of analysis and accuracy of results
proteins can be modified by posttranslational for protein identification by MS have increased
modifications, undergo translocations within the by several orders of magnitude. It is now esti-
cell or be synthesised or degraded. Thus, exami- mated that proteins in the femtomolar range can
nation of the proteome of a cell is like taking a be identified in gels. Because MS is more sensi-
snapshot of the protein environment at any given tive, can tolerate protein mixtures and is amena-
time. Considering all the possibilities, it is likely ble to high-throughput operations, it has
that any given genome can potentially give rise to essentially replaced Edman sequencing as the
an infinite number of proteomes. protein identification tool of choice.
224 10 Curtain Raiser to Novel MAS Platforms

can also be regulated by proteolysis and


Why Proteomics? compartmentalisation. The average number
of protein forms per gene was predicted to be
Many types of information cannot be obtained one or two in bacteria, three in yeast and
from the study of QTLs or genes alone. For three or more in humans. Therefore, it is clear
example, proteins (intern metabolites), not genes, that the theory of one gene, one protein is
are responsible for the phenotypes of cells. It is an oversimplification. In addition, some
impossible to elucidate mechanisms of growth bodily fluids such as serum or urine have no
and development, disease, aging and effects of mRNA source and therefore cannot be studied
the environment solely by studying the genome. by mRNA analysis.
Only through the study of proteins can protein 3. Protein Function
modifications be characterised and the targets of According to one study, no function can be
drugs identified. assigned to about one-third of the sequences
1. Annotation of the Genome in organisms for which the genomes have been
One of the first applications of proteomics will sequenced. The complete identification of all
be to identify the total number of genes in a proteins in a genome will aid the field of struc-
given genome. This functional annotation of a tural genomics in which the ultimate goal is to
genome is necessary because it is still difficult obtain 3-D structures for all proteins in a pro-
to predict genes accurately from genomic data. teome. This is necessary because the functions
One problem is that the exonintron structure of many proteins can only be inferred by
of most genes cannot be accurately predicted examination of their 3-D structure.
by bioinformatics. To achieve this goal, genomic 4. Protein Modifications
information will have to be integrated with One of the most important applications of
data obtained from protein studies to confirm proteomics will be the characterisation of
the existence of a particular gene. posttranslational protein modifications.
2. Protein Expression Studies Proteins are known to be modified posttrans-
In recent years, the analysis of mRNA expres- lationally in response to a variety of intracel-
sion by various methods has become increas- lular and extracellular signals. For example,
ingly popular. These methods include SAGE protein phosphorylation is an important sig-
and DNA microarray technology (see above). nalling mechanism, and dysregulation of pro-
However, the analysis of mRNA is not a tein kinases or phosphatases can result in
direct reflection of the protein content in the undesirable effects such as oncogenesis. By
cell. Consequently, many studies have now using a proteomics approach, changes in the
shown a poor correlation between mRNA modifications of many proteins expressed by
and protein expression levels. The formation a cell can be analysed simultaneously.
of mRNA is only the first step in a long 5. Protein Localisation and Compartmentalisation
sequence of events resulting in the synthesis One of the most important regulatory mecha-
of a protein. First, mRNA is subject to post- nisms known is protein localisation. The mis-
transcriptional control in the form of alterna- localisation of proteins is known to have
tive splicing, polyadenylation and mRNA profound effects on cellular function (e.g. cys-
editing. Many different protein isoforms can tic fibrosis). Proteomics aims to identify the
be generated from a single gene at this step. subcellular location of each protein. This
Second, mRNA then can be subject to regula- information can be used to create a 3-D pro-
tion at the level of protein translation. tein map of the cell, providing novel informa-
Proteins, having been formed, are subject to tion about protein regulation.
posttranslational modification. It is estimated 6. ProteinProtein Interactions
that up to 200 different types of posttransla- Of fundamental importance in biology is the
tional protein modification exist. Proteins understanding of proteinprotein interactions.
One- and Two-Dimensional Gel Electrophoresis 225

The process of cell growth, programmed cell Functional Proteomics


death and the decision to proceed through the
cell cycle are all regulated by signal transduc- Functional proteomics is a broad term for
tion through protein complexes. Proteomics many specific, directed proteomics approaches.
aims to develop a complete 3-D map of all pro- In some cases, specific subproteomes are iso-
tein interactions in the cell. One step toward lated by affinity chromatography for further
this goal was completed for the microorganism analysis. This could include the isolation of pro-
Helicobacter pylori. Using the yeast two- tein complexes or the use of protein ligands to
hybrid method to detect protein interactions, isolate specific types of proteins. This approach
1,200 connections were identified between H. allows a selected group of proteins to be studied
pylori proteins covering 46.6% of the genome. and characterised and can provide important
A comprehensive two-hybrid analysis has also information about protein signalling, disease
been performed on all the proteins obtained mechanisms or proteindrug interactions.
from the yeast S. cerevisiae.

Protein Analysis
Types of Proteomics
By the very definition of proteomics, it is expected
Protein Expression Proteomics that complex protein mixtures will be encoun-
tered. Therefore, methods must exist to resolve
The quantitative study of protein expression these protein mixtures into their individual com-
between samples that differ by some variable is ponents so that the proteins can be visualised,
known as expression proteomics. In this approach, identified and characterised. The predominant
protein expression of the entire proteome or of technology for protein separation and isolation is
subproteomes between samples can be compared. polyacrylamide gel electrophoresis. Unlike the
Information from this approach can identify breakthroughs in molecular biology that eventu-
novel proteins in signal transduction or identify ally enabled the sequencing of the human genome,
disease-specific proteins. some aspects of protein science have shown little
progress over the years. Protein separation tech-
nology is one of them. Since its inception several
Structural Proteomics decades ago, protein electrophoresis still remains
the most effective way to resolve a complex mix-
Proteomics studies whose goal is to map out the ture of proteins. In many applications, it is at this
structure of protein complexes or the proteins stage where the bottleneck occurs. This is because
present in a specific cellular organelle are known 1- or 2-DE is a slow, tedious procedure that is not
as cell map or structural proteomics. Structural easily automated. However, until something
proteomics attempts to identify all the proteins replaces this methodology, it will remain an
within a protein complex or organelle, deter- essential component of proteomics.
mine where they are located and characterise all
proteinprotein interactions. An example of
structural proteomics is the analysis of the One- and Two-Dimensional Gel
nuclear pore complex. Isolation of specific sub- Electrophoresis
cellular organelles or protein complexes by
purification can greatly simplify the proteomic For many proteomics applications, 1-DE is the
analysis. This information will help join together method of choice to resolve protein mixtures. In
the overall architecture of cells and explain how 1-DE, proteins are separated on the basis of
expression of certain proteins gives a cell its molecular mass. Because proteins are solubilised
unique characteristics. in sodium dodecyl sulphate (SDS), protein solu-
226 10 Curtain Raiser to Novel MAS Platforms

bility is rarely a problem. Moreover, 1-DE is microorganisms, cellular organelles and protein
simple to perform, is reproducible and can be complexes. It can also be used to resolve and
used to resolve proteins with molecular masses of characterise proteins in subproteomes that have
10300 kDa. The most common application of been created by some form of purification of a
1-DE is the characterisation of proteins after some proteome. Because a single 2-DE gel can resolve
form of protein purification. This is because of the thousands of proteins, it remains a powerful tool
limited resolving power of a 1-D gel. If a more for the cataloguing of proteins. Many 2-DE data-
complex protein mixture such as a crude cell bases have been constructed and are available on
lysate is encountered, then 2-DE can be used. In the World Wide Web.
2-DE, proteins are separated by two distinct prop- A number of improvements have been made
erties. They are resolved according to their net in 2-DE over the years. One of the biggest
charge in the first dimension and according to improvements was the introduction of immobil-
their molecular mass in the second dimension. ised pH gradients, which greatly improved the
The combination of these two techniques pro- reproducibility of 2-DE. The use of fluorescent
duces resolution far exceeding that obtained in dyes has improved the sensitivity of protein
1-DE. One of the greatest strengths of 2-DE is the detection, and specialised pH gradients are able
ability to resolve proteins that have undergone to resolve more proteins. The speed of running
some form of posttranslational modification. This 2-DE has been improved, and 2-D gels can now
resolution is possible in 2-DE because many types be run in the mini-gel format. In addition, there
of protein modifications confer a difference in have been efforts to automate 2-DE. Hochstrassers
charge as well as a change in mass on the protein. group has automated the process of 2-DE from
One such example is protein phosphorylation. gel running to image analysis and spot picking.
Frequently, the phosphorylated form of a protein The use of computers has aided the analysis of
can be resolved from the nonphosphorylated form complex 2-D gel images. This is a critical aspect
by 2-DE. In this case, a single phosphoprotein of 2-DE because a high degree of accuracy is
will appear as multiple spots on a 2-D gel. In addi- required in spot detection and annotation if arte-
tion, 2-DE can detect different forms of proteins facts are to be avoided. A molecular scanner is
that arise from alternative mRNA splicing or pro- available to record 2-DE images. Software pro-
teolytic processing. grams, such as Melanie, compare computer
The primary application of 2-DE continues to images of 2-D gels and facilitate both the
be protein expression profiling. In this approach, identification and quantitation of protein spots
the protein expression of any two samples can be between samples. An exciting advance in 2-DE
qualitatively and quantitatively compared. The was developed by Minden and co-workers. This
appearance or disappearance of spots can provide technology is called difference gel electrophore-
information about differential protein expression, sis (DIGE) and utilises fluorescent tagging of
while the intensity of those spots provides quanti- two protein samples with two different dyes.
tative information about protein expression levels. The tagged proteins are run on the same 2-D gel,
Such information can be treated as quantitative and post-run fluorescence imaging of the gel is
traits and mapped on the linkage map (which is used to create two images, which are superim-
referred to as protein QTL (pQTL) mapping). posed to identify pattern differences. The dyes
Protein expression profiling can be used for sam- are amine reactive and are designed to ensure that
ples from whole organisms, cell lines, tissues or proteins common to both samples have the same
bodily fluids. Examples of this technique include relative mobility regardless of the dye used to tag
the comparison of normal and diseased tissues or them. This technique circumvents the need to
of cells treated with various chemicals (pesticide/ compare several 2-D gels. In their original paper,
herbicide) or stimuli (water or salinity or nutrient DIGE was used to detect differences between
stress). Another application of 2-DE is in cell map exogenous proteins in two Drosophila melano-
proteomics. 2-DE is used to map proteins from gaster embryo extracts at nanogram levels.
Acquisition of Protein Structure Information 227

Moreover, an inducible protein from Escherichia mixture to peptides (usually by digestion with
coli was detected after 15 min of induction. This trypsin) and then purify the peptides before sub-
technology is now commercially available from jecting them to analysis by mass spectrometry
Amersham Pharmacia. (MS). Various methods for peptide purification
However, a number of problems with 2-DE have been devised, including liquid chromatogra-
still remain. Despite efforts to automate protein phy, capillary electrophoresis and a combination
analysis by 2-DE, it is still a labour-intensive and of techniques such as multidimensional protein
time-consuming process. A typical 2-DE experi- identification or cation-exchange chromatogra-
ment can take 2 days, and only a single sample phy and reverse-phase (RP) chromatography. The
can be analysed per gel. In addition, 2-DE is lim- advantage of these methods is that because a 2-D
ited by both the number and type of proteins that gel is avoided, a greater number of proteins in the
can be resolved. For example, the protein mixture mixture can be represented. The disadvantage is
obtained from a eukaryotic cell lysate is too com- that it can require an immense amount of time
plex to be completely resolved on a single 2-D and computing power to disclose the data
gel. Many large or hydrophobic proteins will not obtained. In addition, considerable time and
enter the gel during the first dimension, and pro- effort may be expended in the analysis of uninter-
teins of extreme acidity or basicity (proteins with esting proteins. One of the most exciting tech-
pIs below pH 3 and above pH 10, respectively) niques to emerge as an alternative to protein
are not well represented. Some of these problems electrophoresis is that of isotope-coded affinity
can be overcome with different solubilisation tags (ICAT). This method allows the quantitative
conditions and pH gradients. Another limitation protein profiling between different samples with-
of 2-DE is the inability to detect low-copy pro- out the use of electrophoresis.
teins when a total-cell lysate is analysed. In a
crude cell extract, the most abundant proteins can
dominate the gel, making the detection of low- Acquisition of Protein Structure
copy proteins difficult. It was determined in the Information
analysis of yeast proteins by 2-DE that no pro-
teins defined as low-copy proteins were visible Edman Sequencing
by 2-DE. Yet it is estimated that over half of the
6,000 genes in yeast may encode low-copy pro- One of the earliest methods used for protein
teins. In mammalian cells, the dynamic range of identification was microsequencing by Edman
protein expression is estimated to be between 7 chemistry to obtain N-terminal amino acid
and 9 orders of magnitude. This problem cannot sequences. Little has changed in Edman chemis-
be overcome by simply loading more protein on try since its introduction, but improvements in
the gel, because the resolution will decrease and sequencing technology have increased the sensi-
the co-migration of proteins will increase. tivity and ease of Edman sequencing. Although
Because of these limitations, the largest applica- the use of Edman sequencing is decreasing in the
tion of 2-DE in the future will probably involve field of proteomics, it is still a very useful tool for
the analysis of protein complexes or subpro- several reasons. First, because Edman sequencing
teomes as opposed to whole proteomes. existed before MS as a sequencing tool, a consid-
erable number of investigators continue to use
Edman sequencing. Second, Edman sequencing
Alternatives to Electrophoresis of relatively abundant proteins is a viable alterna-
in Proteomics tive to MS if a mass spectrometer is in high
demand for the identification of low-copy proteins
The limitations of 2-DE have inspired a number or is not available. Finally, Edman sequencing is
of approaches to bypass protein gel electrophore- used to obtain the N-terminal sequence of a pro-
sis. One approach is to convert an entire protein tein (if possible) to determine its true start.
228 10 Curtain Raiser to Novel MAS Platforms

The N-terminal sequencing of proteins was to 1-pmol level. The mixed sequence approach
introduced by Edman in 1949. Today, Edman has the advantage of enabling subsequent searches
sequencing is most often used to identify proteins to be carried out against unannotated or non-
after they are transferred to membranes. The species-specific DNA databases as well as anno-
development of membranes compatible with tated protein databases. This is because the T/
sequencing chemicals allowed Edman sequenc- FASTF algorithms utilise actual amino acid
ing to become a more applicable sequencing sequence and are therefore able to tolerate
method for the identification of proteins sepa- errors in the database as well as polymorphisms
rated by SDS-polyacrylamide gel electrophore- or conservative substitutions. A variation of T/
sis. One of the biggest problems that has limited FASTF has been devised for MS. The T/FASTF/S
the success of Edman sequencing in the past is programs are available at http://fasta.bioch.
N-terminal modification of proteins. Since it is virginia.edu/.
difficult to tell if a protein is N-terminally blocked
before it is sequenced, precious samples were
often lost in failed sequencing attempts. To over- Mass Spectrometry
come this problem, a novel approach called
mixed-peptide sequencing has developed. In MS enables protein structural information, such as
mixed-peptide sequencing, a protein is converted peptide masses or amino acid sequences, to be
into peptides by cleavage with cyanogen bromide obtained. This information can be used to identify
(CNBr) or skatole, and the peptides are sequenced the protein by searching nucleotide and protein
in an Edman sequencer simultaneously. Briefly, databases. It also can be used to determine the type
the process of mixed-peptide sequencing involves and location of protein modifications. The harvest-
separation of a complex protein mixture by poly- ing of protein information by MS can be divided
acrylamide gel electrophoresis (1-D or 2-D) and into three stages: (1) sample preparation, (2) sam-
then transfer of the proteins to an inert membrane ple ionisation and (3) mass analysis.
by electroblotting. The proteins of interest are
visualised on the membrane surface, excised and Sample Preparation
fragmented chemically at methionine (by CNBr) In most of proteomics, a protein is resolved from
or tryptophan (by skatole) into several large pep- a mixture by using a 1- or 2-D polyacrylamide
tide fragments. On average, three to five peptide gel. The challenge is to extract the protein or its
fragments are generated, consistent with the fre- constituent peptides from the gel, purify the sam-
quency of occurrence of methionine and trypto- ple and analyse it by MS. The extraction of whole
phan in most proteins. The membrane piece is proteins from gels is inefficient; however, if a
placed directly into an automated Edman protein is in-gel digested with a protease, many
sequencer without further manipulation. Between of the peptides can be extracted from the gel. A
6 and 12 automated Edman cycles are carried out method for in-gel protein digestion was devel-
(48 h), and the mixed sequence data are fed into oped and is now commonly applied to both 1-
the FASTF or TFASTF algorithms, which sort and 2-D gels. In-gel digestion is more efficient at
and match the data against protein (FASTF) and sample recovery than other common methods
DNA (TFASTF) databases to unambiguously such as electroblotting. In addition, the conver-
identify the protein. The FASTF and TFASTF sion of a protein into its constituent peptides pro-
programs were written in collaboration with vides more information than can be obtained
William Pearson (Department of Biochemistry, from the whole protein itself. For many applica-
University of Virginia) and are available at sev- tions, the peptides recovered following in-gel
eral databases including NCBI. Because minimal digestion need to be purified to remove gel con-
sample handling is involved, mixed-peptide taminants. Common impurities from electropho-
sequencing can be a sensitive approach for iden- resis such as salts, buffers and detergents can
tifying proteins in polyacrylamide gels at the 0.1- interfere with MS. In addition, peptide samples
Acquisition of Protein Structure Information 229

often require concentration before being analysed tubes. The drawback to both the purification
by MS. One method of peptide purification com- and manual loading of microcapillary tubes is
monly employed for this purpose is reverse-phase that it is tedious and slow. As an alternative,
chromatography, which is available in a variety of electrospray sources have been connected in
formats. Peptides can be purified with ZipTips line with liquid chromatography (LC) systems
(Millipore) or Poros R2 perfusion material that automatically purify and deliver the sam-
(PerSeptive Biosystems, Framingham, Mass.) or ple to the mass spectrometer. Examples of this
by high-pressure liquid chromatography (HPLC). method are LC, reverse-phase LC (RP-LC)
and reverse-phase microcapillary LC (RP-LC)
Sample Ionisation (b) Matrix-Assisted Laser Desorption/Ionisation
For biological samples to be analysed by MS, the (MALDI): In MALDI, the sample is incorpo-
molecules must be charged and dry. This is rated into matrix molecules and then subjected
accomplished by converting them to desolvated to irradiation by a laser. The laser promotes
ions. The two most common methods for this are the formation of molecular ions. The matrix is
electrospray ionisation (ESI) and matrix-assisted typically a small energy-absorbing molecule
laser desorption/ionisation (MALDI). In both such as 2,5-dihydroxybenzoic acid or cyano-
methods, peptides are converted to ions by the 4-hydroxycinnamic acid. The analyte is
addition or loss of one or more protons. ESI and spotted, along with the matrix, on a metal
MALDI are soft ionisation methods that allow plate and allowed to evaporate, resulting in the
the formation of ions without significant loss of formation of crystals. The plate, which can be
sample integrity. This is important because it 96-well format, is then placed in the mass
enables accurate mass information to be obtained spectrometer, and the laser is automatically
about proteins and peptides in their native states. targeted to specific places on the plate. Since
(a) Electrospray Ionisation: In ESI, a liquid sam- sample application can be performed by a
ple flows from a microcapillary tube into the robot, the entire process including data
orifice of the mass spectrometer, where a collection and analysis can be automated.
potential difference between the capillary and This is the single biggest advantage of MALDI.
the inlet to the mass spectrometer results in the Another advantage of MALDI over ESI is that
generation of a fine mist of charged droplets. samples can often be used directly without
As the solvent evaporates, the sizes of the any purification after in-gel digestion.
droplets decrease, resulting in the formation of
desolvated ions. A significant improvement in Mass Analysis
ESI technology occurred with the development Mass analysis follows the conversion of proteins
of nanospray ionisation. In nanospray ionisa- or peptides to molecular ions. This is accom-
tion, the microcapillary tube has a spraying plished by the mass analysers in a mass spec-
orifice of 12 mm and flow rates as low as trometer, which resolve the molecular ions on the
510 nl/min. The low flow rates possible with basis of their mass and charge in a vacuum.
nanospray ionisation reduce the amount of (a) Quadrupole Mass Analysers: One of the most
sample consumed and increase the time avail- common mass analysers is the quadrupole
able for analysis. For ESI, there are several mass analyser. Here, ions are transmitted
ways to deliver the sample to the mass spec- through an electric field created by an array
trometer. The simplest method is to load of four parallel metal rods, the quadrupole.
individual microcapillary tubes with sample. A quadrupole can act to transmit all ions or as
Because a new microcapillary tube is used for a mass filter to allow the transmission of ions
each sample, cross-contamination is avoided. of a certain mass-to-charge (m/z) ratio. If
In ESI, peptides require some form of multiple quadrupoles are combined, they can
purification after in-gel digestion, and this can be be used to obtain information about the
accomplished directly in the microcapillary amino acid sequence of a peptide. For a more
230 10 Curtain Raiser to Novel MAS Platforms

detailed review of the operating principles obtain amino acid sequences. In the first stage
of a quadrupole mass analyser, the reader is of analysis, the machine is operated in MS
directed to several excellent reviews. scan mode, and all ions above a certain m/z
(b) Time of Flight: A time-of-flight (TOF) instru- ratio are transmitted to the third quadrupole
ment is one of the simplest mass analysers. It for mass analysis. In the second stage, the
measures the m/z ratio of an ion by determin- mass spectrometer is operated in MS/MS
ing the time required for it to traverse the mode, and a particular peptide ion is selec-
length of a flight tube. Some TOF mass analy- tively passed into the collision chamber. Inside
sers include an ion mirror at the end of the the collision chamber, peptide ions are frag-
flight tube, which reflects ions back through mented by interactions with an inert gas by a
the flight tube to a detector. In this way, the process known as collision-induced dissocia-
ion mirror serves to increase the length of the tion or collisionally activated dissociation.
flight tube. The ion mirror also corrects for The peptide ion fragments are then resolved
small energy differences among ions. Both of on the basis of their m/z ratio by the third
these factors contribute to an increase in mass quadrupole. Since two different mass spectra
resolution. are obtained in this analysis, it is referred to as
(c) Ion Trap: Ion trap mass analysers function to tandem mass spectrometry (MS/MS). MS/MS
trap molecular ions in a 3-D electric field. In is used to obtain the amino acid sequence of
contrast to a quadrupole mass analyser, in peptides by generating a series of peptides that
which ions are discarded before the analysis differ in mass by a single amino acid.
begins, the main advantage of an ion trap (b) Quadrupole-TOF: Several hybrid mass
mass analyser is the ability to allow ions to be spectrometers have emerged from the combi-
stored and then selectively ejected from the nation of different ionisation sources with
ion trap, increasing sensitivity. mass analysers. One example is the quadru-
pole-TOF mass spectrometer. In this machine,
the first quadrupole (Q) and the quadrupole
Types of Mass Spectrometers collision cell (q) of a triple-quadrupole
machine have been combined with a time-of-
Most mass spectrometers consist of four basic flight analyser (TOF). The main applications
elements: (1) an ionisation source, (2) one or more of a QqTOF mass spectrometer are protein
mass analysers, (3) an ion mirror and (4) a detec- identification by amino acid sequencing and
tor. The names of the various instruments are characterisation of protein modifications.
derived from the name of their ionisation source However, because it is coupled to electro-
and the mass analyser. Some of the most common spray, it is not typically utilised for large-
mass spectrometers are discussed hereunder. The scale proteomics.
analysis of proteins or peptides by MS can be (c) MALDI-TOF: The principal application of a
divided into two general categories: (1) peptide MALDI-TOF mass spectrometer is peptide
mass analysis and (2) amino acid sequencing. mass fingerprinting because it can be completely
In peptide mass analysis or peptide mass automated, making it the method of choice for
fingerprinting, the masses of individual peptides large-scale proteomics work. Because of its
in a mixture are measured and used to create a speed, MALDI-TOF is frequently used as a
mass spectrum. In amino acid sequencing, a pro- first-pass instrument for protein identification.
cedure known as tandem mass spectrometry, or If proteins cannot be identified by fingerprinting,
MS/MS, is used to fragment a specific peptide they can then be analysed by electrospray and
into smaller peptides, which can then be used to MS/MS. A MALDI-TOF machine can also be
deduce the amino acid sequence. used to obtain the amino acid sequence of pep-
(a) Triple Quadrupole: Triple-quadrupole mass tides by a method known as post-source decay.
spectrometers are most commonly used to However, peptide sequencing by post-source
Uninterpreted MS/MS Data Searching 231

decay is not as reliable as sequencing with b-ion, whereas if the charge is maintained on the
competing electrospray methods because the C terminus, it is a y-ion. The difference in mass
peptide fragmentation patterns are much less between adjacent y- or b-ions corresponds to that
predictable. of an amino acid. This can be used to identify the
(d) MALDI-QqTOF: The MALDI-QqTOF mass amino acid and hence the peptide sequence, with
spectrometer was developed to permit both the exception of isoleucine and leucine, which
peptide mass fingerprinting and amino acid are identical in mass and therefore indistinguish-
sequencing. It was formed by the combina- able. In addition to fragmentation along the pep-
tion of a MALDI ion source with a QqTOF tide backbone, cleavage can occur along amino
mass analyser. Thus, if a sample is not acid side chains, and this information can be used
identified by peptide mass fingerprinting in to distinguish isoleucine and leucine.
the first step, the amino acid sequence can
then be obtained without having to use a dif-
ferent mass spectrometer. However, the amino De Novo Peptide Sequence
acid sequence information obtained using this Information
instrument was more difficult to interpret than
that obtained from a nanospray-QqTOF mass Another approach to protein identification is to
spectrometer. obtain de novo sequence data from peptides by
(e) FT-ICR: A Fourier transform ion cyclotron MS/MS and then use all the peptide sequences to
resonance (FT-ICR) mass spectrometer is an search appropriate databases. Multiple peptide
ion-trapping instrument that can achieve sequences can be used for protein identification
higher mass resolution and mass accuracy by searching databases with the FASTS program.
than any other type of mass spectrometer. The single biggest advantage of this method is
Recently, FT-ICR has been employed in the the capability of searching peptide sequence
analysis of biomolecules ionised by both ESI information across both DNA and protein data-
and MALDI. The unique abilities of FT-ICR bases. This is because the search engine utilised
provide certain advantages compared to other exhibits a certain amount of flexibility in the
mass spectrometers. For example, because of assignment of protein scores. This search method
its high resolution, FT-ICR can be used for is useful for organisms that do not have well-
the analysis of complex mixtures. FT-ICR, annotated databases. However, because this
coupled to ESI, is also being employed in the method requires several peptide amino acid
study of protein interactions and protein con- sequences of three or four amino acids, it is not
formations. A high-throughput, large-scale the first choice for peptide identification. Rather,
proteomics approach involving FT-ICR has the much faster methods of peptide mass
recently been developed fingerprinting or peptide mass tag searching can
be used first. If these search methods fail, de novo
sequence information can be obtained and used
Peptide Fragmentation to identify the protein.

As peptide ions are introduced into the collision


chamber, they interact with the collision gas (usu- Uninterpreted MS/MS Data Searching
ally nitrogen or argon) and undergo fragmenta-
tion primarily along the peptide backbone. Since A large number of programs are now available
peptides can undergo multiple types of fragmen- for the identification of proteins by using unin-
tation, nomenclature has been created to indicate terpreted MS/MS data. Examples include pro-
what type of ions has been generated. If, after grams such as Mascot, SONAR and SEQUEST.
peptide bond cleavage, the charge is maintained However, searches against unannotated or
on the N-terminus of the ion, it is designated a untranslated DNA databases with uninterpreted
232 10 Curtain Raiser to Novel MAS Platforms

MS/MS data are likely to suffer from the same proteins can then be examined by 2-DE and
pitfalls associated with mass fingerprinting. In autoradiography. Proteins of interest are excised
particular, polymorphisms, sequencing errors from the gel and microsequenced by MS. A major
and conservative substitutions will probably limitation of this approach is that while many
contribute to failure to accurately identify a pro- phosphorylated proteins can be visualised by
tein. The development of uninterpreted MS/MS autoradiography, they cannot be identified because
search algorithms that are error tolerant may of their low abundance. One solution to this
overcome some of these shortcomings, provided problem is enrichment of the phosphoproteome.
that they assign some form of statistical scoring
to the identified proteins.
Phosphoprotein Enrichment

Proteomics Approach to Protein Enrichment of the phosphoproteome of a cell can


Phosphorylation allow the identification of low-copy phosphopro-
teins that would otherwise go undetected. In one
Posttranslational modification of proteins is a approach, phosphoproteins were enriched by
fundamental regulatory mechanism, and charac- conversion of phosphoserine residues to biotiny-
terisation of protein modifications is paramount lated residues. This method is an extension of
for understanding protein function. MS is one of techniques originally developed by Hielmeyer
the most powerful tools for the analysis of pro- and colleagues. Following derivatisation, pro-
tein modifications because virtually any type of teins that were formerly phosphorylated can be
protein modification can be identified. Although isolated by avidin affinity chromatography.
we focus here on protein phosphorylation, the Proteins immobilised on avidin beads can then be
analysis of other types of protein modification by eluted with biotin, theoretically resulting in the
MS can also been done. isolation of the entire phosphoserine proteome.
Protein phosphorylation is one of the most By increasing the amount of cell lysate used
common of all protein modifications and has been for avidin affinity chromatography, low-abundance
found in nearly all cellular processes. MS can be phosphoproteins can be enriched. However, this
used to identify novel phosphoproteins, measure technique does not work for phosphotyrosine,
changes in the phosphorylation state of proteins and the reactivity of phosphothreonine by this
in response to an effector and determine phos- method is very poor. Tyrosine-phosphorylated
phorylation sites in proteins. Identification of proteins can be isolated by the use of antiphos-
phosphorylation sites can provide information photyrosine antibodies. As an alternative, another
about the mechanism of enzyme regulation and method for phosphopeptide enrichment was
the protein kinases and phosphatases involved. devised to allow the recovery of proteins phos-
A proteomics approach to protein phosphoryla- phorylated on serine, threonine and tyrosine. In
tion has the advantage that instead of studying this method, a protein or mixture of proteins is
changes in the phosphorylation of a single digested to peptides with a protease and then
protein in response to some perturbation, one subjected to a multistep procedure for the conver-
can study all the phosphoproteins in a cell (the sion of phosphoamino acids into free sulfhydryl
phosphoproteome) at the same time. A common groups. To capture the derivatised peptides, the
approach to studying protein phosphorylation free sulfhydryl groups in the peptides are then
events is the use of in vivo labelling of phospho- reacted with iodoacetyl groups immobilised on
proteins with inorganic 32P. The phosphopro- glass beads. Enrichment of the phosphoproteome
teomes of cells that differ in some way (e.g. can also be combined with protein profiling by
normal vs. water stressed) can be analysed by 1- or 2-DE. In this way, changes in protein amount
growing cells in inorganic 32P and creating cell observed on electrophoresis will reflect the level
lysates. Changes in the phosphorylation state of of protein phosphorylation. Thus, the principle of
Phosphorylation Site Determination by Mass Spectrometry 233

protein quantitation by ICAT can be combined any peptide within the protein can be informative,
with phosphoprotein enrichment. phosphorylation site analysis requires that the
phosphorylated peptide be analysed. This means
that considerably more protein is required for
Phosphorylation Site Determination analysis. In addition, phosphorylation can alter
by Edman Degradation the cleavage pattern of a protein, and the result-
ing phosphopeptides may require different
Edman sequencing is still a widely used method purification methods. To isolate and purify the
for determining phosphorylation sites in proteins phosphopeptides of interest, it may be necessary
labelled with 32P, either in vitro or in vivo. This is to alter the way in which the phosphoprotein is
because sites can be determined at the sub- digested and to alter the pH or the chromato-
femtomolar level if enough radio activities can be graphic material used for peptide purification.
incorporated into the phosphoprotein of interest. 1. Phosphopeptide Sequencing by MS/MS
This can be as little as 1,000 cpm (which is not A combination of HPLC, Edman degradation
ideal). Briefly, a 32P-labelled protein is digested and phosphopeptide sequencing by MS/MS
with a protease, and the resulting phosphopep- provides the best results for phosphorylation
tides are separated and purified by reverse-phase site determination. Following excision and
HPLC or thin-layer chromatography (TLC). digestion of a 32P-labelled protein, the peptides
The isolated peptides are then cross-linked via are resolved by HPLC. By monitoring HPLC
their C termini to an inert membrane (e.g. fractions for radioactivity, the phosphopep-
Immobilon P, PerSeptive Biosystems). The radio- tides can be selected for analysis. This reduces
active membrane is subjected to several rounds of the complexity of the peptide mixture before
Edman cycles, and radioactivity is collected after MS is performed and facilitates phosphopep-
the cleavage step. The released 32P is counted in a tide identification. Phosphopeptides can be
scintillation counter. This method positionally identified from a mixture of peptides by a
places the phosphoamino acid within the method known as precursor ion scanning.
sequenced phosphopeptide. Of course, this is Peptides are sprayed under neutral or basic
meaningful only if the sequence of the phospho- conditions, and phosphopeptides are identified
peptide is already known. In addition, the in the precursor ion scan. Once a phosphopep-
analysis ceases to become quantitative beyond tide is identified, the peptide mixture is sprayed
30 Edman cycles (even with efficient, modern under acidic conditions, and the phosphopep-
Edman machines) due to well-understood issues tide is sequenced by conventional tandem MS/
with repetitive yield associated with Edman MS. On fragmentation of the phosphopeptide,
chemistry. phosphoserine and phosphothreonine can be
identified by the formation of elimination
products.
Phosphorylation Site Determination 2. Analysis of Phosphopeptides by MALDI-TOF
by Mass Spectrometry MALDI-TOF mass spectrometry can also be
used to identify phosphopeptides. When phos-
Because of its sensitivity, MS can allow the direct phorylated peptides are subjected to ionisation
sequencing of phosphopeptides, resulting in by MALDI, phosphate groups are frequently
unambiguous phosphorylation site identification. liberated from the peptides. This is the case
Below, a brief overview of some common meth- for phosphoserine- and phosphothreonine-
ods for phosphorylation site determination by containing peptides, which can liberate HPO3
MS is given. Identification of phosphorylation or H3PO4, resulting in a neutral loss of 80 and
sites in proteins provides several unique chal- 98 Da, respectively. Careful examination of
lenges for the mass spectrometrist. For example, the TOF spectrum for differences in peptide
unlike in protein identification, where analysis of masses of 80 Da that are not found in the
234 10 Curtain Raiser to Novel MAS Platforms

unphosphorylated peptide control can identify etry (CEMS) and Fourier transform ion cyclo-
phosphopeptides. Phosphopeptides can also tron resonance mass spectrometry (FT-ICR-MS)
be identified by treating one of two identical systems has been demonstrated for metabolite
samples with protein phosphatase to liberate profiling. The first of these, CEMS, is a highly
phosphate groups. Once a phosphopeptide is sensitive methodology that can detect low-abun-
identified, it can be sequenced by MS/MS for dance metabolites and that provides good analyte
identification of the phosphorylation site. separation, whereas the second, FT-ICRMS,
relies solely on very high-resolution mass analy-
sis, which potentially enables the measurement
Metabolite Proling Technologies of the empirical formula for thousands of
metabolites; however, it is somewhat limited by
Two techniques dominate metabolite profiling the lack of chromatographic separation. NMR
strategies: (1) mass spectrometry (MS) and (2) approaches, which rely on the detection of mag-
nuclear magnetic resonance (NMR). Meta- netic nuclei of atoms after application of a con-
bolomics, or the more modestly termed metabo- stant magnetic field, are the main alternative to
lite profiling, has been carried out since the MS-based approaches for metabolite profiling.
mid-1970s but only became a standard labora- These are well-developed and well-validated
tory technique after 2000. The following focus on methods, and the computer software associated
providing short definitions of the techniques with NMR instrumentation is, consequently,
and their relative advantages and disadvantages. also advanced. Furthermore, despite limitations
Gas-chromatography-mass-spectrometry (GC- in its sensitivity and, therefore, in metabolite cov-
MS), gas-chromatography-time-of-flight-mass- erage, it retains an advantage over MS-based
spectrometry (GC-TOF-MS) and liquid- approaches for certain biological questions. For
chromatography-mass-spectrometry (LC-MS) example, it can be used non-invasively (i.e. on
are currently the standard mass-spectrometry living cells) because the pH of the vacuole is
methods for metabolite analyses. GCMS tech- different from that found elsewhere in the cell.
nologies enable the identification and robust NMR can provide subcellular information, and it
quantification of a few hundred primary metabo- is easier to derive atomic information for flux
lites within a single extract. The main advantage modelling from NMR than from MS-based
of this instrument stems from the fact that it has approaches.
long been used for metabolite profiling, and,
therefore, there are stable protocols for machine
set-up, maintenance and usage. GCTOF-MS Physiological Techniques
offers several advantages, most notably, fast scan
times, which give rise to either improved peak Several numbers of physiological criteria (includ-
deconvolution (the ability to resolve partially co- ing physiological traits determining yield under
eluting peaks) or higher sample throughput. normal and unfavourable environments and
Compared with GCMS technologies, LCMS genetic basis of such physiological traits) need to
offers several distinct advantages, chiefly its be evaluated before starting up a molecular breed-
adaptability to measure a far broader range of ing programme. The use of physiological trait as
metabolites encompassing both primary and sec- indirect selection index for yield (such as tillering,
ondary metabolites. However, LCMS usually xylem vessel diameter, leaf dimensions, stomatal
uses electrospray ionisation, which is prone to or cuticular water loss, harvest index) in breeding
ion suppression (i.e. the competition of co-eluting programme has been discussed elsewhere. As that
entities for ionisation energy) making it impor- of previous sections, only few physiological tech-
tant to validate novel applications of this type of niques are explained below, though large arrays of
instrumentation. In addition to these machines, techniques are available to increase the efficiency
use of capillary electrophoresismass spectrom- of QTL mapping and MAS.
Physiological Techniques 235

The global water shortage caused by an the germplasm is characterised more thoroughly
increasing world population and worldwide for physiological traits than for yield alone.
climate change is considered as one of the major The use of physiological traits in a breeding
challenges in agriculture. The combination of programme, either by direct selection or through
continued impact of drought, salinity and high a surrogate such as molecular markers, depends
temperature impairs the photosynthesis during on their relative genetic correlation with yield,
the daytime and increases the surface tempera- extent of genetic variation, heritability and geno-
tures in the night, which in turn increase the pho- type environment interactions. For instance, in
torespiratory losses and thus the productivity. drought environments, osmotic adjustment, accu-
The elevated greenhouse gas concentrations may mulation and remobilisation of stem reserves,
lead to the general drying of the subtropics. Thus, superior photosynthesis, heat- and desiccation-
the convergence of population growth and vari- tolerant enzymes, etc. are important physiologi-
able climate is expected to threaten global food cal traits. However, it is important to establish
security. This forces the scientists to develop their heritability and genetic correlation with
drought-suited varieties through molecular yield in target environments. Identification of
breeding and genetically modified approaches. physiological traits and mechanisms is time con-
However, it is clear that the demand to produce suming and costly; however, if successful, the
sufficient major food crops (wheat, rice and benefits are likely to be substantial. The informa-
maize) for the growing population has always tion on important physiological traits can be col-
been increasing. Hence, optimising yield stability lected on potential parental lines involving
for these major crops and locally important crops screening of entire crossing block, or a set of
is essential. Therefore, maintaining food security commonly used parents, thus producing a cata-
in this scenario will require systematic approaches logue of useful physiological traits. This infor-
including advances in physiological approaches. mation can be used strategically in designing
The physiological dissection of complex traits crosses, thereby increasing the likelihood of
like drought, salinity or nutrient stress tolerance transgressive segregation events, which bring
is a first step to understand the genetic control of together desirable traits. However, if enough
tolerance and will ultimately enhance the resources are available, screening for physiolog-
efficiency of MAS strategies. Developing and ical traits could be applied to segregating genera-
integrating a gene-to-phenotype concept in crop tions in yield trials, or any intermediate stage,
improvement requires particular attention to depending on when genetic gains from selection
phenotyping and ecophysiological modelling, as are optimal. It is important to note that using
well as the identification of stable candidate specific traits, breeding strategies are effective
genomic regions through novel concepts of only when these traits are properly defined in
genetical genomics (see chapter 7). Knowledge terms of the stage of crop development at which
of both the plant physiological response and they are relevant, the specific attributes of the
integrative modelling is needed to tackle the target environment for which they are adaptive
confounding effects associated with environment and their potential contribution to yield. For
and gene interaction. To maximise the impact of example, the early escape from progressively
using specific physiological traits, breeding intensifying moisture stress, through the manip-
strategies require a detailed knowledge of the ulation of plant phenology, is the most com-
environment where the crop is grown, genotype monly exploited genetic strategy used to ensure
environment interactions and fine tuning the relatively stable yields under terminal drought
genotypes suited for local environments. A phys- conditions. When significant genetic diversity
iological approach has an advantage over empir- for a physiological trait in a germplasm collec-
ical breeding for yield per se because it increases tion for the given species is established, it is
the probability of crosses resulting in additive imperative that the relevance of the trait as a
gene action for stress adaptation, provided that selection criterion be determined. The precise
236 10 Curtain Raiser to Novel MAS Platforms

phenotyping of physiological traits often requires crop water content, including leaf water potential,
the utilisation of sophisticated and expensive leaf stomatal conductance and canopy tempera-
techniques, and the techniques used to character- ture, which is the relative measure of water flow
ise drought tolerance specific physiological traits associated with water absorption from the soil
are explained here. under water deficit. In addition to the above, one
of the most commonly used indirect techniques
for measurement of these variables is thermal
Near-Infrared (NIR) Spectroscopy infrared imaging, or infrared thermography,
which involves the measurement of leaf or can-
This method provides spectral information cor- opy temperature. Plant canopy temperature is a
responding to the field plot in a single near-infra- widely measured variable that is closely related
red spectrum, where physical and chemical to canopy conductance at the vegetative stage and
characteristics of the harvested seed material are therefore provides insight into plant water status.
captured. By using calibration models (i.e. math- One of the high-throughput integrated pheno-
ematical and computational operations that relate typing platforms that include the pipeline of
the spectral information with phenotypic values), imaging, image processing automatisation and
several traits can be determined on the basis of a data handling modules was developed by
single spectrum (dry matter, protein, nitrogen, LemnaTec, a German company (http://www.
starch and oil content, grain texture and grain lemnatec.com). The platform has the capacity to
weight, etc.). The use of NIR spectroscopy on measure almost unlimited sets of parameters eas-
agricultural harvesters provides indexing of grain ily, allows comprehensive screening and provides
characteristics. In contrast to conventional sam- statistics on various plant traits in a dynamic way.
ple-based methods, NIR spectroscopy on agricul- Depending on the degree of automatisation,
tural harvesters secures a good distribution of plants are manually placed in the Scanalyzer 3-D
measurements within plots and covers substan- or transported on conveyor belts directly from the
tially larger amounts of plot material, thus greenhouses to the imaging chambers. Such
reducing sampling error and providing more rep- chambers provide top and side imaging of both
resentative measurements of the plot material in shoot and root systems to quantify plant height/
terms of homogeneity. width, biomass and plant architecture. Application
of different camera and acquisition modesfrom
visual light to near-infrared (NIR/SWIR), infra-
Canopy Spectral Reectance (SR) red (IR) and fluorescence imagingopens new
and Infrared Thermography (IRT) perspectives for visualisation using non-destruc-
tive quantification. The key application is in the
Spectral reflectance of plant canopy is a non- fast developing domain of plant functional
invasive phenotyping technique that enables sev- genomics. These automated systems will increase
eral dynamic complex traits, such as biomass our understanding of plant growth kinetics and
accumulation, to be monitored with high tempo- help improve plant models for systems biology or
ral resolution. It has many advantages including breeding programmes.
easy and quick measurements; integration at the
canopy level and additional parameters can also
be measured simultaneously via a series of Estimation of Compatible Solutes
diverse spectral indices like photosynthetic
capacity, leaf area index, intercepted radiation Under osmotic stress, an important consideration is
and chlorophyll content. Plant water status as to accumulate osmotically active compounds called
determined by plant water content or water poten- osmolytes in order to lower the osmotic potential.
tial integrates the effects of several drought-adap- These are referred to as compatible metabolites
tive traits. Several methods are used to determine because they do not apparently interfere with the
Genomics-Assisted Breeding 237

Table 10.1 Important osmolytes that accumulate in A three-tiered sequence of physiological screens
plants during drought and salinity has been already used to identify candidate paren-
Carbohydrate Nitrogenous compound Organic acid tal genotypes for use as parents in breeding pro-
Sucrose Proteins Oxalate grammes for some key traits like nitrogen fixation
Sorbitol Betaine Malate activity during soil water deficit in soybean.
Mannitol Glutamate Furthermore, bringing integrative phenotyping
Glycerol Aspartate technology, such as that developed by LemnaTec,
Arabinitol Glycine from the controlled environments to the field will
Pinitol Choline
improve the assessment of plant responses to
Other polyols Putrescine
environmental stimuli while enabling high-
throughput screening and generating comprehen-
normal cellular metabolism. Molecules like sive and accurate phenotypic data.
glycerol and sucrose were discovered by empirical
methods to protect biological macromolecules
against the damaging effects of salinity. Later, a Genomics-Assisted Breeding
systematic examination of the molecules, which
accumulate in halophytes and halotolerant A number of resources for major crop species
organisms, led to the identification of a variety including detailed, high-density genetic maps,
of molecules also able to provide protection. cytogenetic stocks, contig-based physical maps
Characteristically, these molecules are not highly and deep coverage and large-insert libraries are
charged, but are polar, highly soluble and have a now available to the public. These tools have
larger hydration shell. Such molecules will be facilitated the isolation of genes via map-based
preferentially solubilised in the bulk water of the cloning, the localisation of quantitative trait loci
cell where they could interact directly with the (QTLs) and the sequencing and annotation of
macromolecules. The biochemical pathways large genomic DNA fragments in several plant
producing them are now better known, and there species. Complete genome sequences of crop
are several sophisticated methods to estimate plants such as Arabidopsis and rice have become
such compounds. Genes that are rate limiting available through public databases. Further,
these steps have been cloned and transferred into whole-genome or gene space sequencing proj-
crop plants to raise the level of osmolytes. ects for several plant species such as maize
Osmolytes for which some progress has been (http://www.maizegenome.org/), sorghum, wheat
made are indicated in Table 10.1. (http://www.wheatgenome.org/), tomato (http://
To sum up, the techniques and platforms sgn.cornell.edu/help/about/tomato_sequencing.
mentioned above will greatly improve the phe- html), tobacco (http://www.intl-pag.org/13/abstracts/
notyping accuracy and throughput, thus contrib- PAG13_P027.html), poplar (http://genome.jgi-psf.
uting to a better elucidation of the genetic org/Poptr1/), Medicago (http://www.medicago.org/
control of complex physiological traits in plants. genome/) and lotus (http://www.kazusa.or.jp/lotus/)
However, many of the techniques discussed are now ready to use. The widespread use of
above are applied to plants grown under con- transcriptome sampling strategies is a complemen-
trolled conditions that may not reflect field envi- tary approach to genome sequencing and results
ronment or can only be used to assess a limited in a large collection of expressed sequence tags
number of genotypes due to high costs and/or (ESTs) for almost all the important plant species
practicality. Therefore, to overcome this prob- (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_
lem, multitiered selection screens, where a sim- summary.html). Comparative sequence analysis
ple but less accurate screen allows large number can be used in some cases to facilitate isolation
of genotypes to be evaluated (first screen), followed of genes in species lacking ESTs. However,
by tiers of more sophisticated screens of decreasing EST resources have some limitations, such as
numbers of genotypes have been proposed. unidentified contaminants, chimeric sequences,
238 10 Curtain Raiser to Novel MAS Platforms

multiple forms in polyploids (homoeoalleles) and distance from the targeted genes and thus are
putatively non-functional transcripts. Moreover, they often population specific or parent related, and
lack untranscribed regulatory factors and underrep- their predictive value depends on the degree of
resented genes. linkage between markers and target locus alleles
One of the hallmarks of genomics research has in specific populations. As a result, relatively few
been the discovery of new mechanisms contribut- linked markers are used in breeding. In contrast,
ing to genome evolution. Bioinformatics facilitates functional or gene-specific markers are derived
both the analysis of genomic and post-genomic from polymorphic sites within candidate genes
data and the integration of data from the related that are directly associated with phenotypic vari-
fields of transcriptomics, proteomics, metabolom- ations developed from functional gene sequences
ics and phenomics. Several bioinformatic tools and accurately discriminate alleles at one locus
and databases have been developed for DNA and represent ideal markers for MAS in breeding.
sequence analysis, marker discovery and querying Candidate gene is defined as a gene that has been
and analysing information. Enhanced bioinfor- identified as related to a particular trait (pheno-
matic tools, genome databases and integration of type, disease or condition). Candidate genes in
information from different fields enable the general can be divided into two categories: posi-
identification of genes and gene products and can tional and functional. A positional candidate
elucidate the functional relationships between gene is one that might be associated with a trait,
genotype and observed phenotype. Probably the based on the location of a gene on a chromosome.
most important future prospect is the enhancement A functional candidate gene is one whose function
of visualisation tools that extend beyond simple has something in common biologically with the
relationships and help us more clearly to interpret trait under investigation. Positional candidate
the complex multidimensional biological networks genes are identified through QTL- and map-based
of genes and their relationships to phenotypes. cloning approaches, whereas functional genomics
Metabolomics approaches enable the parallel approaches such as transcriptomics and expres-
assessment of the levels of a broad range of metab- sion genetics provide the set of functional candi-
olites and have been documented to have great date genes.
value in both phenotyping and diagnostic analyses Functional markers have advantages over ran-
in plants. These tools have recently been turned to dom DNA markers, because they are diagnostic
evaluation of the natural variance apparent in of the desired trait allele. Many new crop-specific
metabolite composition. genes have been cloned during the past years, and
Such advances in genomics can contribute to the corresponding functional markers have been
crop improvement in two general ways. First, a developed and used in MAS. For example, more
better understanding of the biological mecha- than 30 loci (genes) have been cloned in common
nisms can lead to new or improved screening wheat and its relatives, and 97 functional markers
methods for selecting superior genotypes more for wheat processing quality, agronomic traits
efficiently. Second, new knowledge can improve and disease resistance genes have been developed
the decision-making process for more efficient and used to identify those alleles (Liu et al. 2012).
breeding strategies which is broadly termed as Knowledge of marker-trait association is a pre-
genomics-assisted breeding. requisite for marker-assisted selection. SNPs and
InDels are the most abundant forms of DNA
sequence variation in crop plants, and this was
Functional Markers confirmed with cloned genes and amplicons.
Large-scale genome sequencing and associated
During the past decades, molecular mapping has bioinformatics are becoming widely accepted
identified chromosome regions carrying impor- research tools for accelerating the analysis of crop
tant genes in crop plants using SSR, RFLP, AFLP, genome structure and function. Second-generation
RAPD, DArT and other markers. However, these DNA sequences from several crops provide an
usually neutral genetic markers can be some opportunity to use genomic information to clone
Comparative Genomics 239

genes and develop SNP markers. Rapid progress application of inexpensive next-generation
is now being achieved in assembling the DNA sequencing. It seems certain that with the
sequences from individual chromosome arms of sequencing of major crop plants, followed by the
crop plants, and this progress provides a template assigning of function to these sequences (drafts),
for defining the FMs for future use. High-quality there is a lot of information for applications of
genome sequences integrated with molecular genomics in other orphan species as well. This
genetic maps provide the basis for identifying assignment is based on the fact that there is a
duplicated genes, analysing promoter regions in significant degree of synteny that exists between
detail, defining SNPs/InDels and aligning the plant species as revealed by several comparative
transcriptome with the genome. These advances genetic mapping experiments. Comparative
will allow gene networks to be clearly defined genomics is the study of the relationship of
and thus allow meaningful functional markers to genome structure and function across different
be developed for complex traits. Extensive pro- biological species or strains. Actually, it is an
teomic studies have allowed identification of attempt to take advantage of the information pro-
many allelic variants, and genomic analyses vided by the signatures of selection to understand
identified several markers for discriminating the function and evolutionary processes that act
alleles at one locus. These successes have indi- on genomes. While it is still a young field, it
cated that it is now essential to establish rapid, holds great promise to yield insights into many
convenient and economical PCR-based assays in aspects of the evolution of modern crop species.
crop breeding. In order to detect genes simultane- For example, conservation of gene order and
ously in a single PCR, multiplex PCR can be content has been detected between Arabidopsis
developed, in which several markers in the same and other species within the dicot family, such as
reaction mix are co-amplified under identical the cultivated Brassica species, tomato and soy-
conditions. However, a clear challenge is for bean. Within the monocots also, especially the
multiplexing markers to have similar annealing cereals, extensive colinearity has been observed
temperatures for the different primers and for the by comparative mapping of the genomes using
expected PCR products to be easily separated on genetic markers. This phenomenon of macro-
agarose gels. If alleles conferring specific resis- colinearity was first established between seven
tance are being sought, it is important to know grass species, with rice as the reference genome,
which alleles are effective and potentially useful and was represented in the form of a graphical
to local breeding programmes. However, more consensus map that is popularly known as the
functional markers are needed for important traits circle diagram. This map has been refined to
such as disease and stress resistance in order to embrace more grass species whose genomes are
strengthen the application of molecular markers described using several rice linkage blocks (visit
in breeding programmes. SNPs are the most www.gramene.org for more information).
applicable markers for high-throughput screen- Altogether these studies give the general impres-
ing once the genotypephenotype associations sion that all the grasses examined have similar
are determined. The expanded use of these mark- gene order despite the large differences in DNA
ers will develop as high-throughput techniques content or chromosome number. Microcolinearity,
for MAS based on functional SNP markers and or the conservation of gene order at the sub-
produce DNA chips for efficient analysis. megabase level, is also observed to be extensive
but has frequent deviations which can be attrib-
uted to small-scale rearrangements, deletions or
Comparative Genomics even local gene amplification and translocation.
This has been examined not only between sor-
The number of sequenced plant genomes and ghum and maize but also between rice and other
associated genomic resources is growing rapidly crop plants as well as between rice subspecies.
with the advent of both an increased focus on The absence of microcolinearity as compared to
plant genomics from funding agencies and the the recombinational map level has also been
240 10 Curtain Raiser to Novel MAS Platforms

confirmed by comparison of small segments of subtle differences among animal species. Such
the rice genome sequence with some cereals. In efforts might also possibly lead to the rearrange-
particular, use of wheat chromosome bin-mapped ment of our understanding of some branches on
ESTs with rice genome sequence has predicted the evolutionary tree, as well as point to new strat-
that order of rice genes in relation to wheat egies for conserving rare and endangered species.
genome could emerge as a complex pattern, and
its utility for synteny-based analysis/application
remains to be assessed. Nevertheless, the rice Identication of Novel Molecular
genome has come forth as a relatively stable Networks and Construction
genome compared to other cereals, which have of New Metabolic Pathway
faced most of the rearrangements during evolu-
tion. Various investigations have also revolved Despite extensive knowledge of fundamental met-
around the idea of colinearity between monocot abolic processes, the mechanisms of physiological
and dicot plants. However, rice genome being modulation over short and extended time intervals
four times larger and containing more than twice in response to changing environmental conditions
the number of genes as that of Arabidopsis may remain difficult to understand. What is more, the
show limited synteny. The low level of synteny pure existence of some plant metabolites such as
between Arabidopsis and rice might not be ade- trehalose still puzzles us. Correspondingly, inves-
quate for applications in map-based cloning strat- tigation of metabolic network regulation upon
egies as well as for integration of functional and genetic or environmental perturbations may be
structural genomic data across the monocot or viewed as a necessity for pathway discovery and
dicot divide, but a detailed study of the genomic functional genomics. There is a long tradition of,
data of both plants could provide answers to and extensive knowledge about, metabolite anal-
questions related to the structure and evolution of ysis. In fact, metabolite analysis can be better
genomes. On the other hand, the high level of understood by distinguishing among levels on
genome colinearity between plant species belong- the basis of its objectives. Four levels can be
ing to the same family can be exploited to carry identified. First, there is metabolite target analy-
out fine mapping and map-based cloning experi- sis, which utilises specialised protocols for the
ments, especially in the case of crop plants hav- analysis of difficult analytes such as phytohor-
ing large genomes. As in the cereals, the genetic mones. Second, metabolite profiling aims at
mapping of an agronomically important locus is quantitation of several predefined targets (e.g. of
carried out with the large genome followed by all metabolites of a specific pathway or a set of
cloning using information from the closely metabolites typical for different pathways).
related model organism such as rice. Third, metabolomics has the ultimate goal of
The major benefits of comparative genomics unbiased identification and quantitation of all the
are in twofolds: (1) Using computer-based analy- metabolites present in a certain biological sam-
sis to zero in on the genomic features that have ple from an organism grown under defined con-
been preserved in multiple organisms over mil- ditions. Fourth, there is metabolic fingerprinting,
lions of years, researchers will be able to pinpoint which, instead of separating individual metabo-
the signals that control gene function, which in lites by physical parameters, focuses on collect-
turn should translate into innovative approaches ing and analysing data from crude metabolite
for treating human disease and improving human mixtures to rapidly classify samples. Among
health, and (2) in addition to its implications for these four approaches, metabolomics seems to
human health and well-being, comparative genom- be best suited for investigation of metabolic net-
ics may benefit the plant world as well. As sequenc- works, because it focuses on quantifying indi-
ing technology grows easier and less expensive, it vidual metabolites without having a bias
will likely find wide applications in agricultural concerning the choice of targets to be analysed,
biotechnology as a tool to tease apart the often- as in metabolite profiling.
Bioinformatics for MAS 241

Ideally, metabolomic data should accurately knockout mutations and novel metabolic pathways.
describe physiological processes as responses to Besides allowing comparison with experimen-
developmental, genetic or environmental changes. tally established metabolic networks, the inherent
However, some theoretical considerations limit characteristics of topological metabolic networks
direct interpretation of metabolic networks could be investigated to compare structural dif-
generated from metabolic snapshots. First, any ferences in network organisation and thus
subcellular compartmentalisation is lost in the improve our understanding of key metabolites
process of sample preparation. Although mRNA and the effects of random mutations in biological
or protein expression levels can sometimes be systems.
ascribed to plant compartments on the basis of An understanding of metabolic networks might
their target sequences, there is a high degree of be further improved by an integration of static
uncertainty about the actual location of metabo- enzyme stoichiometry networks and inherent
lites, many of which may occur simultaneously network characteristics. Eventually, the combina-
(and for potentially different purposes) in differ- tion of metabolomic analysis with other profiling
ent locations and in varying amounts. Therefore, technologies, especially proteomics and integra-
metabolomic information can be interpreted on tive techniques like metabolic control analysis,
the multicellular, tissue or organ level. If metabo- could enable novel pathway discovery and aid the
lite analysis of subcellular compartments is the evaluation of changes in plant networks produced
goal, large amounts of tissue must be used for the by genetic or environmental changes.
parallel determination of enzyme activities for
ascribing cellular compartments to density
fractions. Because plant metabolomes are so Bioinformatics for MAS
complex, many of the detected metabolites will
remain structurally unidentified until being eluci- Bioinformatics refers to the study of biological
dated by de novo identification, which is much information using concepts and methods in com-
more difficult than the identification of transcripts puter science, statistics and engineering. It can
or proteins. Finally, the question arises of how to be divided into two categories: biological infor-
correlate metabolite levels under different situa- mation management and computational biology.
tions if they only relate to multiple steady states Bioinformatics plays an essential role in todays
without any kinetic experimental design that plant science. As the amount of data grows expo-
could guide interpretation. Most often, average nentially, there is a parallel growth in the demand
metabolite levels are used for deducing novel for tools and methods in data management, visu-
insights into plant physiology. This strategy again alisation, integration, analysis, modelling and
results in a loss of information, however, as prediction. At the same time, many researchers
metabolomic data from individual snapshots can in biology are unfamiliar with available bioinfor-
be regarded to be as reliable as proven by the ini- matics methods, tools and databases, which
tial method validation tests. Any variation found could lead to missed opportunities or misinter-
in a homozygous plant population therefore indi- pretation of the information. Here, an attempt
cates responses to subtle differences in plant has been made to list out only a few commonly
development or physiology for each individual used bioinformatics tools that may have their
plant. This variation must have biological causes potential role in MAS made. Of course, this list
reflecting the flexibility of metabolic networks in is not exhaustive; no one can prepare such a
the studied populations. It can, therefore, be used complete list because of the rapid developments
to calculate pathways by comprehensive pairwise in bioinformatics.
metabolite correlation plots. In this way, stoichio- Biological sequence such as DNA, RNA and
metrically feasible metabolic networks could be protein sequence is the most fundamental object
computed for a variety of organisms. Such networks for a biological system at the molecular level.
would enable researchers to predict the effect of Advances in sequencing technologies provide
242 10 Curtain Raiser to Novel MAS Platforms

opportunities in bioinformatics for managing, org/) to search repetitive sequences in a genome.


processing and analysing the sequences. Shotgun Working from a library of known repeats,
sequencing (see above) is currently the most RepeatMasker is built upon BLAST and can
common method in genome sequencing: Pieces screen DNA sequences for interspersed repeats
of DNA are sheared randomly, cloned and and low complexity regions. Repeats with poorly
sequenced in parallel. Software has been devel- conserved patterns or short sequences are hard to
oped to piece together the random, overlapping identify using RepeatMasker due to the limita-
segments that are sequenced separately into a tions of BLAST. To identify novel repeats, vari-
coherent and accurate contiguous sequence. ous algorithms were developed. Some widely
Numerous software packages exist for sequence used tools include RepeatFinder (http://ser-loopp.
assembly, including Phred/Phrap/Consed (http:// tc.cornell.edu/cbsu/repeatfinder.htm) and RECON
www.phrap.org), Arachne (http://www.broad.mit. ( http://www.genetics.wustl.edu/eddy/recon/ ).
edu/wga/) and GAP4 (http://staden.sourceforge. Simple sequence repeats can be identified in the
net/overview.html). The Institute of Genome given sequence using SSRIT available at www.
Research (TIGR) developed a modular, open- gramene.org.
source package called AMOS (http://www.tigr.org/ Comparing sequences provides a foundation
software/AMOS/), which can be used for com- for many bioinformatics tools and may allow
parative genome assembly. inference of the function, structure and evolution
Gene finding refers to prediction of introns of genes and genomes. Methods in sequence
and exons in a segment of DNA sequence. comparison can be largely grouped into pair-
Dozens of computer programs for identifying wise, sequence-profile and profileprofile com-
protein-coding genes are available. Some of the parison. For pairwise sequence comparison,
well-known ones include Genscan (http://genes. FASTA (http://fasta.bioch.virginia.edu/) and
mit.edu/Genscan.html ), GeneMarkHMM BLAST (http://www.ncbi.nlm.nih.gov/blast/)
( http://opal.biology.gatech.edu/GeneMark/ ), are popular. To assess the confidence level for an
GRAIL (http://compbio.ornl.gov/Grail-1.3/), alignment to represent homologous relationship,
Genie (http://www.fruitfly.org/seq tools/genie. a statistical measure (expectation value, e-value)
html) and Glimmer (http://www.tigr.org/softlab/ is integrated into pairwise sequence alignments.
glimmer). In addition, one can use genome A sequence profile is calculated using the prob-
comparison tools such as SynBrowse (http://www. ability of occurrence for each amino acid at each
synbrowser.org/) and VISTA (http://genome.lbl. alignment position. PSI-BLAST (http://www.
gov/vista/index.shtml) to enhance the accuracy ncbi.nlm.nih.gov/BLAST/) is a popular example
of gene identification. of a sequence-profile alignment tool. Some other
An important aspect of genome annotation is sequence-profile comparison methods are slower
the analysis of repetitive DNAs, which are copies but even more accurate than PSI-BLAST, includ-
of identical or nearly identical sequences present ing HMMER (http://hmmer.wustl.edu/), SAM
in the genome. Repetitive sequences exist in ( http://www.cse.ucsc.edu/research/compbio/
almost any genome and are abundant in most sam.html) and META-MEME (http://metameme.
plant genomes. The identification and characteri- sdsc.edu/).
sation of repeats is crucial to shed light on the Proteins can be generally classified based on
evolution, function and organisation of genomes sequence, structure or function. Several sequence-
and to enable filtering for many types of homol- based methods were developed based on sizable
ogy searches. A small library of plant-specific protein sequence (typically longer than 100
repeats can be found at ftp://tigr.org/pub/data/ amino acids), including Pfam (http://pfam.wustl.
TIGR Plant Repeats/; this is likely to grow sub- edu/), ProDom (http://protein.toulouse.inra.fr/
stantially as more genomes are sequenced. One prodom/current/html/home.php) and Clusters of
can use RepeatMasker (http://www.repeatmasker. Orthologous Group (COG) (http://www.ncbi.
Bibliography 243

nlm.nih.gov/COG/new/). Other methods are information and perform analysis and simulation
based on fingerprints of small conserved motifs in a cellular modelling environment like E-Cell
in sequences, as with PROSITE (http://au.expasy. (http://www.e-cell.org/) or CellDesigner (http://
org/prosite/), PRINTS (http://umber.sbs.man.ac. www.systems-biology.org).
uk/dbbrowser/PRINTS/) and BLOCKS (http:// The data that are generated and analysed as
www.psc.edu/general/software/packages/blocks/ described in the previous sections need to be
blocks.html). Several bioinformatics tools have compared with the existing knowledge in the
been developed for two-dimensional (2-D) elec- field in order to place the data in a biologically
trophoresis analysis. SWISS-2DPAGE can locate meaningful context and derive hypotheses. To do
the proteins on the 2-D PAGE maps from Swiss- this efficiently, data and knowledge need to be
Prot (http://au.expasy.org/ch2d/). Melanie (http:// described in explicit and unambiguous ways that
au.expasy.org/melanie/) can analyse, annotate must be comprehensible to both humans and
and query complex 2-D gel samples. Flicker computer programs. Ontology is a set of vocabu-
(http://open2dprot.sourceforge.net/Flicker/) is an lary terms whose meanings and relations with
open-source stand-alone program for visually other terms are explicitly stated and which are
comparing 2-D gel images. PDQuest (http:// used to annotate data. A list of open-source
www.proteomeworks.bio-rad.com) is a popular ontologies used in biology can be found on the
commercial software package for comparing 2-D Open Biological Ontologies website (http://obo.
gel images. Some software platforms handle sourceforge.net/). Many ontologies on this site
related data storage and management, including are under development and are subject to frequent
PEDRo (http://pedro.man.ac.uk/), a software change. Gene Ontology (GO) (www.geneontology.
package for modelling, capturing and dissemi- org) is an example of bio-ontologies that has
nating 2-D gel data and other proteomics experi- garnered community acceptance. Other exam-
mental data. ples of ontologies currently in development are
A protein family can be represented in a phy- the Sequence Ontology (SO) and the Plant
logenetic tree that shows the evolutionary rela- Ontology (PO) project (www.plantontology.org).
tionships among proteins. Phylogenetic analysis Besides, there are large collections of biological
can be used in comparative genomics, gene func- databases that are available in the web for several
tion prediction and inference of lateral gene crops. Nucleic Acids Research (http://nar.oxford-
transfer among other things. The analysis typi- journals.org/) publishes a database issue in
cally starts from aligning the related proteins January of every year.
using tools like ClustalW (http://bips.u-strasbg.
fr/fr/Documentation/ClustalX/). Among the pop-
ular methods to build phylogenetic trees are min-
Bibliography
imum distance (also called neighbour joining),
maximum parsimony and maximum likelihood
Literature Cited
trees. Some programs provide options to use any
of the three methods, for example, the two widely Bachem CWB, van der Hoeven RS, de Bruijn SM,
used packages PAUP (http://paup.csit.fsu.edu), Vreugdenhil D, Zabeau M, Visser RGF (1996)
and PHYLIP (http://evolution.genetics.washington. Visualization of differential gene expression using a
novel method of RNA fingerprinting based on AFLP:
edu/phylip.html).
analysis of gene expression during potato tuber devel-
As more reliable data are collected, one can use opment. Plant J 9:745753
ordinary differential equations for dynamic simu- Edman P (1949) A method for the determination of amino
lations of metabolic networks and combine infor- acid sequence in peptides. Arch Biochem 22(3):475
Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE,
mation about connectivity, concentration balances,
Mello CC (1998) Potent and specific genetic interfer-
flux balances, metabolic control and pathway opti- ence by double-stranded RNA in Caenorhabditis ele-
misation. Ultimately, one may integrate all of the gans. Nature 391:806811
244 10 Curtain Raiser to Novel MAS Platforms

Fischer A, Saedler H, Theissen G (1995) Restriction frag- Further Readings


ment length polymorphism-coupled domain-directed
differential dis-play: a highly efficient technique for Buzdin A, Lukyanov S (eds) (2007) Nucleic acids hybrid-
expression analysis of multigene families. Proc Natl ization. Springer, New York
Acad Sci USA 92:53315335 Rhee S, Dickerson J, Xu D (2007) Bioinformatics and its
Habu Y, Fukuda-Tanaka S, Hisatomi Y, lida S (1997) applications in plant biology. Annu Rev Plant Biol
Amplified restriction fragment length polymorphism- 57:335360
based mRNA fingerprinting using a single restriction Shendure J, Hanlee J (2008) Next-generation DNA
enzyme that recognizes a 4-bp sequence. Biochem sequencing. Nat Biotechnol 26(10):11351145
Biophys Res Commun 234:516521 Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur
Ji H, Hodges E et al (2007) Genome-wide in situ exon capture A, Kapur A, Gupta V, Kumar D, Ravi V, Vij S, Khurana
for selective resequencing. Nat Genet 39:15221527 P, Sharma S (2004) Structural and functional analysis
Liu Y, He Z, Appels R, Xia X (2012) Functional markers of rice genome. J Genet 83:7999
in wheat: current status and future prospects. Theor Varshney RK, Graner A, Sorrells ME (2005) Genomics-
Appl Genet 125:110 assisted breeding for crop improvement. Trends Plant
Shendure J et al (2005) Accurate multiplex polony sequencing Sci 10(12):621630
of an evolved bacterial genome. Science 309:17281732 Yamamoto M et al (2001) Use of serial analysis of gene
Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T, expression (SAGE) technology. J Immun Method
Hornes M, Freijters A, Pot J, Peleman J, Kuiper M, 250:4566
Zabeau M (1995) AFLP: a new concept for DNA fin- Ye SQ et al (2000) MiniSAGE: gene expression profiling
gerprinting. Nucleic Acids Res 21:44074414 using serial analysis of gene expression from 1 mg
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) total RNA. Anal Biochem 287:144152
Serial analysis of gene expression. Science 270:484487
Recent Advances in MAS in Major
Crops 11

The amount of land available for crop production


is decreasing steadily due to urban growth and Rice
land degradation, and the trend is expected to be
much more dramatic in the developing than in Rice (Oryza sativa L.) is an intimate part of the
the developed countries. These decreases in the culture, food habits and economy of many societies
amount of land available for crop production and and is one of the most important crops for man-
increase in human population will have major kind. It is the basic food of more than three
implications for food security over the next two billion people, and it accounts for 5080% of
or three decades. Food insecurity and malnutrition their daily calorie intake. To meet the growing
result in serious public health problems. Much of demand for food and to sustain food security for
the early increase rise in grain production resulted people in low-income countries, rice production
from an increase in area under cultivation, has to be raised by another 70% over the next
irrigation, better agronomic practices and, most three decades. This means raising the rice yield
importantly improved cultivars through conven- from the current level if these countries can
tional breeding strategies. However, yields of maintain their rice-growing area at current levels.
several crops have already reached a plateau in For the irrigated ecosystem, the rice yield
developed countries, and therefore, most of the will be difficult to rise from the current levels of
productivity gains in the future will have to be 56 t/ha. The potential for increasing yield in the
achieved in developing countries through better rainfed ecosystem is vast, as the current yield is
natural resources management and crop improve- only about 2.0 t/ha (compared to 5.0-t attainable
ment. It is in this context that marker-assisted yields) and nearly 40% of the total rice area is
selection (MAS) will play an important role in grown under rainfed conditions and future
food production in the near future. MAS offers increases in rice production will rely on rainfed
plant breeders access to an infinitely wide array ecosystems. Hence, this section describes the
of novel genes and traits, which can be inserted importance of MAS in genetic improvement of
into high-yielding and locally adapted cultivars. rice under water-limited environments. As that of
This approach offers rapid introgression of novel this complex drought-tolerance trait, MAS can
genes and traits into elite agronomic backgrounds. also be applied to genetically improve other
Though MAS has been successfully applied to complex characteristics such as pest and disease
several crops (see chapter 9), only four crops have resistance, nutrient improvement and other quality
been discussed in detail in the below sections. and agronomic traits.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 245
and Benefits, DOI 10.1007/978-81-322-0958-4_11, Springer India 2013
246 11 Recent Advances in MAS in Major Crops

Rice and Drought rest is lost through transpiration, which helps to


maintain leaf energy balance of the crop. The
Rice is a heavy consumer of water, requiring around effect of water stress may vary with variety,
5,000 liters of water to produce 1 kg of rice, and growth stage of the rice crop and degree and
is less efficient in the way it uses water than either duration of water stress. There may be two
wheat or maize. In Asia, where 90% of all rice is kinds of traits, namely, constitutive and adaptive
grown and the vast majority of it is consumed, traits, which confer drought resistance in rice.
72% of freshwater resources are used for irrigat- Constitutive traits are expressed under anaerobic,
ing rice crops. However, water availability has non-water stressed conditions, do not require water
been shrinking as domestic and industrial demand stress for their expression and may demonstrate
has increased. In the tropics of South and Southeast variation that is subsequently modified by
Asia, only 41% of the rice area is irrigated. Yield adaptive traits. Adaptive traits can be defined as
loss due to drought is 227 kg/ha (20% of average those, such as osmotic adjustment (OA), which are
yield) for upland ecosystem. In a typical year, expressed in response to water deficit. Identifying
abiotic stresses decrease rice yields by about 15% traits of importance in drought resistance is
in Asia, more than twice the damage caused by difficult due to the complexity of climatic varia-
biotic stresses. Almost half of the land planted to tion in precipitation and evapotranspiration, the
rice in Asia and almost all of the rice in Africa is diversity of the rice hydrological environments,
rainfed and the yields are seriously limited by water the relationship between soil moisture status
stress. Thus, obviously, drought is the most impor- and nutrient availability and the differential plant
tant abiotic constraint in the upland ecosystem. interactions with this environment. Traits which
Rice is the main food of 65% of the population are contributing drought resistance in rice have been
in India. It constitutes about 52% of the total food reviewed by several researchers (see chapter 5).
grain production and 55% of total cereal produc- All the traits have either positive or negative
tion. Rice environments in India are extremely influence on yield, depending on the existing
diverse. Since the major portion of the area under drought situation (timing, severity and duration)
rice in India is rainfed, production is strongly and depending on whether a survival or production
tied to the distribution of rainfall. In some of the mechanism is necessary. The best combination
states, erratic rainfall leads to drought during of traits depends, therefore, on the nature of the
the vegetative period, but later on the crop may drought stress. This emphasises the need for a
be damaged by submergence due to high rainfall. good characterisation of drought occurrence in
Improving the yield of rainfed rice can be the target area for breeding programs. The prob-
achieved by selecting directly for yield under lem of adaptation to drought conditions in rice
stress in breeding program. However, the ability is complex and unique as compared with most
to select for yield is severely hampered by year- other crops. The following are the traits, which
to-year variability in rainfall pattern and low heri- are demonstrated for their importance in drought
tability of yield under water stress. Consequently, resistance in rice.
it has been suggested that improvements in yield
could be achieved more efficiently by identifying
secondary traits that allow a plant to escape, avoid Phenology
or tolerate water stress and selecting for those
traits in a breeding program. If a pattern of drought occurrence can be identified,
the plant can escape drought by having the most
sensitive phenological stages coinciding with the
Mechanisms of Drought Resistance periods of lower risks of drought stress either
in Rice through manipulation of the plant duration or
through manipulation of the cropping calendars.
In general, rice plant uses less than 5% of the For example, in a terminal stress situation, a
water absorbed through roots from the soil. The common phenomenon in South Asia, breeding
Rice 247

for short-duration varieties is a simple strategy water status, nutrient status). At the genetic level,
with proven efficacy. The duration of upland the response of roots to the environment is poorly
varieties of Bangladesh and eastern India is understood because roots are intrinsically difficult
generally below 95 days, which matches the to study, particularly in the natural environment.
short monsoon season. The role of plant devel- Irrespective of root axial resistance, a few
opmental and phenological factors in affecting long roots can theoretically sustain reasonable
crop response to drought stress, such as moderated evapotranspirational demand at adequately high
water use through reduced leaf area and shorter leaf water potential. The ability of rice to reach
growth duration, has already been discussed deep soil moisture or to penetrate compacted soil
elsewhere. is linked with the capacity to develop a few thick
(fibrous) and long root axes. Thick roots persist
longer and produce more and larger branch
Root System roots, thereby increasing root length density
and water-uptake capacity. When drought stress
The possession of deep and thick root system develops, the root/shoot dry matter ratio increases
which allows access to water deep in the soil as shown in some of the studies. Sometimes, even
profile is considered crucially important in deter- the absolute size of the root increases. Most cer-
mining drought resistance. The trait may be less tainly, root morphology and distribution changes.
important in rainfed lowland rice, where hardpans Drought-resistance improvement through breeding
may severely restrict root growth. Here, the program using root traits is limited due to require-
ability to penetrate a hard layer is considered ment of labour-intensive, destructive and expensive
important. This trait may also be useful in upland phenotyping protocols. Whatever the desirable root
rice where high penetration resistance may limit ideotype may be, it would be extremely difficult
rooting depth and where soils will harden as they to perform selection based on measuring the root
dry. The penetration of roots through uniform phenotype.
hard layers is probably achieved through the
possession of large root diameter which resists
buckling, but when the impedance is due to a Osmotic Adjustment
coarse textured sandy or stony horizon, thin roots
would penetrate more easily. The investment of Osmotic adjustment (OA) is increasingly recogn-
carbon in a deep root system may have a yield ised in several crop plants as an effective compo-
implication because of loss of carbon allocation nent of drought resistance, which has a positive
to the shoot. The rapid development of deep or direct or indirect effect on plant productivity
thick root systems may, therefore, be of limited under drought stress. Generally, when cells are
value if terminal drought occurs early in the crop subjected to slow dehydration, compatible
cycle, but it is certainly important for intermittent solutes are accumulated in the cytosol resulting
and later terminal drought situations. It is also in the maintenance of cell water content against
important to note that root growth is influenced the reduction in apoplastic water potential. The
by the environment. Chemical or physical adverse compatible solutesvarious sugars, organic
conditions such as low water potential or high/ acids, amino acids, sugar alcohols or ions (most
low soil temperature directly inhibit root growth. commonly K+)differ with plant species and
Biological factors in the rooting environment genera. The main solutes that are responsible
such as root-feeding nematodes, termites, mites for OA in rice under water-deficit conditions
and aphids can severely reduce root proliferation were not elucidated. Rice does not accumulate
or rooting depth and thereby affect drought resis- glycine betaine because of a deficiency in choline
tance. The shoot environment can also indirectly monooxygenase and betaine aldehyde dehydro-
influence root growth either via carbon supply genase, the key enzymes that involved in glycine
or signalling process (e.g. light interception, betaine synthesis. Rice accumulates proline, but
248 11 Recent Advances in MAS in Major Crops

the extent of proline accumulation and its terminal drought situation. So if dehydration
contribution to OA has not been evaluated. tolerance of rice is increased by breeding app-
The support of leaf turgor by OA in rice was well roaches, then it could be possible to increase or at
reflected in delayed leaf rolling when water deficit least stabilise the yield of rainfed rice. As reported
developed. Results indicate that leaf rolling and in some studies, genotypic variation for dehydra-
leaf death can be delayed by OA in rice. However, tion tolerance capacity of rice is large. However,
more data are needed on the contribution of incorporation of this trait in breeding program
OA to rice performance under different drought is hampered by complex experimental protocols
stress conditions. Traditional upland cultivars requiring heavy investment in creating controlled
generally tend to excel in root growth and soil environment facilities.
moisture extraction capacity while lacking in OA.
These cultivars usually develop severe leaf dehy-
dration and leaf rolling as soon as soil moisture is Shoot-Related Drought-Resistance
depleted. It can be speculated that under upland Traits
situations with deep soil moisture, there may
have been a selective advantage to deep and thick Leaf Rolling
root systems, which served to maintain high Several mechanisms of drought resistance are
leaf water status and dehydration avoidance. associated with the shoots of rice. Leaf rolling
Under such conditions, deep roots have evolved (drought avoidance) reduces the water loss in
in adapted materials. OA did not evolve under addition to reducing the leaf area exposed to heat
such conditions because plants were usually and light radiation. Varieties differ in their ability
avoiding severe water deficit. The capacity for to roll leaves under similar water deficit. There is
OA may have evolved where leaf tissue water some evidence that enhanced ability to roll leaves
status was often reduced by water de fi cit, such confers a yield advantage under drought conditions.
as in lowland rice where deep rooting is often However, most breeders consider the triggering
deterred by the subsoil compaction. These different of leaf rolling as an indication of a plant suffering
modes of response to drought stress require and select against its early manifestation.
validation and further research to suggest clues to
desirable breeding strategies with respect to the Green Leaf Area
different rice environments. It has been suggested that plants which are able
to retain green leaf area are better able to recover
after drought and give good yield. Leaf drying,
Dehydration Tolerance often used in field scoring, is the reverse side
of the stay-green ability and has been shown
Dehydration tolerance (the ability of leaves to to be correlated with leaf relative water content.
tolerate desiccation level water stress) assists the However, it has proved difficult to separate the
plant organs to survive short-term water deficits. green leaf retention from the possible underlying
The lowest leaf water potential that leaves reach mechanisms of drought resistance since the
just prior to death (lethal leaf water potential) has process of drought recovery in terms of mecha-
been used to determine dehydration tolerance. nisms, importance or genetic variation is poorly
During terminal stress, dehydration tolerance understood.
may allow plants to maintain metabolic activity
for longer time and to translocate more stored Stomatal Closure and Canopy
assimilates to the grain. Plants with the ability Temperature
to adjust osmotically or tolerate dehydration Another mechanism of drought avoidance in the
may delay leaf rolling, delay stomatal closure rice shoot is fast stomatal closure which acts to
and maintain leaf expansion with little cost, which reduce water losses. Varietal differences in the
should promote resistance particularly in the sensitivity of stomatal conductance to leaf water
Rice 249

status do exist. The contribution of stomatal A plant, which is more water use efficient, should
conductance to drought performance in the field be more successful in a drought environment,
is yet to be identified. However, a plant with particularly late in the growing season when
sensitive stomata would only be adapted to a transpiration accounts for the majority of total
situation of relatively severe drought. But during evaporation. WUE can be either positively or
mild drought, rapid stomatal closure would negatively related to production under stress,
reduce photosynthesis when there is no need to which is largely dependent on the genotypes
do so. Canopy temperature can also be used capacity to sustain transpiration, and WUE alone
since low canopy temperature may indicate more might be questionable as a selection criterion.
favourable soil moisture conditions. This charac- Therefore, WUE can even be a misleading param-
teristic could be valuable in selection, but mea- eter if selection for high WUE is performed under
suring them requires extremely uniform soils to drought stress where genotypic variation in deep
eliminate any subsoil spatial variation. soil moisture extraction is possible. It is realised
that results from selection for WUE (by carbon
Cell Membrane Stability isotope discrimination) depend very much on the
The cell membrane is one of the main cellular environmental conditions in which such selection
targets common to different stresses. The extent is performed. It also seems that the results from
of its damage is commonly used as a measure of selection for high WUE may be unpredictable.
tolerance to various stresses in plants such as In several crops, the correlations between WUE
freezing, heat, drought and salt. Cell membrane and dry matter production were inconsistent
stability (CMS) or the reciprocal of cell membrane in experiments conducted over different water
injury is a physiological index widely used regimes and years.
for the evaluation of drought and temperature
tolerance. This method was developed for a Epicuticular Wax
drought and heat tolerance assay in sorghum and It has been repeatedly shown that total crop dry
measures the amount of electrolyte leakage from matter production is linearly and positively
leaf segments. Its reliability as an index of heat related to crop transpiration. This relationship
stress tolerance is supported in several plant is partly derived from the fact that the control of
species by good correlation between CMS and both transpiration and CO2 exchange is dependent
plant performance in the field under high tem- on stomatal activity. However, loss of water can
perature and water stress. The genetic variation also occur through non-stomatal pathways for
in heat tolerance in various crops has been studied which no return in CO2 fixation is expected. Non-
using CMS as one of the component traits. stomatal resistance to water loss from leaves
Phenotype selection for CMS may not always can also be considered a drought-avoidance
lead to accurate results for breeding purposes mechanism. An important non-stomatal pathway
because of its complex nature and its strong inter- is the leaf cuticle. Research suggests that rice has
action with the environment. Thus, the evaluation a low cuticular resistance to water loss compared
of this trait should be done in a controlled envi- with other grasses but variation between varieties
ronmental situation. exists, and this may have potential in breeding
for improvement in drought resistance. The
Water Use Efciency fact that traditional upland rice cultivars have
Connected to stomata and leaf rolling is water relatively higher epicuticular wax supports the
use efficiency (WUE, the ratio between carbon hypothesis that high epicuticular wax is an important
gained for water used). Analysis of WUE generally drought-resistance attribute in rice. The specific
relies on measuring carbon isotope discrimina- effects of the amount, the composition and the
tion. This has been shown to vary between rice form of cuticular wax in rice were explored, but
varieties, suggesting that upland varieties need the quantification of these factors with respect
less water for every molecule of carbon fixed. to rice performance under drought stress is still
250 11 Recent Advances in MAS in Major Crops

needed. Further, physiological and biochemical drought-resistance traits and crop productivity
work is required to logically link cuticular in stressful environments. Once the tightly linked
resistance and epicuticular wax with drought markers have been identified, they can be used to
resistance and for efficient manipulation in breed- develop marker-assisted selection (MAS) strategy
ing program. for breeding applications. Molecular markers
allow breeders to track the genetic loci controlling
Other Traits drought resistance without measuring the pheno-
The value of improving the use of absorbed type, thus reducing the need for extensive field
light, resistance to photoinhibition and capacity testing over space and time. High-resolution
for non-photochemical quenching to improve mapping and physical mapping can be followed
drought resistance of rice has been described. for isolation of the drought-resistance genes by
In addition, a genetic basis for difference in map-based cloning techniques. The genes of
resistance to photoinhibition in rice has been interest can be used in functional studies and crop
demonstrated. These traits are physiologically, improvement through genetic transformation.
biochemically and genetically complex in them-
selves and interact with each other. Since abscisic
acid (ABA) has been shown to be involved in Genetic Linkage Map in Rice
regulating stomatal conductance, OA and root
conductivity, interest has been shown in measuring Construction of linkage map is essentially the
ABA contents in order to establish relationships first step in QTL mapping. Such maps allow genetic
with drought resistance. Varietal differences dissection of QTL, facilitate high-resolution
in leaf ABA content and sensitivity to applied genetic mapping and positional cloning of impor-
ABA also exist in rice. tant genes, assist in local comparisons of synteny
In summary, a utilisable secondary trait in within and across the species and provide an
breeding for drought resistance in rice should be ordered scaffold on which complete physical
(1) genetically associated with grain yield under maps can be assembled. Recent progress in DNA
drought, (2) highly heritable, (3) stable and feasible markers and their linkage maps have provided an
to measure and (4) not associated with yield loss efficient tool and methods for mapping individual
under ideal growing conditions. However, despite loci conferring not only monogenic but also
the description of several above-said traits, polygenic traits. For rice, the first molecular
these traits are rarely selected for in traditional marker-based genetic map was constructed by
rice improvement programs because phenotypic McCouch et al. in 1988, and since then several
selection for these traits involves complex, linkage maps were constructed in rice using
difficult and labour-intensive protocols; the tre- different mapping populations including high-
mendous diversity of environments and water density restriction fragment length polymorphism
availability; and the large genotype environ- (RFLP) maps and expressed sequence tags (ESTs)
ment interactions which complicate selection. maps. These maps provide the foundation for
Knowledge from physiological studies indicated molecular genetic analysis of almost any traits
that the ability of the root systems in exploiting of interest and thus have a number of advantages
deep soil moisture and the capacity for OA over classical genetic maps for genetic research
during water stress are considered as major and breeding.
drought-resistance traits in rice. They can also be
negatively correlated due to tight genetic linkage
of some of the controlling genes as was shown QTL Mapping of Drought-Resistance
for OA and root morphology. Therefore, the Traits in Rice
impact of one trait in isolation may be difficult to
establish. One promising approach is to map genetic The availability of high-density linkage maps is
loci (quantitative trait loci, QTL) influencing valuable as a resource for studies that genetically
Rice 251

dissect out the complex traits such as drought conducted using the subset of this population
resistance. QTL mapping provides a potential to identify and map QTL associated with dehy-
tool for conducting physiological and genetical dration tolerance and OA by Lilley and her team
research to understand and improve drought in 1996, and the identified QTLs were compared
resistance. It eases screening for traits that are to root traits and leaf rolling scores measured in
difficult to quantify and influenced by environ- the same lines. It is interesting to note that the
mental stimuli. putative OA locus and two of the dehydration
A good progress has been made in identifying tolerance QTL on chromosome 8 were close to
molecular markers linked to various drought- the regions associated with root morphology.
resistance traits in rice. Two review papers have From their results, it was suggested that OA and
been published, from the author of this book and dehydration tolerance is negatively correlated
his colleagues, based on the available literature, with root morphological characters associated
and it is available freely on the web (or refer https:// with drought avoidance. High OA and dehydration
sites.google.com/site/drnmboopathi/). Table 11.1 tolerance is associated with Co39 (indica) alleles,
summarises the details of QTL identified from and extensive root systems were associated with
selected publications as an example, for different Moroberekan (japonica) alleles. It was suggested
drought-resistant traits and their flanking markers that to combine high OA with extensive root
in different mapping population. The first report on systems, the linkage between these traits needs
QTL associated with various root morphological to be broken.
characters has been reported in a CO39/Morobe- It is obvious that QTL detection depends on
rekan recombinant inbred (RI) line population the cross combination used in the analysis
under greenhouse conditions by Cham-poux et al. because detection of QTL is based on allelic
in 1995. They have also identified QTL linked to differences in QTL between parental lines. Thus,
drought avoidance in the field under water-deficit an important question is whether QTLs detected
stress at three different growth stages using the in one population are shared with QTL detected
same mapping population. It is encouraging to in other populations. QTL analysis of the same
note that over 50% of the putative QTL associ- traits using different cross combinations will be
ated with root characters in the greenhouse study necessary to answer this question. In this context,
mapped to the same chromosomal locations as several publications studied doubled haploid
QTL influencing drought avoidance in the field (DH) population derived from IR64/Azucena
experiments. Using the same RI lines, Ray et al. cross and mapped the genes controlling root
in 1996 mapped QTL for root penetration ability morphology and distribution. The main QTLs
using wax petrolatum layer. Clustering of QTL were common between traits, which indicate
associated with root traits was observed as that of that there is a possibility to modify several aspects
previous study. This suggests that specific regions of root morphology simultaneously. The sd-1
of the rice genome containing genes that determine locus on chromosome 1, which has massive effect
root morphology may be clustered in certain on plant height and tillering, was found to show
chromosomal regions. These regions may contain co-location with QTL governing root system in
clusters of genes or genes with pleiotropic effect. this study. However, the QTL on chromosome 7
Most of the QTLs linked to tiller number are that was associated with effects on maximum
mapped closely to chromosomal regions identified root depth did not seem to be linked with a QTL
as associated with total root number. These results for plant height. This suggests that it may be
suggested that molecular marker could play a possible to decrease the height of traditional tall
significant role in studying the relationship of upland rice varieties without diminishing the
shoot- and root-related drought-resistant traits. quality of their root system. Besides, those reports
This issue can be investigated further in a rice identified several common QTL depending on
population developed specifically for the purpose the traits. Development of isogenic lines would
of studying these traits. An analysis was also help to clarify the proper value of the common
252

Table 11.1 Details of mapping population, linkage map characteristics and QTL identified for drought-resistant traits in rice from selected publications
QTL identified Maximum
Number and type Linkage map Across trials/ Across phenotypic
Parents Populationa of markers used coverage (cM) Traits No. of QTL experiments population variance (%) References
Co39/Moroberekan281 F7 RILs 127 (RFLP) Root thickness 18 56 Champoux et al.
(203) Rootshoot ratio 16 38 (1995)
Root dry weight 14 35
per tiller
11

Deep root weight 8 18.5


Maximum root depth 4
Drought avoidance 18 5 35
(leaf rolling)
Co39/Moroberekan281 F7 RILs 127 (RFLP) Number of penetrating 4 8 Ray et al. (1996)
(202) roots
Total number of roots 19 19
Root penetration 6 13
index
Tiller number 10 14
Co39/Moroberekan281 F7 RILs 127 (RFLP) Dehydration tolerance 5 36 Lilley et al. (1996)
(52) Osmotic adjustment 1 32
Relative water content 2 35
Recent Advances in MAS in Major Crops
IR1552/Azucena 150 RIL (96) 249 (RFLP, SSR, Seminal root 4 4 13.4 Zheng et al. (2003)
Rice

cDNA-AFLP) length
Relative seminal 2 13.9
root length
Adventitious 7 3 18.2
root number
Relative adventitious 1 15.0
root number
Lateral root length 4 2 14.4
Relative lateral 1 11.9
root length
Lateral root number 2 11.7
Relative lateral 1 12.3
root number
IR62266/ 150 167 (RFLP, SSR, 1,370 Osmotic adjustment 19 12 3 25.0 Robin et al. (2003)
IR60080 BC3F3(142) candidate genes)
IR64/Azucena 135 DH 260 (RFLP, SSR, 2,457 Days to flowering 2 24.6 Venuprasad et al.
(90, 84, RAPD, isozymes) Plant height 2 1 1 20.0 (2002)
56 & 109) Grain yield 1 1 15.7
Harvest index 1 1 19.7
Days to maturity 1 20.4
Root thickness 1 26.9
Root volume 1 29.1
Root dry weight 1 30.7
Maximum root length 1 12.9
DH doubled haploids, RIL recombinant inbred lines, BC backcross progenies, RFLP restriction fragment length polymorphism, RAPD random amplified polymorphic DNA,
SSR simple sequence repeats, cDNA complimentary DNA, AFLP amplified fragment length polymorphism
a
Subset of population used for phenotyping is indicated in parenthesis
253
254 11 Recent Advances in MAS in Major Crops

QTL by eliminating the confounding effects single-marker analysis. Root length is known
of other genomic regions and to fine-tune their to be highly sensitive to environmental variation
location. and therefore is more difficult to improve than
QTLs controlling drought-avoidance mecha- other root traits such as root thickness.
nisms (such as leaf rolling, leaf drying, relative Another extensively analysed population for
water content of leaves and relative growth rate QTL linked to drought resistance is Bala/Azucena
under stress) were analysed in this DH popula- developed by Price and his team. They reported
tion in three field trials with different drought the construction of a linkage map and its use in
stress intensities in two sites in some publications. mapping the QTL controlling maximum root
Some of the QTLs were common across the trials length at various stages of root development,
and traits. QTLs detected for leaf rolling, leaf adventitious root thickness and root volume in
drying and relative water content were mapped an F2 population. QTL for different days/stages
in the same location as QTL controlling root showed different types of genetic effect. Some
morphology in the previous study using the same QTLs observed in the Bala/Azucena population
population. QTL identified for leaf rolling in this are evident in the CO39/Moroberekan population,
population located similarly as that of the QTL while some are not. The same population was used
for leaf rolling in other population. However, for mapping two shoot-related mechanisms,
in contrast to these studies, when a randomly namely, stomatal conductance and leaf rolling
chosen subset of 56 DH lines derived from this along with heading date. This F2 population was
cross were grown in polyvinyl chloride cylinders forwarded to F6, and a more detailed linkage
to study the root morphology and associated traits map was constructed to analyse the QTL for root
under well-watered conditions and low-moisture penetration ability with modified wax petrolatum
stress at two growth stages during the vegetative layer. It is interesting to note that some of the
phase, several QTLs were found. In total, 15 QTLs QTLs for root penetration ability reported here
were detected from both the growth stages, and are close to QTL for root morphology reported in
only three were common between the stages. the F2. However, the differences in the reported
This reveals that different sets of QTL show up locations of QTL between this study and similar
under different developmental stages within the study are probably due to the different populations
vegetative stage itself. Further, absence of common studied and to the different methods used for
QTL for root traits between two developmental assessing the root penetration phenotype. Com-
stages and two moisture regimes in this study parison of the QTL identified in this study with
suggests the existence of parallel genetic pathways previous reports of QTL for root morphology
operating at different growth stages and moisture suggests that alleles which improve root pene-
regimes. Using a wax petrolatum layer system tration ability may also either make the roots
simulated to compacted soil layers, root traits longer or thicker. In another study, QTLs for
were evaluated with a subset of this DH lines. drought avoidance based on the field trials in the
QTLs for root penetration index, penetrated root Philippines and West Africa have been localised.
thickness, penetrated root number and total root QTLs for leaf rolling and drying and relative
number have been located. Common QTLs linked water content were mapped for each site and
to root penetration index and basal root thickness across the site. However, there was relatively
were noted across experimental systems and poor correlation between traits measured in the
genetic background. This suggests that both root two sites suggesting there may be some different
penetration ability and root thickness may be genetic components contributing to drought
controlled by genes, which are closely linked or resistance in the different environments. The
have pleiotropic effect. No QTLs for maximum same experimental materials were used to map
penetrated root length were detected by interval QTL for root morphology and distribution
mapping, although five RFLP markers were using soil-filled chambers exposed to contrasting
found significantly associated with this trait using water-deficit regimes. QTLs for the deep root
Rice 255

weight, maximum root length, rootshoot ratio, to aerobic soil conditions and water stress in
number of deep roots and root thickness were rainfed lowlands. Constitutive root system devel-
identified. Some were revealed only in individual opment in anaerobic soil conditions has been
experiment and/or for individual traits, while others reported to have a positive effect on subsequent
were common to different traits or experiments. expression of adaptive root traits and water
A comprehensive analysis of dissecting extraction during water stress (Kamoshita et al.
physiological and morphological traits related to 2008). The effect of phenotyping environment on
drought resistance and partitioning of drought identification of QTL for constitutive root mor-
resistance into components and comparative QTL phology traits were studied using greenhouse
analysis would contribute a better understanding experiments, and the results emphasised the
of the genetic basis for drought resistance in careful selection of phenotyping environment
plants. The parents, CT9993 and IR62266, were which relate closely to the target environment
studied at morphological and physiological level where the traits are to be expressed and interpre-
and shown to differ in root system and OA. tation of results which otherwise leads to mis-
In order to better understand the mechanisms of placing the QTL. In spite of large environmental
drought tolerance via OA and drought avoidance effects, even in well-watered anaerobic conditions,
via a deep root system in rice, a molecular dissec- they have identified stable QTL across the experi-
tion of QTL for both OA and root traits in one ments in CT9993/IR62266 DH lines. Physical
genetic background is important. Hence, genomic mapping of the putative QTL for deep root
regions responsible for CMS were studied in the morphology traits would help to elucidate how
greenhouse in a slowly developed drought-stress rooting depth and deep root mass are genetically
environment by using rice DH lines derived from controlled at the molecular level. QTLs linked to
CT9993/IR62266. No significant correlation was plant height, number of tillers, total root number,
found between CMS and relative water content, root dry weight, total plant length and root to
indicating that the variation in CMS was genotypic shoot length ratio were identified in this popula-
in nature. They have located nine putative QTLs tion under well-watered conditions. Some of the
for CMS and one of the QTL on chromosome 8 alleles governing the root-related traits were from
mapped on the same locus as the OA mapped. IR62266, which indicates that inferior parent can
Moreover, several QTLs involved in root morphol- also contribute favourable alleles for root traits.
ogy and the drought avoidance in rice have been Drought-resistance component traits, descri-
identified in this region. The mapping of CMS bed above, can interact with each other in modi-
QTL in this region suggests that this region might fying the plant water status. The real test for
contain genes for different traits responsible for drought resistance is continuous growth and
conferring drought resistance in rice. The same production under stress. Three traits, which
DH lines were used to map the QTL associated perhaps encapsulate all the drought-resistance
with root traits and OA. Consistent QTL for components, are leaf expansion (as an indication
drought responses across genetic backgrounds of plant turgor), biomass production and ultimately
were detected. Comparative mapping identified grain production under stress. Although previous
three conserved regions associated with various analysis indicated the map positions of QTL
physiological responses to drought in several grass associated with drought-resistance traits and their
species. This result suggests that these regions co-location, the effects of those traits on plant
conferring drought adaptation have been con- production under drought have to be properly
served across grass species during genome evolu- established. Thus, there is a need to determine
tion and might be directly applied across species whether the QTLs linked to drought-resistance
for the improvement of drought resistance in traits also affect yield under stress. By comparing
cereal crops. the coincidence of QTL for specific traits and
Rice develops roots under anaerobic soil QTL for plant production under drought, it is
conditions with ponded water prior to exposure possible to test whether a particular constitutive
256 11 Recent Advances in MAS in Major Crops

or adaptive response to drought stress is of drought avoidance, whereas indica cultivars have
significance in improving field level drought different adaptive strategies including shortening
resistance. Such associations would also improve of growth duration and tissue level tolerance.
the efficacy of MAS in breeding for drought Whether a drought-avoidance strategy based
tolerance in rice. QTLs associated with grain almost entirely on a well-developed root system
yield and root morphological traits were mapped in japonica background can be combined with
in IR64/Azucena DH population under contrasting tissue level tolerance and/or short growth dura-
moisture regimes. CT9993/IR62266 DH lines were tion to improve plant performance under water
used to identify the QTL linked to rice perfor- stress in specific environments is a question
mance under drought and to genetically dissect which is central to drought-resistance breeding in
the nature of association between drought-resis- cereals. The phenomenon of return to parental
tance traits and yield under drought in the field. type after repeated generations of selfing follow-
ing indica/japonica hybridisation is familiar to
rice breeders and makes it difficult to obtain
Rice Subspecies and Habitat favourable recombinants through traditional
means. Differential adaptation to edaphic factors,
Rice is cultivated in four continents, and very such as soil, water and temperature regimes and
large germplasm collections are available offering genetically controlled sterility barriers, separates
many possibilities of identifying adaptive traits these two major subspecies. Evaluation of upland
and tolerance characters towards abiotic stresses. japonica/lowland indica populations under
Cultivated rice belongs to the Oryza sativa com- anaerobic lowland conditions may be confounded
plex, which contains the two cultivated species, by the difference in adaptation to lowland con-
O. sativa and O. glaberrima, and several wild ditions. Cross combinations used in breeding
species, which are considered as direct ancestors programs are mainly same ecotype crosses, such
of the cultivated ones. O. sativa is cultivated all as japonica/japonica and indica/indica.
over the world, whereas O. glaberrima is cultivated Therefore, more QTL analysis based on crosses
only in Africa. Within the O. sativa species, two between closely related varieties, especially the
major groups of ecogeographic races are distin- indica/indica cross, will be necessary for
guished, the indica and japonica types. They identification of QTL alleles which will be useful
roughly correspond to rice grown in tropical in rice breeding. Ali et al. in 2000 analysed RILs
regions of Southeast Asia and in more temperate developed from two indica parents, IR58821/
regions of Japan and northern China, respectively. IR552561, to map QTL for root traits in two
Indica and japonica varieties cross-hybridise, but different seasons. They have identified not only
usually many plants in the progeny are sterile or common QTL between two seasons but also
partially sterile. Large and well-known genetic consistent QTL across genetic backgrounds. The
diversity exists in the subspecies level and is a effect of phenotyping environment and genetic
valuable resource for both classical and biotech- background on QTL identification was examined
nology-assisted breeding. by using this population. QTLs for shoot bio-
Most of the populations used in QTL analysis mass, deep root morphology and root thickness
of drought-resistance traits were derived from were mapped. Consistent QTLs across the
an indica/japonica cross because of the high experiments and genetic backgrounds were
frequency of polymorphism based on wide detected. Results from these studies suggest
variation. Development of a deep and extensive that some amount of similarity exists between
root system is one adaptive strategy of plants japonica/indica crosses and indica/indica crosses
for drought avoidance. Upland japonica cultivars in the genetic control of root traits. Since then,
appear to rely on its deep and extensive root several studies were conducted using such cross
system to achieve its demonstrated capacity for combination (e.g. see Gomez et al. 2010).
Rice 257

Marker-Aided Selection and Near- between parental lines that are equally well
Isogenic Lines for Drought-Resistance adapted to target environments should be evalu-
Improvement ated (refer chapter 5 also). Focusing on the vari-
ation within single ecotype might hasten progress
QTL presented above, associated with different towards drought resistance, and the locally well-
drought-resistance mechanisms assessed at dif- adapted germplasm will increase the efficiency of
ferent sites, methodologies and seasons, confirms breeding. Traditional rice varieties are still
the complexity of the genetics of drought resis- being grown in rainfed uplands even though
tance in rice. It also illustrates the degree of QTL they give low but definite yield. There is a need
by year and QTL by site interaction and demon- to develop rice varieties with higher yield but
strates the value of calculating averages for retaining the drought-tolerance capacity of tradi-
identification of the more stable but small effect tional accessions. The necessity of QTL
QTL. A significant proportion of the phenotypic identification based on the variation from the
variability of several of these putative drought- crosses between two related varieties belonging to
resistance traits is explained by the segregation the same subspecies adapted to target population
of relatively few genetic loci, thus leading to the of environment (TPE) has been emphasised by
possibility of indirect selection of these complex various authors. Further, upland rice environ-
traits using MAS strategy. This information is ments vary widely in terms of climate and edaphic
potentially valuable to breeders and enables factors, making it difficult to use genetic material
researchers to target specific regions in order to developed for one location in other locations.
produce near-isogenic lines (NILs) at some QTL. Most of the QTLs linked to drought-resistance
These NILs will allow more accurate determina- traits were flanked by mostly RFLP and few
tion of environmental stable QTL and understand amplified fragment length polymorphisms (AFLP)
and further allow for the assessment of the impact markers. Though RFLP markers are reliable, it
of QTL on yield under drought. They could also involves tedious, time-consuming protocols besides
aid in the identification of the genes responsible handling hazardous radioactive chemical. Hence,
for the QTL through candidate gene and/or posi- they are not suitable for routine MAS. The RFLP
tional cloning approaches. Shen et al. in 2001 and AFLP markers need to be converted to a
reported improvement of rice root system by simple, rapid and inexpensive polymerase
MAS of several root QTL. They have also studied chain reaction (PCR)-based markers, like STS,
the possible effects of these introgressed segments to enhance and economise the breeding programs.
on other agronomic traits through pleiotropy This involves extra effort in conversion of this
or linkage drag. Work has also been done to trans- marker besides establishing the polymorphism
fer the QTL for root morphological traits from between the parents as that of original RFLP
Azucena into a popular Indian variety, Kalinga or AFLP markers. Identification of simple PCR-
III, by MAS. NILs were developed for OA with based nonradioactive markers linked to putative
japonica background. NILs shall serve as valuable drought-resistance component traits will hasten
material to test the utility of the introgressed QTL. MAS for drought-resistance improvement. SSRs,
This will also lead to understand the mechanisms inter-simple sequence repeats (ISSRs) and ran-
underlying physiological and molecular nature dom amplified polymorphic DNAs (RAPDs) are
of the QTL and to evaluate the contribution of well-established PCR-based markers being
the QTL to yield in the target environment. involved in mapping process (see chapter 3).
The candidate gene approach has been applied
in plant genetics in the past decade for the char-
Target Population of Environment acterisation and cloning of QTL (see chapter 10).
and Molecular Breeding Candidate genes are genes involved in the expres-
sion of a given trait. They can be identified either
To improve the drought resistance of rainfed from previously sequenced genes of known func-
lowland rice, mapping populations from crosses tion or from cDNA libraries constructed specific
258 11 Recent Advances in MAS in Major Crops

to different organs, developmental stages or stress allows the design of primer pairs, which can be
responses. Expressed sequence tags (ESTs) are used to identify the length polymorphism among
partial or single-pass sequencing of more or the parental lines.
less randomly chosen cDNA clones from libraries
at all stages of plant growth and development.
They allow fast and affordable gene identification. Concluding Remarks on MAS in Rice
Development of EST-based markers is dependent for Water-Limited Environments
on extensive sequence data of regions of the
genome that are expressed. They are highly Managed drought environments in the field, such
reproducible and can be directly associated as dry season trials, delayed planting in the
with functional genes. A number of ESTs specific wet season, use of high toposequence locations,
to drought response are now available in the EST drainage, raised beds and large-scale rainout
database (dbEST). It will be important to resolve shelters, have been developed to simulate the
to what extent the allelic variation in these genes target environments for breeding. Selection for
affects drought tolerance in rice. Hybridisation- higher grain yield under managed stress, partly
based RFLP markers have been developed assisted by selection for secondary or integrative
from ESTs and used extensively for the con- traits such as low leaf rolling score, low spikelet
struction of high-density genetic linkage maps sterility and high drought-resistance index,
in rice. The genetic factors underlying constitu- with their moderate to high degrees of heritabil-
tive and adaptive morphological traits of roots ity, shows promise. Understanding of genotypic
under different water-supply conditions were responses to drought is increasing. Resistance
investigated using RI lines derived from IR1552/ traits differ under different types of drought (e.g.
Azucena by exploiting the genetic map con- terminal drought, vegetative stage drought and
structed with EST clones and cDNA-AFLP intermittent drought), but genotypic responses
clones. Two genes for cell expansion, OsEXP2 that contribute to drought avoidance (e.g. deep
and endo-1,4-b-d-glucanase Ecase, and four and thick roots and conservative water use by
cDNA-AFLP clones from root tissues of Azucena moderate plant size) and maintain higher plant
were mapped on the intervals carrying the QTL water status are often found to be more important
for seminal as well as lateral root length. Robin for higher yield under stress than are tolerance
et al. in 2003 found a candidate gene that was mechanisms. Transgenic rice, engineered for
closely linked to QTL for OA. The tight linkage enhanced expression of primary induced traits
between these candidate genes and the QTL for for drought tolerance, has been studied under
root traits and OA may demonstrate a causal laboratory conditions, but the usefulness of these
relationship. However, further investigation of lines under field drought conditions remains to
these genes for stimulated root elongation under be tested. QTLs for constitutive primary traits
water-limited stress in rice is needed before such as deep roots and plant-type traits such as
drawing conclusions on what gene lies beneath plant height had higher contribution to pheno-
the QTL. The candidate genes used in these typic expression than QTL for induced traits and
studies were engaged as radioactive probe as were identified across different populations
that of RFLP. Development of PCR-based under both well-watered and stress conditions.
EST markers could be useful in QTL mapping The QTLs for root traits and plant-type traits,
and efficient MAS for drought-resistance together with QTL for plant water status, were
improvement in rice. Further, ESTs allow a com- more often co-located with integrated traits such
putational approach to the development of as grain yield under stress. Although it is unlikely
SSR for which previous development strategies that a single primary or secondary trait will improve
have been expensive. Pattern-finding programs rice resistance to different types of drought, selec-
can be employed to identify SSRs in the ESTs. tion of some of the QTL clusters containing multiple
Readily available EST sequence information drought-resistance traits is promising.
Cotton 259

In spite of the large amount of information genetic improvement still remains to be resolved
on QTL linked to various drought-resistance due to negative association between lint yield and
traits, routine use of these QTLs in MAS is not fibre quality. The long-term challenge faced by
widely practised. The accuracy of phenotyping in cotton breeders is the simultaneous improvement
these QTL mapping studies is one concern. of yield and fibre quality traits to meet the
Further, use of molecular approaches may be demands of the cotton cultivars as well as the
limited because of the need to consider large modernised textile industry. Textile industry is
number of QTL with individually small effects. based on measurable quality factors, and often
The effects that MAS for such QTL will have on this is the area where technological changes
improvement of plant breeding can be estimated are being rapidly implemented. All the changes
by the use of simulation models. Development in spinning technology require unique and often
of near-isogenic lines for these QTL will allow greater cotton fibre quality, especially strength,
testing of their true agronomic value. Several labs for processing. Strong fibres survive the rigours
are currently working on MAS introgression of of ginning, cleaning, carding, combing and
these QTL into locally adapted elite rice lines. drafting. Besides fibre strength, fibre length and
fibre fineness are the other key qualities that
influence textile processing. Usually, G. hirsutum
Cotton accessions possess high yield, and G. barbadense
accessions have superior fibre quality traits.
Cotton (Gossypium spp.) is an important com- Though considerable progress has been made in
mercial and natural fibre crop of global impor- the past, the current genetic information and
tance and generates high employment at various conventional plant breeding methods involving
stages. Though synthetic/man-made fibres have interspecific hybridisation between G. hirsu-
made inroads, cotton deserves the prime posi- tum G. barbadense cannot lead to quick
tion in India with cultivation. It has been in improvement of fibre quality. This may be due
cultivation in India for more than 5,000 years. to the involvement of long duration and low
Globally, India ranks first in cotton area but selection efficiency in such cross combinations.
occupies second position in production, next These attempts also had resulted in poor agro-
to China. Cotton has significant contributions nomic qualities of the progeny, distorted segre-
in Indian economy by earning more than 30% gation, sterility, motes formation and limited
of foreign exchange. recombination due to incompatibility between
India has the distinction of growing all the the genomes.
four cotton cultivable species, namely, Gossypium On the other hand, quantitative trait loci (QTL)
arboreum, G. herbaceum, G. barbadense and mapping and marker-assisted selection (MAS)
G. hirsutum. Among the four species, the tetraploid offer new avenues to overcome the above-said
(or allopolyploid) species G. hirsutum L. and G. limitations. Molecular markers are employed to
barbadense L. accounted for 90 and 8% of the construct genetic linkage map, and it can be
world cotton production, respectively. Though employed to understand the genetic basis and
India is the major cultivating and consuming improvement of the complex polygenic traits
country, commercial cotton lint produced in India such as fibre quality. The identification of tightly
is in narrow fibre quality spectrum, and hence linked markers to the stable QTL affecting
several 1,000 bales of cotton lint that fit to modern fibre traits across the generations would be useful
textile industries are being imported. Thus, it is in MAS and thus increase the efficiency of
imperative to improve the fibre quality of the breeding program. Thus, the identification of
cotton cultivars. DNA markers linked to the fibre quality QTL
Conventional breeding methods have contrib- would allow cotton breeders to trace this very
uted much to the development of high-yielding important trait in early plant growing stages or
cotton cultivars. But, the efficiency of fibre in early segregating generations.
260 11 Recent Advances in MAS in Major Crops

Status of Cotton Molecular Marker requires well-spaced polymorphic markers for


Technology the given parents. Hence, selection of a marker
system that serves the above purpose is the key
DNA marker technology has enormous potential step in MAS.
to improve the efficiency and precision of conven- To overcome the paucity of a particular type
tional plant breeding via MAS. The advantage of DNA markers, genetic maps were developed
of MAS over conventional breeding is that the by incorporating different classes of markers.
selection is simple than phenotypic selection and For example, Lacape and his group have con-
selection can be done at seedling stage itself structed a combine restriction fragment length
(single plant or even a small leaf sample is enough polymorphism (RFLP)simple sequence repeats
to predict the entire gene or QTL of the particular (SSR)amplified fragment length polymor-
trait). Thus, DNA marker technology provides phism (AFLP) map based on an interspecific
a valuable tool to the plant breeders to select G. hirsutum G. barbadense backcross population
desirable plants directly on the basis of genotype of 75 BC1 plants. The map consists of 888 loci
rather than phenotype. Advances in the use of that ordered into 37 linkage groups and spanning
DNA markers to identify QTL and MAS have 4,400 cM. This map was updated, mostly with new
shown promises for streamlining plant breeding SSR markers, to contain 1,160 loci that spanned
programs. For example, genetic maps constructed 5,519 cM with an average distance between loci
using crosses of upland cotton (Gossypium of 4.8 cM. Similarly, SSRs, SRAP, RAPD and
hirsutum L.) and Egyptian cotton (Gossypium retrotransposonmicrosatellite amplified poly-
barbadense L.) have led to the identification of morphisms (REMAPs) were also employed to
several QTLs for fibre strength, fineness and construct cotton linkage map. Due to conservation
length (e.g. refer Table 11.2). of genomic regions in cotton, combination of
different types of molecular markers is required
to have a sufficiently saturated linkage map in
Molecular Markers and Polymorphism cotton. However, use of simple, cost-effective
in Cotton marker types may have promising applications
in Indian scenario. Considering the different type
Though modern G. hirsutum and G. barbadense of molecular marker system to study the extent of
cultivars show significant variation for important diversity in cultivated cotton, SSR marker is
traits including fibre production, pest resistance the best to predict the genetic variation within
and tolerance to environmental adversities such cultivated diploid and tetraploid cotton.
as heat, cold and drought, these cultivars exhibit
narrow genetic diversity. Decrease in genetic
diversity is harmful to future breeding programs. Simple Sequence Repeats (SSRs)
Molecular markers are playing a critical and in Cotton
increasing role in the analysis of genetic diversity
in cotton cultivars. Wild Gossypium germplasm Though several types of DNA markers are avail-
harbours many valuable traits including disease able, simple sequence repeats (SSRs) are being
and insect resistance, stress tolerance and fibre considered as the markers of choice in many of
quality attributes. DNA markers in construction the crop-breeding activities. SSR or microsatel-
of genetic maps would be useful in introgression lites are short, tandemly repeated DNA sequence
of alien genes into cultivated cotton species. motifs that consist of two to six nucleotide core
Molecular linkage map construction has been units. They are highly abundant in eukaryotic
recognised as an essential tool for plant breeding genome but also occur in prokaryotes at lower fre-
because they have the properties of neutrality, quencies. The regions flanking the microsatellites
lack epistasis and are simply inherited Mendelian are generally conserved, and PCR primers relative
characters. Efficient construction of genetic map to the flanking regions are used to amplify SSR-
Cotton

Table 11.2 Selected examples in QTL mapping for agronomic, yield and fibre quality traits in cotton
Population Maximum phenotypic
Species involved type QTL reported for Chromosome number/linkage group variance observed (%) References
G. hirsutum G. barbadense F2 Fibre strength LGD02 13.3 Jiang et al. (1998)
Chr.20 9.7
Chr.22 12.0
Fibre length LGD03 14.7
Fibre thickness Chr.10 12.6
Fibre elongation LGA02 14.0
LGD03 12.3
Earliness LGD04 8.1
G. hirsutum G. hirsutum RILs Micronaire Chr.3, Chr.5, Chr.13 13.3 Wu et al. (2009)
2.5 % span length Chr.12, Chr.13, Chr.14, Chr.20 38.6
Elongation percentage Chr.14, Chr.20, Chr.26 9.7
Bundle strength Chr.5, Chr.9, Chr.12, Chr.16, Chr.20, 13.7
Chr.26
G. hirsutum G. hirsutum F2 Lint percentage Chr.26 87.1 Jenkins et al. (2010)
G. hirsutum G. hirsutum RILS Boll size D08 35 Chen et al. (2010)
Lint percentage D08 19
G. hirsutum G. barbadense BC, F2 Reniform nematode Chr.21 15 Gutierrez et al. (2011)
resistance
G. hirsutum G. tomentosum BC3F2 Fibre fineness Chr.14 11.9 Zhang et al. (2011)
G. hirsutum G. hirsutum RIL Fibre strength Chr.7, Chr.13,Chr.18, Chr.24, Chr.25 27.8 Sun et al. (2012)
Fibre length Chr.4, Chr.7, Chr.14, Chr.18, Chr.23, 20.6
Chr.25
Micronaire Chr.3, Chr.4, Chr.5, Chr.7, Chr.14, 19.1
Chr.16, Chr.19, Chr.25
Uniformity ratio Chr.4, Chr.7, Chr.13, Chr.14, Chr.25 13.4
Fibre elongation Chr.4, Chr.7, Chr.13, Chr.14, Chr.15, 11.5
Chr.18, Chr.25
261
262 11 Recent Advances in MAS in Major Crops

containing DNA fragments. Several methods have organisation, transmission and evolution of the
been pursued to develop SSR markers in cottons, Gossypium genomes. Later, an F2 population was
including analysis of SSR-enriched small insert derived from a cross between homozygous lines
genomic DNA libraries, SSR mining from G. hirsutum cv. TM-1 and G. barbadense cv. 3-79
expressed sequence tags (ESTs) and large insert at the USDA-ARS in Texas, and segregation data
BAC derivation by end sequence analysis or SSR- of 171 F2 individuals of this cross were obtained
containing BAC subcloning. More than 16,000 for 868 genetic markers. These markers have been
SSRs have been developed in cotton and are made mapped into 50 linkage groups and spanning
available to public as on September, 2012 (http:// nearly 5,000 cM of the cotton genome.
www.cottonmarker.org). It is considered that the A trispecific F2 population was also developed
total pool of SSRs present in the cotton genome is from three different cultivars to study inheritance
sufficient to satisfy the requirements of extensive patterns of segregating loci and to establish link-
genome mapping and MAS. Several SSRs have age groups among three genome species. Besides
assigned to cotton chromosomes by making use interspecific linkage maps, intraspecific maps
of aneuploid stocks. SSRs have been employed are also constructed by several researchers to
to study the extent of genetic diversity among investigate cotton genome and identify molecular
cotton germplasm. Even though few of the studies markers linked to agriculturally important genes/
revealed that low level of polymorphism within QTL. The linkage maps so far constructed in cotton
G. hirsutum genotypes, some of the studies helped in determining the chromosomal location
clearly discriminate the evaluated germplasm and of many agronomically important characters
phylogenic evolution of Gossypium species. such as yield, fibre quality, yield and fibre quality,
bacterial blight resistance and pubescence, stomatal
conductance, verticillium wilt resistance gene and
Cotton Linkage Maps leaf morphology.

As in most plant species, the early application of


DNA markers in cotton genomic research has QTL Mapping for Yield and Fibre
been in the form of RFLPs. It is, therefore, not Quality Traits in Cotton
surprising that the first molecular linkage map of
the Gossypium species was constructed from an In view of most measures of cotton, quality and
interspecific G. hirsutum G. barbadense F2 pop- productivity are polygenic; QTL mapping is in a
ulation based on RFLPs by Reinisch et al. in 1994 high priority of many research programs. Selected
who used to assemble 705 RFLP loci into 41 noteworthy findings have come out of QTL map-
linkage groups with average spacing between ping for yield and fibre quality in cotton and
markers of about 7 cM. This map later was further are summarised in Table 11.2. From these studies,
advanced that spanned 4,447.9 cM of the cotton comparison of QTL revealed poor consistency
genome which comprised 2,584 loci at 1.74 cM among populations. Although some QTLs were
intervals and covered all 26 chromosomes of found to be located on same chromosomes in
the allotetraploid cottons, representing the most different populations, no common markers could
complete genetic map of the Gossypium to date. prove that they were of the same QTL. Only a
Many of the DNA probes of the map were also few stable and common QTLs have been reported
mapped in crosses of the D-genome diploid spe- up to now due to non-replicated experiments and
cies G. trilobum G. raimondii and the A-genome difficulty in assignment of linkage groups. To
diploid species G. arboreum G. herbaceum. identify stable QTL for routine molecular breeding
Detailed comparative analysis of the relationship program, we need to integrate different maps of
of gene orders between the tetraploid AD sub- intraspecific and interspecific population, and for
genomes with the maps of the A and D diploid this it is important to work with a fixed population
genomes has revealed intriguing insights on the and common set of molecular markers.
Cotton 263

Specic Challenges in Cotton MAS number of QTL, even if many genes with equal
and small effects actually control the trait.
Despite the enormous above-said achievements, Further, in several published reports, the number
genetic improvement of cotton faces some specific of linkage groups exceeds the gametic chromo-
challenges because of its polyploid genome struc- some number (n = 26), and numerous linkage
ture, the large genome size and so forth, and they groups are yet to be associated with specific
are described hereunder. chromosomes mainly due to lack of informative
markers and use of small sample size. Moreover,
common identities and common nomenclature
Confronts with Mapping Population have yet to be established among many linkage
groups in the laboratory-specific maps. Physical
Detection of QTL is often limited by several coverage of the cotton genome by these linkage
factors such as genetic properties of QTL, maps also remains unknown. In most of the pub-
environmental effects, population size and lished maps, the markers were not uniformly
experimental error. Hence, it is desirable to spaced over many linkage groups. It is suggested
independently confirm QTL mapping studies. that such regions may be heterochromatin or
Such confirmation studies may involve inde- gene rich. Clusters of markers with very limited
pendent populations constructed from the same recombination were frequently present which
parental genotypes or closely related geno- may be indicative of QTL-rich (gene-rich) regions
types used in the primary QTL mapping study. of cotton.
Sometimes, larger population sizes may also be
used. Furthermore, some recent studies have
proposed that QTL positions and effects should QTL Environment Analysis
be evaluated in independent populations because
QTL mapping based on typical population sizes Relatively large numbers of QTL were detected
results in a low power of QTL detection and a for fibre quality traits, and most of the detected
large bias of QTL effects. QTLs explained only less than half of the total
Unfortunately, due to constraints such as lack genetic variation. What causes the remaining
of research funding and time and possibly a lack genetic variation that is unexplained by QTL in
of understanding of the need to confirm results, large samples? One possibility is that there are
QTL mapping studies are rarely confirmed. many QTLs with very small effects, as assumed
Validation of conserved fibre quality QTL in classical models of quantitative genetics,
across populations has not been conclusive due and these remain undetected even with very
to the fact that the majority of these QTL studies large sample sizes. Another possibility is the
were either derived from small and mortal (F2 or higher-order epistatic interactions, which are
backcross (BCs)) populations. As compared to refractory to QTL mapping. Further, a recur-
F2 or BCs, homozygous immortalised recombi- ring complication in the use of QTL data is that
nant inbred lines (RILs) constitute the preferred different parental combinations and/or experi-
material for QTL mapping in many crops. RILs ments conducted in different environments
have not been widely utilised in cotton except often result in identification of partly or wholly
in some cases mainly due to long development nonoverlapping sets of QTL. The majority
timelines and difficulties in production of of such differences in the QTL landscape are
sufficient seeds. Though there is no clear rule for presumed to be due to environment sensitivity
the precise population size that is required for of genes. Hence, proper care of including QTL
QTL analysis, it is increasingly believed that environment interaction analysis, which was
sampling limited numbers of progeny in map- found to be limited in the published literature,
ping studies tends to cause the skewed distribu- will improve the further progress of QTL map-
tion of QTL effects and identification of limited ping towards MAS.
264 11 Recent Advances in MAS in Major Crops

Incongruence Among QTL Studies proteins (domains of unknown function) remain


as a challenge for elucidation of their biological
The use of stringent statistical thresholds to infer function. In addition to that, quantitative data
QTL while controlling experiment-wise error on proteome and metabolome is still in its infant
rates is another reason for identification of only stage, and proteinprotein interactions and
a small fraction of these nonoverlapping QTL. protein with other macromolecules remain to be
Small QTL with opposite phenotypic effects revealed. Therefore, complete knowledge on fibre
might occasionally be closely linked in coupling growth and development at molecular level and
in early-generation populations and separated its integration with QTL mapping is essential
only in advanced-generation populations after to design next-generation breeding strategies.
additional recombination. Comparison of multiple
QTL mapping experiments by alignment to a
common reference map offers a more complete Alternatives and Future Perspectives
picture of the genetic control of a trait than can
be obtained in any one study. However, lack of The realisation of value of MAS in routine cotton
common set of anchored markers in the published breeding program for fibre productivity and
reports limits the comparison of QTL across the quality has been realised only in few reports.
genetic backgrounds. It highlights several insights and improvement
in the current methodologies and tools, and the
following strategies are proposed for successful
Complexities in Integration MAS in cotton.
of Functional Genomics with QTL

Fibre gene function is highly conserved in the Meta-analysis of QTL: Synergy Through
genomes of wild and cultivated species, as well Networks
as diploid and tetraploid species, despite millions
of years of evolutionary history. The phenotypic Though QTLs for several common traits were
variation in fibre properties therefore is more likely mapped, direct comparisons cannot be conducted
one of quantitative differences in gene expression since no common markers existed among these
as opposed to differences in the genotype at the studies. Detected QTLs are held up within family,
DNA level. Hence, further studies are required to the sizes of QTL effects that can be detected are
understand the number of copies of the genes, limited, and inferences are restricted to a single
their regulation and specific function in fibre population and set of conditions. Thus, one direc-
development. Though systematic transcriptomic tion for QTL analysis is to combine information
approaches can be combined with QTL analyses from several or many studies by meta-analysis.
(discussed below), these studies do not address Integration of QTL from different populations
the occurrence of alternative splicing or the into a common map facilitates exploration of
posttranslational modifications of the proteins. their allelic and homoeologous relationships,
In addition, proteins can move in and out of other though the level of resolution is limited by com-
macromolecular complexes and thus modifying parative marker densities, variation in recombi-
their functionality. This level of complexity cannot nation rates in different crosses, variation in gene
be tackled using transcriptomics alone, and densities across the genome and other factors.
hence it is vital to include proteomics in MAS. Using a high-density reference genetic map which
On the other hand, biochemical functions of only consists of 3,475 loci in total, Rong and his team
a small proportion of the identified proteins have reported alignment of 432 QTL mapped in one
been demonstrated and/or determined based on diploid and ten tetraploid interspecific cotton
the assumptions that proteins sharing conserved populations and depicted in a CMap resource.
domains have the same activity. Hence, the leftover Similarly, Lacape group conducted meta-analysis
Cotton 265

of more than 1,000 QTLs obtained from the RIL loci. This makes it difficult to assign megabase
and BC populations derived from the same DNA clones to their site of origin. One possible
parents and reported consistent meta-clusters approach to this problem is the utilisation of
for fibre colour, fineness and length. As per their diploids in physical mapping and map-based
discussion, although their result on cotton fibre cloning.
can hardly support the optimistic assumption that
QTLs are accurate, they have shown that the reli-
ability of QTL-calls and the estimated trait impact Cotton Genome Sequencing
can be improved by integrating more replicates
in the analysis. Hence, it is imperative to verify Decoding cotton genomes will be a foundation
the regions of convergence with new maps which for improving understanding of the functional
share common markers with the consensus map. and agronomic significance of polyploidy and
genome size variation within the Gossypium
genus. The whole-genome shotgun sequence of
Map-Based Cloning the smallest Gossypium genome, G. raimondii,
provided fundamental information about gene
As QTL mapping results accumulate over the content and organisation. This sequence will
next years, attention will turn to clone QTL and be used to query homologous and orthologous
then to using them. This requires higher resolu- genomes and to investigate the gene and allele
tion of QTL mapping, combined with a dense basis of phenotypic and evolutionary diversity
marker map. A centimorgan (cM), corresponding for cotton improvement. A good parallel approach
to a crossover of 1%, can be a span of 101,000 kbp may be to search for candidates in species that
and can vary across species or even within the are having naturally superior fibre qualities.
chromosome of the given species. This region Sequencing of G. raimondii genome established
may contain both desirable and undesirable the critical initial template for characterising the
genes, and hence to avoid the linkage drag of spectrum of diversity among the eight Gossypium
undesirable traits, it is important to establish genome types and three polyploid clades and
the causal relationship between the QTL and provided a reference for sequencing many geno-
phenotype using positional or map-based clon- mes in Gossypium species which is essential for
ing. The physical size of a cM in cotton is not further improvement of cotton.
prohibitive to map-based cloning, but the lengthy
genetic map will require a large number of markers
in order to be sufficiently close to most genes for Advances in Functional Genomics
chromosome walking. A new high-throughput
marker, SNPs, is gaining its importance in this Several studies performed to compare the struc-
context, but huge initial investment for its genera- tural differences in the genomes have shown that
tion necessitates simple innovative and economic the difference is in the expression pattern rather
marker techniques. It is also important to note than in the presence or absence of particular
that instead of using anonymous DNA markers, genes. The comparison of gene expression profiling
development and use of gene-specific functional between contrasting genotypes with respect to
markers such as SRAP, TRAP and PAAP (see fibre quality can be extended to transcription
chapter 3) may increase the efficiency of map- profiling at the QTL level, and the genes identified
based cloning. at such QTL may potentially be better candidates
Further, map-based cloning in polyploids such for superior fibre quality. In addition to cDNA
as cotton introduces a new technical challenge and oligonucleotide microarrays, tiling path
not encountered in diploid (or highly diploidised) arrays can also be used to study gene expression
organisms, for example, that virtually all single- in plants. The advantage of tiling path arrays
copy DNA probes occur at two or more unlinked over conventional microarrays is that they are not
266 11 Recent Advances in MAS in Major Crops

stuck-up with the gene structure and hence provide operates within the cell. A complete elucidation
unbiased and more accurate information about of the genotypephenotype map does not seem to
the transcriptome. In addition, they provide knowl- be feasible unless we can include all possible
edge on transcriptional control at the chromo- causal variables in the network-inference meth-
somal level. The use of tiling path arrays could odology. One has to take a global perspective
help to provide better understanding on the fibre on life processes instead of individual compo-
transcriptome at the genome-wide level, and it nents of the system. The network approach con-
is yet to be tried in cotton. This will result into necting all these subdisciplines indicates the
a paradigm shift from MAS to genomics-assisted emergence of a system quantitative genetics.
selection.

Association Mapping and Alternatives


System Quantitative Genetics:
Bridging Subdisciplines Association mapping provides another route to
identifying QTLs that have effects across a
The ultimate objective of QTL mapping is to broader spectrum of germplasm, if false positives
identify the causal genes or even the causal that are caused by population structure can be
sequence changes, the quantitative trait nucleotides minimised. In addition, QTL mapping in biparen-
(QTNs). While this remains a major challenge, tal populations reveals only a slice of the genetic
it has been achieved in a few instances in other architecture for a trait because only alleles that
crops. Identification of candidate genes and differ between the two parental lines will segre-
enrichment of functional markers within small gate. Therefore, more comprehensive analyses
targeted genomic regions are driven by the increas- of genetic architecture require consideration of
ing availability of sequence resources, genomic multiple populations that represent a larger
databases and by technological developments. sample of the standing genetic variation in the
If functional candidate genes for a trait are not species. An important genetic resource developed
known, co-location of candidate gene polymor- in recent years is the construction of nested asso-
phisms with map positions, linkage to QTL, ciation mapping (NAM) population. The NAM
association of alleles with specific traits or the population is a novel approach for mapping genes
identification of syntenic regions among genomes underlying complex traits, in which the statistical
can help to select positional candidate genes for power of QTL mapping is combined with the
the trait. In another approach called genetical high (potentially gene-level) chromosomal reso-
genomics, gene expression profiles are quantita- lution of association mapping, and it has been
tively assessed within a segregating population, adapted in maize (see chapter 6). Although
and expression quantitative trait loci (eQTL) sufficient diversity must be present in each asso-
can be mapped like classical QTL (see chapters 7 ciation mapping panel, too much phenotypic
and 10). Though global eQTL mapping studies, diversity (or poor adaptation to any specific grow-
using whole-genome microarrays, have been ing environment) may make it difficult to pheno-
published in yeast, Arabidopsis, maize and type a panel in an association study. Thus, more
eucalyptus, it is in preliminary stage in cotton. In region-specific association mapping panels may
addition, a comparative picture of transcript ver- need to be created that contain germplasm more
sus protein abundance indicates that functionally suited to specific growing regions.
important changes in the levels of the former are
not necessarily reflected in changes in the levels of
the later. It also holds good for metabolomes too. Improved Databases
Hence, genes, proteins, metabolites and pheno-
types should be considered simultaneously to There is a great need to expand bioinformatic infra-
unravel the complex molecular circuitry that structure for managing, curating and annotating
Mungbean 267

the cotton genomic sequences that will be generated early maturity (approximately 60 days) and relatively
in the near future. The cotton genome sequence drought tolerance. It is a self-pollinating diploid
and functional genomics database of the future plant with 2n = 2x = 22 chromosomes and a
should be able to host and manage cotton informa- genome size of 515 Mb/1C.
tion resources using community-accepted genome Despite its importance in poor mans food
annotation, nomenclature and gene ontology. basket, mungbean genomic research has lagged
Some existing databases may be upgraded to behind the other crop species due to a lack of
effectively handle a large amount of data flow polymorphic DNA markers. A limited number of
and community requests, but additional resources polymorphic SSR markers, the marker of choice,
will be sought to support key bioinformatic have been published for mungbean. Therefore,
needs. developing and identifying polymorphisms of the
SSR motifs of mungbean is an important require-
ment for mungbean development. Similarly,
Concluding Remarks for MAS in Cotton single-nucleotide polymorphisms are the most
frequently found variation in DNA and are valuable
Significant strides have been made particularly in markers for high-throughput genetic mapping,
phenotypic and molecular diversity in the cotton analysis of genetic variation and association
germplasm and identification of QTL linked to mapping studies in crop plants. Several methods
fibre productivity and quality. Yet the application have been described for SNP detection such as
of molecular marker-assisted breeding tools to high-throughput sequencing technologies and
accelerate gains in cotton productivity has barely EcoTILLING. However, the discovery of SNP
begun, and there is vast potential and need to markers based on transcribed regions has become
expand the scope and impact of such innovative a common application in plants because of the
breeding program. Progress in this direction will large number of ESTs available in databases,
be further enhanced by bringing the information and EST-SNPs have been successfully mined
generated through omics studies. Further, as from EST databases in non-model.
discussed above, involvement of innovative A transcriptome is the set of all RNA molecules,
strategies, resource pooling and capacity build- including mRNA, rRNA, tRNA and non-coding
ing to deploy marker-assisted breeding in RNA, produced in one cell or a population of
cotton will eventually lead to develop cotton cells. Although the analysis of relative mRNA
cultivars improved with improved productivity expression levels might be complicated by the fact
and quality. that relatively small changes in mRNA expres-
sion can produce large changes in the total amount
of corresponding protein present in the cell, a
Mungbean number of organism-specific transcriptome
databases have been constructed and annotated to
Pulses are important protein resources that help aid in identifying genes that are differentially
meet the nutritional requirements of poor people expressed in distinct cell populations or subtypes.
living in developing countries. Among them, Unlike genome analysis, transcriptome analysis
mungbean (Vigna radiata (L.) Wilczek) is one of offers a full profile of gene function information
the most widely cultivated species throughout under various conditions, and it differs with dis-
the southern half of Asia, and particularly it is the similar environments, cell types, developmental
widely cultivated crop in the rainfed areas. It is stages and cell states. It has repeatedly shown that
adapted to short growth duration, low water requi- transcriptome or EST sequencing is an efficient
rements, several nutrient deficient soils or poor way to generate functional genomic level data for
soil fertility. It is popularly grown as a component non-model organisms.
in various cropping systems because of its ability Interestingly, some of the studies have focused
to fix nitrogen in association with soil bacteria, on the analysis of transcriptomic functions and
268 11 Recent Advances in MAS in Major Crops

investigation of SSR and SNP markers in mungbean. These maps were constructed from the data of F2
This study can support clear understanding of or RIL populations from inter-subspecific crosses
the transcriptomic functions in mungbean and of VC3980 (cultivated) TC1966 (wild from
can provide resource data for the purpose of Madagascar) or Berken (cultivated) ACC41
crop improvement programs. Next-generation (wild from Australia) using mainly RFLP and/or
transcriptome sequencing will serve as a superior random amplified polymorphic DNA (RAPD)
resource for developing polymorphic DNA mark- markers. The population size ranged from 58
ers, not only because of the enormous quantities to 80 plants. The maps differ in length (737.9
of sequence data in which markers can be dis- 1,570 cM), number of markers (102255 markers),
covered but also because the discovered markers number of linkage groups (LG) (1214) and level
are gene-based. Such markers are advantageous (1230.8%) and regions of marker distortion.
because they facilitate the detection of functional The most comprehensive map consists of 255
variation and selection in genomic scans or genetic loci with an average distance between the adjacent
association studies in mungbean. The large num- markers of 3 cM. However, most of the maps
ber of SSRs and SNPs is now available, and they do not resolve 11 LGs, which is the haploid
are potentially useful for multiple applications chromosome number of mungbean. To resolve
ranging from population genetics, linkage map- 11 LGs and saturate the map, many more markers
ping and comparative genomics to gene-based are needed. In addition, the genome coverage of
association studies. the markers has yet to be determined.

Genetic Diversity and Linkage Mapping QTL Mapping in Mungbean


in Mungbean
QTLs for several traits encompassing azuki
A large collection of mungbean germplasm bean weevil resistance, seed colour, seed weight,
encompassing 415 cultivated (V. radiata var. radi- hard-seededness, powdery mildew resistance
ata), 189 wild (V. radiata var. sublobata) and 11 and Cercospora leaf spot resistance were mapped
intermediate accessions from diverse geographic with molecular markers in mungbean. Among
regions have been characterised using 19 azuki them, QTL linked to bruchid, Cercospora leaf
bean SSRs. The results revealed that mungbean spot and yellow mosaic virus resistance are of
has highest diversity in South Asia, supporting importance for genetic improvement of this
the view of its domestication in the Indian sub- crop, and they are highlighted here. The bruchid-
continent and showing that Australia and Papua resistance gene (Br) has already been mapped
New Guinea are centres of diversity for wild using an F2 population from a cross between
mungbean. A core collection of 106 accessions resistance line, TC1966 and a susceptible culti-
representing most genetically diverse of these var. Br is located on linkage group 9 of the
germplasm has been made. Despite the work current mungbean linkage map. Mungbean has
carried out on the Fabaceae, research into mung- a relatively small genomic size, ranging from
bean genetics and evolution is not as advanced 470 to 560 Mb. The current estimated genetic
as in many other species. size of the mungbean genome is about 1,570 cM.
Several linkage maps of mungbean have been The small genomic size of mungbean may allow
constructed (e.g. Menancio-Hautea et al. 1992; us to apply a map-based cloning strategy to
Lambrides et al. 2000; Humphry et al. 2002) isolate the resistance gene. Cloning of the Br
upon which most marker research into this crop gene would aid not only the elucidation of the
has been based, but they do not provide the same synthetic pathway of the resistance factor(s)
level of genome saturation seen in many other but also the development of transgenic plants
species mainly due to the reason mentioned above. harbouring resistance against a wide spectrum
Mungbean 269

of insect pests. In another study, molecular Legume Comparative Genomics


markers that are tightly linked to the resistance and Its Importance in Mungbean MAS
locus using the construction of a high-resolution
linkage map were reported. Economically, legumes represent the second most
Cercospora leaf spot (CLS) caused by the important family of crop plants after Poaceae
fungus Cercospora canescens Ellis and Martin (grass family), accounting for approximately
is a serious disease in mungbean, and disease 27% of the worlds crop production. On a world-
can reduce seed yield by up to 50%. The QTL wide basis, legumes contribute about one-third of
analysis was conducted using F2 (KPS1 V4718) humankinds protein intake, while also serving
and BC1F1 [(KPS1 V4718) KPS1] popula- as an important source of fodder and forage for
tions developed from crosses between the CLS- animals and of edible and industrial oils. One of
resistant mungbean V4718 and CLS-susceptible the most important attributes of legumes is their
cultivar Kamphaeng Saen 1 (KPS1). The results unique capacity for symbiotic nitrogen fixation,
of segregation analysis indicated that resistance underlying their importance as a source of nitro-
to CLS is controlled by a single dominant gene, gen in both natural and agricultural ecosystems.
while composite interval mapping consistently Legumes also accumulate natural products (sec-
identified one major QTL (qCLS) for CLS ondary metabolites) such as isoflavonoids that
resistance on linkage group 3 in both F2 and are beneficial to human health through anticancer
BC1F1 populations. qCLS was located between and other health-promoting activities.
markers CEDG117 and VR393 and accounted The legumes are highly diverse and contain
for 65.580.53% of the disease score variation several economically important crops such as
depending on seasons and populations. An allele soybean (Glycine max), peanut (Arachis hypogaea),
from V4718 increased the resistance. The SSR mungbean (Vigna radiata), chickpea (Cicer
markers flanking qCLS will facilitate transferral arietinum), lentil (Lens culinaris), common
of the CLS resistance allele from V4718 into elite bean (Phaseolus vulgaris), pea (Pisum sativum)
mungbean cultivars. and alfalfa (Medicago sativa). Despite their close
At present, mungbean yellow mosaic virus phylogenetic relationships, crop legumes differ
(MYMV) is the most important disease of mung- greatly in their genome size, base chromosome
bean all over the world. The disease is charac- number, ploidy level and self-compatibility. Never-
terised by yellow mosaic on leaves of infected theless, earlier studies indicated that members
plants that results in considerable yield losses. of the legumes exhibited extensive genome con-
MYMV is caused by a bipartite begomovirus servation based on comparative genetic mapping.
which is transmitted via whiteflies (Bemisia Unlike many of the major crop legumes, M. trun-
tabaci). Lambrides and his group tagged the catula and Lotus japonicus (selected as model
resistance gene from NM92 in two RIL popula- systems for studying legume genomics and biology)
tions, using BSA strategy. A marker generated are of small genome size, amenable to forward
from RAPD primer OPAJ20 was found to be and reverse genetic analyses, and well suited for
distantly linked with the resistance gene. Inter- studying biological issues important to the related
simple sequence repeat (ISSR) and SCAR crop legume species.
markers linked to the resistance in blackgram An immediate goal of legume genomics is
have exerted a potential for locating the gene in to transfer knowledge between model and crop
mungbean. Lambrides and Godwin suggested legumes. Accordingly, an in-depth understanding
that mungbean probe Mng247 associated with of conservation of genome structure among
soybean mosaic virus resistance might be useful legume species is a prerequisite to achieving this
in identifying MYMV resistance gene. In addi- goal. The idea that conserved genome structure
tion, Mng247-derived SSR marker, M3Satt41, can facilitate transfer of knowledge among related
may also be useful in this regard. plant species is best addressed in grasses in which
270 11 Recent Advances in MAS in Major Crops

genome macrosynteny and microsynteny have crop is needed. A major obstacle to achieve such
been extensively maintained. maps is the lack of high-throughput SSR and
It has been demonstrated that mungbean and SNP markers (however, some progress has made
cowpea (Vigna unguiculata) exhibited a high to this end; see above). As indicated above, the
degree of linkage conservation, whereas chromo- genome study in mungbean has been made
somal rearrangements have occurred since the possible by using genetic markers from other
divergence of the two species. Comparative related legumes, and this trend will continue
mapping among mungbean, common bean and since only limited genetic resources are available
soybean in the Phaseoleae tribe indicated that for further study in mungbean. For example, SSRs
mungbean and common bean linkage groups from azuki bean, common bean and cowpea will
were highly conserved, but synteny with soybean be useful in development of mungbean linkage
was limited only to the short linkage blocks. map with 11 LGs resolved, as in the case of
Use of a bridging species (soybean) revealed blackgram. Moreover, the information obtained
that homoeologous segments of soybean chro- from sequencing of soybean genome, common
mosomes showed a higher degree of synteny with bean ESTs and gene space of cowpea, M. trun-
chromosomes of common bean and mungbean catula and Lotus japonicus, can create high-
than previously thought. throughput genetic markers for mungbean. In
Comparative mapping in mungbean and a dis- addition, a database of thousands of cowpea gene
tantly related legume crop, lablab, gave surprising space sequences containing SSRs is now publicly
results in that the two species share several large available. In-silico development of cowpea SSRs
conserved genome blocks as indicated by similar and application of those markers in mungbean
marker orders and LGs. However, the results are also interesting. With many genomic tools and
also showed genome rearrangements and many resources for legumes are becoming increasingly
deletions/duplications after divergence. available, a more detailed and in-depth genome
By contrast, macrosyntenic relationships mapping of mungbean will be possible in the
between M. truncatula and Phaseoloid legumes near future. One such study is already reported
were more complicated and less informative. (Isemura et al. 2012). The genetic differences
Twenty-nine of the 38 (approximately 76%) mark- between mungbean and its presumed wild ances-
ers mapped between M. truncatula and mungbean tor were analysed for domestication-related
revealed evidence of conserved gene order, whereas traits by QTL mapping. A genetic linkage map of
the remaining markers mapped to nonsyntenic mungbean was constructed using 430 SSR and
positions. Despite these limitations, it is proposed ESTSSR markers from mungbean and its related
that a comprehensive analysis of legume compara- species, and all these markers were mapped onto
tive genomics in future may help to genetically 11 linkage groups spanning a total of 727.6 cM.
improve the mungbean via MAS. This mungbean map was the first map where
the number of linkage groups coincided with the
haploid chromosome number of mungbean. In
Concluding Remarks for MAS total, 105 QTLs and genes for 38 domestication-
in Mungbean related traits were identified using this map.
Another challenge for mungbean genome
Although some progress in genome research researchers is the development and establishment
has been made in mungbean, it is still far behind of a more efficient protocol of genetic transfor-
the other major legume crops such as soybean, mation to support breeding work as the use of
cowpea and common bean or, even their relative transgenic technology is inevitable for mungbean
but less important, azuki bean. The fact that the in the future. The technology will be helpful in
current genetic linkage maps of mungbean are development of cultivars resistant to serious
not yet at detailed level and hence dense or insects and tolerance to adverse environment that
saturated maps with 11 LGs resolved for this no effective gene source exists in their gene pool
Tomato 271

such as legume pod borers and drought and other ble crops, such as (1) their high yield which
abiotic stresses. results in their high economic value and (2) they
have very high nutritional value with high lev-
els of pro-vitamin A and C. As well as being
Tomato ranked first on their nutritional contribution to a
humans diet, (3) they are a short-duration crop,
Tomatoes (Lycopersium esculentum L.) are con- and (4) they are very well suited for different
sidered to be one of the most economically cropping systems that are used on grains, pulses,
important crops of all those that exist in the world. cereals and oilseeds.
Tomatoes are juicy berry fruits of the nightshade There are over 200 documented diseases of
family (Solanaceae). They came originally from cultivated tomato and seriously affecting the fruit
Central and South America. They are nutritious yield. Growers usually employ an integrated pest/
vegetables that provide good quantities of vita- disease management strategy including both
mins A and C as well as essential minerals and cultural practices and pesticide use to combat the
other nutrients. Furthermore, fresh and processed damage caused by these pathogens. An example
tomatoes are the richest sources of the dietary of a cultural practice is the use of netting over
antioxidant lycopene, which arguably protects tomato plants, which provides a physical barrier
cells from oxidants that have been linked to can- that can be effective in excluding disease-bearing
cer. Tomato is also a source of other compounds insects from infecting the crop.
with antioxidant activities, including chlorogenic
acid, plastoquinones, rutin, tocopherol and
xanthophylls. Conventional Breeding and Tomato
Economically speaking, tomatoes are worth a Improvement
tremendous amount of money because they give
more yields. Tomatoes are also one of the main Conventional breeding efforts in tomato date
ingredients in hundreds of dishes and products back to the 1930s, when breeding for improve-
that are sold in supermarkets throughout the ment of the overall horticultural characteristics
developing and developed world. This means that of tomato started. As market demand developed
the demand of tomatoes (i.e. where ever high for more specific traits desired by the fresh-
demand for tomatoes as they are a main ingredi- market or processing tomato industry, breeding
ent in dishes) is extremely high. The production objectives became more specialised, and by the
of tomatoes is ranked first in India, where small 1950s, improved varieties were developed for
business owners and farmers are dominated by either processing or fresh-market uses through
producing tomatoes. They highly value and selecting best phenotypes. Despite a significant
favour the choice to produce tomatoes because of contribution in genetic improvement, conven-
their high value in money as this makes up a very tional breeding has several potential inherent
large part of their income. difficulties, including limitations in the availabil-
Tomatoes are also a popular choice by people ity of screening environments, reduced response
who wish to grow fruits and vegetables in their to selection for traits with low heritability or
own gardens. Not only can they be used raw in recessive expression, growing length before trait
salads, but they are also an essential part of evaluation can be conducted, genetic linkage
many recipes as well as many products such as drag, the need to use large populations and thus
tomato ketchup and chutney. They can also be large space and concerns regarding genotype by
grown both indoors in greenhouses and out- environment (G E) interactions. Furthermore,
doors, although tomatoes that are grown outside in some cases, breeders are unable to fully char-
tend to have higher nutrient contents than those acterise or utilise the genetic information available
grown in greenhouses. Tomatoes have many in wild germplasm or breeding populations via
advantages over growing other types of vegeta- phenotypic screening.
272 11 Recent Advances in MAS in Major Crops

Biotechnology and Tomato Breeding As discussed in chapter 8, successful application


of the MAS depends on several factors. A major
Advances in DNA technology after 1950s have concern in the use of molecular markers for
made huge revolution in tomato breeding. breeding purposes in tomato is the low frequency
There are two areas in biotechnology that have of marker polymorphism within breeding popu-
immediate effect in tomato breeding: (1) trans- lations as shown in several reports. Most genetic
genic technology and (2) marker-assisted selec- maps of tomato are based on interspecific crosses
tion (MAS). Despite numerous research studies between the cultivated and related wild species
regarding transgenic approaches against diseases of tomato, where marker polymorphism is abun-
of plants, there are currently no or very few dant. This is of particular concern when the wild
transgenic tomato varieties (in some countries) species is only distantly related to the cultivated
available to the grower that are resistant to any tomato, such as S. pennellii that has been used
pathogens. Further, there remains an issue of for the construction of the high-density molecu-
public resistance, which, combined with the high lar linkage map of tomato. However, as shown
cost of obtaining regulatory approval, has effec- in the rice case study, most tomato-breeding
tively prohibited this promising technology from populations are based on intraspecific crosses
being used in commercial tomato cultivation. within the cultigen or crosses between the culti-
Thus, the MAS has the proven potential in vated and closely related wild species such
tomato breeding for genetic improvement of as S. pimpinellifolium. In such populations, there
several important economic traits such as pest is much less marker polymorphism compared
and disease resistance, quality improvement and to that in wide crosses. Thus, efforts must be
nutrient enhancement. With the advent of molec- made to identify markers with a higher rate of
ular markers and genetic maps, there has been polymorphism in breeding populations. Further,
an increased interest in using markers technology markers must be high throughput and economi-
to facilitate tomato crop improvement. Tomato cally affordable to justify their use in large popu-
was among the first crop species for which genetic lations. Finally, linkage association between the
markers and maps were developed and utilised gene or QTL of interest and the genetic marker
for breeding purposes (Tanksley et al. 1992). must be tight enough to avoid unwanted crossing
Molecular markers and MAS can potentially over, which may result in false positive selection.
overcome at least some of the limitations associated In this regard, the best genetic markers are those
with conventional breeding involving phenotypic that are within the gene of interest. Due to the low
selection. A major advantage of DNA markers is genetic diversity within the tomato cultigen, new
that they are neutral in phenotypic reactions, marker technologies, which can detect minor
that is, they do not have any pleiotropic effect genetic variation, are being leveraged for marker
on the phenotype, nor are they influenced in their discovery and tomato variety development. Among
segregation and inheritance by the growing the marker classes, SNPs have become the marker
conditions of the plant. Furthermore, molecular of choice for numerous reasons. First, SNPs are
markers can be detected at any growth stage, more plentiful than other marker types. Second,
offering the possibility of selecting plants on the high-throughput Taqman-based SNP assays can
basis of convenience to the breeder, in contrast be developed for large-scale genotyping and
to the season-bound nature of conventional selec- relatively easy data analysis. Third, Taqman-based
tion. With the availability of molecular markers SNP genotyping is cheaper than other protocols
distributed throughout the tomato genome, many when larger numbers of samples are involved.
tomato genetic maps have been developed, Furthermore, a newer technology that is emerg-
including the high-density linkage map of tomato ing and is being employed by some public and
based on a S. lycopersicum S. pennellii cross private tomato researchers is genotyping by
(refer Foolad et al. 2008 for a list of tomato sequencing (GBS). This technology is becoming
genetic maps). more feasible due to the reduced cost and the fact
Tomato 273

that normally large numbers of polymorphic after reliable linkages between markers, and
SNPs are discovered between genotypes (often simple traits of interest are discovered. Such traits
on the order of hundreds of thousands). With include, but not limited to, disease resistance,
the completion of the tomato reference genome fruit colour and carotenoid content (e.g. lycopene
sequence, localising SNPs identified by GBS and b-carotene), fruit ripening-related traits
to specific physical locations is becoming an (various genes including Rin and Nr), jointless
easy task. pedicel (j2) and extended field storage (EFS;
Tomato was one of the first crops for which using various genes including Alcobaca and Long
molecular markers were suggested as indirect Keeper). It appears that for many simple disease-
selection criteria for breeding purposes (as early resistance traits in tomato, MAS is not only faster
as it is reported in 1974; refer Foolad and Panthee than conventional selection but also cheaper and
2012 for an excellent review of tomato breeding more effective. In tomato, genes for resistance
using MAS). The actual use of MAS in tomato to over 35 pathogens have been identified and
breeding began approximately three decades mapped. It is assumed that currently in the tomato
ago with the use of the isozyme marker acid seed industry MAS is routinely employed for
phosphatase (Aps-11 locus) as an indirect selection selecting for several qualitative disease-resistance
criterion for breeding for nematode resistance. traits, including fusarium wilt races 1, 2 (with
This isozyme marker still is being used in many some difficulty) and 3, late blight (Ph-3 and may
private and public tomato-breeding programs for be Ph-2), verticillium wilt race 1, bacterial spot
selecting for nematode resistance. However, more (Rx3 and Rx4), tomato spotted wilt virus (Sw5),
recently, with the development of new molecular tomato yellow leaf curl virus (Ty1, Ty2, Ty3 and
markers and maps in tomato, MAS has become Ty4) and root-knot nematode. As an example,
a routine practice in many tomato-breeding the detailed MAS work for genetic improvement
programs, in particular in the private sector, for of tomato for bacterial spot and TYLC virus
several purposes including the following three. resistance is discussed below (see Foolad and
First, MAS is often used to assess hybrid purity Panthee 2012 for references and other details).
from overseas production by screening seed lots
with a panel of molecular markers. The technolo-
gies used for this purpose vary widely; SNPs MAS for Bacterial Spot Resistance
are leveraged regularly, PCR-based markers
are employed routinely, and in some cases, even Bacterial spot, a common disease of tomato
well-known isozyme markers are recruited. throughout the world and particularly in tropical
Second, when reliable markers closely linked and subtropical regions, is caused by four
to resistance genes (or specific fruit quality loci) species and five races of Xanthomonas, including
are known, MAS is used effectively for quick X. euvesicatoria (race T1), X. vesicatoria (race
germplasm screening for disease resistance or T2), X. perforans (races T3, T4 and T5) and
fruit quality. Often, a panel of linked markers is X . gardneri (race T2). Among these, X. perforans
used on individual selections or pools of seed is the predominant species. Bacterial spot affects
or tissue from early-generation populations to leaves, stem and fruit and causes defoliation, fruit
index breeding populations. This aids breeding lesion and reduced yield. The chemical control of
efforts by informing the breeder about which this disease has not been very effective due to
disease resistances or fruit quality traits are seg- the presence of multiple sources of inoculum and
regating or fixed in a given population. However, development of chemical resistance in the pathogen.
often organism screening may still be required to Sources of host genetic resistance to bacterial
verify the results of MAS and to validate linkage spot have been identified in S. lycopersicum (e.g.
(or lack thereof) between markers and the trait(s) Hawaii 7998 and Hawaii 7981), S. lycopersicum
of interest. Third, MAS is employed for marker- var. cerasiforme (PI 114490) and the related wild
assisted backcrossing (MAB; refer chapter 8) species S. pimpinellifolium (PI 126932 and PI
274 11 Recent Advances in MAS in Major Crops

128216) and S. pennellii (LA 716). However, the S. lycopersicum var. cerasiforme accession
the presence of multiple species and races of the PI114490 (yellow cherry tomato) has shown field
pathogen as well as complex nature of host resistance to multiple races of the pathogen. This
genetic resistance has made bacterial spot resis- resistance seems complex as it may be conferred
tance breeding in tomato very challenging. While by different genes in response to different races
most resistance sources seem to be race-specific, of the pathogen. However, in a mapping study
some resistant genotypes interact with multiple using this accession, a major QTL was identified
races of the pathogen and exhibit quantitative on chromosome 11, which may confer resistance
response. For example, the breeding line Hawaii to races T1, T2, T3 and T4. In addition, QTLs
7998, the most reliable source of resistant to race associated with race T4 of bacterial spot were
T1, exhibits reduced disease symptoms in the identified on chromosome 3 (PVE = 4.8%) and 11
field and a hypersensitive response (HR) to T1 in (PVE = 29.4%) in inbred backcross populations
the greenhouse. Three QTLs/genes, Rx-1 (chromo- developed from PI 114490, OH 9242 and Fla
some 1), Rx-2 (chromosome 1) and Rx-3 (chro- 7600. In a different study, two RAPD markers
mosome 5), were reported to be independently associated with bacterial spot resistance were
associated with HR in the greenhouse using a reported, where the markers were originally
population derived from crosses between Hawaii derived based on a resistance gene (Bs-2) in pep-
7998 and S. pennellii accession LA 716. The per. In this study, an F2 population of pepper
RFLP markers associated with these genes, from a cross between Early Calwonder (bs1/bs1
however, are based on S. pennellii LA716 and bs2/bs2 bs3/bs3) and Early Calwonder 20R (bs1/
thus are not polymorphic in most breeding popu- bs1 Bs2/Bs2 bs3/bs3) was employed to identify
lations, limiting their utility for MAS breeding. recombinants, which subsequently were used to
The Rx-3 locus was subsequently confirmed to identify the gene sequence and design primers
provide HR as well as field resistance in advanced for screening for Bs-2 gene in tomato.
backcross populations derived from a cross In summary, the available molecular markers
between Hawaii 7998 and processing breeding associated with different bacterial spot resistance
line OH 88119 (susceptible), and markers linked genes or QTLs are expected to be useful for
to Rx-3 were also reported including a CAPS pyramiding resistance from different sources via
marker that has been used for MAS breeding. MAS, providing a strong and durable resistance
Breeding line Hawaii 7981 provides an HR-based to tomato bacterial spot. However, because of the
resistance to race T3 of the pathogen and is con- complexities of the pathogen and host resistance,
sidered the strongest source of resistance to this it may be necessary to combine MAS with field
race under both greenhouse and field conditions. disease screening to confirm the presence of
This resistance is controlled by a single gene, strong resistance.
Xv-3, which is mapped to tomato chromosome
11. In another study, using a population derived
from OH 88119 and PI 128216 (a resistant acces- MAS for Tomato Yellow Leaf Curl Virus
sion of S. pimpinellifolium), markers associated Resistance
with race T3 resistance were identified in the same
location as Xv-3 on chromosome 11, and resis- Tomato yellow leaf curl virus (TYLCV), a
tance gene was designated as Rx-4. SSR and SNP monopartite geminivirus transmitted by whitefly,
markers associated with Rx-4 have been identified. is a serious disease of tomatoes in tropical and
S. pennellii accession LA 716 exhibits HR to race subtropical regions of the world. Genetic sources
T4, conferred by the resistance gene Xv-4, which of resistance have been identified in the tomato
originally was mapped to tomato chromosome 3. wild species S. pimpinellifolium, S. peruvianum,
Another bacterial spot resistance gene, Bs-4, S. cheesmanii, S. habrochaites and S. chilense
was discovered in cv. Moneymaker and mapped and used to study the genetic control of resistance.
to the short arm of chromosome 5. Furthermore, Due to the very destructive nature of this disease
Tomato 275

in certain tomato growing regions, intensive associated with linkage drag and recovery of
breeding efforts have been devoted to developing desirable horticultural characteristics. Such unde-
TYLCV resistant cultivars, mostly in private seed sirable associations could be due to genetic linkage
companies. Traditional breeding has resulted in and/or pleiotropic effects; the distinction between
development of cultivars with reduced suscepti- the two is often not very straightforward. Thus,
bility, but no cultivar with complete resistance to before MAS can become a routine practice for
TYLCV is available. In addition, the disease improving complex traits in tomato, issues sur-
response of the resistant cultivars often varies rounding this utility must be addressed.
from location to location, and it has been difficult
to develop resistant cultivars with horticultural
characteristics similar to those of susceptible MAS for Genetic Improvement
ones. Thus far, four resistance loci, Ty-1, Ty-2, of Fruit Quality Traits
Ty-3 and Ty-4, have been identified and mapped
to tomato chromosomes 6, 11, 6 and 3, respec- Antioxidants in tomato fruits have been a public
tively. Several QTLs conferring resistance to health focus for many years. The lycopene
TYLCV have also been identified. At least six content (LYC) in tomato fruit is an important
PCR-based molecular markers associated with source of lipid-soluble antioxidants in the human
the major resistance genes have been developed diet and can prevent the initiation or propagation
and reported. However, the lack of consistent of oxidising chain reactions. Total soluble solid
genetic markers associated with TYLCV resistance content (SSC) is one of the main components of
has hindered the utility of MAS for this trait. tomato flavour, and it is the property in tomato
In addition, since TYLCV is considered a dan- most likely to match the consumer perception
gerous pathogen, screening germplasm for resis- of internal quality. LYC and SSC are the main
tance as well as validation of any genetic marker quality traits of tomato fruit. A range of genetic
has been challenging. and environmental factors that result in quanti-
tative variation across varieties governs tomato
fruit quality; however, the inheritance is complex.
MAS for Other Economic Traits Therefore, overcoming the genetic linkage
between fruit quality traits presents a challenge
As for quantitative traits, in addition to the limited for conventional breeding methods. The use of
use of MAS for manipulating QTLs for traits QTL mapping to find major genes and functional
such as fruit flavour and soluble solids content markers and improve the ability to control quan-
(Brix), MAS is being attempted for improving titative traits is an effective way to solve these
quantitative resistance to diseases such as pow- problems.
dery mildew, bacterial canker and bacterial wilt. Conventional breeding methods provide little
Furthermore, despite considerable efforts devoted information on the chromosomal regions control-
to the identification and mapping of QTLs for ling these complex quality traits or the simultane-
various abiotic stress tolerance traits in tomato, ous effects of each chromosomal region on
including salt tolerance, drought tolerance and other traits such as epistasis, pleiotropy and
cold tolerance, it does not seem MAS has been linkage. If based only on phenotype analysis,
employed for improving any of these traits. As selection by conventional breeding methods is
is the case in other crop species, many QTLs extremely difficult when genotypeenvironment
reported for complex traits in tomato are either interactions are substantial. No reliable field
unreliable, population-specific or not strong screening technique exists that can be used year
enough in terms of linkage to warrant their use after year and generation after generation. One
for marker-assisted breeding. In fact, in many approach to facilitate the selection and breeding
cases where MAS has been employed to transfer of complex quality traits is to identify genetic
QTLs from wild species, there have been problems markers linked to the traits of interest. During the
276 11 Recent Advances in MAS in Major Crops

past decades, QTL studies conducted for tomato nucleotide (QTN) in the promoter of the gene.
have revealed more than 50 traits, and most are Further genetic analysis of this QTN supported
fruit-related traits. Studies on the traits of LYC or the finding that this SNP is the causative mutation
SSC have suggested the existence of at least 17 at the fw3.2 locus.
QTLs for LYC in all of the tomato chromosomes
except 9 and at least 109 QTLs for SSC in all
chromosomes. With the exception of 2 QTLs Concluding Remarks for MAS in Tomato
for LYC, none of these QTLs have been used for
marker-assisted selection (MAS) in breeding; Molecular markers associated with genes or
this suggests that constructing a static model of QTLs have been reported for numerous economi-
genetic roles only at only one development point cally important traits in tomato. Theoretically,
is inadequate and more effort should be directed such marker information should be useful for
towards examining the stability and effectiveness improving qualitative or quantitative traits in
of the target trait QTLs with the view of using a tomato via marker-assisted breeding. In practice,
dynamic model in the genetic variation. however, while markers have been used rather
extensively for improving certain simple-inherited
traits in tomato, they have rarely been utilised for
Fine Mapping and Characterisation improving complex traits. This has been due to
of Fruit-Size QTL various reasons, including population-specific
markers (e.g. lack of correspondence between
Fruit size is one of the most important agricultural QTLs identified in interspecific populations and
traits controlled by quantitative trait loci (QTL). those existing in breeding populations), lack of
Therefore, identification of the underlying genes marker validation by repeating experiments, lack
of the major fruit-size loci may benefit the breeding of marker polymorphism in breeding populations
industry, as well as help us better understand and linkage drag. For simple-inherited character-
the molecular mechanism underlying fruit devel- istics, in particular some disease-resistance traits,
opment. In one study, one of the major fruit-size however, markers have been used for tomato
loci in tomato, fw3.2, was fine mapped by linkage breeding to a great extent in both public- and
analysis to a 51.4 kb interval corresponding to private-sector breeding programs. It is estimated
BAC clone of the tomato genome. The gene action that, at least for some disease-resistance traits,
suggested a gain-of-function mutation occurred MAS is not only faster than phenotypic selection
in cultivar allele producing larger fruit during the but it is also cheaper and more efficient. However,
domestication. The phenotypic characterisation not all markers publicly reported in the literature
of near-isogenic lines (NILs) showed that this are readily applicable in tomato-breeding pro-
locus also controls other traits such as branch grams. Often additional efforts are necessary to
number, leaf size and seed size. Yield per plant refine the markers or to identify and develop new
was similar, and the larger fruited lines carried markers with greater utility and reproducibility
fewer fruit that ripened later than the smaller in specific breeding populations. In particular,
fruited lines. The changes in fruit weight were extra efforts are often required to identify/develop
not due to an alteration in the sinksource rela- markers that detect polymorphism within tomato-
tionship. Expression level analysis of the seven breeding populations. In fact, as most commercial-
candidate genes in the NILs did not identify scale tomato-breeding material is developed
which gene may underlie fw3.2, and numerous by the private sector, such programs often develop
SNPs and InDels were found between the parents their own resource of proprietary markers and asso-
of the population. Based on function of the putative ciations tailored to their germplasm pool. Often
orthologs, one candidate gene is proposed to publicly available marker information is a good
be FW3.2. Association mapping around this start but not always adequate. The utility of avail-
candidate gene yielded one quantitative trait able markers for several major disease-resistance
Hot Pepper 277

traits in tomato was tested in a number of breeding Progress in MAS in Hot Pepper
lines and commercial cultivars with known
resistance/susceptibility responses. While several The characteristics of male sterility (MS) are
markers were validated, others needed PCR opti- used in breeding programs to achieve economical
misation for successful amplifications or were seed production. Male sterility is divided into
not informative in the genotypes used. Specifically, genic male sterility (GMS) and cytoplasmic male
of the 37 markers examined, 19 (approximately sterility (CMS), which are used to breed commer-
51%) were informative, including markers for cial pepper varieties. The CMS system, however,
resistance to Fusarium wilt, late blight, bacterial is not feasible in some pepper varieties, including
wilt, tomato mosaic virus, tomato spotted wilt C. annuum, because of the absence of a restorer
virus and root-knot nematodes (Panthee and source. GMS is thus important for seed produc-
Foolad 2012). It appears that many of the avail- tion in bell peppers. A GMS-linked marker from
able markers may need to be further refined or bell peppers was developed using the bulked seg-
examined for trait association and presence of regant analysis and amplified fragment length
polymorphism in breeding lines and populations. polymorphism method using F2 and sibling indi-
However, with recent advances in tomato sequenc- viduals. Use of 1024 AFLP primer sets found a
ing, it is becoming increasingly possible to polymorphism from EcoRI ACG/MseI GTT
develop more informative markers to accelerate among the siblings. An internal sequence-based
the use of MAS in tomato breeding. Thus, it is primer was designed from the 395 bp sequence
imperative that additional efforts are required for high-resolution melting (HRM) analysis, and
to devote to identifying allele-specific and the marker score of 87 of 92 F2 individuals cor-
population-specific markers in order to expand responded to their phenotypes. The marker was
the utility of MAS in tomato breeding. mapped on chromosome 5 on the AC99 map.
Phytophthora root rot, caused by Phytophthora
capsici, is a major disease that limits pepper
Hot Pepper production in the world. It is a soil-borne patho-
gen that can survive on host residues in soil for
Hot pepper (Capsicum annuum) is an important months. Various methods to control phytophthora
horticultural crop, not only because of its eco- root rot have been reported; however, most
nomic importance but also due to nutritional treatments increase production costs as well as
and medicinal value of its fruit. These are the environmental and health risks. The use of resis-
excellent source of natural colours and antioxi- tant cultivars is a simple and effective strategy.
dants. A wide spectrum of antioxidant vitamins, Several resistance sources to phytophthora root
carotenoids, capsaicinoids and phenolic com- rot have been reported, but commercial cultivars
pounds are present in hot pepper fruits. The intake with good stable resistance in different environ-
of these compounds in food is an important ments against diverse isolates of the pathogen
health-protecting factor preventing widespread across regions are still lacking. Quantitative trait
human diseases. Acreage under hot peppers is loci (QTL) for resistance to phytophthora root
increasing due to a shift in production trend from rot were investigated using two Korean P. capsici
other crop-based farming to nontraditional crop isolates and 126 F8 recombinant inbred lines
production which in turn is due to a decline in derived from a cross of Capsicum annuum line
income from regular cropping program. During YCM334 (resistant parent) and local cv. Tean
the last decade, the area under protected cultiva- (susceptible parent). Seven QTLs common to
tion (poly/plastic tunnels) of vegetables like resistance for the two isolates on chromosome 5
hot pepper, tomato and cucumber is increasing besides QTL that were isolate-specific were
steadily. Hot pepper is one of the potential crops identified. The QTLs in common with the major
to be grown in poly/plastic tunnels. effect on the resistance for two isolates explained
278 11 Recent Advances in MAS in Major Crops

20.048.2% of phenotypic variation. The isolate- markers are unique in their ability to detect the
specific QTLs explained 6.017.4% of pheno- functional nucleotide polymorphisms of the
typic variation. The result confirms a three Pun1 alleles. This set of Pun1 markers will
gene-for-gene relationship between C. annuum aid diversity studies through the easy
and P. capsici for root rot resistance (Truong identification of the three known Pun1 mutants
et al. 2012). QTLs for phytophthora root rot in a wide range of germplasm. Additionally, the
resistance were previously identified on chro- markers are useful for seed lot testing in seed
mosome 11 in other studies. Thus, the results purity programs. With a trait such as pungency
indicate that at least a few specific gene func- in fruit, which can cause a painful sensation upon
tions are important components of root rot resis- contact, it is critical to maintain the purity of non-
tance to different P. capsici races/isolates in the pungent seed stocks. Finally, these markers will
YCM334 Tean population. Identification of be highly useful in breeding programs because
isolate-specific resistance QTLs in P. capsiciC. they provide an easy method to genotype popu-
annuum interactions will help breeders in select- lations and quickly identify plants with the
ing appropriate resistant lines for future hybridi- desired pungency state.
sation. Breeders may need to breed for resistance
against a specific isolate from different regions
and then pyramid a number of specific genes to Concluding Remarks on MAS
confer resistance into a cultivar. The approach in Hot Pepper
for further studies could be to develop near-
isogenic lines carrying different combinations of Molecular markers have been contributed in
QTLs and challenging the isogenic lines with genetic improvement of hot pepper in several
different pathogen isolates. ways including ef fi cient screening of large
Pungency in peppers is due to the presence of amount of germplasm for genetic diversity
capsaicinoid molecules, which are only produced analysis, screening for seed purity, finger print-
in Capsicum species. Capsaicinoids, the molecules ing and QTL mapping. Though genes for major
that cause a pungent, burning sensation when hot dominant traits have been mapped, QTL for
peppers are consumed, are produced exclusively complex polygenic traits such as pest and dis-
in the genus Capsicum. This organoleptic quality ease resistance and abiotic stress resistance
is due to the activation of the TRPV1 (VR1) remains to be analysed. It is envisaged that
receptor. The primary capsaicinoids are capsaicin, future development in molecular biology may
dihydrocapsaicin and nordihydrocapsaicin. reduce the cost involved in marker development
The presence of capsaicinoids makes pungent which in turn have huge impact on hot pepper
peppers valuable as a spice. In contrast, the breeding via MAS.
absence of capsaicinoids is important when non-
pungent peppers are grown as a vegetable crop.
The major gene Pun1 is required for the production
of capsaicinoids. Three distinct mutant alleles Bibliography
of Pun1 have been found in three cultivated
Capsicum species, one of which has been widely
utilised by breeders. A robust collection of Literature Cited
molecular markers for the set of alleles were
identified that can differentiate four Pun1 alleles. Ali ML, Pathan MS, Zhang J, Bai G, Sarkarung S, Nguyen
HT (2000) Mapping QTLs for root traits in a recombi-
Those markers were tested on a diverse panel of nant inbred population from two indica ecotypes in
pepper lines and in an F2 population segregating rice. Theor Appl Genet 101:756766
for pungency (Wyatt et al. 2012). These markers Boopathi NM, Senthil A, Chandrikala R, Singh A,
will be useful for pepper breeding, germplasm Shanmugasundaram P, Sadasivam S, Babu RC (2002)
Mapping quantitative trait loci and marker assisted
characterisation and seed purity testing. Those
Bibliography 279

selection for the improvement of drought tolerance in McCouch SR, Kochert G, Yu ZH, Wang ZY, Khush GS,
rice. Madras Agric J 89(1012):553562 Coffman WR, Tanksley SD (1988) Molecular mapping
Champoux MC, Wang G, Sarkarang S, Mackill DJ, of rice chromosomes. Theor Appl Genet 76:815829
OToole JC, Huang N, McCouch SR (1995) Locating Menancio-Hautea D, Kumar L, Danesh D, Young ND
genes associated with root morphology and drought (1993) A genome map for mungbean [Vigna radiata
avoidance in rice via linkage to molecular markers. (L.) Wilczek] based on DNA genetic markers (2n = 2x
Theor Appl Genet 90:961981 = 22) In: OBrien JS (ed) Genetic maps 1992. A com-
Chen H, Qian N, Guo W, Song Q, Li B, Deng F, Dong C, pilation of linkage and restriction maps of genetically
Zhang T (2010) Using three selected overlapping RILs studied organisms. Cold Spring Harbor Laboratory
to fine-map the yield component QTL on Chro.D8 in Press, Cold Spring Harbor, pp 6.2596.261
Upland cotton. Euphytica 176:321329 Panthee DR, Foolad MR (2012) A reexamination of
Foolad MR, Panthee DR (2012) Marker-assisted selection molecular markers for use in marker-assisted breeding
in tomato breeding. Crit Rev Plant Sci 31(2):93123 in tomato. Euphytica 184:165179
Foolad MR, Merk HL, Ashrafi H (2008) Genetics, genomics Ray JD, Yu LX, McCouch SR, Champoux MC, Wang G,
and breeding of late blight and early blight resistance Nguyen HT (1996) Mapping quantitative trait loci
in tomato. Crit Rev Plant Sci 27:75107 associated with root penetration ability in rice (Oryza
Gomez S, Boopathi NM, Kumar SS, Ramasubramanian T, sativa L.). Theor Appl Genet 92:627636
Chengsong Z, Jeyaprakash P, Senthil A, Babu RC Reinisch AJ, Dong J, Brubaker CL, Stelly DM, Wendelt
(2010) Molecular mapping and location of QTL for JF, Paterson AH (1994) A detailed RFLP map of cot-
drought resistance traits in indica rice (Oryza sativa ton, Gossypium hirsutum Gossypium barbadense:
L.) lines adapted to target environments. Acta Physiol chromosome organization and evolution in a disomic
Plant 32(2):355364 polyploid genome. Genetics 138:829847
Gutierrez OA, Robinson AF, Jenkins JN, McCarty JC, Robin S, Pathan MS, Courtois B, Lafitte R, Carandang S,
Wubben MJ, Callahan FE, Nichols RL (2011) Lanceras S, Amante M, Nguyen HT, Li Z (2003)
Identification of QTL regions and SSR markers associ- Mapping osmotic adjustment in an advanced backcross
ated with resistance to reniform nematode in Gossypium inbred population of rice. Theor Appl Genet
barbadense L. accession GB713. Theor Appl Genet 107:12881296
122:271280 Shen L, Courtois B, McNally KL, Robin S, Li Z (2001)
Humphry ME, Konduri V, Lambridges CJ, Magner T, Evaluation of near-isogenic lines of rice introgressed
McIntyre CL, Aitken EAB, Liu CJ (2002) Development with QTLs for root depth through marker-aided selec-
of a mungbean (Vigna radiata) RFLP linkage map and tion. Theor Appl Genet 103:7583
its comparison with lablab (Lablab purpureus) reveals Sun FD, Zhang JH, Wang SF, Gong WK, Shi YZ, Liu AY,
a high level of synteny between the two genomes. Li JW, Gong JW, Shang HH, Yuan YL (2012) QTL
Theor Appl Genet 105:160166 mapping for fiber quality traits across multiple genera-
Isemura T, Kaga A, Tabata S, Somta P, Srinives P et al tions and environments in upland cotton. Mol Breed
(2012) Construction of a genetic linkage map and 30:569582
genetic analysis of domestication related traits in Tanksley SD, Ganal MW, Prince JP, Devicente MC,
mungbean (Vigna radiata). PLoS One 7(8):e41304. Bonierbale MW, Broun P, Fulton TM, Giovannoni JJ,
doi:10.1371/journal.pone.0041304 Grandillo S, Martin GB et al (1992) High-density
Jenkins JN, Wu J, Guo Y, McCarty JC (2010) Use of fiber molecular linkage maps of the tomato and potato
and fuzz mutants to detect QTL for yield components, genomes. Genetics 132:11411160
seed, and fiber traits of upland cotton. Euphytica Truong HTH et al (2012) Identification of isolate-specific
172:2134 resistance QTLs to phytophthora root rot using an
Jiang CX, Wright RJ, El-Zik KM, Paterson AH (1998) intraspecific recombinant inbred line population of
Polyploid formation created unique avenues for pepper (Capsicum annuum). Plant Pathol 61(1):
response to selection in Gossypium (cotton). Proc Natl 4856
Acad Sci USA 95(8):44194424 Venuprasad R, Shashidhar HE, Hittalmani S, Hemamalini
Kamoshita A, Babu RC, Boopathi NM, Fukai S (2008) GS (2002) Tagging quantitative trait loci associated
Phenotypic and genotypic analysis of drought with grain yield and root morphological traits in rice
resistance traits for development of rice cultivars under contrasting moisture regimes. Euphytica
adapted to rainfed environments. Field Crops Res 128:293300
109(13):123 Wu J, Gutierrez OA, Jenkins JN, McCarty JC, Zhu J
Lambrides CJ, Lawn RJ, Godwin ID, Manners J, Imrie (2009) Quantitative analysis and QTL mapping for
BC (2000) Two genetic linkage maps of mungbean agronomic and fibre traits in an RI population of upland
using RFLP and RAPD markers. Aust J Agric Res cotton. Euphytica 165:231245
51:415425 Wyatt LE et al (2012) Development and application of a
Lilley JM, Ludlow MM, McCouch SR, OToole JC (1996) suite of non-pungency markers for the Pun1 gene in
Locating QTL for osmotic adjustment and dehydration pepper (Capsicum spp.). Mol Breed. doi:10.1007/
tolerance in rice. J Exp Bot 47:14271436 s11032-012-9716-9
280 11 Recent Advances in MAS in Major Crops

Zhang Z, Rong J, Waghmare VN, Chee PW, May OL, Further Reading
Wright RJ, Gannaway JR, Paterson AH (2011) QTL
alleles for improved Wber quality from a wild
Boopathi NM, Thiyagu K, Urbi B, Santhoshkumar M,
Hawaiian cotton, Gossypium tomentosum. Theor Appl
Gopikrishnan A, Aravind S, Swapnashri G, Ravikesavan
Genet 123:10751088
R (2011) Marker-assisted breeding as next-generation
Zheng BS, Yang L, Zhang WP, Mao CZ, Wu YR, Yi KK, strategy for genetic improvement of productivity and
Liu FY, Wu P (2003) Mapping QTLs and candidate quality: can it be realized in cotton? Int J Plant Genom
genes for rice root traits under different water-supply 2011. doi:10.1155/2011/670104
conditions and comparative analysis across three pop-
ulations. Theor Appl Genet 107:15051515
Future Perspectives in MAS
12

MAS can be simply defined as selection for a is desired, (8) conducting gene introduction/
trait based on the genotype of an associated pyramiding from different sources and (9) trans-
marker rather than the trait itself. In essence, the ferring genes/QTLs from wild genetic back-
associated marker is used as an indirect selection grounds. Furthermore, in a backcross-breeding
criterion. The potential of MAS as a tool for crop programme, MAS allows reduction of linkage
improvement has been extensively explored in drag by selecting against the undesirable donor
different plant species. Major applications of genome and for desirable recurrent parent genome
MAS include (1) tracing favourable alleles (background selection) while also selecting for
and pyramiding them in desirable genetic back- desirable donor alleles (foreground selection).
grounds (foreground MAS), (2) eliminating Moreover, with MAS, it is possible to conduct
unwanted genetic backgrounds (background multiple rounds of selection in a year, allowing
MAS) or undesirable plant material in early approximately two generations of selection per
breeding generations and identifying the most year, compared to one in phenotypic selection
desirable gene combinations or individuals in methods.
segregating populations and (3) breaking the The success of MAS also depends on many
undesirable linkages between favourable and other factors, including the underlying genetic
unfavourable alleles (reducing linkage drag). The control of the trait(s) of interest. MAS has been
success of MAS in plant breeding is often possible, if not always practical, for a wide range
assessed on the basis of these three components. of qualitative/simple traits since the early twenti-
In theory, MAS can reduce the cost and increase eth century. The utility of MAS for manipulating
the precision and efficiency of selection and single-gene traits is straightforward and has been
breeding. However, MAS is not a silver bullet, well documented. MAS for the improvement of
and it can be more effective than conventional polygenic traits, however, is more complicated,
phenotype-based selection only under certain though its usefulness has been recognised.
situations, including when (1) trait-based selec- In general, for quantitative traits, MAS seems to
tion is not feasible (e.g. lack of selection environ- be most effective for traits with low (0.10.3)
ment or pathogen), (2) such selection is costly or heritability and those which are controlled by
ineffective, (3) trait expression is developmen- rather small numbers of QTLs with large effects.
tally regulated or phenotypically not obvious However, with the recent development of
until late in the season, (4) the trait is governed by next-generation molecular tools and genetic
recessive or incompletely dominant gene(s), (5) maps, MAS has shown to become more attractive
trait heritability is low rendering conventional and practical for many simple and complex traits
phenotypic selection is ineffective, (6) there are too in applied breeding programmes in several
much G E interactions, (7) multiple trait selection occasions.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 281
and Benefits, DOI 10.1007/978-81-322-0958-4_12, Springer India 2013
282 12 Future Perspectives in MAS

One of the yet unrealised promises of molecu- chromosome arms of plant sequences, and this
lar markers is their utility for improvement of progress provides a template for defining the
complex quantitative traits, which are often novel functional markers for future use. High-
controlled by more than one gene and exhibit low quality crop genome sequences integrated with
heritability and often strong G E interactions. molecular genetic maps provide the basis for
The failure in using molecular markers for com- identifying duplicated genes, analysing promoter
plex traits is due to various reasons, including regions in detail, defining SNPs/InDels and
QTLs being unreliable or population or environ- aligning the transcriptome with the genome.
ment specific, QTLs not strong enough in terms These advances will allow gene networks to be
of linkage to warrant their use for marker-assisted clearly defined and thus allow meaningful causal
breeding, lack of marker validation or marker or functional markers to be developed for complex
polymorphism in breeding populations and prob- traits.
lems associated with linkage drag. However, it Extensive proteomic studies have allowed
should be possible to use markers for improving identification of many allelic variants at the novel
complex traits assuming that additional neces- genes, and genomic analyses identified several
sary efforts are made to develop reliable markers, markers for discriminating alleles at one locus.
including minimising the environmental effects These successes have indicated that it is now
and maximising the relationship between geno- essential to establish rapid, convenient and
type and phenotype (e.g. by repeating experi- economical PCR-based assays in crop breeding.
ments in multiple environments), breaking up In order to detect genes simultaneously in a single
complex traits into their individual components PCR, multiplex PCR can be developed, in which
and identifying QTL-linked markers for such several markers in the same reaction mix are
components, and identifying QTLs using actual co-amplified under identical conditions. For
breeding populations. Obviously, these are not example, two multiplex PCR assays, developed
easy challenges, but they are doable. for the identification of genes/loci w-secalin,
Thus, future progress in MAS will greatly Glu-B1-2a, Glu-D1-1d, Glu-A3d, Glu-B3,
depend on improved genetics. However, the agro- Pin-D1b, Ppo-A1, Ppo-D1 and Wx-B1b, provide
nomical context, as well as socio-economic fac- the proof of concept for the efficient screening of
tors and policy, must be taken into account; they genotypes in wheat. A clear challenge is for
influence to a large extent whether farmers adopt multiplexing markers to have similar annealing
improved varieties and whether they can mini- temperatures for the different primers and for the
mise the gap between yield potential and on-farm expected PCR products to be easily separated on
yield. This integration of quantitative knowledge agarose gels. Although several genes conferring
arising from diverse but complementary disci- pest/disease resistance have been cloned in plants,
plines will allow researchers to more fully under- the gene-specific markers are available for only
stand genes associated with complex traits in few genes. If alleles conferring specific resistance
crop plants and more precisely forecast the pen- are being sought, it is important to know which
alty of modulating expression levels of those alleles are effective and potentially useful to local
genes. breeding programmes. A good example is for the
Large-scale genome sequencing and associ- leaf rust resistance genes Lr10 and Lr21, which
ated bioinformatics are becoming widely accepted confer resistance to a broad spectrum of Puccinia
research tools for accelerating the analysis of triticina races, but gene-specific markers are
plant genome structure and function. Second- not available for these two genes because the
generation DNA sequences from crop plants can reactions of alleles to various Puccinia triticina
provide an opportunity to use genomic informa- races have not been well characterised. Currently,
tion to clone genes and develop SNP markers in functional markers are being increasingly adopted
plants. Rapid progress is now being achieved in in crop breeding including wheat (e.g. many
assembling the DNA sequences from individual functional markers associated with wheat quality
MAS in Orphan Crops 283

genes, in particular, are available; however, more Evaluation of the extent of linkage disequilib-
functional markers are needed for important traits rium in exotic and domesticated germplasm is yet
such as disease and stress resistance in order to another requirement. Phenotypic evaluation of
strengthen the application of molecular markers multiple populations per species should be con-
in breeding programmes). SNPs are the most ducted so that the locations of quantitative trait
applicable markers for high-throughput screen- loci for important agronomic traits can be
ing once the genotypephenotype associations identified by genetic and association mapping.
are determined. The expanded use of these The accumulation of mapping information will
markers will develop as high-throughput tech- facilitate the exploration of syntenic regions
niques for MAS based on functional SNP markers across orphan crops. These genetic tools will also
and chips are established. The meaningful inter- help in construction of physical maps of chromo-
pretation of whole-genome studies to associate somes in orphan crops. Construction of physical
SNPs with variation in phenotype is expected to maps will allow better understanding of such a
provide the next generation of functional markers complex genome and facilitate cloning and
for use in MAS. manipulation of traits with economic interests.
This will also help to better understand the sec-
ondary metabolism involved in interactions
MAS in Orphan Crops between neglected crops and pathogens, symbi-
otic organisms, predators and pollinators and will
The development of genetic markers is complex lead to varieties with enhanced yield potential,
and costly in species with little pre-existing nutritional benefits, resistance to pests and dis-
genomic information (such as orphan or neglected eases and tolerance of adverse environmental
or underutilised crops but have potential in conditions.
human welfare). Such orphan crops possess one Using molecular marker technology, it is now
of the largest and least studied genomes among feasible to analyse quantitative traits such as
cultivated crop plants, and only few gene-based salt tolerance and identify the chromosomal
genetic maps have been reported in such crops. regions (QTLs) associated with such characters.
The development of new markers in orphan crops Identifying such regions will significantly help to
will be an essential step for MAS to be adopted as increase the selection efficiency in the breeding
a routine procedure in such crops breeding programmes. Molecular marker-assisted selec-
programmes. Many regional working groups are tion is considered to be faster, more efficient and
now engaged in developing molecular markers in probably more cost effective than conventional
those crops. This includes the utilisation of screening particularly for abiotic stresses where
SCAR, SRAP, ISSR, AFLP, SSR and SNP mark- expression of the trait is subject to significant
ers (see chapter 3). Developing new SSRs based environmental effects. It will also help narrow
on SSR-enriched libraries from locally adopted down the possible candidate genes and ultimately
genotypes, EST-based SSRs or cross species will lead to map-based cloning of the major genes
SSRs, may be deployed. The development of controlling the trait of interest and opening a new
SSRs together with increasingly larger sets of avenue for genetic manipulations using the real
transferable markers such as ESTs in orphan candidate genes, since it has been shown that sev-
crops should provide direct bridges among eral such underutilised crops are adapted well to
genetic maps, allowing not only to streamline the unfavourable environmental conditions. With
high-resolution mapping and positional cloning the recent advances in DNA sequencing and sin-
of major QTLs or genes of interest but also the gle nucleotide polymorphism (SNP) genotyping,
development of many types of DNA markers new approaches to QTL mapping and quantita-
such as STSs, SCARs or SNPs that will greatly tive trait nucleotide (QTN) identification are now
help in establishing MAS systems in orphan available, and this could be applied to orphan
crops. crops for identification of phenotype-related SNPs.
284 12 Future Perspectives in MAS

Once genes responsible for quantitative variation genes and alleles and provided insights into
are identified, information can be passed on to genome evolution and duplication. Comparisons
those crop breeding programmes to enable imple- between ESTs and gene sequences among several
mentation of MAS. This will greatly help in accel- legume species have allowed comparative
erating the breeding programme. In addition, genome studies between L. albus and M. trunca-
traditional breeding efforts will be greatly enhanced tula, and L. angustifolius and Lotus japonicus.
through integrated approaches using functional, The use of molecular markers and the devel-
comparative and structural genomics. It should be opment of suitable mapping populations will
kept in mind, however, that optimisation of marker allow significant progress in mapping to enhance
genotyping methods in terms of cost-effectiveness breeding strategies in orphan crops. For example,
and a greater level of integration between mole- local faba bean variety Hassawi 2, with drought
cular and conventional breeding represent the tolerance and excellent cooking quality, was used
main challenges for the future adoption and with an introduced small black seeded Pakistani
impact of MAS on orphan crop breeding. variety for developing a mapping population in
Orphan crops are widely distributed across the an attempt to map QTLs for drought tolerance in
Mediterranean region and have shallow soil Vicia faba. Those studies proved that some
requirements, and their cultivated accessions have physiological parameters such as stomatal
variable seed yields in Mediterranean environ- conductance, leaf rolling and leaf temperature as
ments. In addition, some of them, for example, well as grain yield under stress are well associ-
yellow lupin seeds have the highest protein con- ated with drought tolerance. These parameters
tent and twice the cysteine and methionine content along with water use efficiency and proline
of most lupins. However, despite its highly nutri- content could be utilised in plant phenotyping.
tional qualities, there is a lack of genetic and Breeding programmes for drought tolerant in
molecular tools to aid the genetic breeding of faba bean should consider the genetic diversity in
this species. However, some progress has been the tested genotypes for physiological, morpho-
shown in certain orphan crops. EST sequencing logical and agronomical traits and the important
has accelerated gene discovery when genome correlations among these traits. Significant
sequences are not available, facilitating gene fam- correlations allow the utilisation of relatively
ily identification and development of molecular simple traits as indirect selection criteria for
markers. Next-generation sequencing has gener- drought tolerance in faba bean breeding. Other
ated enormous amount of expressed sequence data drought tolerant traits investigated in a number of
for a wide number of plant species, specially minor field legumes include dry matter accumulation
or orphan crops. For example, EST and genome under stressed and unstressed environments,
sequencing of lentil and chickpea would not have relative water content (RWC), stomata frequency,
been feasible without next-generation sequencing. stomata size, transpiration efficiency, carbon iso-
The lower cost and greater sequence yield have tope discrimination, leaf temperature and osmotic
allowed the identification of candidate genes, even potential. These traits have been detected to have
when they are expressed at low levels. significant linkage with drought tolerance and
Research on plants, animals and fungi has could be utilised in drought breeding selection.
shown that sequences of expressed genes are There is an urgent need to identify chromosomal
often widely transferable among species, and regions associated with economically important
even genera, allowing wide genome comparative traits in faba bean. Identification of expression
mapping studies(see chapter 7). For instance, the QTLs (eQTLs) will help in narrowing down
combination of orphan crop EST sequences with candidate genes for traits of interest and lead to
model plant genetic and genomic resources, such an increase number of QTLs for agronomically
as Lotus japonicus (Japanese trefoil) and important traits for faba bean improvement.
Medicago truncatula (barrel medic), has identified One of the functional genomic approaches to
macro- and microscale synteny, discovered new identify candidate genes responsible for a trait of
MAS in Developing Countries 285

interest is through differential expression strategies. advanced laboratories. Molecular breeding for
DNA chips and subtractive hybridisation are polygenic traits has been successfully deployed
among the tools of choice to identify abiotic in the multinational private sector, and several
stress responsive genes. Many genes are expected experts in the art see molecular plant breeding as
to be drought responsive, among which, a fewer the foundation for twenty-first century crop
number are the real candidate genes. Combining improvement.
the QTL approach with differential display strat- Although the number of successful stories is
egy will allow narrowing down the possible can- increasing, it is fair to say that in todays reality
didate genes by focusing only on those responsive in MAS application for complex traits in breeding
genes in the major QTL regions in faba bean. In programmes remains primarily limited to the pri-
summary, the bioinformatics tools and analysis vate sector and is barely used in developing coun-
of gene motifs, real candidate genes, could be tries. Reasons for this situation in developing
identified in faba bean. Further PCR-based countries are shortage of well-trained personnel,
validation using such candidate genes designed inadequate access to high-throughput genotyp-
primers will demonstrate the efficiency of the ing, inappropriate phenotyping infrastructure,
genes identified. This will allow trait manipula- unaffordable information systems and analysis
tion and eventually will lead to the development tools and the logistical difficulty of integrating
of stress tolerant faba bean genotypes. The avail- new approaches with traditional breeding meth-
ability of second-generation sequencing and odologies, including problems when scaling
high-throughput technology in parallel with up from small to large breeding programmes.
other genomic approaches will facilitate the Therefore, except for leading emerging econo-
analysis of transcripts, proteins and insertional mies, the capacity to conduct intensive research
and chemically induced mutants and will allow in plant biology and to support plant breeding
understanding the gene function and phenotype remains rather limited in developing countries,
relationship. and in some cases it has even decreased over the
Furthermore, developing efficient regenera- last decade. For example, although there has been
tion protocols will allow successful in vitro a strong focus on agricultural development in
culture and genetic transformation in orphan Africa in recent years, many of the African breed-
crops. This will facilitate the development of ing institutes, especially those in sub-Saharan
transgenic plants in such underutilised crops with Africa, remain dependent on international support
excellent biotic and abiotic stress tolerance and for agricultural research. These needier institutes
open a new avenue for functional genomics and tend to be in countries whose population has a
crop manipulation. Ultimately this will help in high proportion of resource-poor people; thus,
developing better genotypes in underutilised building the capacities of breeding programmes
crops that are suitable for local and regional and seed systems in those countries is vital to
ecosystem and enhancing the role of orphan crops achieving any improvement in the ability of poor
for conservation agriculture in arid and semiarid farmers to grow improved varieties. In order to
regions. realise the full potential of marker technologies
and bioinformatics in plant breeding, tools for
molecular characterisation, accurate phenotyp-
MAS in Developing Countries ing, efficient information systems and effective
data analysis must be integrated with breeding
Though there were successful examples in MAS workflows managing pedigree, phenotypic,
shown in developed countries, the transfer and genotypic and adaptation data into efficient
application of new plant biotechnologies to information systems. With all the progress
developing countries are recognised as a big chal- achieved in marker technology, software devel-
lenge, and solutions can be found only through opment, analytical pipelines and data management
innovative partnerships and collaborations with systems, it is time to provide an information
286 12 Future Perspectives in MAS

system, available through a public platform that typic and genotypic information through com-
will offer breeding programmes in developed and mon or mutually compatible crop information
developing countries access to modern breeding systems.
technologies, in an integrated and configurable However, amidst the challenges there are also
way, to boost crop quality and productivity. actual and potential opportunities. Several of the
There are several constraints in developing constraints listed above, in particular access to
countries that hamper the application of MAS. marker technologies and limited data manage-
Some relate to access to information and publica- ment systems, can be overcome through the estab-
tions. Others relate to data collection, manage- lishment of crosscutting technology and service
ment and storage, such as availability of systems platforms, and several international initiatives are
for reliable sample and data tracking. Very impor- supporting the development of such platforms in
tant are the scientific and technical concerns tight collaboration with partners from developing
involved in adequate experimental design, precise countries. To partially offset the undesirable trend
and reliable trait phenotyping (i.e. dissection of of losing the champions, novel international ini-
complex traits), dependable marker validation tiatives such as the Alliance for a Green Revolution
and advanced analytical methodologies and tools in Africa (AGRA) support high-quality education
for accurate decision making, among others. in the South, and although there is still a long way
Thus, the main challenges hampering the potential to go, governmental and institutional commitment
of molecular breeding in developing countries is increasing for the adoption of biotechnologies
encompass (1) human resources, (2) infrastructure in developing countries (Delannay et al. 2012).
capacity, (3) access to marker technologies and
(4) availability of an efficient data management
system. Human capacity for molecular breeding Community Efforts in Developing
technologies in developing countries is an on-going Countries and Their Implications
challenge, and limitations include substandard in MAS
agriculture programmes at universities; difficulties
in keeping up to date with relevant developments, The recent emergence of affordable large-scale
including failures by others; poor technical skills marker technologies (e.g. Diversity Arrays
in core disciplines; isolation as a result of Technology (DArT), SNPs), the sharp decline of
insufficient peer critical mass in the workplace; sequencing costs boosting marker development
and poor incentives to attract and retain scien- based on sequence information and the explicit
tists, resulting in brain drain and staff turnover. efforts of national agricultural research pro-
Fortunately, with the establishment of marker grammes (e.g. in India) and international initia-
service laboratories and a clear change in mental- tives such as generation challenge programme
ity, breeders need to be trained on how to analyse (GCP) have all resulted in a large increase in
the data and not how to run marker genotyping; the number of genomic resources available for
there is general acceptance that large-scale geno- less-studied crops. As a result, most key crops in
typing activities are best outsourced while nobody developing countries now have adequate genomic
questions the basic local laboratories. For breed- resources for meaningful genetic studies and
ers to efficiently access relevant information gen- most MAS applications. In more recent times,
erated by themselves and by other researchers, the capacity of the national breeding institutes, in
reliable data management (including sample terms of their financial resources, infrastructure
tracking, data collection and storage and modern and expertise, has evolved in a somewhat country-
analytical methodologies and tools for accurate specific manner, reflecting the health of their
decision making, among others) is critical both domestic economies. Thus, capacity has degraded
within a given molecular breeding programme in some countries, while in others there have been
and across programmes. In view of this, it is major improvements, as evidenced by a change
essential that breeders manage pedigree, pheno- from requiring training and support from large
Community Efforts in Developing Countries and Their Implications in MAS 287

international programmes to becoming mutual determining the choice of screening method.


partners in agricultural research. This is reflected Secondly, the choice between MAS and conven-
in the sharp differences in capacity to conduct and tional selection may be complicated by the fact
apply biotechnological research in developing that the two are rarely direct substitutes for one
countries. another or mutually exclusive, and in fact they
Interestingly, newly industrialised countries are quite complementary under most breeding
such as Brazil, China, India, Mexico, South schemes. Where operating capital is not a limita-
Africa and Thailand substantially invest in tech- tion, MAS maximises the net present value and
nology and research and development (R&D) with the decrease in marker data point cost and
and are self-reliant in most aspects of marker increased access to marker service laboratories,
technologies. These countries have the concomi- marker-assisted breeding operating costs are
tant potential to effectively adopt, adapt and apply shrinking, making this approach increasingly
information and communication technologies to attractive from an economic perspective.
enhance research efficiency and outputs. They Few economic analyses have been undertaken
are therefore naturally at the frontline in adopting to assess the potential impacts of MAS. A famous
molecular breeding technologies. These institutes example is definitely the impact of the submergence
are beginning to communicate with one another, gene for rice in Asia. Among the few analyses
as illustrated by the 2006 agreement between available is an evaluation of the economic benefits
Brazil, China and India to collaborate in the area of MAS to develop rice varieties with tolerance to
of agriculture, including the exchange of genetic salinity and P deficiency in Bangladesh, India,
resources and joint efforts in plant biology and Indonesia and the Philippines, since DNA mole-
breeding. cular markers for these traits are available (see
On the other hand, mid-level developing world chapter 11). Encompassing a broad set of eco-
economies such as Colombia, Indonesia, Kenya, nomic parameters, the study concluded that MAS
Morocco, Uruguay and Vietnam are well aware is estimated to save at least 23 years, resulting in
of MASs importance, and some effectively apply significant incremental benefits in the range of
marker technologies for germplasm characterisa- USD 300800 million, depending on the country,
tion and selection of major genes. These coun- abiotic stress and lag for conventional breeding.
tries have a matching potential for a limited Another study estimates the benefits of using
utilisation of molecular breeding platforms, a marker-assisted breeding, as compared with con-
potential that can be enhanced fairly rapidly in ventional breeding alone, in developing cassava
the medium to long term. In contrast, low-level varieties resistant to cassava mosaic disease,
developing world economies are struggling to green mite, whitefly and postharvest physiological
sustain even basic conventional breeding. They deterioration in Nigeria, Ghana and Uganda.
have very limited or no approaches to application Marker-assisted breeding is estimated to save at
of molecular breeding and are unlikely to adopt least 4 years in the breeding cycle for varieties
molecular breeding platforms except in the long resistant to the pests and to result in incremental
term. Due to its ability to generate quickly and net benefits over 25 years in the range of USD
cost-effectively precise trait linkage information 34800 million depending on the country, the par-
for specific regions of the genome, MAS is ticular constraint and various assumptions.
expected to improve the efficiency of crop The key technical constraint to the efficient
breeding to progressively increase genetic gains management of crop information across the
by selecting and stacking with markers favour- layers of implementation is standardisation and
able alleles at target loci. Comparing the cost- consistency. At the crop level, the most important
effectiveness of MAS with phenotyping selection key to data integration is a community-accepted
is not straightforward. Firstly, interlinked factors trait dictionary, ontology of traits of interest for
other than cost, such as trade-offs between time each crop together with a set of effective proto-
and money, are likely to play an important role in cols for their evaluation, including scales or units
288 12 Future Perspectives in MAS

of measurements and data quality standards. operations include at a minimum high-quality


Developing, maintaining and supporting inte- DNA extraction, polymerase chain reaction (PCR)
grated breeding informatics applications are also amplification, gel electrophoresis and gel scoring.
critical. This would include the design of data- Performing those operations requires well-trained
bases to manage crop information from any crop technicians and the availability of well-equipped
and the development of user applications to facil- laboratories with stable electricity supply, reliable
itate breeding processes. These would need to be supply of clean water, room temperature and
configured to the best practices for each crop to humidity control and the scientific equipment
provide common functionality under different necessary to perform those tasks. Refrigerators
community efforts. and freezers (regular freezers and 80C freezers)
also need to be in operation on an uninterrupted
basis to store temperature-sensitive reagents,
Field and Laboratory Infrastructure primers and DNA samples. Automatically
Improvement triggered power generators need to be installed
when a reliable electrical supply cannot be
Reliable phenotypic data are a must for high-qual- guaranteed. A first attempt to resolve this issue
ity genetic studies, and most developing countries has been for donor organisations to fund the con-
lack suitable field infrastructure for proper trials struction of genotyping laboratories in various
and collection of accurate phenotypic data. places of the Third World. However, except for
Guidelines on best practice must be provided on large, well-funded centres, this was often not suc-
how to design and run a trial and conduct precise cessful because sustained resourcing was not
phenotyping for genetic studies under different available to hire qualified personnel and to pur-
target environments. Improving access to homo- chase and maintain the necessary equipment and
geneous field areas and paying attention to good reagents. The logistics of reliably shipping perish-
soil preparation and homogeneous sowing are able reagents to remote areas of the Third World
critical. Until a few years ago, the major invest- is also often an obstacle. As a result, there are
ment required to establish large-scale marker unfortunately a number of poorly equipped labo-
technology was considered a large impediment to ratories lying idle in some remote parts of Africa.
the application of molecular breeding in develop- In spite of that, a few local centres, such as the
ing countries. One of the challenges in conducting National Root Crop Research Institute (NRCRI)
agronomic research in developing countries is in Umudike, Nigeria, have been successful in
that research stations are often underfunded establishing low-throughput laboratories that can
and understaffed and do not have the resources serve the basic genotyping needs of their breed-
necessary to establish and maintain the field envi- ers. An intermediate solution is to rely on regional
ronments appropriate for quality phenotyping. hubs. Those hubs should be relatively well-
Even with the availability of the best genotyping funded and well-equipped laboratories that can
resources, integrated molecular breeding pro- handle primarily SSR genotyping for interested
grammes will be doomed to failure in the absence parties. Part of the strategy is to rely on four
of quality phenotypic data to support the proper hubs covering the needs of the Americas (Centro
identification of the main QTLs affecting key Internacional de Agricultura Tropical, CIAT,
target traits. www.ciat.cgiar.org), Africa (BioSciences eastern
The ability to generate genotyping data has and central Africa, BecA, http://hub.africabiosci-
been one of the main stumbling blocks preventing ences.org), South Asia (International Crops
wide utilisation of markers in developing coun- Research Institute for the Semi-Arid Tropics,
tries. Molecular markers rely on the availability ICRISAT, www.icrisat.org) and Southeast Asia
of high-quality laboratories able to perform (International Rice Research Institute, IRRI,
the necessary molecular biology operations. For www.irri.org). Those hubs will be able to provide
simple sequence repeat (SSR) markers, these basic genotyping needs and at the same time help
Lessons Learnt and Concluding Remarks 289

train local scientists in the fundamentals of of genotyping data for the exclusive benefit of the
molecular breeding. customer. Examples of such companies that can
Full integration of molecular markers into service breeding programmes from around the
breeding programmes will require the availabil- world are DNA Land-Marks, Inc. of Saint-Jean-
ity of high-throughput and low-cost genotyping sur-Richelieu, Quebec, Canada (http://www.dna-
platforms primarily based on SNPs. SNPs are the landmarks.ca/english), and KBioscience Ltd. of
only marker type that can meet the long-term Hoddeston, UK (http://www.kbioscience.co.uk).
needs of integrated molecular breeding so that it This approach represents a very attractive solu-
can be widely applied in a cost-effective manner. tion for large-scale integration of markers into
However, high-throughput SNP genotyping Third World country breeding programmes, as it
requires the use of highly automated laboratories does not necessitate any heavy capital investment
using an array of sophisticated equipment and it completely removes the maintenance and
(pipetting robots, high-density PCR, high- equipment upgrade issues.
throughput SNP detection machines, high-level
informatics). Although large private seed compa-
nies have had the need and the resources to put in Lessons Learnt and Concluding
place large-scale genotyping laboratories for their Remarks
own uses, smaller programmes, especially in the
public sector, have typically not had the resources Marker-assisted selection that complements regu-
or the justification to establish and maintain such lar conventional breeding programme increases
large operations to meet their increasing needs genetic gain per crop cycle, stacks favourable
for SNP genotyping data. In response to this alleles at target loci and reduces the number of
need, a few private marker service laboratories selection cycles. In the last decade, the multina-
have sprung up over the past few years. Those tional private sector has benefitted immensely
laboratories can provide complete genotyping from MAS, which demonstrates its efficacy.
services for their customers, from DNA extrac- In contrast, its adoption is still limited in the pub-
tion to generation of large numbers of SNP or lic sector, and it is hardly used in developing
other datapoints. Due to their broad customer countries. Major bottlenecks in these countries
base (from medical research laboratories to ani- include shortage of well-trained personnel, inad-
mal and plant breeding operations, both public equate high-throughput capacity, poor phenotyp-
and private), such laboratories can have the large ing infrastructure, lack of information systems or
volume of data point production that can lead to adapted analysis tools or simply resource-limited
low costs to the customer and high throughput. breeding programmes. The emerging virtual
They are able to invest in the most advanced platforms aided by the information and communi-
equipment to keep up with the constant evolution cation technology revolution will help to over-
of genotyping technologies and are able to pass come some of these limitations by providing
on the resulting benefits to their customers. breeders with better access to genomic resources,
Processes have now been put in place for rapid advanced laboratory services and robust analyti-
shipment of dried leaf samples from any location cal and data management tools. Apart from some
(field or laboratory) around the world without advanced national agricultural research systems,
the phytosanitary and similar restrictions that can the implementation of large-scale molecular
affect the shipment of seed or other viable breeding programmes in developing countries
tissues. will take time. However, the exponential develop-
Contract genotyping is also generally exempt ment of genomic resources, including for less-
from material transfer agreements (MTAs) and studied crops, the ever-decreasing cost of marker
other intellectual property requirements because technologies and the emergence of platforms for
the material being sent is not viable and will not accessing MAS tools and support services, plus
be used for any other purpose than the generation the increasing publicprivate partnerships and
290 12 Future Perspectives in MAS

needs-driven demand for improved varieties to and to select desirable genotypes in breeding
counter the global food crisis, are all grounds to populations. The construction of graphical geno-
predict that MAS will have a significant impact on types of each plant or progeny row would allow
crop breeding in developing countries. These the breeder to determine which chromosome sec-
predictions are supported by some preliminary tions are inherited from each parent to facilitate
successful examples presented in previous the selection process and perhaps to reduce the
chapters 9 and 11. Advances in genomics research need for extensive field tests. A logical extension
are generating new tools, such as functional of whole-genome selection for the breeder would
molecular markers and informatics, as well as be to design the superior genotypes in silico, an
new knowledge about statistics and inheritance approach described as breeding by design.
phenomena that could increase the efficiency and Thus, in the post-genomics era, high-through-
precision of crop improvement. In particular, the put approaches combined with automation,
elucidation of the fundamental mechanisms of increasing amounts of sequence data in the public
heterosis and epigenetics, and their manipulation, domain and enhanced bioinformatics techniques
has great potential. Eventually, knowledge of the will contribute to genomics research for crop
relative values of alleles at all loci segregating in a improvement. However, the costs of applying
population could allow the breeder to design a genomics strategies and tools are often more than
genotype in silico and to practise whole-genome is available in commercial or public breeding
selection for minor crops in developing countries. programmes, particularly for crops that are only
Considerable progress has been made building of regional importance. Newly developed genetic
infrastructure for applying genomics approaches. and genomics tools will enhance, but not replace,
These include one-dimensional genetic informa- the conventional breeding and evaluation pro-
tion (genome sequences), many ESTs and gene cess. The ultimate test of the value of a genotype
knockout populations in several plant species is its performance in the target environment and
of biological and agronomic importance. New acceptance by farmers and consumers.
knowledge and new tools are changing the strate-
gies used in crop plant research and will thus
reduce the costs and increase the throughput of
the assays. There is a continuing need to integrate Bibliography
disciplines such as structural genomics, transcrip-
tomics, proteomics and metabolomics with plant Literature Cited
physiology and plant breeding. Bioinformatics is
providing the means for integration and structured Delannay X, McLaren G, Ribaut JM (2012) Fostering
molecular breeding in developing countries. Mol
interrogation of datasets that will facilitate
Breed 29:857873
the cross-fertilisation of disciplines. Genomics
research has successfully unravelled various met-
abolic pathways and provided molecular markers
for agronomic traits. However, the mechanisms Further Readings
of epigenetic phenomena are only beginning to
be understood, and their potential role in crop Ali HQ et al (2012) An overview of genomics assisted
improvement is unknown. Similarly, tantalising improvement of drought tolerance in maize (Zea mays
L.): QTL approaches. Afr J Biotechnol 11(65):
bits of information concerning the possible basis
1283912848
of heterosis are gradually emerging. Eventual Fauquet CM, Taylor NJ, Tohme J (2012) The global cas-
elucidation of the mechanism of heterosis might sava partnership for the 21st century (GCP21). Trop
be one of the most important contributions of Plant Biol 5:48
Foolad MR, Panthee DR (2012) Marker-assisted selection
molecular genetics research to crop improvement.
in tomato breeding. Crit Rev Plant Sci 31(2):93123
Ultimately, the goal of the breeder will be to assay Fridman E, Zamir D (2012) Next-generation education in
the genetic make-up of individual plants rapidly crop genetics. Curr Opin Plant Biol 2012(15):218223
Bibliography 291

Isemura T, Kaga A, Tabata S, Somta P, Srinives P et al Panthee DR, Foolad MR (2012) A re-examination of
(2012) Construction of a genetic linkage map and molecular markers for usein marker-assisted breeding
genetic analysis of domestication related traits in in tomato. Euphytica 184:165179
Mungbean (Vignaradiata). PLoS One 7(8):e41304. Sharma HC et al (2002) Applications of biotechnology for
doi:10.1371/journal.pone.0041304 crop improvement: prospects and constraints. Plant
Khan M (2012) Current status of genomic based approaches Sci 163:381395
to enhance drought tolerance in rice (Oryza sativa L.): Varshney RK, Graner A, Sorrells ME (2005) Genomics-
an over view. Mol Plant Breed 3(1):110. doi:10.5376/ assisted breeding for crop improvement. Trends Plant
mpb.2012.03.00 Sci 10(12):621630
Liu Y, He Z, Appels R, Xia X (2012) Functional markers Xu Y et al (2012a) Whole-genome strategies for
in wheat: current status and future prospects. Theor marker-assisted plant breeding. Mol Breed
Appl Genet 125:110 29:833854
Nakaya A, Isobe SN (2012) Will genomic selection Xu Y, Li Z-K, Thomson MJ (2012b) Molecular breeding
be a practical method for plant breeding? Ann Bot in plants: moving into the mainstream. Mol Breed
110(6):13031316. doi:10.1093/aob/mcs109 29:831832
About the Author

N. Manikanda Boopathi is presently working been recognised during several occasions and
as an Assistant Professor (Biotechnology) at has brought him laurels and awards. He has a
the Department of Plant Molecular Biology vast experience in QTL mapping and marker
and Bioinformatics, CPMB&B, Tamil Nadu assisted selection in rice and cotton. He has
Agricultural University, Coimbatore, India. He completed several national and international
graduated in agricultural sciences, did his masters research projects and is currently working in
and doctoral studies in Plant Biotechnology and two countrywide and one worldwide network
trained at International Rice Research Institute, projects that address the problems of biotic and
the Philippines. He has handled more than 20 abiotic stresses in cotton, mungbean, hot-pepper
courses for undergraduate and postgraduate and tomato using system quantitative genetics.
students in his university and is invited frequently His publications can be found at http://sites.
for delivering lectures in several institutions, google.com/site/drnmboopathi and/or http://
both in India and abroad. His scientific work has tnaucottondatabase.wordpress.com/.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice 293
and Benefits, DOI 10.1007/978-81-322-0958-4, Springer India 2013

Das könnte Ihnen auch gefallen