NGS-based Identification of Induced Mutations in A Doubly Mutagenized Tomato (Solanum Lycopersicum) Population

See
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/318928974
NGS-based identification of induced mutations

in a doubly mutagenized tomato ( Solanum
lycopersicum ) population
Article in The Plant Journal · August 2017

DOI: 10.1111/tpj.13654
CITATIONS READS
0 233
14 authors, including:
Prateek Gupta Reddaiah Bodanapu

University of Hyderabad University of Hyderabad
3 PUBLICATIONS 22 CITATIONS 27 PUBLICATIONS 54 CITATIONS
SEE PROFILE SEE PROFILE
Kamal Tyagi Supriya Sarma

Agricultural Research Organization ARO Centre for Cellular and Molecular Biology
31 PUBLICATIONS 17 CITATIONS 29 PUBLICATIONS 16 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Biofortification of tomato by targeted manipulation of the biosynthetic pathway View project
High performance liquid chromatography coupled to mass spectrometry for profiling and
quantitative analysis of folate monoglutamates in tomato View project
All content following this page was uploaded by Prateek Gupta on 22 October 2017.
The user has requested enhancement of the downloaded file.

The Plant Journal (2017) 92, 495–508 doi: 10.1111/tpj.13654
TECHNICAL ADVANCE
Next-generation sequencing (NGS)-based identification of

induced mutations in a doubly mutagenized tomato
(Solanum lycopersicum) population
Prateek Gupta1 , Bodanapu Reddaiah1,†,††, Hymavathi Salava1,††, Pallawi Upadhyaya1,‡‡, Kamal Tyagi1,‡,‡‡,
1,§,‡‡
Supriya Sarma , Sneha Datta2,‡‡, Bharti Malhotra1,¶,§§, Sherinmol Thomas1,k,§§, Anusha Sunkum1,§§,
Sameera Devulapalli1, Bradley John Till2, Yellamaraju Sreelakshmi1,* and Rameshwar Sharma1,*
1
Repository of Tomato Genomics Resources, Department of Plant Sciences, University of Hyderabad, Hyderabad, India,
2
Plant Breeding and Genetics Laboratory, IAEA Seibersdorf Laboratories, Reaktorstrasse 1, Seibersdorf, Austria, and
Received 23 May 2017; revised 25 July 2017; accepted 26 July 2017; published online 28 August 2017.
*For correspondence (emails rameshwar.sharma@gmail.com; syellamaraju@gmail.com).
†
Present address: Bodanapu Reddaiah:, AgriGenome Labs Pvt. Ltd., Hyderabad, India.
‡
Present address: Kamal Tyagi: Institute of Plant Sciences, Agricultural Research Organization—The Gilat Research Center, Beersheba, Israel.
§
Present address: Supriya Sarma: Centre for Cellular and Molecular Biology, Hyderabad, India.
¶
Present address: Bharti Malhotra: Mordor Intelligence, Hyderabad, India.
k
Present address: Sherinmol Thomas: Department of Biosciences and Bioengineering, Indian Institute of Technology, Mumbai, India.
††
Equal second authors.
‡‡
Equal third authors.
§§
Equal fourth authors.
SUMMARY
The identification of mutations in targeted genes has been significantly simplified by the advent of TILLING
(Targeting Induced Local Lesions In Genomes), speeding up the functional genomic analysis of animals and
plants. Next-generation sequencing (NGS) is gradually replacing classical TILLING for mutation detection,
as it allows the analysis of a large number of amplicons in short durations. The NGS approach was used to
identify mutations in a population of Solanum lycopersicum (tomato) that was doubly mutagenized by
ethylmethane sulphonate (EMS). Twenty-five genes belonging to carotenoids and folate metabolism were
PCR-amplified and screened to identify potentially beneficial alleles. To augment efficiency, the 600-bp
amplicons were directly sequenced in a non-overlapping manner in Illumina MiSeq, obviating the need for a
fragmentation step before library preparation. A comparison of the different pooling depths revealed that
heterozygous mutations could be identified up to 128-fold pooling. An evaluation of six different software
programs (CAMBA, CRISP, GATK UNIFIED GENOTYPER, LOFREQ, SNVER and VIPR) revealed that no software program was
robust enough to predict mutations with high fidelity. Among these, CRISP and CAMBA predicted mutations
with lower false discovery rates. The false positives were largely eliminated by considering only mutations
commonly predicted by two different software programs. The screening of 23.47 Mb of tomato genome
yielded 75 predicted mutations, 64 of which were confirmed by Sanger sequencing with an average muta-
tion density of 1/367 Kb. Our results indicate that NGS combined with multiple variant detection tools can
reduce false positives and significantly speed up the mutation discovery rate.
Keywords: tomato, TILLING, NGS, mutation, EMS, reverse genetics, Solanum lycopersicum, technical advance.
INTRODUCTION
Since the dawn of civilization, beneficial traits in domesti- that arose during domestication (Doebley et al., 2006). The
cated plants have been selected by humans for enhanced realization that mutations can enhance variability signifi-
food production. Intrinsic to this was the heritability of the cantly accelerated crop improvement by the artificial induc-
chosen trait to the next generation. It is now recognized tion of mutations and the selection of desired traits by
that the selected traits were mostly spontaneous mutations breeding (Ahloowalia and Maluszynski, 2001).
© 2017 The Authors 495

The Plant Journal © 2017 John Wiley & Sons Ltd
496 Prateek Gupta et al.
Induced mutagenesis has been a mainstay for crop Raghavan et al., 2007; Dong et al., 2009). A variant of this
improvement during the last century. Most often it method called high-resolution melting (Gady et al., 2009)
involves phenotype-based selection and introgression into was also used to detect the mismatches in PCR amplicons.
the desired cultivar by backcrossing without prior knowl- Typical assays involved the PCR amplification of ~1500-
edge about the gene(s) encoding the trait. The green revo- bp target regions using fluorescently labeled primers, fol-
lution that considerably enhanced the yields of Oryza lowed by denaturing and annealing products to create
sativa (rice) and Triticum sp. (wheat) was enabled by the heteroduplexes that are the substrate of nucleases.
introgression of mutations for dwarfing. It was later identi- Cleaved products of lower molecular weight than the origi-
fied that these traits were associated with the genes nal PCR product were indicative of the presence of a muta-
involved in biosynthesis or in the perception of the plant tion. Higher throughputs could be achieved by pooling
hormone gibberellin (Hedden, 2003). genomic DNA samples prior to PCR, increasing the num-
A previous knowledge about the mutagenized gene(s) ber of cyclers and electrophoresis units, and automating
encoding a particular trait can greatly accelerate the pro- gel loading (Till et al., 2006). Although efficient and accu-
cess of mutation breeding. The acquisition of plant gene rate, major bottlenecks in what can be called ‘traditional’
sequences enabled the development of reverse-genetic TILLING were the fact that only single amplicons could be
approaches to identify lesions disrupting gene function (re- assayed per electrophoresis run and that sample pooling
viewed in Jankowicz-Cieslak and Till, 2015). The availability was limited to approximately eight fold, owing to the high
of complete and annotated genome sequences in plants noise of the assays. These two factors limited the through-
starting in the early 2000s enabled larger scale reverse put of TILLING, requiring the analysis of a large number of
genetics, with the ultimate aim of assigning a function to samples to find mutations in a set of target genes.
the many newly discovered genes. One such approach These inherent limitations are overcome with the intro-
was TILLING (targeting induced local lesions in genomes), duction of ‘TILLING by sequencing’, wherein the genomic
which relies on random mutagenesis followed by mutation DNA is pooled in a multidimensional fashion to a much
discovery assays to recover the newly induced novel higher pooling depth. PCR is then performed on the pools
sequence variants. First described for Arabidopsis and Dro- to amplify the target region of interest. Multiple targets are
sophila (Bentley et al., 2000; McCallum et al., 2000), the amplified in separate PCR reactions. PCR products are next
approach has since been applied to a large number of crop pooled prior to sequencing using a next-generation
plants and even to several animal species (Kurowska et al., sequencer (NGS), and data are analyzed to reveal rare
2011). induced mutations. The use of the NGS method provided
The approach became widespread in part because the accurate and rapid detection of rare mutations in target
chemical mutagenesis of plant and animals had been well genes through the sequencing of specific amplified regions
established and mutation discovery assays were both scal- (Marroni et al., 2011; Zhu et al., 2012; Guo et al., 2015).
able and easily adapted for different species. One of the The application of NGS by Rigola et al. (2009) to 3000 Sola-
strengths of the TILLING approach is that multiple muta- num lycopersicum (tomato) M2 lines using a 3D pooling
tions can be recovered from any target region with a rela- strategy yielded two mutations in the eIF4E gene from 28
tively small population size of a few thousand individuals DNA pools. The use of 768 individual rice M2 lines in 44
(Jankowicz-Cieslak and Till, 2015). Although mutation den- DNA pools with 3D pooling yielded 79 mutations in 32
sities vary between species, and sometimes between geno- genes (Tsai et al., 2011), indicating a better throughput of
types of the same species, growing data sets from TILLING mutation detection than can be achieved with the tradi-
projects provide useful benchmarks for the frequency of tional TILLING method.
induced mutations expected for seed and vegetatively When comparing different TILLING methods, NGS with
propagated diploids and polyploids (Slade et al., 2005; highly pooled samples provides a large improvement in
Jankowicz-Cieslak et al., 2012; Chen et al., 2014). screening throughput. Bench protocols for PCR amplifica-
In addition to applications across species, one of the tion of target sequences, quantifying and pooling PCR
major focuses of advancing TILLING has been the products, and library preparation are largely standardized.
improvement of mutation discovery. The majority of Improvements can be made to screening throughput by
induced mutations created in TILLING populations are altering the sample pooling size so that unique individuals
single-nucleotide changes. Although different discovery can be screened in a single run. Another area where bench
methods such as denaturing HPLC (dHPLC) and high-reso- methods can be streamlined is in the preparation of PCR
lution melt (HRM) have been described, the use of single amplicons for sequencing. When PCR products contain
strand-specific nucleases such as CEL I to cleave DNA more base pairs than the maximal read length of the
heteroduplexes followed by denaturing PAGE or capillary sequencing technology employed, amplicons must be
electrophoresis predominated in the first decade of fragmented prior to sequencing if all bases in the product
TILLING (Oleykowski et al., 1998; Colbert et al., 2001; are to be interrogated. With increasing read lengths
© 2017 The Authors

The Plant Journal © 2017 John Wiley & Sons Ltd, The Plant Journal, (2017), 92, 495–508
Tomato TILLING by NGS 497
available in some sequencing technologies, it is now possi- RESULTS

ble to consider the direct sequencing of PCR products with-
Tomato TILLING population
out the extra steps of fragmentation, size selection and end
repair (Pan et al., 2015; Duitama et al., 2017). In a single crop species the efficiency of mutagenesis is
Adjusting pooling depth and fragment size to match highly variable, and depends on factors such as the
improvements in sequencing technologies can be achieved genetic background of the cultivar, and the nature and
with adaptations to workflows using standard molecular dosage of the mutagen (Mba, 2013; Chen et al., 2014).
biology techniques. Data analysis, on the other hand, has Optimally a trade-off has to be established between the
yet to be standardized and can be considered a major bot- density of mutation on the genome and the loss of fertil-
tleneck, and necessitates a focus for improvements. The ity of the plants. In this study, we mutagenized a fresh
‘TILLING by sequencing’ method requires efficient bioin- market tomato cultivar, Arka Vikas, using 120 mM EMS,
formatics tools that can report rare sequence variants. which induces mutations by the ethylation of guanine
Accurate mutant identification requires software that is bases in DNA (Prakash and Sherman, 1973). The M2
robust enough to discriminate sequencing errors from true seeds were remutagenized using an identical concentra-
rare variants during a comparison with the reference gen- tion of EMS, and M2M2 seeds were collected (Figure 1a).
ome sequence. Ideally, a balance is struck such that both In the first round of mutagenesis, nearly 53% of the
false-discovery and false-negative errors are minimized. seeds lost viability, and of the remaining seeds, only
Most bioinformatics tools available for the detection of sin- 31.58% of M1 plants were fertile, yielding M2 seeds
gle-nucleotide polymorphisms (SNPs) by NGS are (Table S1). Remutagenesis of M2 seeds marginally
designed for non-overlapping pools, such as CRISP (Bansal, increased the loss of seed viability: it severely affected
2010), SNVER (Wei et al., 2011), LOFREQ (Wilm et al., 2012), VIPR the fertility, with only 25% of M2M1 plants being fertile.
(Altmann et al., 2011) and the GATK UNIFIED GENOTYPER These results were consistent with the mutagenesis of
(McKenna et al., 2010). In comparison, a limited number of tomato Micro-Tom, where the viability of seeds and fertil-
bioinformatics tools such as CAMBA (Missirian et al., 2011), ity of plants progressively declined with higher EMS
COMSEQ (Shental et al., 2010) and KEYPOINT (Rigola et al., dosage (Saito et al., 2011). The M2M2 plants of the remu-
2009) are designed for mutation calling using overlapping tagenized population were used for genomic DNA isola-
multidimensional DNA pools. tion and pooling, similarly to that reported by Tsai et al.
Each of the aforementioned programs uses different (2011) (Figure 1b).
algorithms and provides only approximations about the
Optimal pooling depth
presence of rare variants in the sequenced pools. The pre-
dicted variants, therefore, must be validated by another In classical TILLING the throughput of mutation detection
method to ensure that the mutation is found in the is increased by DNA pooling, usually with eightfold bidi-
expected plant line. Considering that the software pro- rectional pooling. Massively parallel sequencing allows
grams predict only putative mutants, it is apt to compare much higher fold DNA pooling because the number of
the robustness of the programs with a single set of data. times each nucleotide position is assayed for the pres-
Moreover, the throughput of mutation detection can be ence of a variation can be controlled experimentally. The
increased by balancing genomic DNA pooling, the number optimal pooling depth depends on the ability to discrimi-
of amplicons sequenced and the total number of reads, nate a single mutation from background noise as well as
such that real induced mutations will be detected above on the concentration of the template DNA used. The lat-
background noise. We examined these aspects in a popu- ter being important to ensure that all samples in a pool
lation of tomato mutagenized twice using ethyl methane- of genomic DNAs are represented in the resulting PCR
sulfonate (EMS). Double mutagenesis was used to products and subsequent sequencing reads. In this study,
increase the frequency of mutations in a singly mutage- we used 1:64 fold pooling for mutation detection, similar
nized tomato population. We streamlined library prepara- to Tsai et al. (2011). To find out whether pooling of
tion and sequencing through the production and direct higher than 1:64 fold can be used, we selected a tomato
sequencing of amplicons of approximately 600 bp in mutant bearing two different heterozygous mutations
length. We next tested different software programs for (G2984A and C3139T) in the c-glutamyl hydrolase 1
mutation discovery to streamline and standardize the (GGH1) gene.
bioinformatic analysis of data from highly pooled samples. The DNA of the GGH1 mutant was diluted in ratios of
We report the identification of 64 confirmed mutations out 1:128 and 1:256 with respect to the wild-type allele. The
of 75 putative overlapping mutations predicted by different mean coverage of the target region for 128- and 256-fold
software programs. The relative efficiency of different soft- pools was 3880 (30.319 coverage) and 3486 (13.619 cover-
ware programs and that of pooling depth is also presented age), respectively, which was sufficient to identify a muta-
in this study. tion. For the depth-of-pooling analysis, we used CAMBA as it
© 2017 The Authors

(a)
1st round EMS 2nd round EMS
treatment treatment
M0 M1 M1 M2 M2M0 M2M1 M2M1 M2M2 M2M2 plants DNA isolation from

seeds seeds plants seeds seeds seeds plants seeds 4 plants/line individual lines
2 3 X
1
(b)
Amplicon1
Amplicon2
D4
C1 C2 C3. . . . . . C16 1 2 3 4 5 6 7 8
2 3 X A
1
B
C
D
Amplicon X C2 E
Amplicon3 R1 R2 R3. . . . . . R16 F
2 3 X G
1
H
Plate 4
Read1 Read2
R2
500-600 bp D1 D2 D3. . . . . . D12 Paired-end
3-D PCR amplification Amplicons pooled Library Mutant

pooling with target regions to their respective preparation and Identification
from all the pools pool numbers Sequencing
Figure 1. Re-mutagenesis and the structure of the TILLING population. (a) The EMS-mutagenized Solanum lycopersicum (tomato) seeds were used to raise the
M1 plants. The M2 seeds harvested from individual M1 plants were remutagenized with EMS. The remutagenized M2M1 seeds were used to raise M2M1 plants,
and M2M2 seeds were harvested. Four seeds from each M2M2 line (labeled A, B, C and D) were grown and M2M3 seeds were collected. The leaf tissue from 2-
week-old M2M2 seedlings was harvested for genomic DNA isolation. (b) The genomic DNA was quantified, equalized and three-dimensionally pooled following
the method described by Tsai et al. (2011), generating 44 DNA pools. The pools were PCR-amplified using gene-specific primers. The equal quantities of PCR
products were combined in their respective pools, and combined amplicon pools were used for library preparation and sequencing. After data analysis, muta-
tion calling and de-multiplexing, individual mutant lines were identified using different software and confirmed by Sanger sequencing.
is the only available software that allows the detection of Library preparation and sequencing
mutations in multidimensionally pooled DNA samples. To
Considering that the shearing and fragmentation of PCR-
detect the heterozygous mutations in 1:128 and 1:256 fold
amplified target sequences requires additional liquid han-
pooled samples, the frequency changes obtained using
dling, and can lead to uneven sequencing coverage, this
CAMBA were plotted versus the base position of the
step was avoided. We amplified target gene sequences
screened region. To visibly discriminate the mutations
with amplicon lengths of ~500–600 bp and directly
from the background noise, the frequency change should
sequenced the product using MiSeq long sequencing run
be higher than 0.0039 [one mutant allele in 256,
kits (2 9 300 bp, paired end). For all pools, sufficient
1/(128 9 2) = 0.0039] and 0.0019 [one mutant allele in 512,
sequence reads were obtained, excepting R11 and R12
1/(256 9 2) = 0.0019] in 128- and 256-fold pools, respec-
pools in the second run. Essentially no reads were
tively. The observed frequency change of 0.0138 for
observed for any of the PCR amplicons in these two pools
G2984A and 0.0147 for C3139T in the 128-fold pool was
(Figure 3a). Owing to the fact that PCR products were
3.5-fold higher than the background noise, and appeared
inspected before sequencing, the read failure for R11 and
acceptable for the identification of a mutation. In contrast,
R12 pools is likely to have resulted from an error during
the observed frequency change 0.0021 for G2984A and
the library preparation of these pools. These pools were
0.0032 for C3139T, being close to the background noise,
eliminated from further data analysis. Although mutations
suggested that the use of the 256-fold pool for the detec-
present in R11 and R12 should be found in at least two
tion of mutation in our current set-up would be unaccept-
other pools because of the three-dimensional pooling
ably error prone (Figure 2).
© 2017 The Authors
0.08
0.07
(a)
0.06
Frequency
0.05 G86A
0.04
0.03
0.02
0.01
0
0.08
0.07
(b)
0.06
Frequency
0.05 C241T
0.04
0.03
0.02
0.01
0
0.08
0.07 (c)
0.06
Frequency
0.05
0.04
0.03
G86A
0.02
0.01
0
0.08
0.07 (d)
0.06
Frequency
0.05
0.04
0.03 C241T
0.02
0.01
0
0.08
(e)
0.07
0.06
Frequency
0.05
0.04
0.03
0.02
0.01
0
0.08
0.07 (f)
0.06
Frequency
0.05
0.04
0.03
0.02
0.01
0
0 100 200 300 400 500 600
Figure 2. The exploratory experiment for checking the efficiency of genomic DNA pooling depth. (a, b) Position and type of base substitution at the 86th and 241st
position in GGH1 gene in 3D pooled samples. Both the mutations were heterozygous. (c, d) Detection of G86A and C241T mutations in heterozygous GGH1 allele
in 128-fold diluted samples. (e, f) The heterozygous GGH1 mutant allele cannot be distinguished from the background noise in the 256-fold diluted sample.
© 2017 The Authors

used, it is plausible that the failure of these two pools may read depth varying within 10% across the length of frag-
reduce the total number of mutations recovered in this ments; however, the read depth for different fragments
project, and result in a lower mutation density estimation. varied to a maximum of fivefold in the same pool. On the
Notwithstanding the precise quantification and normal- other hand, the read depth for the same fragment varied
ization of input DNA and the PCR product, the coverage between different pools, with the maxima being 10-fold for
varied between the amplicons and among the pools (Fig- a few fragments.
ure 3b). Such variation in the coverage among the genes
Data analysis and mutation calling
was also observed by Tsai et al. (2011) and Pan et al.
(2015) using NGS for TILLING. We selected a minimum The identification of mutant individuals by sequencing
coverage threshold of 640, corresponding to five reads per from a pooled population is a challenging task, considering
allele. The amplicons that showed coverage of less than that mutations are rare in the populations. The identifica-
640 in a pool were discarded from the analysis. Within a tion of the rare variants by sequencing necessitates addi-
pool, the coverage of each amplicon was even, with the tional parameters to distinguish the mutations from
(a)
10 000
1000
100
10
(b)
7000
6000
5000
4000
3000
2000
1000
Figure 3. Sequencing coverage of each primer set in 3D pooled gene fragments among 44 libraries. (a) Each dot represents a single library and its coverage for
that particular gene; the respective primer sets are indicated on the x-axis. (b) The average sequencing depth in the 44 pools of each primer is displayed.
© 2017 The Authors

sequencing errors. Considerable efforts have been devoted large number of singletons that arise through PCR and
to developing the bioinformatics tools to detect rare, as sequencing errors. Second, from the remaining variants,
well as common, variants affecting any given trait. Based we considered only SNPs that were predicted by at least
on the end-user application, the available software pro- two independent programs (Table S2).
grams for detecting variants use different algorithms for After eliminating all of the SNPs with these two parame-
statistical models, read quality score analysis, sequencing ters, 75 putative SNPs were identified that were predicted
error rates and number of samples pooled to detect rare by at least two programs out of the six programs tested in
variants. Considering that no single software program is this study. The amplicons for these 75 predicted SNPs
robust enough to identify all of the rare variants present in were amplified by PCR from genomic DNA of the mutant
the pooled population, we analyzed Illumina reads using lines predicted to be harboring the mutation. The presence
six different software programs, namely CAMBA, CRISP, SNVER, of mutations in amplified PCR product was confirmed by
LOFREQ, VIPR and GATK UNIFIED GENOTYPER, for variant calling (Fig- Sanger sequencing. The Sanger sequencing revealed that
ure 4). 64 out of 75 putative SNPs indeed harbored mutations.
The comparison of SNP prediction offered by different The remaining 11 SNPs were not confirmed by Sanger
software programs shows that each program predicted a sequencing, as the respective PCR products did not reveal
different number of total SNPs. The three-dimensional any variation in the nucleotide sequence compared with
pooling strategy used requires that true mutations be the wild-type sequence. Evidently, these 11 SNPs repre-
found in a minimum of one pool of each pooling dimen- sented the false positives that were erroneously predicted
sion (named C, D and R). Only the CAMBA pipeline takes by two programs. In conformity with mutagenesis being a
three-dimensional pooling into consideration (Table 1). random process, out of 64 SNPs, only four SNPs were
When applying this rule to data from the other software, homozygous, and the majority of SNPs were heterozygous
the total number of SNPs detected was substantially in nature. A low frequency for homozygous mutation was
reduced. The number of SNPs predicted was much higher also reported for the ProDH gene in tomato, where only
for LOFREQ, VIPR and GATK, however. Considering that each three out of 19 mutations recovered were homozygous
program predicted different SNPs, two additional parame- (Gady et al., 2009).
ters were applied to screen out expected false variants. The comparison of SNP predictions offered by different
First, only variants that overlapped in three different pools software with Sanger sequencing revealed that none of
(C, D, R) were considered, with a likelihood of a change at these programs were robust enough for accurate predic-
a particular position in a maximum of two different individ- tion for standalone usage (Table 1). CAMBA predicted 113
uals. The same mutation can appear in two different indi- SNPs, of which only 58 SNPs overlapped with the predic-
viduals owing to the application of two mutagenesis steps tions from other software. Likewise, only 54 SNPs over-
to create the population. The above criterion eliminated a lapped with the 76 SNPs predicted by CRISP. The frequency
Figure 4. An overview of the bioinformatics pipe-

line. The layout shows the tools and filters used to
process the raw reads, and the analysis through six
different variant calling tools.
© 2017 The Authors

Table 1 Comparison of mutation calling tools used for the identification of mutations
Pipeline CAMBa CRISP SNVer LoFreq VipR GATK UG
Total SNPs predicted – 17397 12078 74752 2130 64279

Overlapping SNPs (1C 9 1D 9 1R) 108 69 90 518 282 313
Overlapping SNPs (two pipelines) 58 54 35 42 22 13
True positive (TP) 57 53 32 33 15 12
True negative (TN) 10 10 8 2 4 10
False positive (FP) 1 1 3 9 7 1
False negative (FN) 7 11 32 31 49 52
Accuracy 89.33 84 53.33 46.66 25.33 29.33
Sensitivity 89.06 82.81 50 51.56 23.43 18.75
Specificity 90.90 90.90 72.72 18.18 36.36 90.90
Precision 98.27 98.14 91.42 78.57 68.18 92.30
FDR 1.72 1.85 8.57 21.42 31.81 7.69
The parameters like plausible single-nucleotide polymorphisms (SNPs), output by other tool/s and number of true positives were compared
among the six tools used. Final confirmation by Sanger sequencing was carried out for mutations that were predicted by at least two soft-
ware programs.
Accuracy = (TP + TN)/(TP + TN + FP + FN); sensitivity = TP/(TP + FN); specificity = TN/(TN + FP); precision = TP/(TP + FP); false discovery
rate (FDR) = FP/(TP + FP).
of the overlapping SNPs was much lower for the remain- six genes may result from the randomness of the mutage-
ing four programs. The exclusion of SNPs that were pre- nesis process (Table 2).
dicted by only one software program considerably For the identified mutations, the ensuing amino acid
improved the accuracy of the software. With the addition changes in protein sequence were determined. The likely
of this parameter, CAMBA and CRISP showed 89.3 and 84.0% effect of the amino acid change on the protein function
accuracy for 75 putative SNPs predicted on Sanger was predicted in silico by SIFT4G software (Table 3).
sequencing. Even the accuracy of poorly performing soft- Twenty-three mutations caused nonsynonymous changes
ware, such as LoFreq, VipR and GATK, was significantly that replaced an amino acid in the protein. Fifteen muta-
improved by considering the shared SNPs. Among the tions caused synonymous changes that may not have any
software programs tested, the performance of only two, effect on protein function. Only a single mutant line was
CAMBA and CRISP, was most satisfactory, as these predicted identified with a premature stop codon. Twenty-two muta-
mutations with higher accuracy and with low false-discov- tions were identified in the intronic or 3ʹ untranslated
ery rates. In contrast, VIPR and GATK UNIFIED GENOTYPER showed regions, and three mutations were located in the promoter
the least accuracy. region. The overall distribution of synonymous, nonsyn-
onymous and nonsense mutations in our study is 38.5,
Mutation density of the population
59.0 and 2.5%, respectively, Out of the 23 nonsynonymous
The mutation density in a TILLING population depends mutations, 43.5% were predicted as deleterious by SIFT4 G
upon various factors such as species, genotype, type of software (Vaser et al., 2016), which may affect the biologi-
mutagen, mutagen concentration and treatment time cal activity of the protein.
(Talame et al., 2008). In the present study, the total amount
DISCUSSION
of DNA screened for the 55 amplicons was approximately
23.47 Mb, with an average GC content of 38.35%. Combin- Compared with traditional TILLING, wherein genomic DNA
ing the frequency of mutations for all amplicons analyzed, is mostly eightfold pooled in two-dimensional arrays, the
the M2M2 population showed an average mutation density NGS-based TILLING allows much higher fold pooling in
of approximately one mutation per 367 Kb. Notwithstand- three-dimensional arrays. The pooling depth is determined
ing the average mutation density, the mutation frequency by the ease of detection of mutations from the background
varied from 1/165 Kb to 1/860 Kb between genes. noise. In this study, we emulated the pooling protocol of
Of the 64 confirmed mutations, the majority showed Tsai et al. (2011) using 64- and 96-fold pooled DNA sam-
nucleotide transition mutations: 46 were G/C ? A/T and ples for the bulk of our mutation screening. We also evalu-
four were A/T ? G/C. The number of transversion muta- ated higher pooling depths. In rice, higher pooling depths
tions were relatively few: two were G/C ? T/A, 10 were such as 128- or 192-fold precluded the detection of muta-
A/T ? T/A and six were G/C ? C/G (Table S3). Out of the tions above the background noise (Tsai et al., 2011); how-
25 screened genes, the mutations were detected only in 19 ever, in tomato, heterozygous mutant individuals could be
genes. It is plausible that the absence of the mutation in unambiguously identified even from 128-fold pooled DNA.
© 2017 The Authors
Table 2 The list of genes screened for mutations and the number rerio (zebrafish), using paired-end sequencing of 250-bp
of mutations identified for a gene amplicons and by eliminating improperly aligned reads,
No. of Size Mutation ENU-induced mutations could be detected in 288 heterozy-
Gene amplicons screened No. of frequency gous fish pools (Pan et al., 2015). In that study a one-
name screened (bp) mutations (Kb) dimensional pooling strategy was employed, requiring
additional genotyping by HRM to identify individuals
ADCL1 1 595 1 1/456.9
ADCS 2 1121 1 1/860.9 within a pool that harbor the desired mutation. Although
CCD4A 3 1728 8 1/165.8 this approach requires extra steps, it reduces the up-front
CCD4B 2 1124 2 1/431.6 sequencing costs. A similar one-dimensional pooling strat-
CHRC 3 1543 5 1/237.0 egy was recently used to evaluate natural variation in a
COP1 6 3368 10 1/258.6
large collection of Manihot esculenta (cassava) accessions.
CRTISO 1 593 0 –
CYCB 4 2212 5 1/339.7 Here, pools of up to 281 individuals were sequenced using
DHFS 2 1167 3 1/298.7 amplicons of up to 600 bp (Duitama et al., 2017). Our work
FPGSM 1 566 0 – in tomato shows the robust recovery of induced mutations
FPGSP 2 1173 2 1/450.4 in three-dimensionally pooled samples of up to 128 indi-
GCH1 2 1046 2 1/401.6
viduals. As EMS mutations can be discovered in large
GGH1 2 1192 2 1/457.7
GGH2 1 599 0 – three-dimensional pools in tomato, it provides the oppor-
GGH3 1 580 2 1/222.7 tunity for increasing throughput and reducing costs. For
NCED1 4 1825 4 1/350.4 example, a 128-fold pooling allows the three-dimensional
OR 2 1158 2 1/444.6 arraying of 2048 mutant lines in 48 DNA pools (Figure S1).
PAP3 1 592 0 –
It is well established that the achievable frequency of
PHYF 2 1018 1 1/781.8
PSY1 5 2882 4 1/553.3 induced mutations in diploid species is such that a rela-
SPA1 2 1273 0 – tively large population is required to ensure the recovery
SPA3 2 1041 3 1/266.4 of deleterious alleles in most genes. For most TILLING
SPA3LIKE 2 1150 5 1/176.6 studies, a population of 3000–5000 M2 lines has been used
TF 1 417 0 –
for mutation detection. The use of a pooling depth of 128
ZEP 1 600 2 1/230.4
Total 55 30 563 64 1/367 improves the chances of the detection of rare mutations,
as it allows for the analysis of a larger number of M2 indi-
For most genes, the CODDLE predicted regions were chosen for PCR viduals. The design of higher fold pooling experiments,
to increase the likelihood of obtaining deleterious mutations. however, requires that all genomic DNAs in a pool are
properly represented in the pool of amplified PCR prod-
In tomato, with 128-fold pooled DNA, the mutation fre- ucts. With other factors held constant, it can be assumed
quency was 3.5-fold higher than the background noise, that errors in the misrepresentation of individual samples
whereas in rice it was 2.6-fold higher in 96-fold pooled in a pool are increased with increased pooling. Further-
DNA. more, our data show variations in read coverage between
It is noteworthy that the tomato genome is over two amplicons. Ideally, all PCR amplicons in a ‘TILLING by
times larger than the rice genome, suggesting that gen- sequencing’ experiment would be sequenced to a similar
ome size may not influence mutation recovery. Further- depth to ensure consistent mutation discovery. Variations
more, in contrast to the 300-bp amplicon fragments used in depth can be controlled to some extent through the
by Tsai et al. (2011), we used 600-bp amplicons, thus effec- quantification of PCR products and dilutions; however,
tively using a two-fold higher number of bases for muta- accurate quantification and dilutions can be time consum-
tion detection. Another distinction is that in the present ing and expensive. We chose in this study to rapidly deter-
study we avoided potential uneven sequence coverage, mine the relative concentrations of each amplicon and to
which can result from the shearing of PCR products to gen- bin amplicons of similar concentrations together to reduce
erate short fragments appropriate for Illumina sequencing. liquid handling steps. We are currently working on
Given the observed differences, we postulate that the improvements to this approach.
higher sensitivity for mutation discovery in tomato may be The accurate identification of mutations in deeply
the result of a variety of factors, including wet-bench pooled populations presents a major challenge. Consider-
parameters and improvements to sequencing technolo- ing that data contain millions of reads, mutation identifica-
gies. tion is only possible by computational tools. These tools
One limiting factor in NGS approaches is the sequencing align millions of short reads with a reference sequence. In
error rate (Craig et al., 2008). Improvements in sequencing this study, we used six different software programs for
accuracy, read lengths and throughput promise to allow data analysis. The different programs varied considerably
further gains in TILLING by increasing pooling. In Danio in their output for the accuracy of mutation detections. In
© 2017 The Authors

Table 3 Summary of mutations identified and analyzed for their Table 3. (continued)
possible effects on gene function
Nucleotide AA SIFT Pipeline
Nucleotide AA SIFT Pipeline
Gene change change GC% score detected
Gene change change GC% score detected
SPA3LIKE C40 -> A T188 -> = 42.2 1 ABCDE
ADCL Set1 C492 -> T L302 -> F 40.84 0 ABD
Set2 G45 -> A G190 -> E 0.907 ABCE
ADCS Set1 C475 -> T Intron 38.34 ABCDE
C70 -> T V198 -> = 1 ABCD
CCD4A Set1 G128 -> A S52 -> = 38.62 1 BF
SPA3 Set1 G57 -> A E333 -> = 41.83 1 BCD
G176 -> C S68 -> = 1 BF
G506 ->A R483 -> K 0.662 AB
T500 -> A N176 -> K 0.621 ABCD
SPA3 Set2 G234 -> A V572 -> I 40.15 0.139 AB
CCD4A Set2 C41 -> T P214 -> = 45.53 1 AB
ZEP G520 -> A G500 -> R 38.66 0.098 AC
G62 -> A A221 -> = 1 ABCDE
T576 -> C Intron AD
C104 -> G L235 -> = 1 ABD
G129 -> C V244 -> L 0.426 ABD
SIFT was used to assess the effect of the mutation on protein func-
C537 -> T P380 -> S 0.005 AB
tion: a SIFT score of less than 0.05 indicates the mutation is delete-
CCD4B Set2 G95 -> A Intron 34.8 ABCD
rious in nature. A, CAMBA; B, CRISP; C, SNVER; D, LOFREQ; E, VIPR; F, GATK
A199 -> T A502 -> = AB
UNIFIED GENOTYPER 3.6.
CHRC Set1 C100 -> T P24 -> S 42.04 0.514 ADE
C101 -> T P24 -> L 0.795 AE
C398 -> T T123 -> I 0.191 ACD all likelihood, these variations resulted from the different
CHRC Set2 G100 -> A Intron 41.16 ABC algorithms used by the individual software. CRISP uses con-
C120 -> T S160 -> F 0.001 ABCDF tingency tables to compute P values to identify rare vari-
COP1 Set1 G38 -> A Intron 36.51 AB
ants (Bansal, 2010; Bansal et al., 2010), whereas SNVER uses
G496 -> A Intron ABCD
COP1 Set3 C218 -> T T454 -> I 38.8 0.004 AB a binomial–binomial model to calculate a pooled Simes’
C280 -> T Intron ABE P value to distinguish true variants from sequencing errors
A363 -> T S474 -> = 1 ABD (Wei et al., 2011). LOFREQ assigns a P value using a Poisson–
G370 -> A E477 -> K 0.007 AB binomial distribution for the variant base (Wilm et al.,
COP1 Set4 G504 -> A Intron 35.09 ABCD
2012). Similar to CRISP, VIPR is based on the assumption that
G512 -> A Intron ABCD
COP1 Set5 G31 -> A Intron 37.18 ABCD the sequence-dependent error rate is conserved across
G69 -> A Intron AB pools, except that it derives the P value using the Skellam
CYCB PRO G124 -> A Promoter 34.17 ABCDE distribution (Altmann et al., 2011). GATK UNIFIED GENOTYPER
C229 -> T Promoter AB uses a Bayesian genotype likelihood model to estimate
A426 -> T Promoter ABCDEF
allele frequencies and uses the ploidy argument for pooled
CYCB Set1 C140 -> T S14 -> = 37.56 1 ABCDF
CYCB Set2 T132 -> A S200 -> T 37.54 0.476 BCDF samples (McKenna et al., 2010). CAMBA employs Bayes’ the-
DHFS Set2 G38 -> A Intron 37.93 ACF orem to compute the posterior probabilities with an Ft
C99 -> G Intron AC score (Missirian et al., 2011), and also considers the over-
G229 -> A V164 -> M 0.188 ABCF lapping pooling scheme, whereas five other programs do
FPGSp Set1 C119 -> G Intron 38.07 ACD
not encompass a pooling scheme.
G249 -> A L164 -> = 1 AE
GCH1 Set1 C248 -> T C248 -> = 42.83 1 AB A comparison of these programs revealed that SNVER,
A496 -> T E331 -> V 0.28 AB LOFREQ, VIPR and GATK UNIFIED GENOTYPER are not robust enough
GGH1 Set2 G86 -> A Intron 34.59 ABCDE to accurately identify the mutations. Using the simulated
C241 -> T Intron ABC pool data set, it was reported that LOFREQ, CRISP and GATK UNI-
GGH3 T436 -> A Intron 31.37 BF
FIED GENOTYPER gave optimal accuracy, compared with SNVER
T495 -> C Intron BD
NCED Set1 G32 -> A G162 -> R 37.6 0 ACDE (Huang et al., 2015). This study suffers from a drawback,
C402 -> A T285 -> K 0.001 ABDEF however, as it did not use a true pooled data set. As stan-
C509 -> T Q321 -> * AB dalone software, CRISP and CAMBA analysis predicted muta-
NCED Set2 G490 -> A G130 -> R 46.62 0.02 ABCD tions more accurately. In contrast, in wheat and rice, CAMBA
OR Set1 G251 -> A G67 -> R 36.75 0.011 AB
yielded a higher mutation discovery rate than CRISP and the
OR Set2 G451 -> A G167 -> D 36.81 0.022 ABC
PHYF Set2 T133 -> A Intron 41.04 AB other software (Tsai et al., 2011).
PSY1 Set2 T115 -> A S98 -> = 38.62 1 ABE Our comparative analysis highlights that the algorithms
A440 -> G Intron ABCDEF used by different programs are not yet robust enough to
PSY1 Set3 C259 -> T Intron 35.91 ABCF reduce false-discovery rates; however, the use of overlap-
PSY1 Set5 T436 -> A Intron 39.27 ABD
ping mutations predicted by at least two programs consid-
SPA3LIKE A54 -> G K13 -> E 44.12 0.853 CD
Set1 G479 -> C S154 -> = 1 ABD erably eliminated false positives, as was evident by the
higher Sanger sequencing confirmation. Furthermore,
(continued) assuming minimal experimental variations in the assay,
© 2017 The Authors

multidimensional pooling should be more accurate than density observed. Nonetheless, the above mutation density
one-dimensional pooling, as true mutations must repeat in is similar to other diploid organisms like Arabidopsis
each of the respective pool dimensions. It is imperative (Greene et al., 2003; Till et al., 2003).
that better algorithms are developed for improving the To summarize, we have adapted the ‘TILLING by
detection of mutation by sequencing. Likewise, Tsai et al. sequencing’ approach for a doubly mutagenized tomato
(2011) also opined that mutation discovery by software population. The approach is streamlined through the use
could be made more robust by optimizing algorithms and of longer amplicons, allowing direct sequencing, and the
determining the optimal probability threshold. use of multiple SNP calling algorithms to improve accu-
Considering that EMS predominantly induces G/C ? A/T racy. The observation of a mutation density similar to that
transitions (Vidal et al., 1995), several analyses took into previously reported in tomato suggests this approach is
account only A/T changes for mutation detection. In robust and suitable for species where multiple mutagen-
tomato (Garcia et al., 2016) and rice (Abe et al., 2012), only esis treatments are required owing to recalcitrance in
the EMS-induced A/T transition was considered to identify mutation accumulation. Although this study used 768
the mutant gene using the MutMap approach. Limiting mutant lines and 55 amplicons, the approach is highly scal-
mutation detection to the A/T transition alone can miss able and limited only to the read output of the sequencer
other mutations, however, as in this study nearly 35% of used. We expect the approach to become more efficient as
EMS-induced transitions were non-A/T transitions. sequencing technologies evolve.
GC ? AT transitions represented 65% of the transitions in
tomato, which is very similar to rice where 70% of transi- EXPERIMENTAL PROCEDURES
tions were GC ? AT (Till et al., 2007). In earlier studies Development of a re-mutagenized tomato population
using the classical TILLING method, the frequency of EMS-
induced transitions was calculated by Sanger sequencing A population of tomato cultivar Arka Vikas, mutagenized with
of the putative mutants that were detected by mismatch 120 mM EMS, was selected for re-mutagenesis. The details of pop-
ulation development and mutant lines collection are outlined in
cleavage of CEL I. Using this approach, in Micro-Tom and Figure 1(a). The seeds from 1000 M2 lines were remutagenized
M82 cultivars of tomato, around 90% of mutations were G/ with 120 mM EMS, and M2M1 plants were grown in an open field.
C ? A/T transitions (Piron et al., 2010; Okabe et al., 2011), The M2M2 seeds were harvested and were sown to raise the M2M2
whereas in Arabidopsis, Zea mays (corn) and wheat, the plants. The juvenile leaves from the individual M2M2 plants were
frequency of G/C ? A/T transitions is 99% (Greene et al., collected for genomic DNA isolation. The M2M3 seeds were har-
vested and stored at 20°C.
2003; Till et al., 2004; Slade et al., 2005).
When reviewing these data, it is important to consider DNA isolation and 3D pooling
the possibility of ascertainment bias. Although no nucleo- Genomic DNA was isolated from four individual plants (A, B, C
tide cleavage bias was observed with the use of crude cel- and D) of each M2M2 family following the protocol described by
ery juice extract and fluorescent detection (Till et al., 2004, Sreelakshmi et al. (2010) with minor variations. The variations
2010), it was reported that purified CEL I preferentially involved the use of Eppendorf tubes in place of the deep well
cleaves C/C ≥ C/A ≥ C/T ≥ G/G mismatches, and has lower plates, a higher centrifugation speed (130 00 g) and a pro-
teinase K incubation step. Approximately 100 mg of leaf tissue
recognition for other mismatches (Oleykowski et al., 1998; was homogenized in 750 lL of preheated (at 65°C) DNA extraction
Yang et al., 2000). Current NGS-based mutation detection buffer [0.1 M Tris-HCl, pH 7.5; 0.05 M EDTA, pH 8.0; 1.25% (w/v)
platforms do not suffer from such ascertainment biases. A SDS] containing 0.2 M b-mercaptoethanol and 20 mg of
more accurate view of mutation spectra should emerge polyvinylpolypyrrolidone (PVPP) in 1.5-ml Eppendorf tubes using
with growing data sets, such as with the millions of TIL- three three steel balls in a bead beater. Approximately 40 lg of
proteinase K was added to the homogenate, followed by incuba-
LING mutations discovered through exome capture tion at 37°C for 30 min.
sequencing in wheat (Krasileva et al., 2017). Post-proteinase K, the other steps were essentially similar to
In tomato, the reported mutation density in a singly those described in Sreelakshmi et al. (2010), except for the usage
mutagenized population ranges from 1/322 to 1/1710 Kb of a higher centrifugation speed (13 000 g). The DNA quality and
(Gady et al., 2009; Minoia et al., 2010; Piron et al., 2010; quantity were checked using three methods: agarose gel elec-
trophoresis, Nanodrop estimation and Picogreen dye method
Okabe et al., 2011, 2013; Okabe and Ariizumi, 2016), and
(Invitrogen, now ThermoFisher Scietific, https://www.thermofishe
may reflect the susceptibility of different cultivars to the r.com). The genomic DNA was equalized to 10 ng ll1 based on
mutagenesis. In this study a much higher mutation density Picogreen estimation. The DNA from 768 individuals of the M2M2
of 1/367 Kb was observed for tomato, compared with other population were arrayed into 12 plates in the 8 9 8 grid format
reports, excluding the Red Setter cultivar that had a muta- and stored at 80°C until further use. DNA pooling and multiplex-
ing was carried out following the protocol described by Tsai et al.,
tion frequency of 1/322 Kb (Minoia et al., 2010). This sug-
2011 (Figure 1b). From the 12 64-well plates, equal quantities of
gests that ‘TILLING by sequencing’ can be efficiently DNA from each well were pooled in three dimensions to give 44
applied for tomato. It is plausible that re-mutagenesis of pools, consisting of 12 D-pools, 16 column (C) pools, and 16 row
the tomato may have contributed to the higher mutation (R) pools.
© 2017 The Authors

mutations. The PCR products were relatively quantified using the

Primer design and PCR amplification
Advanced Analyticalâ Fragment AnalyzerTM with the low sensitiv-
All primers were designed to amplify a target fragment of 500– ity 1-kb separation matrix with 30-cm capillaries (#DNF935). Analy-
600 bp of the corresponding genes. The web-based CODDLE soft- sis of data were performed using the included PROSIZEâ 2.0
ware (Codon Optimized to Discover Deleterious Lesions, http:// software. The relative concentration of each amplicon was esti-
blocks.fhcrc.org/proweb/coddle/) was used for the prediction of mated, and amplicons of similar concentration (10%) were put
the most deleterious region caused by mutagenesis. The primers into the same concentration bin. All amplicons were then diluted
for the CODDLE-predicted region were designed using OLIGO CALC to match the lowest-concentration bin. Following this, all ampli-
(http://biotools.nubic.northwestern.edu/OligoCalc.html). For a few cons produced from the same genomic DNA pool were pooled
genes, two or more primers were designed to encompass the pre- together. These pools containing PCR amplicons were used
dicted exon region of these genes. directly for the library preparation.
The PCR was carried out in a 20-lL volume using 10 ng of Indexed DNA library for NGS was prepared using the TruSeqâ
pooled DNA, 1X PCR buffer [10 mM Tris, 50 mM KCl, 1.5 mM Nano DNA HT Library Preparation Kit with 200 ng of starting PCR
MgCl2, 0.1% (w/v) gelatin, 0.005% (v/v) Tween-20, 0.005% (v/v) NP- products. The library was diluted to 18 pM concentration.
40, pH 8.8], 2.5 mM of each dNTPs, 0.18 ll Taq polymerase (in- Sequencing was performed on an Illumina MiSeqâ using 2 9 300
house isolated) and 3 pmoles of both forward and reverse pri- PE chemistry according to the manufacturer’s protocol. The reads
mers. The cycling conditions for amplification were 94°C for were de-multiplexed with the MISEQ REPORTER software and were
4 min, 35 cycles of 94°C for 20 s, 60°C for 45 s, 72°C for 40 s, 72°C stored as FASTQ files for downstream analysis.
for 10 min and held at 12°C. The quality of PCR products was
checked by agarose gel electrophoresis. The list of the amplicons Data analysis and mutation calling
used and their primer details are given in Table S4.
The de-multiplexed reads were aligned to reference sequences by
Pilot experiment BWA 0.7.15 (Burrows-Wheeler Aligner) using –q 20 –k 1, followed
by SAM to BAM file conversion using SAMTOOLS 1.3.1. BAM files were
To check the efficiency of pooling depth, the genomic DNA from sorted using PICARD 1.119 tools. Sorted BAM files were then pro-
one heterozygous individual carrying two different known muta- cessed for variant calling by CRISP (https://github.com/vibansal/
tions (G2984A and C3139T) in the gamma-glutamyl hydrolase crisp/), LOFREQ (http://csb5.github.io/lofreq/), SNVER (http://snver.sour
gene were mixed with wild-type DNA in ratios of 1:128 and 1:256 ceforge.net/) and GATK UNIFIED GENOTYPER 3.6 (https://software.broadin
(or 1:256 and 1:512, mutant allele to wild-type allele). The pools stitute.org/gatk/download/). For VIPR (https://sourceforge.net/projec
were amplified with the gamma-glutamyl hydrolase gene and with ts/htsvipr/) and CAMBA (http://comailab.genomecenter.ucdavis.edu/
other target genes also. The libraries were prepared, barcoded index.php/TILLING_by_Sequencing#Bioinformatics_tools) analy-
and mixed with other samples, and then sequenced. sis, sorted BAM files were converted to MPILEUP using SAMTOOLS for
further processing. CRISP requires multiple pooled BAM files to run
Library preparation and sequencing simultaneously. Therefore, for CRISP, BAM files from all column
pools were run in parallel. Similarly, for row pools and D pools
Mutation detection by NGS, including PCR amplification, library
too, the BAM files were also run in parallel. For all BAM file runs,
preparation and sequencing, was carried out as two independent
a –qvoffset filter of 33 was used.
sets, with the first being an exploratory subset and the second
For VIPR analysis, RKWARD (https://rkward.kde.org/) a GUI for R was
being the main subset. In the exploratory subset, 12 target regions
used, as VIPR requires the R environment to run. Similar to CRISP, VIPR
of 10 different genes were amplified by PCR: ADCS, aminodeoxy-
too requires multiple pooled MPILEUP files to run together. For VIPR,
chorismate synthase; CHRC, chromoplast-specific carotenoid-
the MPILEUP files for all column, row and D pools were indepen-
associated protein; CRTISO, carotenoid isomerase; CYCB,
dently run with default parameters. The other three programs,
chromoplast-specific lycopene b-cyclase; FPGSp, plastidial
LOFREQ, SNVER and GATK UNIFIED GENOTYPER, were independently run on
folylpolyglutamate synthase; GCHI, GTP cyclohydrolase I; GGH1,
individual BAM files. LOFREQ was run with default parameters. In the
c-glutamyl hydrolase 1; PSY1, phytoene synthase 1; NCED1, 9-cis
case of SNVER, –t 0, –a 0, –s 0, –f 0 and –p 0.1 was used to achieve
epoxy carotenoid dioxygenase 1; and ZEP, zeaxanthin epoxidase.
more specificity. For GATK UNIFIED GENOTYPER, the ploidy argument
PCR products from all of the amplicons were pooled directly to
was used for variant calling. CAMBA pipeline is designed for the
their respective pools.
overlapping paired-end reads, and uses custom PYTHON scripts for
In the main subset, 45 target regions of 22 different genes were
variant calling. As the obtained Illumina reads were non-overlap-
PCR amplified: ADCL, aminodeoxychorismate lyase; ADCS, amin-
ping paired ends, the CAMBA script was slightly custom-modified to
odeoxychorismate synthase; CCD4A and CCD4B, carotenoid cleav-
make it suitable for our data set (Appendix S1).
age dioxygenase 4A and 4B; CHRC, chromoplast-specific
All five programs, CRISP, LOFREQ, SNVER, VIPR and GATK UNIFIED GENO-
carotenoid-associated protein; COP1, WD-40 repeat protein; CYCB,
TYPER, gave the variants in the vcf format; however, CRISP and VIPR
chromoplast-specific lycopene b-cyclase; DHFS, dihydrofolate syn-
gave the pooled vcf files for C pools, R pools and D pools sepa-
thase; FPGSm, folylpolyglutamate synthase; GCHI, mitochondrial
rately, whereas LOFREQ and SNVER gave the vcf files for the individ-
GTP cyclohydrolase I; GGH1, GGH2 and GGH3, c-glutamyl hydro-
ual pool. The pooled vcf files were converted into individual vcf
lase 1, 2 and 3; NCED1, 9-cis epoxycarotenoid dioxygenase 1; Or,
files in the case of CRISP and VIPR. As every M2M2 plant line harbor-
chaperone protein dnaJ-like; PAP3, plastid-lipid-associated pro-
ing a particular mutation is represented in three different kinds of
tein 3; PHYF, phytochrome F; PSY1, phytoene synthase 1; SPA1
pools, we considered only those mutations that were present in
and SPA3, WD-40 repeat protein; SPA3LIKE, WD-40 repeat protein;
all three pools (R pool, C pool and D pool) at an identical position.
and TF, trifoliate. In this subset, GGH1 was used as the positive
internal control, as an exploratory subset identified the presence To identify the mutant individuals from the pools, we used a
of two mutations in the GGH1 gene in the population. Taking into custom shell script using VCFTOOLS –ISEC command on the individual
account the common genes present in both exploratory and main pool vcf files. As CAMBA is designed for an overlapping pooling
subsets, a total of 25 genes were analyzed for the presence of strategy, it directly identifies the individuals carrying the mutation
© 2017 The Authors

in the output file with a Bayesian method threshold of positive Colbert, T., Till, B.J., Tompa, R., Reynolds, S., Steine, M.N., Yeung, A.T.,
five. All the predicted mutations from these programs were cata- McCallum, C.M., Comai, L. and Henikoff, S. (2001) High-throughput
loged. Among these, the mutations that were predicted by at least screening for induced point mutations. Plant Physiol. 126, 480–484.
Craig, D.W., Pearson, J.V., Szelinger, S. et al. (2008) Identification of genetic
two different programs were selected for further analysis. The pre-
variants using bar-coded multiplexed sequencing. Nat. Methods, 5, 887–
dicted mutations in the filtered variants were confirmed by the
893.
Sanger sequencing of genomic DNA of mutant individuals. The Doebley, J.F., Gaut, B.S. and Smith, B.D. (2006) The molecular genetics of
effect of the mutation on protein function was analyzed by SIFT4 G crop domestication. Cell, 127, 1309–1321.
(Sorting Intolerant From Tolerant, http://sift.bii.a-star.edu.sg/sift4g/ Dong, C., Vincent, K. and Sharp, P. (2009) Simultaneous mutation detection
AboutSIFT4G.html). Mutations with the SIFT scores of <0.05 were of three homoeologous genes in wheat by High Resolution Melting anal-
predicted to affect protein function. ysis and Mutation Surveyorâ. BMC Plant Biol. 9, 143.
Duitama, J., Kafuri, L., Tello, D., Leiva, A.M., Hofinger, B., Datta, S., Lentini,
ACKNOWLEDGEMENTS Z., Aranzales, E., Till, B. and Ceballos, H. (2017) Deep assessment of
genomic diversity in cassava for herbicide tolerance and starch biosyn-
This work was supported by the Department of Biotechnology, thesis. Comput. Struct. Biotechnol. J. 15, 185–194.
India (grant nos BT/PR/5275/AGR/16/465/2004, BT/PR/7002/PBD/16/ Gady, A.L., Hermans, F.W., Van de Wal, M.H., van Loo, E.N., Visser, R.G.
1009/2012 and BT/COE/34/SP15209/2015) to RS and YS, Interna- and Bachem, C.W. (2009) Implementation of two high through-put tech-
tional Atomic Energy Agency (IAEA) (grant no. 15166/R0-4 to YS), niques in a novel application: detecting point mutations in large EMS
Council of Scientific and Industrial Research, India (research fel- mutated plant populations. Plant Methods, 5, 13.
lowship to PG, KT, SH) and University Grants Commission, India Garcia, V., Bres, C., Just, D. et al. (2016) Rapid identification of causal muta-
(PU, ST, AS, SS). Funding for PCR quantification, library prepara- tions in tomato EMS populations via mapping-by-sequencing. Nat. Pro-
toc. 11, 2401–2418.
tion and sequencing was provided to BT and SD by the Food and
Greene, E.A., Codomo, C.A., Taylor, N.E. et al. (2003) Spectrum of chemi-
Agriculture Organization of the United Nations and the Interna- cally induced mutations from a large-scale reverse-genetic screen in Ara-
tional Atomic Energy Agency through their Joint FAO/IAEA Pro- bidopsis. Genetics, 164, 731–740.
gramme of Nuclear Techniques in Food and Agriculture. Guo, Y., Abernathy, B., Zeng, Y. and Ozias-Akins, P. (2015) TILLING by
sequencing to identify induced mutations in stress resistance genes of
CONFLICT OF INTEREST peanut (Arachis hypogaea). BMC Genom. 16, 157.
Hedden, P. (2003) The genes of the Green Revolution. Trends Genet. 19, 5–9.
The authors declare no conflicts of interests. Huang, H.W., Mullikin, J.C. and Hansen, N.F. (2015) Evaluation of variant
detection software for pooled next-generation sequence data. BMC
SUPPORTING INFORMATION Bioinformatics, 16, 235.
Jankowicz-Cieslak, J. and Till, B.J. (2015) Forward and reverse genetics in
Additional Supporting Information may be found in the online ver- crop breeding. In Advances in Plant Breeding Strategies: Breeding,
sion of this article. Biotechnology and Molecular Tools (Al-Khayri, J.M., Jain, S.M. and
Figure S1. Tridimensional pooling using 128-fold pooled genome Johnson, D.V. eds). Springer International Publishing, pp. 215–240.
DNA for mutation discovery. Jankowicz-Cieslak, J., Huynh, O.A., Brozynska, M., Nakitandwe, J. and Till,
Table S1. The effect of 120 mM EMS mutagenesis on the seed B.J. (2012) Induction, rapid fixation and retention of mutations in vegeta-
tively propagated banana. Plant Biotech J. 10, 1056–1066.
germination, survival and fertility rate of the plants.
Krasileva, K.V., Vasquez-Gross, H.A., Howell, T. et al. (2017) Uncovering
Table S2. List of mutations identified using multiple software. hidden variation in polyploid wheat. Proc. Natl Acad. Sci. USA 114, 913–
Table S3. Comparison of mutagenesis frequency obtained in this 921.
study with other studies on tomato cultivars using classical TIL- Kurowska, M., Daszkowska-Golec, A., Gruszka, D., Marzec, M., Szurman,
LING. M., Szarejko, I. and Maluszynski, M. (2011) TILLING—a shortcut in func-
tional genomics. J. Appl. Genet. 52, 371–390.
Table S4. List of genes used in this study and primer pairs used Marroni, F., Pinosio, S., Di Centa, E., Jurman, I., Boerjan, W., Felice, N., Cat-
for PCR amplifications. tonaro, F. and Morgante, M. (2011) Large-scale detection of rare variants
Appendix S1. Modified pileup_processor.py script (CAMBA pipeline) via pooled multiplexed next-generation sequencing: towards next-gen-
for carrying out the analysis of non-overlapping Illumina reads. eration EcoTILLING. Plant J. 67, 736–745.
Mba, C. (2013) Induced mutations unleash the potentials of plant genetic
resources for food and agriculture. Agronomy, 3, 200–231.
REFERENCES McCallum, C.M., Comai, L., Greene, E.A. and Henikoff, S. (2000) Targeting
induced local lesions in genomes (TILLING) for plant functional geno-
Abe, A., Kosugi, S., Yoshida, K. et al. (2012) Genome sequencing reveals mics. Plant Physiol. 123, 439–442.
agronomically important loci in rice using MutMap. Nature Biotech. 30, McKenna, A., Hanna, M., Banks, E. et al. (2010) The Genome Analysis
174–178. Toolkit: a MapReduce framework for analyzing next-generation DNA
Ahloowalia, B. and Maluszynski, M. (2001) Induced mutations–A new para- sequencing data. Genome Res. 20, 1297–1303.
digm in plant breeding. Euphytica, 118, 167–173. Minoia, S., Petrozza, A., D’Onofrio, O., Piron, F., Mosca, G., Sozio, G., Cel-
Altmann, A., Weber, P., Quast, C., Rex-Haffner, M., Binder, E.B. and Mu€ ller- lini, F., Bendahmane, A. and Carriero, F. (2010) A new mutant genetic
Myhsok, B. (2011) vipR: variant identification in pooled DNA using R. resource for tomato crop improvement by TILLING technology. BMC
Bioinformatics, 27, i77–i84. Res. Notes, 3, 69.
Bansal, V. (2010) A statistical method for the detection of variants from next- Missirian, V., Comai, L. and Filkov, V. (2011) Statistical mutation calling
generation resequencing of DNA pools. Bioinformatics, 26, i318–i324. from sequenced overlapping DNA pools in TILLING experiments. BMC
Bansal, V., Harismendy, O., Tewhey, R., Murray, S.S., Schork, N.J., Topol, Bioinformatics, 12, 287.
E.J. and Frazer, K.A. (2010) Accurate detection and genotyping of SNPs Okabe, Y. and Ariizumi, T. (2016) Mutant resources and TILLING platforms
utilizing population sequencing data. Genome Res. 20, 537–545. in tomato research. In Functional Genomics and Biotechnology in Sola-
Bentley, A., MacLennan, B., Calvo, J. and Dearolf, C.R. (2000) Targeted naceae and Cucurbitaceae Crops (Ezura, H., Ariizumi, T., Garcia-Mas, J.
recovery of mutations in Drosophila. Genetics, 156, 1169–1173. and Rose, J. eds). Springer, pp. 75–91.
Chen, L., Hao, L., Parry, M.A., Phillips, A.L. and Hu, Y.G. (2014) Progress in Okabe, Y., Asamizu, E., Saito, T., Matsukura, C., Ariizumi, T., Bres, C.,
TILLING as a tool for functional genomics and improvement of crops. J. Rothan, C., Mizoguchi, T. and Ezura, H. (2011) Tomato TILLING technol-
Integr. Plant Biol. 56, 425–443. ogy: development of a reverse genetics tool for the efficient isolation of
© 2017 The Authors

mutants from Micro-Tom mutant libraries. Plant Cell Physiol. 52, 1994– Till, B.J., Reynolds, S.H., Greene, E.A., Codomo, C.A., Enns, L.C., Johnson,
2005. J.E., Burtner, C., Odden, A.R., Young, K. and Taylor, N.E. (2003) Large-
Okabe, Y., Ariizumi, T. and Ezura, H. (2013) Updating the micro-tom TIL- scale discovery of induced point mutations with high-throughput TIL-
LING platform. Breed. Sci. 63, 42–48. LING. Genome Res. 13, 524–530.
Oleykowski, C.A., Mullins, C.R.B., Godwin, A.K. and Yeung, A.T. (1998) Till, B.J., Reynolds, S.H., Weil, C., Springer, N., Burtner, C., Young, K., Bow-
Mutation detection using a novel plant endonuclease. Nucleic Acids Res. ers, E., Codomo, C.A., Enns, L.C. and Odden, A.R. (2004) Discovery of
26, 4597–4602. induced point mutations in maize genes by TILLING. BMC Plant Biol. 4,
Pan, L., Shah, A.N., Phelps, I.G., Doherty, D., Johnson, E.A. and Moens, C.B. 12.
(2015) Rapid identification and recovery of ENU-induced mutations with Till, B.J., Zerr, T., Comai, L. and Henikoff, S. (2006) A protocol for TILLING
next-generation sequencing and Paired-End Low-Error analysis. BMC and Ecotilling in plants and animals. Nat. Protoc. 1, 2465–2477.
Genom. 16, 83. Till, B.J., Cooper, J., Tai, T.H., Colowit, P., Greene, E.A., Henikoff, S. and
Piron, F., Nicola€ı, M., Mino€ıa, S., Piednoir, E., Moretti, A., Salgues, A., Zamir, Comai, L. (2007) Discovery of chemically induced mutations in rice by
D., Caranta, C. and Bendahmane, A. (2010) An induced mutation in TILLING. BMC Plant Biol. 7, 19.
tomato eIF4E leads to immunity to two potyviruses. PLoS ONE, 5, e11313. Till, B.J., Jankowicz-Cieslak, J., Sagi, L., Huynh, O.A., Utsushi, H., Sweenen,
Prakash, L. and Sherman, F. (1973) Mutagenic specificity: reversion of iso-1- R., Terauchi, R. and Mba, C. (2010) Discovery of nucleotide polymor-
cytochrome c mutants of yeast. J. Mol. Biol. 79, 65–82. phisms in the Musa gene pool by Ecotilling. Theor. Appl. Genet. 121,
Raghavan, C., Naredo, M.E.B., Wang, H., Atienza, G., Liu, B., Qiu, F., 1381–1389.
McNally, K.L. and Leung, H. (2007) Rapid method for detecting SNPs on Tsai, H., Howell, T., Nitcher, R. et al. (2011) Discovery of rare mutations in
agarose gels and its application in candidate gene mapping. Mol. Breed. populations: TILLING by sequencing. Plant Physiol. 156, 1257–1268.
19, 87–101. Vaser, R., Adusumalli, S., Leng, S.N., Sikic, M. and Ng, P.C. (2016) SIFT mis-
Rigola, D., van Oeveren, J., Janssen, A., Bonne, A., Schneiders, H., van der sense predictions for genomes. Nat. Protoc. 11, 1–9.
Poel, H.J., van Orsouw, N.J., Hogers, R.C., de Both, M.T. and van Eijk, Vidal, A., Abril, N. and Pueyo, C. (1995) DNA repair by Ogt alkyltransferase
M.J. (2009) High-throughput detection of induced mutations and natural influences EMS mutational specificity. Carcinogenesis, 16, 817–821.
variation using KeyPoint technology. PLoS ONE, 4, e4761.
TM
Wei, Z., Wang, W., Hu, P., Lyon, G.J. and Hakonarson, H. (2011) SNVer: a
Saito, T., Ariizumi, T., Okabe, Y., Asamizu, E., Hiwasa-Tanase, K., Fukuda, statistical tool for variant calling in analysis of pooled or individual next-
N., Mizoguchi, T., Yamazaki, Y., Aoki, K. and Ezura, H. (2011) TOMA- generation sequencing data. Nucleic Acids Res. 39, e132–e132.
TOMA: a novel tomato mutant database distributing Micro-Tom mutant Wilm, A., Aw, P.P.K., Bertrand, D., Yeo, G.H.T., Ong, S.H., Wong, C.H., Khor,
collections. Plant Cell Physiol. 52, 283–296. C.C., Petric, R., Hibberd, M.L. and Nagarajan, N. (2012) LoFreq: a
Shental, N., Amir, A. and Zuk, O. (2010) Identification of rare alleles and sequence-quality aware, ultra-sensitive variant caller for uncovering cell-
their carriers using compressed sequencing. Nucleic Acids Res. 38, e179. population heterogeneity from high-throughput sequencing datasets.
Slade, A.J., Fuerstenberg, S.I., Loeffler, D., Steine, M.N. and Facciotti, D. Nucleic Acids Res. 40, 11189–11201.
(2005) A reverse genetic, nontransgenic approach to wheat crop Yang, B., Wen, X., Kodali, N.S., Oleykowski, C.A., Miller, C.G., Kulinski, J.,
improvement by TILLING. Nature Biotechnol. 23, 75–81. Besack, D., Yeung, J.A., Kowalski, D. and Yeung, A.T. (2000) Purification,
Sreelakshmi, Y., Gupta, S., Bodanapu, R., Chauhan, V.S., Hanjabam, M., cloning, and characterization of the CEL I nuclease. Biochemistry, 39,
Thomas, S., Mohan, V., Sharma, S., Srinivasan, R. and Sharma, R. (2010) 3533–3541.
NEATTILL: a simplified procedure for nucleic acid extraction from Zhu, Q., Smith, S.M., Ayele, M., Yang, L., Jogi, A., Chaluvadi, S.R. and Ben-
arrayed tissue for TILLING and other high-throughput reverse genetic netzen, J.L. (2012) High-throughput discovery of mutations in tef semi-
applications. Plant Methods, 6, 3. dwarfing genes by next-generation sequencing analysis. Genetics, 192,
Talame, V., Bovina, R., Sanguineti, M.C., Tuberosa, R., Lundqvist, U. and 819–829.
Salvi, S. (2008) TILLMore, a resource for the discovery of chemically
induced mutants in barley. Plant Biotechnol. J. 6, 477–485.
© 2017 The Authors

View publication stats

NGS-based Identification of Induced Mutations in A Doubly Mutagenized Tomato (Solanum Lycopersicum) Population

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

NGS-based Identification of Induced Mutations in A Doubly Mutagenized Tomato (Solanum Lycopersicum) Population

Hochgeladen von

Copyright:

Verfügbare Formate

See

NGS-based identification of induced mutations

Article in The Plant Journal · August 2017

Prateek Gupta Reddaiah Bodanapu

SEE PROFILE SEE PROFILE

Kamal Tyagi Supriya Sarma

SEE PROFILE SEE PROFILE

Biofortification of tomato by targeted manipulation of the biosynthetic pathway View project

The user has requested enhancement of the downloaded file.

Next-generation sequencing (NGS)-based identification of

© 2017 The Authors 495

© 2017 The Authors

available in some sequencing technologies, it is now possi- RESULTS

© 2017 The Authors

M0 M1 M1 M2 M2M0 M2M1 M2M1 M2M2 M2M2 plants DNA isolation from

3-D PCR amplification Amplicons pooled Library Mutant

© 2017 The Authors

© 2017 The Authors

Figure 4. An overview of the bioinformatics pipe-

© 2017 The Authors

Pipeline CAMBa CRISP SNVer LoFreq VipR GATK UG

Total SNPs predicted – 17397 12078 74752 2130 64279

© 2017 The Authors

© 2017 The Authors

© 2017 The Authors

mutations. The PCR products were relatively quantified using the

© 2017 The Authors

© 2017 The Authors

© 2017 The Authors

View publication stats

Das könnte Ihnen auch gefallen