Entero

Virology 500 (2017) 130–138
Contents lists available at ScienceDirect
Virology
journal homepage: www.elsevier.com/locate/yviro
VirusDetect: An automated pipeline for efficient virus discovery using deep

sequencing of small RNAs
crossmark
Yi Zhenga,1, Shan Gaoa,b,1, Chellappan Padmanabhanc, Rugang Lic, Marco Galvezd,
⁎
Dina Gutierrezd, Segundo Fuentesd, Kai-Shu Lingc, Jan Kreuzed, Zhangjun Feia,e,
a
Boyce Thompson Institute, Ithaca, NY 14853, USA
b
College of Life Sciences, Nankai University, Tianjin 300071, PR China
c
U.S. Vegetable Laboratory, U.S. Department of Agriculture-Agricultural Research Service, Charleston, SC 29414, USA
d
Virology laboratory, International Potato Center (CIP), Lima, Peru
e
Robert W. Holley Center for Agriculture and Health, U.S. Department of Agriculture-Agricultural Research Service, Ithaca, NY 14853, USA
A R T I C L E I N F O A BS T RAC T
Keywords: Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep
Virus discovery sequencing and assembly of virus-derived small interfering RNAs has proven to be a highly efficient approach
Small RNA for virus discovery. Here we present VirusDetect, a bioinformatics pipeline that can efficiently analyze large-
Next-generation sequencing scale small RNA (sRNA) datasets for both known and novel virus identification. VirusDetect performs both
VirusDetect
reference-guided assemblies through aligning sRNA sequences to a curated virus reference database and de
novo assemblies of sRNA sequences with automated parameter optimization and the option of host sRNA
subtraction. The assembled contigs are compared to a curated and classified reference virus database for known
and novel virus identification, and evaluated for their sRNA size profiles to identify novel viruses. Extensive
evaluations using plant and insect sRNA datasets suggest that VirusDetect is highly sensitive and efficient in
identifying known and novel viruses. VirusDetect is freely available at http://bioinfo.bti.cornell.edu/tool/
VirusDetect/.
1. Background chanism called RNA silencing or RNA interference (RNAi). In eukar-

yotes, upon viral infection an innate immunity system employs DICER
Virus infections are universal and recognized as a significant threat and DICER-like (DCL) enzymes to cleave viral RNAs into small
to agriculture production and human health. Efficient and accurate interfering RNAs (siRNAs) with sizes from 21 to 24 nucleotides (nt),
detection of viruses in plants and animals is essential for the develop- which are further amplified by RNA dependent RNA polymerases
ment of effective strategies to manage the spread and impact of viral (RdRPs). These virus-derived siRNAs direct antiviral immunity
diseases. Conventional virus detection methods such as enzyme-linked through the RNAi mechanism that guide ARGONAUTE proteins to
immunosorbent assay (ELISA), polymerase chain reaction (PCR), silence the infected viral RNAs through cleavage or translational
nucleic acid hybridization or microarray are useful but they require inhibition (Ding, 2010). During this process, the virus-derived
prior knowledge or sequence information of the potential pathogens, siRNAs are automatically enriched in the hosts, which can be readily
thus they are not highly efficient in detecting novel viruses or virus detected by deep sequencing of host small RNAs (sRNAs) (Kreuze
variants. Traditional methods for the detection of unknown viruses et al., 2009; Wu et al., 2010). In contrast to most traditional methods in
including electron microscopy and biological indicator hosts are virus detection, the sRSA approach does not require any prior virus
limited in their scope and only permit partial characterization of novel sequence information or the ability to cultivate and purify viruses. In
agents. Recently, a new approach, virus discovery by high throughput addition, compared to virus detection using genome or RNA sequen-
sequencing and assembly of total small RNAs (small RNA sequencing cing, the sRSA approach is very sensitive in identifying viruses and sub-
and assembly; sRSA), has proven to be highly efficient in plant and viral agents of different nucleic acid types and genome structures,
animal virus detection (Kreuze et al., 2009; Wu et al., 2010, 2015). This including single-stranded RNA (ssRNA) viruses of positive or negative
approach exploits a natural and fundamental antiviral defense me- polarity, double-stranded RNA (dsRNA) viruses, DNA viruses, and
⁎
Corresponding author at: Boyce Thompson Institute, Ithaca, NY 14853, USA.
E-mail address: zf25@cornell.edu (Z. Fei).
1
Equal contributor.
http://dx.doi.org/10.1016/j.virol.2016.10.017
Received 11 August 2016; Received in revised form 17 October 2016; Accepted 20 October 2016
0042-6822/ © 2016 Elsevier Inc. All rights reserved.
Y. Zheng et al. Virology 500 (2017) 130–138
viroids, even when they occur at low titers that are not readily detected assemblies are defined as those containing the longest assembled
by other methods such as RNA-Seq (Kreuze, 2014; Wu et al., 2015). contigs. The de novo assembled contigs are pooled together with those
Moreover, sRNA profile (e.g., size distribution) can help to identify generated from reference-guided assemblies, and then processed to
novel viruses in a sequence-independent manner and provide informa- remove redundant sequences using the same method described in
tion about virus biology (Aguiar et al., 2015; Webster et al., 2015). Zheng et al. (2011). VirusDetect then employs a homology-dependent
Furthermore, sRNAs are short thus less expensive short-read runs can strategy to identify known and novel virus sequences from the
be used, and because of the natural enrichment of viral siRNAs, only assembled contigs. The program first compares the contigs against
relatively shallow sequencing is required, thus reducing the sequencing reference virus nucleotide sequences using the BLASTN program
cost significantly. (Altschul et al., 1990), and those having hits to virus nucleotide
Thanks to the rapid advance of next generation sequencing (NGS) sequences are identified as virus contigs. The remaining contigs are
technologies, the sRSA approach has been widely used for the further compared against the reference virus protein sequences using
identification of known and novel DNA and RNA viruses as well as the BLASTX program (Altschul et al., 1990), and those having hits to
viroids in both plants and animals (Reviewed in Wu et al. (2015)). the virus protein sequences are also identified as virus contigs. Contigs
Despite the expanding popularity of the sRSA approach in virus matched to the same reference sequence are merged to form the final
discovery, currently there is no software package available that is output, and used to derive the coverage of the reference by virus
specifically designed to detect viruses using large-scale sRNA sequence contigs. In addition, the depth of each virus contig covered by sRNA
data generated with NGS technologies. Several bioinformatics tools reads is calculated and used to derive the average sRNA sequencing
have been developed for virus discovery using NGS data such as depth of the matched virus contigs. To make it comparable among
PathSeq (Kostic et al., 2011), VirusFinder (Wang et al., 2013), VirusSeq different samples, raw depth is normalized to reads per million (RPM).
(Chen et al., 2013), VirusHunter (Zhao et al., 2013), VirFind (Ho and VirusDetect derives the coverage and the depth information since they
Tzanetakis, 2014), PFOR2 (Zhang et al., 2014) and IVA (Hunt et al., are the major parameters indicating whether the identified viruses are
2015). However, these tools are mainly designed for virus discovery true or due to alignment artifacts as true viruses generally have higher
using RNA-Seq or genome sequencing data and some are specifically coverage and depth. Finally, VirusDetect generates user-friendly hy-
designed only for human pathogen discovery. Tools designed specifi- perlinked html tables to display the list of viruses identified from
cally for sRNA include Paparazzi (Vodovar et al., 2011), which BLASTN and BLASTX searches, respectively (Fig. 1B). Each reference
reconstitutes viral genomes through alignment using a related refer- virus sequence in the table is linked to a page, which shows the detailed
ence sequence as scaffold, and PFOR (Wu et al., 2012), which is sequence alignment information of the virus contigs to the reference
specifically designed for discovery of viroids. Visitor (Antoniewski, (Fig. 1C). In addition, VirusDetect also provides the sequences of
2011), viRome (Watson et al., 2013), SearchSmallRNA (de Andrade detected virus contigs and their corresponding reference viruses in
and Vaslin, 2014) and MISIS (Seguin et al., 2014) are graphical fasta format. The alignment information between the contigs and the
interface tools that can only provide alignments of the siRNA reads reference viruses is also provided in files of Excel and SAM (Li et al.,
to a chosen virus reference genome; therefore they cannot be used as 2009) format.
general tools for identification of novel and known viruses. It has been reported that subtraction of host derived sRNAs
In this study, we present VirusDetect, an automated bioinformatics enriches virus-derived siRNAs and reduces noise introduced from host
pipeline that can detect both known and novel viruses efficiently and sRNAs, thus improving the assembly of virus siRNAs and increasing
accurately using deep sequencing of plant or animal siRNAs. We the efficiency of virus detection (Isakov et al., 2011; Li et al., 2012).
demonstrate the high efficiency and sensitivity of VirusDetect in Therefore, prior to de novo assembly, VirusDetect can optionally map
detecting known and novel virus using sRNA data generated from the sRNA reads to host reference sequences to subtract host-derived
plants and animals. Furthermore, we demonstrate that VirusDetect can sRNAs. In addition, the contigs generated from de novo assemblies are
also efficiently identify viruses using other types of NGS datasets, also aligned to the host sequences and those having high nucleotide
suggesting VirusDetect can be used as a universal tool for virus identity (e.g., > 90%) to the host sequences are discarded. Some host
discovery using NGS technologies. genomes contain integrated viral sequences that are related to extant
replicating viruses, but are mostly inactive fragments and could be
2. Results and discussion falsely identified as an infecting virus if the genome subtraction is not
employed. On the other hand, host genomes used for subtraction
2.1. Design of VirusDetect should be carefully examined to ensure they do not contain any
inadvertent viral sequence contamination.
The workflow of VirusDetect is shown in Fig. 1A. VirusDetect It is worth to point out that homology-dependent strategies
accepts cleaned sRNA sequences in fasta or fastq format. The program employed by VirusDetect would not be able to identify novel viruses
first maps the sRNA reads to known virus reference sequences using whose genomes do not have similarity to any known virus sequences at
BWA (Li and Durbin, 2009). The mapped sRNA reads are considered both nucleotide and protein levels. It has been reported that virus-
as virus-derived and then assembled into virus contigs using the derived siRNAs are predominantly 21 or 22 nt in length (Ding, 2010;
reference-guided approach. Next VirusDetect performs de novo assem- Webster et al., 2015). Using this type of siRNA profile to identify novel
bly of sRNAs. De novo assembly in VirusDetect is performed using viral contigs in a sequence independent manner has been demon-
Velvet (Zerbino and Birney, 2008). The ‘hash_length’ (the length of k- strated in insects and plants (Aguiar et al., 2015; Webster et al., 2015),
mer) is the most important parameter that determines the assembly however its general applicability is not yet clear. Nonetheless, for
quality of Velvet and the ‘cov_cutoff’ (the coverage cutoff) is another assembled contigs showing no viral sequence homology, VirusDetect
important parameter (Zerbino, 2010). VirusDetect first determines the generates an html table listing their siRNA size profiles and high-
best ‘hash-length’ by performing Velvet assemblies using different k- lighting those with viral-like siRNA signatures (i.e., enrichment of 21-
mer lengths. It then determines the best ‘cov_cutoff’ by performing nt and 22-nt siRNAs) (Supplementary Fig. 1). However, it is suggested
assemblies using the determined best ‘hash_length’ and different that although the majority of VirusDetect-reported viral-like contigs
coverage cutoffs. Since host siRNAs are distributed sporadically across based on siRNA size profiles are virus derived, they should be further
the host genome, when de novo assembled, they normally result in very evaluated in more detail to confirm their authenticity.
short contigs; while virus-derived siRNAs are distributed densely For virus detection using VirusDetect, the quality and size of the
across entire virus genomes and thus can be de novo assembled into reference virus database are important factors affecting the efficiency of
much longer contigs. Therefore, in VirusDetect, the best de novo sRNA alignments and homology searches of virus contigs. For this
131
Fig. 1. Design of VirusDetect. (A) Flowchart of VirusDetect for virus identification using sRNA sequencing data. (B, C) Screenshots of VirusDetect output html pages showing the
list of identified viruses (B) and alignments of virus contigs to the reference virus genome (C).
Table 1 reason, we classified the GenBank virus entries into eight different host
Comparison of potato viruses detected by standard virus indexing procedure (indicator kingdoms: vertebrate, invertebrate, plant, protozoa, algae, fungus,
plant, ELISA, NASH) and small RNA sequencing and assembly (sRSA) through bacteria, and archaea. sRNA samples generated from a specific host
VirusDetect.
would only need to be compared to the reference virus sequences from
Sample (CIP Standard Indexing sRSAa (from in PCR confirmation the kingdom the host belongs to, thus substantially reducing the
germplasm (from potato and/or vitro potato (from in vitro running time and the potential noise introduced from virus sequences
accession indicator plants plant potato plant of other unrelated host kingdoms. Nonetheless, reference virus se-
number) grown in extractions) extractions)
quences from multiple host kingdoms can be combined and used to
greenhouse)
identify viruses that may have hosts from different kingdoms. Currently
706735 PVXb,c,d PVX, PVA PVX, PVA GenBank (release 211) contains approximately two million virus
396009.258 – − − entries, among which ~94.8% were classified into the vertebrate
703471 PVSb PVS PVS
kingdom, ~4.4% into the plant kingdom, ~2.6% into the invertebrate
705268 PLRVb, PVXb,c,d PLRV, PVX PLRV, PVX
700744 PVSb,c,d, PVTb PVS, PVT PVS, PVT kingdom, and very few ( < 0.5%) into each of the other five kingdoms. It
706851 PVXc,d, PVSb PVX, PVS, PVT PVX, PVS, PVT is worth to note that ~2% of the virus entries could be classified into
703518 PVSb PVS PVS more than one kingdom. To further improve the efficiency of
704832 PLRVd, APMMVb,d, PLRV, APMMV, PLRV, APMMV, VirusDetect, the virus databases from different host kingdoms were
PVXd PVT PVT
processed to remove redundant sequences. The number of virus
703573 − − −
308328.32 − − − sequences in the resulting non-redundant databases (95% redundancy
398098.203 − − − level) is approximately 11.6%, 13%, and 14.3% of that in the original
396272.22 PVSb,d PVS PVS GenBank virus database for vertebrate, invertebrate, and plant viruses,
396063.16 PLRVd PLRV PLRV
respectively (Supplementary Table 1).
598198.4 − − −
304413.45 − − −
393046.7 PVXb,c,d, PVSb,c PVX, PVS PVX, PVS 2.2. Identification of known viruses using VirusDetect
a
By sRSA using VirusDetect. We first evaluated the performance of VirusDetect in detection of
b
Potato plant.
c
Indicator plant (mechanical inoculation).
known viruses using sRNA sequences generated from a set of 16 potato
d
Datura stramonium (graft-inoculation). accessions. These accessions were evaluated using the ISO17025
certified standard virus indexing procedure implemented at the
International Potato Center (CIP) for the infection of known potato
132
706735 (PVX) 700744 (PVS) 704832 (PLRV)
706735 (PVA) 700744 (PVT)

704832 (PVT)
703471 (PVS)
706851 (PVX)
704832 (APMMV)
705268 (PLRV)
706851 (PVS)
396063.16 (PLRV)
705268 (PVX)
393046.7 (PVX)
706851 (PVT)
393046.7 (PVS)
703518 (PVS)
396272.22 (PVS)
100% 75%
Identity
Fig. 2. Viruses identified from the potato samples by VirusDetect. Alignments of identified virus contigs to the reference virus genomes. Blue tracks represent reference virus
genomes, and red tracks represent assembled virus contigs. PVA, Potato virus A; PVS, Potato virus S; PVX, Potato virus X; PLRV, Potato leafroll virus; PVT, Potato virus T; and APMMV,
Andean potato mild mosaic virus.
viruses including Potato virus S (PVS), Potato virus X (PVX), Potato from the original plant, suggesting a possible greenhouse
leafroll virus (PLRV), Potato virus T (PVT) and Andean potato mild contamination during the indexing process. Indeed, PVX is highly
mosaic virus (APMMV) (Table 1). A total of 0.5–1.3 million high- contagious and other indicator plants inoculated with this accession
quality sRNA reads were generated for each of these samples, among should have shown symptoms of infection; however, this was not the
which 60–70% could be mapped to the potato genome while 0.6–5.6% case.
mapped to the plant virus database (Supplementary Table 2). For this analysis, VirusDetect was run on a laptop with i5-3320M
Consistent with the results from the virus indexing analysis, CPU using one thread. It took 4–13 min for VirusDetect to analyze
VirusDetect did not detect any viruses in six accessions (396009.258, each of these 16 samples for virus detection (Supplementary Table 2).
703573, 308328.32, 398098.203, 598198.4, and 304413.45) while it The main workflow of VirusDetect does not support multi-threads, but
identified the corresponding viruses in the remaining accessions except several time-consuming steps include read mapping using BWA,
PVX in accession 704832 (Table 1). For the majority of the identified removing redundant virus contigs and comparing assembled contigs
viruses (15 out of 18), the assembled contigs from sRNAs covered more against the virus reference database and the host sequences using
than 50% of their genomes, and genomes of approximately half of the BLAST can be run using multiple CPUs to increase the speed.
identified viruses were covered by more than 95% (Fig. 2 and Nonetheless, given the fact that VirusDetect can process large datasets
Supplementary Table 3). In addition, Potato virus A (PVA) was on desktop computers with reasonable time, coupled with its high
detected in accession 706735 by VirusDetect with assembled contigs sensitivity in virus discovery, VirusDetect can serve as a tool for
covering nearly the entire PVA genome (Fig. 2 and Supplementary efficient identification of viruses from deep siRNA sequencing datasets.
Table 3), but not identified by the current indexing procedure since an
ELISA test for this virus was not included in the standard procedure 2.3. Detection of a novel plant virus using VirusDetect based on
and symptoms corresponding to PVA infection may have been masked sequence homology
by those of PVX in the indicator hosts. Furthermore, PVT was detected
in accessions 706851 and 704832 by VirusDetect, but not by nucleic We further evaluated the efficiency of VirusDetect using a sRNA
acid spot hybridization (NASH) or indicator host ranges, likely because dataset we generated from an unspecified weed species from Brazil,
the concentration of the virus was below the threshold of those tests. showing virus symptoms. The dataset contained a total of 710,854
All viruses identified by VirusDetect were confirmed by RT-PCR tests high-quality cleaned sRNA reads. The sizes of these sRNA reads were
from the original in vitro plants (Table 1). Sequencing of PCR predominantly ranging from 21 to 24 nt (Fig. 3A). VirusDetect mapped
fragments corresponding to the viruses detected by VirusDetect also a total of 16,175 reads (2.28%) to non-redundant plant virus database.
confirmed accurate sequence assembly over the amplified regions. However, the reference-guided assembly did not yield any contigs from
Conversely, PVX was detected in accession 704832 by ELISA only these mapped reads as these sRNA reads were sporadically aligned to
from grafted Datura stramonium but not by VirusDetect, nor PCR the reference sequences. De novo assembly of these sRNAs without
133
Fig. 3. VirusDetect identified a novel potyvirus, Brazilian weed virus Y (BWVY), from a weed sRNA dataset. (A) Size distribution of total sRNAs and siRNAs derived
from BWVY. (B) siRNA distribution across the genome of BWVY in both positive (+) and negative (−) strands. (C) Alignments of virus contigs (blue lines) assembled from different
depths of sRNAs to the genome (black line) of BWVY identified in the weed sample.
subtraction of host sRNAs (since no host references were available) assembled from a limited number of individual cloned molecules. On
generated a total of 73 non-redundant contigs. Homology search of the other hand, the virus genome sequence generated by deep sRNA
these contigs against the plant virus database identified one virus sequencing was likely resulted from millions of individual virus
contig, which was the longest one of 9837 nt and took up 66.1% of total molecules. Due to the existence of different virus haplotypes in a field
assembled bases (14,891 nt). The open reading frame (ORF) of this population, such low level of sequence variation (~1%) between the
contig had a length of 9237 nt, which could be translated into a single consensus sequences generated from the two different sequencing
polyprotein with 3078 amino acids. It was identified as a novel technologies was not unexpected, as reported in other studies (e.g.,
potyvirus by VirusDetect since the translated protein sequence of its Di Giallonardo et al., 2014).
ORF had significant and complete alignments (59.2% amino acid We further evaluated the sensitivity of VirusDetect in detecting the
sequence identity) to the genome of Verbena virus Y (GenBank novel BWVY using the randomly selected 15 K, 30 K, 60 K, 100 K, 150
accession no. ACB69755), which belongs to the potyvirus genus. K, 300 K, 450 K and 600 K sRNA reads from the dataset. VirusDetect
Mapping sRNA reads back to the complete genome of the newly could confidently detect the novel virus, BWVY, even with only 60 K
identified virus (tentatively named Brazilian weed virus Y or BWVY) sRNA reads, with 96.6% of the BWVY genome covered by the
indicated that these virus-derived siRNAs were predominantly of 21 assembled contigs (Supplementary Table 4), although most of the
and 22 nt (Fig. 3A) and they covered the entire genome of BWVY at assembled virus contigs were relatively short (Fig. 3C). However,
both positive and negative strands (Fig. 3B). increasing the depth improved the coverage of the BWVY genome by
To validate the authenticity of BWVY, we performed Sanger assembled contigs at low depths ( < 20 X). Further increase of the
sequencing through genome walking using overlapping RT-PCR pro- sRNA sequencing depth would not improve the coverage very much but
ducts that covered the entire virus genome. A complete genome would generate much longer virus contigs (Fig. 3C and Supplementary
sequence of BWVY (9837 nt) was assembled from the Sanger Table 4).
sequences. Alignment of the genome sequence assembled from
Sanger sequences and that assembled from sRNA sequences indicated
that the two genomes are largely similar, with 99% sequence identity. A 2.4. Identification of novel viruses using VirusDetect based on siRNA
small percentage of sequence variations (~1%) between the two profile
genomes generated by two different methods were due to the presence
of single nucleotide polymorphisms scattered throughout the genome. The efficiency of VirusDetect in identifying novel viruses based on
The virus genome sequence generated by Sanger sequencing was siRNA profiles was evaluated using an sRNA dataset from the D.
melanogaster ovary somatic sheet (OSS) cell line described in Robine
134
et al. (2009) (GenBank GEO accession no. GSE15378). First, using the possible to be of viral origin, had the proportion > 40% (41.7–68.9%)
homology-based approaches, VirusDetect identified all six known (Supplementary Table 5).
viruses that were reported in Wu et al. (2010), including the In summary, our results suggest that combined with homology-
Drosophila X virus (DXV), Drosophila birnavirus (DBV), Drosophila dependent and homology-independent strategies, VirusDetect is highly
C virus (DCV), and Nora virus, as well as the Flock house virus (FHV) sensitive in identifying novel viruses.
and Drosophila A virus (DAV) which were highly similar to the
American nodavirus (ANV; 93% and 95% nt identity for RNA2 and
2.5. Testing VirusDetect on other types of NGS data
RNA1, respectively) and Drosophila tetravirus (DTrV; 98% nt identity)
reported in Wu et al. (2010), respectively. VirusDetect was able to
Besides sRNA sequencing, several other strategies involving NGS
detect additional viruses in the cell line, including Bloomfield virus,
have been used for virus detection such as those employing sequencing
Autographa californica nucleopolyhedrovirus (AcMNPV), and
of polyA-enriched or rRNA-depleted RNAs and dsRNAs (Wu et al.,
Spodoptera frugiperda rhabdovirus (Sf-rhabdovirus) (Supplementary
2015). Therefore, we tested the performance of VirusDetect on one of
Fig. 2).
these types of NGS datasets.
Bloomfield virus is a new virus that was discovered recently
We applied VirusDetect to a polyA enriched RNA-Seq dataset from
(Webster et al., 2015). To evaluate the ability of VirusDetect to identify
Prunus domestica that included two samples infected with Plum pox
new viruses with no sequence similarity to known viruses, we run
virus (PPV) and two healthy samples (Rodamilans et al., 2014). The
VirusDetect on the OSS sRNA dataset using the reference virus
genome of P. mume (Zhang et al., 2012) was used for host read
database in which all sequences homologous to the segments of the
subtraction and HISAT (Kim et al., 2015) was included in VirusDetect
Bloomfield virus were removed. As expected, homology-based ap-
to replace BWA for aligning RNA-Seq reads to the P. mume genome.
proaches (BLASTN and BLASTX) failed to identify any segments of
Consistent with findings in Rodamilans et al. (2014), VirusDetect
this virus. However, based on the siRNA profile (proportion of 21-nt
successfully assembled contigs covering 98% of PPV-Rec and PPV-D
and 22-nt siRNAs of at least 40%), VirusDetect was able to identify
genomes from the two PPV-infected samples (Supplementary Fig. 3),
viral-like contigs corresponding to all the reported nine segments of the
while no virus was identified from the two healthy samples.
Bloomfield virus, with at least 80% of each segment covered by the
Furthermore, VirusDetect was able to identify Cherry virus A in the
identified viral contigs except the segment 5, which was covered by
infected samples that was not described in Rodamilans et al. (2014).
57.5% (Fig. 4). For all the assembled contigs derived from the
The assembled contigs of Cherry virus A has high sequence identity to
Bloomfield virus, only seven short contigs (53–79 nt) of segment 5
the reference virus genome (Supplementary Fig. 3), indicating that it is
and one (41 bp) of segment 9 were not identified as viral-like contigs
very likely that the identified virus is real. Our analysis supports that
(Supplementary Table 5). In addition, VirusDetect identified several
VirusDetect is also highly efficient and sensitive in virus discovery
novel viral-like contigs that were not reported before, including one of
using RNA-Seq datasets.
1034 nt. However, further confirmation of these newly identified viral-
like contigs is required. It is worth to note that here we chose
proportion of 21-nt and 22-nt siRNAs of 40% as the cutoff because 3. Conclusion
all the assembled long contigs (e.g., > 500 nt), which were highly
We have developed and described an automated bioinformatics
Bloomfield virus segment 1 (81.2%) Bloomfield virus segment 6 (98.5%)
Bloomfield virus segment 7 (79.9%)

Fig. 4. Identification of the Bloomfield virus by VirusDetect using siRNA size profiles. Blue tracks represent reference virus genomes, and red tracks represent assembled
virus contigs.
135
pipeline, VirusDetect, for virus discovery from sRNA sequencing USA) and the quality was checked by agarose gel electrophoresis.
datasets. Extensive validation revealed that VirusDetect is a sensitive sRNAs of 20–30 nt were purified after excising the band from a 3.5%
and efficient tool for detection of known and novel viruses from plants, agarose gel. To demonstrate the efficiency of VirusDetect in identifying
animals and other kingdoms. The high sensitivity and efficiency of and assembling genomes of novel viruses from sRNA sequences, a leaf
VirusDetect in virus discovery are achieved by employing a combina- sample of an unspecified weed species with yellowing and mosaic-like
tion of the following strategies: applying both reference-guided and de symptoms was collected from a tomato field in Barbacena, Brazil in
novo assemblies of virus-derived siRNAs, using a curated and classified February 2013. Total RNA from this sample was purified using TRIzol
virus reference database, and performing host sRNA subtraction and reagents following the manufacturer's instructions (Invitrogen, USA).
automated parameter optimization for de novo assemblies. In some After quantification in a NanoDrop (Thermo Fisher Scientific, USA),
cases, VirusDetect can assemble virus contigs that cover nearly the RNA molecules, ranging from 18 to 28 nt, were excised from a
entire genomes of most identified viruses from relatively small-scale polyacrylamide gel and extracted. sRNA libraries of the weed and
sRNA datasets. This combined with the increasing throughput and potato samples were constructed following the protocol described in
decreasing cost of NGS technologies represents a cost-effective ap- Chen et al. (2012) and sequenced on an Illumina HiSeq 2500 system
proach for virus detection and discovery through sequencing sRNAs with the 50-bp single-end mode.
from a large number of multiplexed samples, and thus may facilitate
metagenomics researches of viral communities (viromes) on different 4.4. sRNA read processing
species across geographic ranges. Furthermore, although VirusDetect is
mainly designed for virus discovery using sRNA sequences, we Raw sRNA reads were first processed to identify the 3′ adaptor
demonstrate that VirusDetect can be used as a universal bioinformatics sequences using an in-house perl script, which is included in the
tool for virus discovery using different types of NGS datasets. VirusDetect package. Briefly, the adaptor sequences were identified in
the sRNA reads if the first eleven nucleotides of the adaptor could
4. Materials and methods match the sRNA reads with at most one mismatch. sRNA reads with no
adaptor sequences identified were discarded and the remaining reads
4.1. Implementation of VirusDetect were then processed to trim the 3′ adaptor sequences. The processed
reads that were in low quality (containing ambiguous bases) or shorter
The VirusDetect package is implemented in Perl. BWA (Li and than 15 nt were excluded from downstream analysis. The cleaned
Durbin, 2009) is employed to align siRNA reads to the reference virus sRNA reads were used for virus discovery using VirusDetect. The non-
sequences, host sequences or assembled virus contigs. For reference- redundant (95%) plant virus database was used as the reference. The
guided assembly, SAMtools (Li et al., 2009) is used to process BWA potato genome (Xu et al., 2011) was used as the reference for host
alignments and generate per-position alignment information in pileup sRNA subtraction for the potato sRNA datasets.
format, which is used to guide the construction of virus contigs. De
novo assembly of viral siRNAs is performed using Velvet (Zerbino and 4.5. Standard virus indexing of potato
Birney, 2008). VirusDetect uses the BLAST program (Altschul et al.,
1990) to compare assembled contigs against virus reference nucleotide Indexing of the 16 potato accessions was performed by virology
and protein sequence databases. Perl modular Bio::Graphics (Stajich service unit at the CIP following an ISO/IEC 17025 accredited
et al., 2002) is used to generate the track images according to the procedure which includes the following tests: i) NASH for Potato
BLAST result of virus contigs. spindle tuber viroid (PSTVd) and PVT from in vitro and greenhouse-
grown plants, respectively, ii) DAS-ELISA from three replicates of
4.2. Curation and classification of GenBank virus sequence database greenhouse grown plants with antibodies for PVX, PLRV, PVS,
APMMV, Potato virus Y (PVY), Potato yellowing virus (PYV),
Virus sequences downloaded from GenBank were classified into Arracacha virus B-Oca strain (AVB-O), and Andean potato mottle
eight different host kingdoms: vertebrate, invertebrate, plant, protozoa, virus (APMoV), and iii) mechanical inoculation and symptom evalua-
algae, fungus, bacteria, and archaea, based on the virus taxonomy tion of eleven biological indicator plants: Nicotiana benthamiana, N.
information provided by the International Committee on Taxonomy of clevelandii×N. bigelovii, N. debneyii, N. glutinosa, N. tabacum "White
Viruses (ICTV; http://www.ictvonline.org/) using a perl script, which is Burley", Chenopodium murale, C. quinoa, Datura stramonium,
included in the VirusDetect package. Virus sequences that were not Gomphrena globosa, Solanum lycopersicum "Rutgers" and Physalis
classified in ICTV were then manually classified according to the floridana. In addition graft inoculation of D. stramonium was also
descriptions of the virus sequences and their high sequence similarity performed.
to those classified viruses. Virus sequences classified into each of the
eight host kingdoms were further processed to remove redundant 4.6. Validation of known and novel plant viruses detected by
sequences with sequence identity cutoff of 95%, 97% and 100%, VirusDetect
respectively, using cd-hit (Li and Godzik, 2006). The corresponding
protein sequences of each virus were also extracted. All the classified Potato viruses were confirmed by RT-PCR using virus specific
virus nucleotide and protein sequences are available at the VirusDetect primers for PVX, PVS, PVT, PLRV, and APMMV (Supplementary
website. Table 6). Same total RNA used for sRNA library construction was
used for PCR. For cDNA synthesis 1 μg of total RNA and 250 ng/μl of
4.3. Plant material, RNA extraction and sRNA library construction random hexamer primers were used in a total reaction mix of 20 μl
and sequencing using 200 U of M-MLV reverse transcriptase (Invitrogen). After
incubation at 37 °C for 50 min, the reaction was diluted 1:10 with
For potato sRNA sequencing and standard virus indexing, two nuclease free water and 5 μl was used for PCR. The PCR was performed
copies of 16 in vitro accessions were selected from the germplasm using GoTaq® DNA Polymerase (Promega) in a volume of 20 μl with a
collection at the CIP (Table 1). One copy of in vitro plants was final primer concentration of 0.5 µM each. The PCR protocol consisted
processed for standard virus indexing and the other set was processed of 5 min initial denaturation at 94 °C, followed by 35 cycles of 94 °C for
for sRNA sequencing. Total RNA was extracted from 1 g of fresh leaf 30 s, 50–60 °C for 30 s and 72 °C between 30s to 1 min, with final
tissue using the CTAB method. The quantity of the total RNA was extension at 72 °C for 10 min. PCR products were visualized on 1%
determined using a NanoDrop analyzer (Thermo Fisher Scientific, agarose gels stained with GelRed (Biotium). PCR fragments amplified
136
for each isolate were purified using the High Pure PCR product References
purification kit (Roche), then cloned into pGEM-T easy vector
(Promega) following standard procedures and transformed into Aguiar, E.R., Olmo, R.P., Paro, S., Ferreira, F.V., de Faria, I.J., Todjro, Y.M., Lobo, F.P.,
Escherichia coli (DH 5α). Samples were sent to Macrogen (Korea) Kroon, E.G., Meignin, C., Gatherer, D., Imler, J.L., Marques, J.T., 2015. Sequence-
independent characterization of viruses based on the pattern of viral small RNAs
for Sanger sequencing. produced by the host. Nucleic Acids Res. 43, 6191–6206.
To validate the genome sequence of the novel virus (BWVY) Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local
assembled by VirusDetect, a series of 16 primer pairs alignment search tool. J. Mol. Biol. 215, 403–410.
de Andrade, R.R., Vaslin, M.F., 2014. SearchSmallRNA: a graphical interface tool for the
(Supplementary Table 7) were designed to amplify overlapping frag- assemblage of viral genomes using small RNA libraries data. Virol. J. 11, 45.
ments covering the entire virus genome. Amplicons generated using the Antoniewski, C., 2011. Visitor, an informatic pipeline for analysis of viral siRNA
above primer pairs in reverse transcription-PCR (RT-PCR) were sequencing datasets. Methods Mol. Biol. 721, 123–142.
Bolger, A.M., Lohse, M., Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina
sequenced directly or cloned using a TOPO TA cloning kit sequence data. Bioinformatics 30, 2114–2120.
(Invitrogen) and sequenced using the Sanger technology by the Chen, Y., Yao, H., Thompson, E.J., Tannir, N.M., Weinstein, J.N., Su, X., 2013. VirusSeq:
Functional Biosciences, Inc. (Madison, WI). software to identify viruses and their integration sites using next-generation
sequencing of human cancer tissue. Bioinformatics 29, 266–267.
Chen, Y.-R., Zheng, Y., Liu, B., Zhong, S., Giovannoni, J., Fei, Z., 2012. A cost-effective
4.7. RNA-Seq data processing method for Illumina small RNA-Seq library preparation using T4 RNA ligase 1
adenylated adapters. Plant Methods 8, 4.
Di Giallonardo, F., Topfer, A., Rey, M., Prabhakaran, S., Duport, Y., Leemann, C.,
Raw RNA-Seq dataset of P. domestica described in Rodamilans Schmutz, S., Campbell, N.K., Joos, B., Lecca, M.R., Patrignani, A., Daumer, M.,
et al. (2014) was downloaded from NCBI SRA database under Beisel, C., Rusert, P., Trkola, A., Gunthard, H.F., Roth, V., Beerenwinkel, N.,
Metzner, K.J., 2014. Full-length haplotype reconstruction to infer the structure of
accession SRP041925. Adaptor and low quality sequences were re-
heterogeneous virus populations. Nucleic Acids Res. 42, e115.
moved using Trimmomatic (Bolger et al., 2014). The remaining high Ding, S.W., 2010. RNA-based antiviral immunity. Nat. Rev. Immunol. 10, 632–644.
quality reads were aligned to the rRNA database (Quast et al., 2013) Ho, T., Tzanetakis, I.E., 2014. Development of a virus detection and discovery pipeline
using bowtie (Langmead et al., 2009) allowing up to 3 mismatches. The using next generation sequencing. Virology 471–473, 54–60.
Hunt, M., Gall, A., Ong, S.H., Brener, J., Ferns, B., Goulder, P., Nastouli, E., Keane, J.A.,
unaligned reads were used for virus discovery using VirusDetect. Kellam, P., Otto, T.D., 2015. IVA: accurate de novo assembly of RNA virus genomes.
Bioinformatics 31, 2374–2376.
Isakov, O., Modai, S., Shomron, N., 2011. Pathogen detection using short-RNA deep
Competing interests sequencing subtraction and assembly. Bioinformatics 27, 2027–2030.
Kim, D., Langmead, B., Salzberg, S.L., 2015. HISAT: a fast spliced aligner with low
The authors declare that they have no competing interests. memory requirements. Nat. Methods 12, 357–360.
Kostic, A.D., Ojesina, A.I., Pedamallu, C.S., Jung, J., Verhaak, R.G., Getz, G., Meyerson,
M., 2011. PathSeq: software to identify or discover microbes by deep sequencing of
human tissue. Nat. Biotechnol. 29, 393–396.
Author contributions
Kreuze, J., 2014. siRNA deep sequencing and assembly: piecing together viral infections.
In: Gullino, M.L., Bonants, P.J.M. (Eds.), Detection and Diagnostics of Plant
Z.F., Y.Z., and S.G. designed the project. Y.Z. and S.G. implemented Pathogens. Springer, Dordrecht, 21–38.
the program. Y.Z. performed data analyses. C.P., R.L. and K.L. Kreuze, J.F., Perez, A., Untiveros, M., Quispe, D., Fuentes, S., Barker, I., Simon, R., 2009.
Complete viral genome sequence and discovery of novel viruses by deep sequencing
constructed the weed sRNA library and performed experiments to of small RNAs: a generic method for diagnosis, discovery and sequencing of viruses.
validate the novel weed virus. D.G., S.F., M.G. and J.K. contributed to Virology 388, 1–7.
the design and evaluation of the program. J.K. and S.F. designed, M.G. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L., 2009. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biol. 10, R25.
and D.G. constructed sRNA libraries, and S.F. and M.G. performed the Li, H., Durbin, R., 2009. Fast and accurate short read alignment with Burrows-Wheeler
validation experiment with potato samples. Z.F., Y.Z., K.L. and J.K. transform. Bioinformatics 25, 1754–1760.
wrote the manuscript. All authors read and approved the manuscript. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis,
G., Durbin, R., 2009. The sequence alignment/map format and SAM tools.
Bioinformatics 25, 2078–2079.
Availability of supporting data Li, R., Gao, S., Hernandez, A.G., Wechter, W.P., Fei, Z., Ling, K.-S., 2012. Deep
sequencing of small RNAs in tomato for virus and viroid identification and strain
differentiation. PLoS One 7, e37127.
The genome sequence of BWVY is available in NCBI GenBank Li, W., Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of
under accession no. KU685505. The nearly complete genome protein or nucleotide sequences. Bioinformatics 22, 1658–1659.
Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., Glöckner,
sequences of viruses assembled from the potato samples are available F.O., 2013. The SILVA ribosomal RNA gene database project: improved data
in NCBI GenBank under the following accession nos.: KU58645, processing and web-based tools. Nucleic Acids Res. 41, D590–D596.
KU586452, KU586454, KU586455, KU586456, KU586451, and Rodamilans, B., San León, D., Mühlberger, L., Candresse, T., Neumüller, M., Oliveros,
J.C., García, J.A., 2014. Transcriptomic analysis of Prunus domestica undergoing
KU586453. The sRNA read data from the Brazil weed and the 16
hypersensitive response to plum pox virus infection. PLoS One 9, e100477.
potato samples are available in NCBI Sequence Read Archive under the Robine, N., Lau, N.C., Balla, S., Jin, Z., Okamura, K., Kuramochi-Miyagawa, S., Blower,
BioProject ID PRJNA311127. M.D., Lai, E.C., 2009. A broadly conserved pathway generates 3'UTR-directed
primary piRNAs. Curr. Biol. 19, 2066–2076.
Seguin, J., Otten, P., Baerlocher, L., Farinelli, L., Pooggin, M.M., 2014. MISIS: a
Acknowledgements bioinformatics tool to view and analyze maps of small RNAs derived from viruses
and genomic loci generating multiple small RNAs. J. Virol. Methods 195, 120–122.
Stajich, J.E., Block, D., Boulez, K., Brenner, S.E., Chervitz, S.A., Dagdigian, C., Fuellen,
The collection of the weed sample in Brazil by HM.CLAUSE is G., Gilbert, J.G.R., Korf, I., Lapp, H., et al., 2002. The Bioperl toolkit: Perl modules
greatly appreciated. We thank Andrea Gilliard and Alan Wilder for for the life sciences. Genome Res. 12, 1611–1618.
Vodovar, N., Goic, B., Blanc, H., Saleh, M.C., 2011. In silico reconstruction of viral
their excellent technical assistance. This work was supported by grants
genomes from small RNAs improves virus-derived small interfering RNA profiling. J.
from National Science Foundation (IOS-1110080) to Z.F. and J.K., the Virol. 85, 11016–11021.
USDA SCRI (2012-01507-229756) to Z.F. and K.L., the USDA SCRI Wang, Q., Jia, P., Zhao, Z., 2013. VirusFinder: software for efficient and accurate
(2010-600-25320) to K.L., and the CGIAR research program on roots, detection of viruses and their integration sites in host genomes through next
generation sequencing data. PLoS One 8, e64465.
tubers and bananas to J.K. Watson, M., Schnettler, E., Kohl, A., 2013. ViRome: an R package for the visualization
and analysis of viral small RNA sequence datasets. Bioinformatics 29, 1902–1903.
Webster, C.L., Waldron, F.M., Robertson, S., Crowson, D., Ferrari, G., Quintana, J.F.,
Appendix A. Supplementary material Brouqui, J.-M., Bayne, E.H., Longdon, B., Buck, A.H., et al., 2015. The discovery,
distribution, and evolution of viruses associated with Drosophila melanogaster. PLoS
Supplementary data associated with this article can be found in the Biol. 13, e1002210.
Wu, Q., Ding, S.-W., Zhang, Y., Zhu, S., 2015. Identification of viruses and viroids by
online version at doi:10.1016/j.virol.2016.10.017.
137
next-generation sequencing and homology-dependent and homology-independent technologies. Curr. Protoc. Bioinform., (Chapter 11:Unit 11.5).
algorithms. Annu. Rev. Phytopathol. 53, 425–444. Zerbino, D.R., Birney, E., 2008. Velvet: algorithms for de novo short read assembly using
Wu, Q., Luo, Y., Lu, R., Lau, N., Lai, E.C., Li, W.-X., Ding, S.-W., 2010. Virus discovery by de Bruijn graphs. Genome Res. 18, 821–829.
deep sequencing and assembly of Virus-derived small silencing RNAs. Proc. Natl. Zhang, Q., Chen, W., Sun, L., Zhao, F., Huang, B., Yang, W., Tao, Y., Wang, J., Yuan, Z.,
Acad. Sci. USA 107, 1606–1611. Fan, G., et al., 2012. The genome of Prunus mume. Nat. Commun. 3, 1318.
Wu, Q., Wang, Y., Cao, M., Pantaleo, V., Burgyan, J., Li, W.-X., Ding, S.-W., 2012. Zhang, Z., Qi, S., Tang, N., Zhang, X., Chen, S., Zhu, P., Ma, L., Cheng, J., Xu, Y., Lu, M.,
Homology-independent discovery of replicating pathogenic circular RNAs by deep et al., 2014. Discovery of replicating circular RNAs by RNA-Seq and computational
sequencing and a new computational algorithm. Proc. Natl. Acad. Sci. USA 109, algorithms. PLoS Pathog. 10, e1004553.
3938–3943. Zhao, G., Krishnamurthy, S., Cai, Z., Popov, V.L., Travassos da Rosa, A.P., Guzman, H.,
Xu, X., Pan, S., Cheng, S., Zhang, B., Mu, D., Ni, P., Zhang, G., Yang, S., Li, R., Wang, J., Cao, S., Virgin, H.W., Tesh, R.B., Wang, D., 2013. Identification of novel viruses
et al., 2011. Genome sequence and analysis of the tuber crop potato. Nature 475, using VirusHunter–an automated data analysis pipeline. PLoS One 8, e78470.
189–195. Zheng, Y., Zhao, L., Gao, J., Fei, Z., 2011. iAssembler: a package for de novo assembly of
Zerbino, D.R., 2010. Using the Velvet de novo assembler for short-read sequencing Roche-454/Sanger transcriptome sequences. BMC Bioinform. 12, 453.
138

Entero

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Entero

Hochgeladen von

Copyright:

Verfügbare Formate

Virology 500 (2017) 130–138

Contents lists available at ScienceDirect

VirusDetect: An automated pipeline for eﬃcient virus discovery using deep

1. Background chanism called RNA silencing or RNA interference (RNAi). In eukar-

706735 (PVX) 700744 (PVS) 704832 (PLRV)

706735 (PVA) 700744 (PVT)

Bloomfield virus segment 1 (81.2%) Bloomfield virus segment 6 (98.5%)

Bloomfield virus segment 7 (79.9%)

Bloomfield virus segment 2 (96.2%) Bloomfield virus segment 8 (96.1%)

Bloomfield virus segment 4 (85.5%) Bloomfield virus segment 9 (97.1%)

Bloomfield virus segment 5 (57.5%)

Das könnte Ihnen auch gefallen