Sie sind auf Seite 1von 6

WgSim Operation Installation and Reference Guide

Table of Contents
I.
Introduction
II.
History
III.
Installation Guide
A. Installation
B. Screenshot for steps 1 and 2
C. Screenshot for steps 3 and 4
IV.
Simulation and Evaluation
A. Running the Simulation
B. Options
C. Example
D. For Genome Evaluation
E. Arguments
V. Table of Results
Introduction
Wgsim is a small tool for simulating sequence reads from a reference genome.
It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL)
polymorphisms, and simulate reads with uniform substitution sequencing errors.
It does not generate INDEL sequencing errors, but this can be partly
compensated by simulating INDEL polymorphisms.
Wgsim outputs the simulated polymorphisms, and writes the true read coordinates
as well as the number of polymorphisms and sequencing errors in read names.
One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl
that comes with the package.

History
Wgsim was modified from MAQ's read simulator by dropping dependencies to other
source codes in the MAQ package and incorporating patches from Colin Hercus
which allow to simulate INDELs longer than 1bp. Wgsim was originally released
in the SAMtools software package. I forked it out in 2011 as a standalone
project. A few improvements were also added in this course.

Installation guide
WgSim is compiled using the GCC compiler. This is present as a default program in Mac and Linux
systems. Instructions to install GCC can be found here if it is not present on your system.
git-clone is needed to copy the code repository from the online source. This is present as a default
program in Mac and Linux systems. Instructions to install git-clone can be found here if it is not present
on your system.

Installation:
Step 1:
Open terminal in source directory
cd ~/src
Step 2:
Download source code using git clone
git clone https://github.com/lh3/wgsim.git
Step 3:
Move to WgSim folder
cd wgsim
Step 4:
Compile program with GCC compiler
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm
Step 5:
Link program to bin folder
ln -s ~/src/wgsim/wgsim ~/bin/wgsim

Screenshot for Steps 1 and 2

Screenshot for Steps 3 and 4

Simulation and Evaluation


Running the simulation:
Usage:

wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options:
Brackets contain default values
-e

FLOAT base error rate [0.020]

-d

INT

outer distance between the two ends [500]

-s

INT

standard deviation [50]

-N

INT

number of read pairs [1000000]

-1

INT

length of the first read [70]

-2

INT

length of the second read [70]

-r

FLOAT rate of mutations [0.0010]

-R

FLOAT fraction of indels [0.15]

-X

FLOAT probability an indel is extended [0.30]

-S

INT

-A

FLOAT disregard if the fraction of ambiguous bases higher than FLOAT [0.05]

-h

seed for random generator [-1]


haplotype mode

<in.ref.fa>:
Reference genome file in fasta format.
<out.read1.fq>, <out.read2.fq>:
Input desired name for read output. Data is returned in FASTQ format.

Example:

The command line for simulation:


wgsim -N100 -1yyy -d0 -S11 -e0 -rzzz read1.fa read2.fq
xxx

is the number of read pairs

yyy

is the read length

-d0

outer distance between two read ends

-s11

seed for random number generation

-e0

base error rate

zzz

fraction error that are INDELs

For genome evaluation:


Usage:

wgsim_eval.pl <command> <arguments>


Commands:
alneval evaluate alignment in the SAM format
vareval

evaluate variant calls in the pileup format

unique

keep the top scoring hit in SAM

uniqcmp

compare two alignments without multiple hits

Arguments:
This function can only accept a SAM file as an argument. SAM files are generated by the
software BWA (Burrow-Wheelers Aligner) found here.

Table comparing read length and percent error of simulated genome. Results generated with genome evaluation function.

100bp

200bp

Program

Metrics

2%

5%

10%

BWA-SW

CPU

249.00

198.00

136.00

Q20%

85.10

63.60

21.40

93.70

88.90

err%

0.01

0.06

0.20

0.00

one%

94.60

77.40

35.70

97.50

AGILE

2%

5%

500bp
10%

1,000bp

10,000bp

2%

5%

10%

2%

5%

10%

2%

5%

10%

332.00

243.00

232.00

320.00

235.00

215.00

235.00

197.00

189.00

53.50

96.40

95.70

89.20

96.60

96.20

95.10

97.70

98.30

97.70

0.01

0.14

0.00

0.01

0.01

0.00

0.00

0.01

0.00

0.00

0.00

95.10

67.60

98.60

98.50

93.40

99.00

98.90

98.30

99.70

99.80

99.70

330.00

352.00 607.00

381.00

480.00

919.00

325.00 262.00 163.00

CPU

302.00

484.00 1,060.00

Q20%

98.60

98.40

98.40

98.40

98.40

98.60

98.20

98.60

99.30

err%

0.66

0.69

2.31

0.34

0.40

0.70

0.10

0.00

0.20

one%

100.00

99.40

0.00

100.00

100.00

100.00

100.00

100.00 100.00

Das könnte Ihnen auch gefallen