Sie sind auf Seite 1von 1

VCFDataPy: A software tool to analyze human genome variation data

By Siddharth Krishnakumar
Thomas Jefferson High School For Science and Technology, Alexandria, Virginia

ABSTRACT VCF GENOTYPES RESULTS(CONTD.)


The variant call format (VCF) is a popular standardized text file format for storing genetic data such as SNPs, Within the format section, there could be multiple Individuals, each with several genotype fields(GT,AD,DP). I analyzed one dozen families, confirming chromosomal abnormalities and expected relationships from cases that had
insertions, deletions and structural variants. The format was developed by the 1000 Genomes Project. Tools have A format line could look like: been independently validated. In one family, the software revealed a deletion (Figure 1). The father-child relationship and
been developed to work on the variant call format (VCF) for genetic analyses such as population genetics analysis FATHER_ID MOTHER_ID SON_ID the mother-child relationships had very little IBS=0 until the end of the maternal chromosome. The inheritance pattern
GT:AD:DP 0|1:41|40:81 GT:AD:DP 0|1:41|40:81 GT:AD:DP 0|1:41|40:81
and variant prediction. Despite such tools, tools required for other genetic analysis are lacking for the VCF graph for Figure 1 shows mainly bi parental inheritance(BPI) and no-calls until the end of the chromosome, where it
format. VCFDataPy is a software tool that can perform various analyses such as Identity By State, Inheritance The GT for the father in this case would be 0|1 and his allelic depth would be 41|40. The GT field represents the genotype of the organism. 0 represents a copy of the reference allele and 1 shows Mendelian inconsistencies and paternal UPD as well. We also output the chromosome and position of any non-BPI
patterns in trios ((mother, father, and child), Parent of Origin of a chromosomal duplication, Copy Number Variation represents a copy of the alternate allele at the locus provided by POS. The AD field represents the allelic depth. It is represented by (no. of REF allele reads | no. of ALT allele reads). The inheritance variants into a file. The father-child IBS2* plot has an IBS2* fraction close to 1, and mother-child IBS2* plot
and complete visualization of chromosomes in VCF Files. The existing tools for analyzing IBS/Inheritance patterns father has one copy of 0 and one copy of 1, so he should he have nearly identical reads of the ref and alt alleles. The son has 2 copies of 0, so he should have only reads of the reference for Figure 1 has an IBS2* fraction close to 1 until the very end of the chromosome. The read depth plots in Figure 1 were
and visualizing chromosomes all work on the outdated SNP Array file format. allele. The DP field generally equals the ref reads + the alt reads,but can sometimes be more than the sum because it includes all reads in the calculation rather than just confident reads. similar in the mother and the father, but there is a lower read depth closer to the end of the chromosome in the child. The
allelic depth of the mother’s copy drops at the end, but the allelic depth of the father’s copy is constant throughout. From
these results, we can identify that there is a deletion of the mother’s copy close to the end of the chromosome. For Figure
2, the IBS plots and the inheritance plots were normal, but we can see a read depth elevation in the child and an allelic
INTRODUCTION METHODOLOGY depth elevation of the mother’s copy, so we can say that there was a mother’s copy duplication. We had independently
observed this duplication using other technologies.

The variant call format (VCF) is a common genetic file in text format that lists variants in genomes. It VCFDataPy is a software program written in the Python programming language. It extracts certain fields from the VCF for all individuals, including chromosome, position,
includes sequencing data from one or more individuals. Each row corresponds to a genetic variant genotypes, read depth, and allelic depth, and creates a table. VCFDataPy provides a visualization of identity-by-state (IBS) between pairs of individuals in a
including its chromosome, position, the nature of the variant, and annotation such as read depth. A VCF mother/father/child trio. The visualization of IBS states between any two individuals identifies related individuals such as parent child relationships and distantly related
file has from thousands to millions of rows. There are relatively few genetic visualization tools for this file individuals. IBS is calculated by the number of alleles the 2 individuals have in common. VCFDataPy also uses the calculated IBS values to find the type and origin of the
format. Additionally, each genetic tool typically analyzes just one type of genetic abnormality, so it is very inheritance patterns such as bi-parental inheritance, uniparental disomy, and Mendelian inconsistency. IBS and inheritance analyses can help identify deletions and single
time consuming to get a full picture of an individual’s genome. Therefore, a genetic visualization tool that nucleotide variants. VCFDataPy plots the IBS2* fraction, a metric to evaluate relatedness between 2 individuals. The formula is IBS2*/ (IBS2*+IBS0) yielding a ratio close
works on the VCF format and includes many of the types of chromosomal analysis (duplications, to 1 for parent-child relationships and 2/3 for unrelated individuals (such as mother-father). VCFDataPy plots the read depth of the father, mother, and child to analyze copy
deletions, genetic relatedness, etc.) is needed. number variations between the individuals as read depth is a proxy for copy number. If the read depth level at a certain place in the chromosome is different for the child
compared to the parents, then there is a chance of a copy number variation (duplication or deletion) in the child. VCFDataPy also plots the allelic depth of the father’s copy
passed down to the child and the mother’s copy passed down to the child. We calculated the p-values for a sliding window to determine if chromosomal abnormalities were
significant in the allelic depth and IBS plots or if they were just due to chance.

VCF FORMAT PROGRAM SPECIFICS


A VCF file consists of a header section and a data section. The header contains an arbitrary number of meta-
VCFDataPy accepts a full genome VCF file input and a trio information input(id name of mother, father, and child) and outputs the image file with all the plots, a SNP array file, files with the p-
information lines, each starting with characters ‘##’, and a TAB delimited field definition line, starting with a single
values over a sliding window, a file with all the non-BPI and non-no call positions for each chromosome. Only the VCF file and the trio info file have to exist beforehand. The code was written
‘#’ character. The meta-information header lines provide a standardized description of tags and annotations used in
in python 2.7. The VCF file should be unzipped through zcat or gunzip. Optional arguments include the window size for the IBS2* fraction, window size for the t-test, and an option to specify if
the data section. The use of meta-information allows the information stored within a VCF file to be tailored to the
the read depth and allelic depth are present in the VCF. If the read depth and allelic depth are not present, then less plots will be produced. The program is run on the command line. WinSCP
dataset in question. It can be also used to provide information about the means of file creation, date of creation,
should be used to transfer the files to the local directory for viewing.
version of the reference sequence, software used and any other information relevant to the history of the file. The
A typical command is shown below:
field definition line names eight mandatory columns, corresponding to data columns representing the chromosome
VCFinputFile.vcf trioinfo.txt SNPoutputfile.txt inheritanceAbnormalityfile.txt imageoutputfile.txt ttestcnvoutput.txt ttestsnvoutput.txt –ttestwindow 1000
(CHROM), a 1-based position of the start of the variant (POS), unique identifiers of the variant (ID), the reference
Since all of the files are made for each chromosome, the name of the chromosome is appended to the end of the name of the input names. The chromosome 1 image output file would be
allele (REF), a comma separated list of alternate non-reference alleles (ALT), a phred-scaled quality score (QUAL),
imageoutputfile1.txt.
site filtering information (FILTER) and a semicolon separated list of additional, user extensible annotation (INFO).
If the VCF includes allelic depth but not read depth:
In addition, if samples are present in the file, the mandatory header columns are followed by a FORMAT column and
VCFinputFile.vcf trioinfo.txt SNPoutputfile.txt inheritanceAbnormalityfile.txt imageoutputfile.txt ttestcnvoutput.txt ttestsnvoutput.txt –ttestwindow 1000 --noDP
an arbitrary number of sample IDs that define the samples included in the VCF file. The FORMAT column is used to
If the VCF does not include allelic depth but includes read depth:
define the information contained within each subsequent genotype column, which consists of a colon separated list
VCFinputFile.vcf trioinfo.txt SNPoutputfile.txt inheritanceAbnormalityfile.txt imageoutputfile.txt ttestcnvoutput.txt ttestsnvoutput.txt –ttestwindow 1000 –noAD
of fields. For example, the FORMAT field GT:GQ:DP in the fourth data entry of indicates that the subsequent
The total list of optional arguments:
entries contain information regarding the genotype, genotype quality and read depth for each sample. All data lines
--ttestwindow is the window size for the CNV ttest
are TAB delimited and the number of fields in each data line must match the number of fields in the header line. It is
--ttestsnv is the window size for the SNV test
strongly recommended that all annotation tags used are declared in the VCF header section.
--windowSize is the window size for the IBS2*fraction calculation.
--noDP to specify that DP is not present in the file
--noAD to specify that AD is not present in the file
--minDepth to specify the minimum depth to include as a non-no call.

RESULTS
The father-child and mother-child
relationships have very little IBS=0(noise) These are the IBS2* fraction plots for a maternal
but the father-mother relationship has large deletion. The IBS2* fraction for the mother-child
amounts of IBS=0. IBS=3 is a proxy for a no relationship Suddenly drops to .25. This indicates
call, where the read depth is not enough to a maternal deletion.
definitively state nucleotide at the position.

These are the read depth plots and read depth


There is a large stretch of IBS=0 in the
plots zoomed for a fairly normal chromosome. All
mother-child chromosome relationship,
three individuals Have fairly similar plots, which
indicating a deletion.
indicates that there is no genetic abnormality. The
spikes indicate regions with a large number of
This is the inheritance pattern diagram for repeats possibly pericentromeric repeats.
a fairly normal chromosome. The numbers CONCLUSION
on the scale each represent a
These are the read depth plots of a chromosome
different inheritance pattern, and 5 is no where the child has a duplication. If you look
call. closely, there is an abnormally elevated read The project was successful in providing an accurate and complete visualization of a whole-genome VCF file. The plots
depth in the child close to the end of the were successfully made for each chromosome of the individual. Our result in Figure 1(chromosome 10 image file) was
In a chromosome where there is a genetic chromosome. This is the duplication. what we expected from an individual with a maternal copy deletion, as we can observe some IBS=0 in the mother-child
abnormality, we can observe MIS and relationship, read depth drops, and allelic depth drops. Our result in Figure 2(chromosome 3 image file) matched
UPDP, indicating a deletion in the previous beliefs about the chromosome having a duplication.
mother’s copy.

These are the IBS 2* plots for a normal This is a read depth plot of a chromosome with
chromosome. The father-child one and a deletion in the child close to the end of the ACKNOWLEDGEMENTS
mother-child one have IBS2* Fractions chromosome. The difference between the
close to 1, while the father-mother one was parent plots and the child’s plot signifies a
around 2/3. Position represents the window deletion. I would like to thank Dr. Jonathan Pevsner of the Kennedy Kreiger Institute for guiding me in this project.
number

Das könnte Ihnen auch gefallen