Sie sind auf Seite 1von 82

Bioinformatic data bases and their applications

CONTENTS
1) 2) 3) 4) Keywords Introduction Definition Databases a) Types of databases b) Search tools c) Softwares Applications Challenges Conclusion References.

5) 6) 7) 8)

1.KEYWORDS
NCBI GenBank EMBL BLAST FASTA Data curation Redundancy

2.INTRODUCTION
Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by the scientific community. Building on the recognition of the importance of information transmission, accumulation and processing in biological systems, Paulien Hogeweg in 1978 coined the term "Bioinformatics" to study the information processes in biotic systems.

Terminologies
Data mining - the extraction of previously unknown and potentially useful information . Accession number - a unique identifier given to a biological polymer sequence (DNA, protein) when it is submitted to a sequence database. Annotation - process of marking the genes in a DNA sequence. Data curation - a term used to indicate management activities, required to maintain research data for long-term such that, it is available for reuse and preservation. Data integrity - accuracy, validity and consistency of data. Algorithm - step-by-step procedure for calculations. Redundancy the presence of more than one mechanism in a biological system, that can take over a function if one mechanism fails.

3.DEFINITION
Bioinformatics is the science concerned with the development and application of computer hardware and software to the acquisition, storage, analysis and visualization of biological information. Synonyms: Computational Biology, Computational Molecular Biology, Bio computing.

Aims of bioinformatics
Researchers accessing existing information and to submit new entries. Development of tools and resources that aid in the analysis of data. Use of these tools to analyze the data and interpret the results in a biologically meaningful manner.

4.DATABASES
Traditionally, biologists relied on textbooks and research articles published in scientific journals as the main source of information. Biological databases - Libraries of life sciences. Contains information, collected from scientific experiments, published literature, highthroughput experiment technology and computational analyses.

Biological databases
It is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. To make biological data available to scientists To make biological data available in computerreadable form. Biological sequence databases.

NCBI
GenBank - DNA sequence database from National Center Biotechnology Information (NCBI), USA. It splits into smaller discrete divisions, facilitating in an efficient search for more information. EMBL (European Molecular Biology Laboratory, UK) - DNA sequence database from European Bioinformatics Institute (EBI). It includes sequences from direct submissions, genome sequencing projects, scientific literature and patent applications. DDBJ (DNA Data Bank of Japan) is the sole nucleotide sequence data bank in Asia, organized by NIG; National Institute of Genetics. It collects sequence data mainly from Japanese researchers, accepts data and issue the accession number to researchers in any other countries.

4a.TYPES OF DATABASES
Primary Databases: Databases consisting of data derived experimentally such as nucleotide sequences and three dimensional structures are known as primary databases. Secondary Databases: Data that are derived from the analysis or treatment of primary data (secondary structures)are known as secondary databases.

Flatfiles
In biological databases, a flat file is a text file, that usually contains one (sequence) record. Flat files are the individual unit of all sequence databases. Data - displayed in a variety of formats. One of the most common format for sequence record is FASTA.

A closer look at Flatfiles


The first line is called the header

Name identifier: a unique identifier for each sequence. This is also known as the primary accession number

Length of mRNA

i.e. not a circular molecule Like a plasmid

Taxonomic code In this case, the sequence was submitted as an mRNA sequence. The N means nucleotide and the M means mRNA.

Date when last updated.

Flatfiles continued
The second line is called the Definition Line, the goal of which is to summarize the essential biological information encoded by the entry.

Genus species

Gene name

Basic description of structure and function

Note: Gene ontology can be confusing. In this case, the gene is named after a fruitfly mutant.

The most important entry.

If using this sequence in a publication, this is cited to refer readers to the database entry used or created.

GenBank specific geneinfo identifier

The version is very similar to the accession number, but if the sequence is updated either because it was wrong or incomplete, the number after the decimal indicates the version

Source organism

All GenBank entries must be associated with a citation

This ensures that the means by which the sequences were acquired have been reviewed. This enables the scientific credibility to the quality of databases.

This is an EMBL accession number, which means that it was not originally submitted through the GenBank portal

Source feature

The list of acceptable database cross references (i.e. db_xref)

All sequences must come from somewhere, so the minimum data (organisms and type of molecule) is entered here, with a link to the Taxonomy Browser.

All annotated nucleotide entries contain a virtual translation into amino acid sequence

In this case, the translation is derived directly from a mRNA sequence, so there is a good chance it is correct, but if the translation is due to computationally derived genomic sequence, it should validated against a curated database.

Sequence data

The sequence data in the flatfile can be displayed or downloaded in a variety of different ways. A FASTA file is a very common format.

4b. Search tools : FASTA


Stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison or a fast nucleotide comparison. A database search tool used to compare a nucleotide or peptide sequence to a sequence database. Described by David J. Lipman and William R. Pearson in 1985. First widely used algorithm for database similarity searching. http://fasta.bioch.virginia.edu/ http://www.ebi.ac.uk/fasta33/ http://www.ebi.ac.uk/fasta33/genomes.html

FASTA file
>sequence AGTCCGATCGATCGTAGCTACGTACGTACGTAGCT AGCTACGTACGTACGATCGATGATCGATCGATCGA TCGATCGATCGATCGATCGATCGATCGATCGATCG
This FASTA sequence file has all of the necessary elements for a database entry, but it is not very informative. For example, we dont know what database it is from, what organism it has come from, what molecule it encodes, if any etc.

FASTA format
A text-based format for representing either nucleotide sequences or peptide sequences. > - denotes the beginning of a new sequence.
>A sequence CAGCTGACAGATCGTACGATCGATGCGCACGAAGCACTACTAGCTAGGT >Another sequence CGCTAGCTCGCGATCGTATCAACGCGCGCGCGCGCGCATACTCACGCGC

BLAST
Basic Local Alignment Search Tool Developed in 1990 and 1997 (S. Altschul) A method for performing local alignments through searches of High-scoring Segment Pairs (HSPs).

Different Kinds of BLAST


BLASTP - protein query against protein DB BLASTN - DNA/RNA query against GenBank (DNA) BLASTX - 6 frame trans. DNA query against proteinDB TBLASTN - protein query against 6 frame GB transl. TBLASTX - 6 frame DNA query to 6 frame GB transl. PSI-BLAST -protein profile query against protein DB PHI-BLAST - protein pattern against protein DB MEGABLAST - for comparison of large sets of long DNA sequences.

NCBI BLAST http://www.ncbi.nlm.nih.gov/BLAST/ Canadian Bioinformatics Resource BLAST http://cbr-rbc.nrc-cnrc.gc.ca/blast/ European Bioinformatics Institute BLAST http://www.ebi.ac.uk/blastall/ http://www.ebi.ac.uk/blast2/

Nucleotide Sequence Databases


3 main databases
EMBL GenBank DDBJ
The 3 databases are synchronized on a daily basis and the accession numbers are given. There are no legal restriction in the usage of these databases. However, there are some patented sequences in the database.

UniGene (http://www.ncbi.nlm.nih.gov/sites/entrez?db=unige ne ) - A collection of ESTs (Expressed Sequence Tags) and full-length mRNA sequences organized into clusters, each representing a unique known gene annotated with mapping and expression information and cross-references to other sources. SGD (http://www.yeastgenome.org/) - The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae.

EBI Genomes (www.ebi.ac.uk/genomes/ ) provides access and statistics for the completed genomes, and information about on going projects. Genome Biology (www.ncbi.nlm.nih.gov/Genomes/ ) - contains information about the available complete genomes. Ensembl (www.ensembl.org ) - joint project between EMBL-EBI and the Sanger Centre to develop a software system which produces and maintains automatic annotation on eukaryotic genomes.

Most popular and widely used source. Includes three member databases EMBL ; GenBank ; DDBJ Each of the three groups collects a portion of the total sequence data reported worldwide. All new and updated database entries are exchanged between the groups on a daily basis. These databases are designed to provide and encourage the scientific community.

Protein Sequence Databases


SWISS-PROT (www.expasy.ch/sprot/ ) - is a protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. PIR (pir.georgetown.edu ) - Protein Information Resource is a division of the National Biomedical Research Foundation (NBRF), US. It has a database of sequences extracted from the threedimensional structures in the Protein Databank (PDB).

Genome sequence database


Database of publicly available nucleotide sequence & their associated biological & bibliographic information. Operated by National Centre for Genome Resources ( NCGR). Contains both genomic expressed & nucleotide sequences from the organisms.

Microbial Database
Data about the protein coding regions in the microbial genome sequences. Organism: Name Accession number Genome size Release date Genome center Sequence Gene (protein coding regions): Name Accession number Organism Location on the chromosome (start,end) Strand Size Product Sequence

Relational Database
Data is organized into tables: rows & columns Each row represents an instance of an entity Each column represents an attribute of an entity Relationships between entities are represented by values stored in the columns of the corresponding tables (keys) Accessible through Standard Query Language (SQL statements used to retrieve and update data in a database).

Metadata
Data that describes the properties or characteristics of other data. Does not include sample data. Allows database designers and users to understand the meaning of the data.

Web Databases
Data is accessible through Internet Have different underlying database models Example: biological databases
Molecular data: NCBI , Swissprot , PDB . Organism specific: Mouse , Worm, Yeast. Literature: Pubmed Disease.

Data Manipulation Language


For executing queries, updating, inserting, and deleting records.
SELECT - extracts data from one or more table INSERT INTO - inserts new data into a table UPDATE - updates data in a table DELETE FROM - deletes data from a table.

Entrez
Many of the databases are linked through a unique search and retrieval system, called Entrez. It helps to access integrated information from many NCBI databases. Example - The Entrez Protein database is cross-linked to the Entrez Taxonomy database. This allows a researcher to find taxonomic information, for the species from which a protein sequence was derived.

4c.List of Softwares
SI.NO. Software Description

Bhageerath

Predicts native-like structures for small globular proteins

Protein Structure Generation

Structure Generation from given dihedrals

Persistence Length

Filters for Globular Protein Evaluation

Radius of Gyration

Filters for Globular Protein Evaluation

Hydrophobicity

Filters for Globular Protein Evaluation

ProRegIn

Protein Regularity Index

Protein structure optimizer

Energy minimizer for proteins

ProSEE - Scoring Function for Protein Structure Evaluation

Calculates intramolecular energy of a protein in component-wise break up.

Genomics 1
Gene Evaluator(ChemGenome1.1)

Characterizes a DNA sequence as gene or nongene

Gene Predictor(ChemGenome2.0)

Whole genome analysis

Drug Design 1
Binding Affinity Prediction of Protein-Ligand (BAPPL) server

Computes the binding free energy of a protein-ligand complex.

Binding Affinity Prediction of Protein-Ligand complex containing Zinc (BAPPL-Z) server

Computes the binding free energy of a metalloprotein-ligand complex containing zinc.

Drug-DNA Interaction Energy (PreDDICTA)

Calculates the Drug-DNA interaction energy .

DBMS
Software package for defining and managing a database. Examples: Proprietary: MS Access, MS SQL Server, DB2, Oracle, Sybase Open source: MySql, PostgreSQL - Improved data consistency & quality Access control Transaction control - Improved accessibility & data sharing - Increased productivity of application development.

INDIAN DATABASES
GM Crops database Developed by NRC on Plant Biotechnology, New Delhi. - Interactive web resource storing information on biosafety of transgenics. - 139 transgenic lines using 4 genes (cry1Ab, cry2Ab & vip3A) are subjected. Vanshanudhan an outcome of Indian rice genome initiative, contains information on 56,298 rice genes.

Bioinformatics Axis DNA Research Centre India, Andhra Pradesh. Bioinformatics Centre, Pondicherry. Bioinformatics Centre, Kerala. Bioinformatics Centre Department of Biotechnology, Bhopal, Madhya Pradesh. Institute of Bioinformatics, Bangalore.

5.APPLICATIONS OF BIOINFORMATICS
Nucleic acid sequence database Serves as the basis for statistical analysis & for the comparison of large numbers of sequences. Sequence Retrieval to find the nucleotide sequence for a gene of interest. Sequence Identification to find function and possible origin of gene from a sequence. Sequence analysis - With these applications we can align two sequences, align multiple sequences, and perform phylogenetic analyses.

Gen Bank provides a computer database of all published DNA & RNA sequences & biological information. EMBL data library to collect, organize & distribute a database of nucleotide sequences & related descriptive information extracted from publication in scientific journals.

Protein sequence database


Information concerning the sequence and properties of a protein. Determining Protein Sequence Properties molecular weight (MW), isoelectric point (pI), etc., for particular protein. Protein Sequence Alignment align a single sequence in a database. Pairwise Sequence Alignment align two protein sequences to each other. Multiple Sequence Alignment align many sequences against a single sequence.

Structure Analysis - three dimensional shape of proteins and nucleotides. - examining a protein in 3D allows greater understanding of protein functions. Taxonomy Database - contains the names and lineages of every organism represented by at least one nucleotide or protein sequence in the NCBI genetic databases.

Hybrid technologies combining biological processes with scientific advances & technological innovations gives rise to synergistic set of new technologies. Biosensor technology measures nutritional value, freshness & locates vital blood components, environmental pollutants, etc. Measuring biodiversity used to collect the species names, descriptions, distributions, genetic information, status & size of populations, habitat needs & how each organism interacts with other species.

Other applications
Molecular medicine genes associated with diseases & the molecular basis of a particular disease can be understood more clearly. This enables better treatment & even preventive tests are developed. Personalized medicine able to analyze a patients genetic profile & prescribe the best available drug therapy & dosage from the beginning. Preventative medicine with the specific details of the genetic mechanisms of diseases, diagnostic tests are developed to measure a persons susceptibility to different diseases. Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of genes (cancer clinical trials).

Drug development at present only about 500 proteins are targeted as drugs, with an improved understanding of disease mechanisms & using computational tools specific drugs with few side effects can be developed. Waste cleanup Dienococcus radiodurans for toxic chemicals (world's toughest bacterium).

Alternative energy sources Chlorobium tepidum generating energy from light. Antibiotic resistance Enterococcus faecalis causes bacterial infection, a virulence region made up of a number of antibioticresistant genes known as a pathogenicity island are discovered. This provides useful markers for detecting pathogenic strains which prevents the spread of infection.

Forensic analysis of microbes using genomic tools, closely related strain of Bacillus anthracis & Anthrax strains are distinguished. Evolutionary studies the sequencing of genomes from all the 3 domains of life, archaea, bacteria & eukarya, to determine the tree of life & the last universal common ancestor.

Biotechnology Corynebacterium glutamicum lysine (of protein in animal nutrition). Xanthomonas campestris xanthan gum (viscosifying & stabilising agent in industries). Lactococcus lactis for dairy products & to prepare beer, wine, other fermented foods.

Crop improvement
Insect resistance Bacillus thuringiensis . Improve nutritional quality vit A in rice. A gene from yeast inserted into tomato to extend shelf life. Development of drought resistance varieties cereal varieties for soil alkalinity, free aluminium & iron toxicities.

Vetinary science production of therapeutic proteins using farm animals. Comparative studies numbers, locations & biochemical functional genes in different organisms are compared using bioinformatics tools. Organisms that are suitable for use in experimental research are termed as model organisms.

The future role for bioinformatics in plant and crop research Detection of allergenicity of genetically modified crops. Plant science community to extend genomics from models to crops.

Problems in Bioinformatics
With the exponential growth of knowledge in biology, there are rapidly growing problems such as storage, retrieval & analysis of the data.

6.CHALLENGES
The origin, structure, and fate of the universe The fundamental structure of matter Earth's physical systems The diversity of life on Earth The tree of life The brain and artificial thinking machines Protein structure prediction Multiple alignment and phylogeny construction Genomic sequence analysis and gene-finding.

7.CONCLUSION
Bioinformatics is an information technology applied to manage & analyze biological data. Based on the recognition of importance of biological information NCBI in collaboration with other organizations made biological database which serves as the libraries of life sciences with many applications by developing tools & softwares.

8.REFERENCES
Zvelebil, M. and Baum, J.O. 2008. Bioinformatics. Garland Science, New York. Pp : 46-65. Polanski, A. and Kimmel, M. 2007. Bioinformatics. Springer Berlin Heidelberg, New York. Pp : 349-352. Ranga, M.M. 2007. Bioinformatics. Agrobios, India. Pp : 164190. Dear, P.H. 2007. Bioinformatics. Scion publishers, Cambridge, UK. Pp : 1-13. Brown, S.M. 2000. Bioinformatics. Eaton publishers, New York. Pp : 47-82.

http://www.ncbi.nlm.nih.gov/About/glance/programs.html http://www.scfbioiitd.res.in/bioinformatics/bioinformaticssoft ware.htm http://www.ebi.ac.uk/2can/databases/dna.html http://www.ncbi.nlm.nih.gov/blast/blastcgihelp.shtml http://www.ncbi.nlm.nih.gov/books/NBK25461/ http://www.google.co.in/imgres?um=1&hl=en&biw=1024&bi h=624&tbm=isch&tbnid=vT-ULM0QLVXIVM:&imgrefurl http://www.flickr.com/photos/emsl/4478768673/&docid=Lbn D8z3RNbLnKM&imgurl http://farm5.staticflickr.com/4048/4478768673_3a89cc85a4_ z.jpg&w=512

Das könnte Ihnen auch gefallen