Sie sind auf Seite 1von 33

CCS HAU Bioinformatics

Bio(-)informatics

Dr. Sudhir Kumar


CCS HAU, Hisar
sudhir@hau.ernet.in
CCS HAU Bioinformatics

Bio = Biology/biological

Informatics = Information Science


including technology
CCS HAU Bioinformatics

What is Bioinformatics?
Mathematical, statistical and
computing methods that aim to solve
biological problems using DNA and
amino acid sequences and related
information.
Bioinformatics is conceptualizing biology in terms
of macromolecules and then applying
“informatics” techniques to understand and
organize the information associated with these
molecules, on a large scale.
CCS HAU Bioinformatics

Bioinformatics
• Bioinformatics is the application of
information technology to analyze, process,
and manage biological data.

• Bioinformatics provides computational tools


to facilitate the process of
Data Information Knowledge Discovery
CCS HAU Bioinformatics

Suggestive Biology-Language Homologies


• Cell • Human Language
Nucleotide Bases Alphabet
Amino Acids Words
Exons Phrases
Folding Syntax
Proteins Word Senses
Protein Circuits Sentences
Biological Functions Semantics
Regulation of gene expression Language generation
CCS HAU Bioinformatics

Overview
• Biological databases are being produced at a phenomenal
rate
• As a result computers are becoming indispensable for
biological research
• Aims
1- organize data
2- develop tools
3- use tools to apply to biology
CCS HAU Bioinformatics

Bioinformatics -Genome and protein databases


-aligning sequences
-searching
-visualizing protein structure
-homology modeling
-molecular mechanics and
molecular dynamics
-structure prediction
-docking
-drug design
-metabolic pathways
-NMR and x-ray crystallography
and many more ….
CCS HAU Bioinformatics

Definitions:
Biocomputing and computational biology are synonyms and
describe the use of computers and computational techniques to
analyze any type of a biological system, from individual molecules
to organisms to overall ecology.
Bioinformatics describes using computational techniques to
access, analyze, and interpret the biological information in any
type of biological database.
Sequence analysis is the study of molecular sequence data for
the purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes
(the total DNA content of an organism) within the same and/or
across different genomes.
Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the proteome, of
organisms, both within and between different organisms.
CCS HAU Bioinformatics

First “Behind the Screen”


• Biological databases are largely
devoted to search.
– Also, integrity, security, etc.
• Search means taking a query
and retrieving some database
entry that matches it.
• Efficiency is a key; want to find
things fast, regardless of how
big the database gets.
CCS HAU Bioinformatics

Rate of growth
CCS HAU Bioinformatics

Bioinformatics: post-genomic era


 High-throughput technologies generate petabytes
of data
Sequencing, Microarray, Recombinatory chemistry,
High throughput screening, Mass spectroscopy, …
 Rapid growth of data and databases in the public
and private domains
Genomics, Gene expression profiles, Proteomics,
Pharmacogenomics, Clinical trials, Literature, …
 Proliferation of computational tools for data
analysis and processing
Statistical analysis tools for sequence analysis and
gene finding, Clustering algorithms, Protein folding and
structure predictions,Drug docking, Visualization tools,
Data mining tools, …
CCS HAU Bioinformatics

The Promises
• Digitization of the biological systems and
processes
Simulation and Modeling of protein-protein
interactions, protein pathways, genetic networks,
biochemical and cellular processes, normal and
disease physiological states,…
• Blurring of the boundary between
experimentally generated data and
computational data search and analysis
• In silico discovery in complement with
wet lab experiments
The Landscape of Biological Data Sources
PRINTS Patent USPTO

BLOCKS PFAMB
PIR GENEPEPT
PFAMA Patent PCT

PROSITEDOC LOCUS LINK


DOMO NRL3D
Patent JPO TFCLASS
SWISSFAM
PROSITE
TREEMBL Medline TFMATRIX
PRODOM UNIGENE
EMBL TFSITE
DSSP
DDBJ GSDB TIGR
DBSTS TFCELL
SWISSPROT
Entrez TAXONOMY
EBI Celera
PDB
RHDB GENBANK GENETICCODE
HUGO
Microbial Genomes
STKE
GDB SNP WIT
OMIM Fly Base
KEGG ENZYME
dbSNP Contact FASTA C. Elegans
Clinical DB SSEARCH
BLAST
dbSNP Population CLUSTALW
SNP Consortium
CCS HAU Bioinformatics

Databases are of two types - Primary & Secondary

PRIMARY DATABASES SECONDARY DATABASES

• Primary source of information • These databases derives the


and can be consider as information by resolving the
reservoir of sequence primary databases.
information. • They express any particular
• Primary repository for the newly attribute of the primary databases.
discovered sequence. ( like motif, pattern etc.)
• e.g. Genbank at NCBI, EMBL, • They add the value to the
DDBJ information present in the primary
databases.
• Eg., pfam, BLOCK, prints etc.
CCS HAU Bioinformatics

Primary Nucleotide Repository


• NCBI ( http://www.ncbi.nlm.nih.gov)
• EMBL (http:// www.ebi.ac.uk/embl)
• DDBJ (http://www.ddbj.nig.ac.jp/)

Primary Protein Repository


• PIR (http://pir.georgetown.edu)

• Swissprot/Uniprot (http:// www.ebi.ac.uk/swissprot)

• Protein Data Bank (http://www.rcsb.org/pdb)


CCS HAU Bioinformatics

Secondary ‘pattern’ databases

PROSITE SWISS-PROT Regular expressions (patterns)


PRINTS SWISS-PROT/TrEMBL Aligned motifs (fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models (HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
BLOCKS PRINTS/InterPro/Domo Weighted motifs (blocks)
IDENTIFY PRINTS/InterPro Permissive regular expressions
CCS HAU Bioinformatics

NUCLEOTIDE REPOSITORY
• EMBL- European Molecular Biology Laboratory, at Cambridge, UK.
• GENBANK- at NCBI, a division at NIH campus, USA.
• DDBJ- DNA Data Bank of Japan, Mishima, Japan

• Since 1982 Work in collaboration.


• Collect information from their region.
• Automatically update each other every 24 hours. To
organize huge amount of information, the database
has been split into numerous divisions (17) and each
division has specific 3-letter code. e.g.
Human HUM
Virus VRL
Fungi FUN
CCS HAU Bioinformatics

NCBI

EMBL

Bioinformatics Centre, BISR, DDBJ 18


Jaipur
CCS HAU Bioinformatics

The Biological data and databases


 Complex
data types range from protein and nucleic acid
sequences, texts, 3-dimensional molecular structures,
images of cells and tissues
 Hierarchical
data organizations range from molecules, biochemical
pathways, cells, tissues, organisms, populations
 Heterogeneous
database locations, storage formats, and access
methods
 Dynamic
data contents and database schema are constantly
changing
CCS HAU Bioinformatics
The computational tools and
algorithms
 Input/Output data formats
Each application program requires specific I/O data
formats that may impede data flow from one program to
the next
 Rapidly evolving
New algorithms development and improvement of old
ones
 Require graphical display or presentation of
results
viewers for sequence alignments, 3-D structures, multi-
dimensional plots,…
Integration
Data
Data Bases
Bases and
and Scientific
Scientific Algorithms
Algorithms
Medline
Medline OMIN Entrez/NCBI
Entrez/NCBI
(Asn.1)
(Asn.1) (Text File) (Asn.1)
(Asn.1)

ClustalW
Microarray Data Integration
Integration (FASTA)
(RDBMS, Excel) BioInformatics
BioInformatics

BLAST KEGG PDB


BLAST PDB
(FASTA) (HTML Text, (Oracle,
(FASTA) Binary Images) (Oracle,3D
3Dimages)
images)
CCS HAU Bioinformatics

Examples of Bioinformatics
• Database interfaces
– Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
• Sequence alignment
– BLAST, FASTA
• Multiple sequence alignment
– Clustal, MultAlin, DiAlign
• Gene finding
– Genscan, GenomeScan, GeneMark, GRAIL
• Protein Domain analysis and identification
– pfam, BLOCKS, ProDom,
• Pattern Identification/Characterization
– Gibbs Sampler, AlignACE, MEME
• Protein Folding prediction
– PredictProtein, SwissModeler
CCS HAU Bioinformatics
Five websites that all biologists should
know
• NCBI (The National Center for Biotechnology Information;
– http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute)
– http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource
– http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource)
– http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank)
– http://www.rcsb.org/PDB/
CCS HAU Bioinformatics

Database Growth (cont.)


The Human Genome Project and numerous smaller
genome projects have kept the data coming at
alarming rates. As of February 2001 45 complete,
finished genomes are publicly available for
analysis, not counting all the virus and viroid
genomes available.
The International Human Genome Sequencing
Consortium announced the completion of a
"Working Draft" of the human genome in June
2000.
CCS HAU Bioinformatics

What is bioinformatics , genomics,


sequence analysis, computational
molecular biology . . . ?
The Reverse Biochemistry Analogy.
Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from its
native organism in order to characterize a particular gene
product. Rather, now scientists can amplify a section of
some genome based on its similarity to other genomes,
sequence that piece of DNA and, using sequence analysis
tools, infer all sorts of functional, evolutionary, and, perhaps,
structural insight into that stretch of DNA!
The computer and molecular databases are a
necessary, integral part of this entire process.
Vaccine development
In Post-genomic era:
Reverse Vaccinology
Approach.
CCS HAU Bioinformatics
CCS HAU Bioinformatics
COMPND 123.PDB
HETATM 1 O -1.250 -2.964 0.008
HETATM 2 C -0.398 -2.223 0.438
HETATM 3 N -0.056 -1.110 -0.255
HETATM 4 N 0.215 -2.505 1.614
HETATM 5 C -0.732 -0.857 -1.489
HETATM 6 C 0.943 -0.166 0.171
HETATM 7 C 1.170 -1.673 2.096
HETATM 8 O -0.192 0.337 -2.121
HETATM 9 C -2.208 -0.564 -1.230
HETATM 10 C 1.548 -0.444 1.330
HETATM 11 O 1.716 -1.925 3.144
HETATM 12 C -1.205 1.278 -2.349

HETATM 24 H -2.768 0.082 -3.214


HETATM 25 H 3.574 0.173 1.498
HETATM 26 H 2.610 0.443 2.943
HETATM 27 H 2.407 1.487 1.544
HETATM 28 H -1.351 1.949 -0.315
HETATM 29 H -0.176 2.831 -1.281
HETATM 30 H -2.056 4.016 -0.887
CONECT 1 2
CONECT 2 1 3 4
CONECT 3 2 5 6
CONECT 4 2 7 18

CONECT 29 15
CONECT 30 17
END
CCS HAU Bioinformatics
CCS HAU Bioinformatics

Challenges in bioinformatics
• Explosion of information
– Need for faster, automated analysis to process large
amounts of data
– Need for integration between different types of
information (sequences, literature, annotations, protein
levels, RNA levels etc…)
– Need for “smarter” software to identify interesting
relationships in very large data sets
• Lack of “bioinformaticians”
– Software needs to be easier to access, use and
understand
– Biologists need to learn about the software, its
limitations, and how to interpret its results
CCS HAU Bioinformatics

New areas in Bioinformatics


•Microarrays
•Functional Genomics
•Structural Genomics
•Comparative Genomics
•Pharmacogenomics
•Medical Informatics
What is bioinformatics?
CCS HAU Bioinformatics

Your Turn:

ANY Question(s)

Das könnte Ihnen auch gefallen