Bio (-) Informatics: Dr. Sudhir Kumar

CCS HAU Bioinformatics
Bio(-)informatics
Dr. Sudhir Kumar

CCS HAU, Hisar
sudhir@hau.ernet.in
Bio = Biology/biological
Informatics = Information Science

including technology
What is Bioinformatics?
Mathematical, statistical and
computing methods that aim to solve
biological problems using DNA and
amino acid sequences and related
information.
Bioinformatics is conceptualizing biology in terms
of macromolecules and then applying
“informatics” techniques to understand and
organize the information associated with these
molecules, on a large scale.
Bioinformatics
• Bioinformatics is the application of
information technology to analyze, process,
and manage biological data.
• Bioinformatics provides computational tools

to facilitate the process of
Data Information Knowledge Discovery
Suggestive Biology-Language Homologies

• Cell • Human Language
Nucleotide Bases Alphabet
Amino Acids Words
Exons Phrases
Folding Syntax
Proteins Word Senses
Protein Circuits Sentences
Biological Functions Semantics
Regulation of gene expression Language generation
Overview
• Biological databases are being produced at a phenomenal
rate
• As a result computers are becoming indispensable for
biological research
• Aims
1- organize data
2- develop tools
3- use tools to apply to biology
Bioinformatics -Genome and protein databases

-aligning sequences
-searching
-visualizing protein structure
-homology modeling
-molecular mechanics and
molecular dynamics
-structure prediction
-docking
-drug design
-metabolic pathways
-NMR and x-ray crystallography
and many more ….
Definitions:
Biocomputing and computational biology are synonyms and
describe the use of computers and computational techniques to
analyze any type of a biological system, from individual molecules
to organisms to overall ecology.
Bioinformatics describes using computational techniques to
access, analyze, and interpret the biological information in any
type of biological database.
Sequence analysis is the study of molecular sequence data for
the purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes
(the total DNA content of an organism) within the same and/or
across different genomes.
Proteomics is the subdivision of genomics concerned with
analyzing the complete protein complement, i.e. the proteome, of
organisms, both within and between different organisms.
First “Behind the Screen”

• Biological databases are largely
devoted to search.
– Also, integrity, security, etc.
• Search means taking a query
and retrieving some database
entry that matches it.
• Efficiency is a key; want to find
things fast, regardless of how
big the database gets.
Rate of growth
Bioinformatics: post-genomic era

 High-throughput technologies generate petabytes
of data
Sequencing, Microarray, Recombinatory chemistry,
High throughput screening, Mass spectroscopy, …
 Rapid growth of data and databases in the public
and private domains
Genomics, Gene expression profiles, Proteomics,
Pharmacogenomics, Clinical trials, Literature, …
 Proliferation of computational tools for data
analysis and processing
Statistical analysis tools for sequence analysis and
gene finding, Clustering algorithms, Protein folding and
structure predictions,Drug docking, Visualization tools,
Data mining tools, …
The Promises
• Digitization of the biological systems and
processes
Simulation and Modeling of protein-protein
interactions, protein pathways, genetic networks,
biochemical and cellular processes, normal and
disease physiological states,…
• Blurring of the boundary between
experimentally generated data and
computational data search and analysis
• In silico discovery in complement with
wet lab experiments
The Landscape of Biological Data Sources
PRINTS Patent USPTO
BLOCKS PFAMB
PIR GENEPEPT
PFAMA Patent PCT
PROSITEDOC LOCUS LINK

DOMO NRL3D
Patent JPO TFCLASS
SWISSFAM
PROSITE
TREEMBL Medline TFMATRIX
PRODOM UNIGENE
EMBL TFSITE
DSSP
DDBJ GSDB TIGR
DBSTS TFCELL
SWISSPROT
Entrez TAXONOMY
EBI Celera
PDB
RHDB GENBANK GENETICCODE
HUGO
Microbial Genomes
STKE
GDB SNP WIT
OMIM Fly Base
KEGG ENZYME
dbSNP Contact FASTA C. Elegans
Clinical DB SSEARCH
BLAST
dbSNP Population CLUSTALW
SNP Consortium
Databases are of two types - Primary & Secondary
PRIMARY DATABASES SECONDARY DATABASES
• Primary source of information • These databases derives the

and can be consider as information by resolving the
reservoir of sequence primary databases.
information. • They express any particular
• Primary repository for the newly attribute of the primary databases.
discovered sequence. ( like motif, pattern etc.)
• e.g. Genbank at NCBI, EMBL, • They add the value to the
DDBJ information present in the primary
databases.
• Eg., pfam, BLOCK, prints etc.
Primary Nucleotide Repository

• NCBI ( http://www.ncbi.nlm.nih.gov)
• EMBL (http:// www.ebi.ac.uk/embl)
• DDBJ (http://www.ddbj.nig.ac.jp/)
Primary Protein Repository

• PIR (http://pir.georgetown.edu)
• Swissprot/Uniprot (http:// www.ebi.ac.uk/swissprot)
• Protein Data Bank (http://www.rcsb.org/pdb)

Secondary ‘pattern’ databases
PROSITE SWISS-PROT Regular expressions (patterns)

PRINTS SWISS-PROT/TrEMBL Aligned motifs (fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models (HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
BLOCKS PRINTS/InterPro/Domo Weighted motifs (blocks)
IDENTIFY PRINTS/InterPro Permissive regular expressions
NUCLEOTIDE REPOSITORY
• EMBL- European Molecular Biology Laboratory, at Cambridge, UK.
• GENBANK- at NCBI, a division at NIH campus, USA.
• DDBJ- DNA Data Bank of Japan, Mishima, Japan
• Since 1982 Work in collaboration.

• Collect information from their region.
• Automatically update each other every 24 hours. To
organize huge amount of information, the database
has been split into numerous divisions (17) and each
division has specific 3-letter code. e.g.
Human HUM
Virus VRL
Fungi FUN
NCBI
EMBL
Bioinformatics Centre, BISR, DDBJ 18

Jaipur
The Biological data and databases

 Complex
data types range from protein and nucleic acid
sequences, texts, 3-dimensional molecular structures,
images of cells and tissues
 Hierarchical
data organizations range from molecules, biochemical
pathways, cells, tissues, organisms, populations
 Heterogeneous
database locations, storage formats, and access
methods
 Dynamic
data contents and database schema are constantly
changing
The computational tools and
algorithms
 Input/Output data formats
Each application program requires specific I/O data
formats that may impede data flow from one program to
the next
 Rapidly evolving
New algorithms development and improvement of old
ones
 Require graphical display or presentation of
results
viewers for sequence alignments, 3-D structures, multi-
dimensional plots,…
Integration
Data
Data Bases
Bases and
and Scientific
Scientific Algorithms
Algorithms
Medline
Medline OMIN Entrez/NCBI
Entrez/NCBI
(Asn.1)
(Asn.1) (Text File) (Asn.1)
(Asn.1)
ClustalW
Microarray Data Integration
Integration (FASTA)
(RDBMS, Excel) BioInformatics
BioInformatics
BLAST KEGG PDB

BLAST PDB
(FASTA) (HTML Text, (Oracle,
(FASTA) Binary Images) (Oracle,3D
3Dimages)
images)
Examples of Bioinformatics
• Database interfaces
– Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
• Sequence alignment
– BLAST, FASTA
• Multiple sequence alignment
– Clustal, MultAlin, DiAlign
• Gene finding
– Genscan, GenomeScan, GeneMark, GRAIL
• Protein Domain analysis and identification
– pfam, BLOCKS, ProDom,
• Pattern Identification/Characterization
– Gibbs Sampler, AlignACE, MEME
• Protein Folding prediction
– PredictProtein, SwissModeler
Five websites that all biologists should
know
• NCBI (The National Center for Biotechnology Information;
– http://www.ncbi.nlm.nih.gov/
• EBI (The European Bioinformatics Institute)
– http://www.ebi.ac.uk/
• The Canadian Bioinformatics Resource
– http://www.cbr.nrc.ca/
• SwissProt/ExPASy (Swiss Bioinformatics Resource)
– http://expasy.cbr.nrc.ca/sprot/
• PDB (The Protein Databank)
– http://www.rcsb.org/PDB/
Database Growth (cont.)

The Human Genome Project and numerous smaller
genome projects have kept the data coming at
alarming rates. As of February 2001 45 complete,
finished genomes are publicly available for
analysis, not counting all the virus and viroid
genomes available.
The International Human Genome Sequencing
Consortium announced the completion of a
"Working Draft" of the human genome in June
2000.
What is bioinformatics , genomics,

sequence analysis, computational
molecular biology . . . ?
The Reverse Biochemistry Analogy.
Biochemists no longer have to begin a research project by
isolating and purifying massive amounts of a protein from its
native organism in order to characterize a particular gene
product. Rather, now scientists can amplify a section of
some genome based on its similarity to other genomes,
sequence that piece of DNA and, using sequence analysis
tools, infer all sorts of functional, evolutionary, and, perhaps,
structural insight into that stretch of DNA!
The computer and molecular databases are a
necessary, integral part of this entire process.
Vaccine development
In Post-genomic era:
Reverse Vaccinology
Approach.
COMPND 123.PDB
HETATM 1 O -1.250 -2.964 0.008
HETATM 2 C -0.398 -2.223 0.438
HETATM 3 N -0.056 -1.110 -0.255
HETATM 4 N 0.215 -2.505 1.614
HETATM 5 C -0.732 -0.857 -1.489
HETATM 6 C 0.943 -0.166 0.171
HETATM 7 C 1.170 -1.673 2.096
HETATM 8 O -0.192 0.337 -2.121
HETATM 9 C -2.208 -0.564 -1.230
HETATM 10 C 1.548 -0.444 1.330
HETATM 11 O 1.716 -1.925 3.144
HETATM 12 C -1.205 1.278 -2.349
HETATM 24 H -2.768 0.082 -3.214

HETATM 25 H 3.574 0.173 1.498
HETATM 26 H 2.610 0.443 2.943
HETATM 27 H 2.407 1.487 1.544
HETATM 28 H -1.351 1.949 -0.315
HETATM 29 H -0.176 2.831 -1.281
HETATM 30 H -2.056 4.016 -0.887
CONECT 1 2
CONECT 2 1 3 4
CONECT 3 2 5 6
CONECT 4 2 7 18
CONECT 29 15
CONECT 30 17
END
Challenges in bioinformatics
• Explosion of information
– Need for faster, automated analysis to process large
amounts of data
– Need for integration between different types of
information (sequences, literature, annotations, protein
levels, RNA levels etc…)
– Need for “smarter” software to identify interesting
relationships in very large data sets
• Lack of “bioinformaticians”
– Software needs to be easier to access, use and
understand
– Biologists need to learn about the software, its
limitations, and how to interpret its results
New areas in Bioinformatics

•Microarrays
•Functional Genomics
•Structural Genomics
•Comparative Genomics
•Pharmacogenomics
•Medical Informatics
What is bioinformatics?
Your Turn:
ANY Question(s)

Bio (-) Informatics: Dr. Sudhir Kumar

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bio (-) Informatics: Dr. Sudhir Kumar

Hochgeladen von

Copyright:

Verfügbare Formate

CCS HAU Bioinformatics

Dr. Sudhir Kumar

Informatics = Information Science

• Bioinformatics provides computational tools

Suggestive Biology-Language Homologies

Bioinformatics -Genome and protein databases

First “Behind the Screen”

Bioinformatics: post-genomic era

PROSITEDOC LOCUS LINK

Databases are of two types - Primary & Secondary

PRIMARY DATABASES SECONDARY DATABASES

• Primary source of information • These databases derives the

Primary Nucleotide Repository

Primary Protein Repository

• Swissprot/Uniprot (http:// www.ebi.ac.uk/swissprot)

• Protein Data Bank (http://www.rcsb.org/pdb)

Secondary ‘pattern’ databases

PROSITE SWISS-PROT Regular expressions (patterns)

• Since 1982 Work in collaboration.

Bioinformatics Centre, BISR, DDBJ 18

The Biological data and databases

BLAST KEGG PDB

Database Growth (cont.)

What is bioinformatics , genomics,

HETATM 24 H -2.768 0.082 -3.214

New areas in Bioinformatics

Das könnte Ihnen auch gefallen