Sie sind auf Seite 1von 59

BIOINFORMATICS & COMPUTING METHODS

Presented by:

Sudhakar Tripathi
Research scholar Computer Engineering Department IT-BHU

Supervisor:

Prof. R.B.Mishra

Bioinformatics Definition
An interdisciplinary field involving biology, computer science, mathematics and statistics to analyze biological sequence data, genome content, arrangement and to predict the function & structure of macromolecules.
-David C. Mount

What is Bioinformatics?
The creation and development of advanced information and computational technologies for problems in biology, most commonly molecular biology (but increasingly in other areas of biology). As such, it deals with methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequences, structures, functions, pathways and genetic interactions.

Need for and Use of Bioinformatics Bioinformatics plays a key role in modern biology and is especially important in: _Molecular biology _Genomics _Functional genomics _Systems biology _Protein design and engineering _Pharmaceutical development _Medicine _Ecology / population genetics Need for and Use of Bioinformatics _Finding genes, locating coding regions, predicting function _Function, Evolution, Sequence, Structure (FESS relationships) _Metabolic genotype, phenotype, redundancy _Genes to Pathways; Genes to Biological Knowledge _Proteomics: Proteome of an Organism _Assigning Gene Sets to different Species: Homologs vs Paralogs _Expression profiles, relation to Metabolic Pathways / Genetic Networks Experimentally Analyse Thousands of Genes simultaneously _Gene Synteny between Species: Gene Adjacency in Genomes _Polymorphisms, Haplotypes, Propensity for Genetic Disease -Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples Inferring a proteins shape and function from a given a sequence of amino acids, Finding all the genes and proteins in a given genome, Determining sites in the protein structure where drug molecules can beattached.

Aim of research in Bioinformatics Understand the functioning of living things - to improve the quality of life. drug design identification of genetic risk factors gene therapy genetic modification of food crops and animals, etc. application to e.g. biotechnology How will this benefit humanity ! Genetically modified crops ! - contamination escapes Genetically modified " & #- whisky? " Genes & behaviour - really? Testing on animals - why? $% Gene therapy &'benefits outweigh dangers? ( Bio weapons? # ) * +

Genetic material Information transfer (mRNA) Protein synthesis (tRNA/mRNA) Some catalytic activity Most cellular functions are performed or facilitated by proteins. Primary biocatalyst Cofactor transport/storage Mechanical motion/support Immune protection Control of growth/differentiation Genome Sequence Finding Genes in Genomic DNA introns exons promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome

The Complexity of Biological Data Nucleotide sequences Nucleotide structures Gene expressions Protein Structures Protein functions Protein-protein interaction (pathways) Cell Cell signaling Tissues Organs Physiology Organisms

Basic cell architecture


Cells are smallest functional units of life

Types of cell
Prokaryotes Single cell No nucleus No organelles Eukaryotes Single or multi cell Nucleus Organelles

One piece of circular DNA (plasmid)


No mRNA post transcriptional modification

Chromosomes
Exons/Introns splicing

Proteins
Proteins are biological molecules of primary importance to the functioning of livingOrganisms Perform many and varied functions Structural Proteins: the organism's basic building blocks, eg. collagen, nails, hair, etc Enzymes: biological engines which mediate multitude of biochemical reactions. Usually enzymes are very specific and catalyze only a single type of reaction, but they can play a role in more than one pathway. Transmembrane proteins: they are the cells housekeepers, eg. By regulating cell volume, extraction and concentration of small molecules from the extracellular environment and generation of ionic gradients essential for muscle and nerve cell function (sodium/potasium pump is an example)

Understanding protein structure is key to understanding function and dysfunction


Amino Acid Sequences
AAs polymerised into Chains (Residues) Gene sequence determines Protein sequence Protein Structure Chains fold into specific compact structures Structure formation (folding) is spontaneous Sequence determines Structure Structure determines function

Information flow in the cell - Central Dogma DNA (4 bases, {A,C,G,T}) transcribed into RNA (4 bases, {A,C,G,U}) translated into Protein (20 amino acid residues, {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}) by triplets (codons) of RNAs UCA -> Serine (S) Start codon AUG -> Methionine (M) 3 stop codons (UGA, UAA,UAG) in most species As always in Biology, there are exceptions! Some species use different stop codons. The codon table (codon -> AA) is not the same for all species, the mitochondria has different codon table.

How DNA Codes for Protein

Various Problem areas in Bioinformatics


Sequencing Sequence analysis Sequence alignment The RNA Secondary Structure Prediction Identifying Gene Regulatory Networks Protein structure analysis Protein structure comparision Protein folding domain pattern recognition Sequence representation Genotype Analysis Splicing Site prediction

Protein - protein interaction Database development Modeling genetics History Ancient DNA cDNAs Population Genetics Simulations Finding SNPs Genome wide Association Studies Homology Search The Sequence DB Search problem Efficient searching in large data sets interfacing with data to support genomics research - software, databases, and HGT Analysis

Finding signal in the datasets - statistical and computational methods Need to get more efficient in how the data is processed, organized, and accessed how do we represent the large amount of data? Dynamically and interactively? Gene network reconstruction from time series data Gene function prediction Clustering of Gene Expression Data Characterization of Metabolic Pathways between Different Genomes Organizing biological knowledge in databases Signal transduction and other biochemical pathways Phylogenetics: Predicting the genetic or evolutionary relation of set of organisms.

Alternative splicing Gene disease relationships Microarray data collection, calibration and analysis Polymorphism Analysis and visualization Pathway Analysis:Sequence comparison,Searches in sequence databases Sequence Matching:Tracing Phylogeny,Finding family relationships between species by tracking, similarities between species. Molecular Networks Protein Threading Sequence Comparisons and Sequence-Based Database Searches Clinical Diagnosis Gene Expression Prediction Genetic Linkage Analysis Protein Function Prediction

Various Computational methods used in Bioinformatics


Mathematical Computing methods Statistical computing methods Intelligent Computing methods Neural Network Approaches Integrated Differential Fuzzy Clustering Fuzzy Computing Genetic and Evolutionary Computing Algorithms Probabilistic Computing and Belief Networks HYBRID INTELLIGENT SYSTEMS Swarm Intelligence Rough Set Theory Granular Computing Artificial Immune Systems Chaos Theory The Differential Evolution Algorithm Soft Computing Dynamic Programming & various Algorithmic Computations

Simulated annealing
Neural Fuzzy Systems Fuzzy Adaptive Resonance Theory Quantum Computing Data mining Theory of computation Quantum Evolutionary/Genetic Algorithm Artificial Intelligence Identification (Decision) Trees Genetic Algorithms Genetic Programming Cellular Automata Computer Science Algorithms . Evolutionary Computation Optimization Techniques Agent based computing

Gene Prediction
Overview of steps & strategies What sequence signals can be used? What other types of information can be used? Algorithms HMMs, discriminant functions, neural nets Gene prediction software 3 major types many,many programs!

Overview of gene prediction strategies


What sequence signals can be used?
Transcription: TF binding sites, promoter, initiation site, terminator Processing signals: splice donor/acceptors, polyA signal Translation: start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage

What other types of information can be used? cDNAs & ESTs (experimental data,pairwise alignment)
homology (sequence comparison, BLAST)

Automated gene prediction strategies


1) Similarity-based or Comparative BLAST - Do other organisms have similar sequence?
(Is sequence similar to known gene or protein)

2) Ab initio = from the beginning


Predict without explicit comparison with cDNA or proteins via rulebased gene models - but rules are derived from statistical analysis of datasets Combine gene models with alignment to known ESTs & protein sequences

3) Combined "evidence-based"

BEST RESULTS? Combined

Examples of gene prediction software


1) Similarity-based or Comparative BLAST SGP2 (extension of GeneID) 2) Ab initio = from the beginning GeneID GENSCAN GeneMark.hmm 3) Combined "evidence-based GeneSeqer (Brendel et al., ISU)

BEST? GENSCAN, GeneMark.hmm, GeneSeqer

but depends on organism & specific task

Signals: Pre-mRNA Splicing


Start codon Stop codon

Genomic DNA

Transcription
Cap-Poly(A)

pre-mRNA mRNA
Protein exon

Splicing
Cap-Poly(A)

Translation intron
GT Donor site AG Acceptor site

Splice sites

Post Transcription Splicing


Start codon Stop codon

Genomic DNA

Start codon

Stop codon

mRNA

Cap-

-Poly(A)

5-UTR

3-UTR

Horizontal Gene Transfer


The movement of genetic material BETWEEN prokaryotes Common in prokaryotes. Useful for environmental adaptation (better than point mutations)

Horizontal Gene Transfer


Also called Lateral Gene Transfer HGT and LGT for short 3 ways to do it
Transformation- naked DNA, short pieces, common in bacteria that transform
Clay 28 hrs; ocean surface - 45-83 hrs; ocean sediment-235

Transduction phage, donor/recipient share receptors, closely related bacteria, DNA: amount in phage head Conjugation-plasmids/transposons, cell to cell contact, distant relations, long DNA

PHYLOGENY

Homology & Similarity


Homology
Conserved sequences arising from a common ancestor Orthologs: homologous genes that share a common ancestor in the absence of any gene duplication (Mouse and Human Hemoglobin) Paralogs: genes related through gene duplication (one gene is a copy of another - Fetal and Adult Hemoglobin)

Similarity
Genes that share common sequences but are not necessarily related

Phylogenetics
What is Phylogenetics? Science of identifying and interpreting evolutionary relationships between biological entities (species, genes, etc) What is a phylogenetic tree? Dendrogram (tree) composed of nodes and branches representing the putative geneology of the taxonomic units

Phylogenetic Trees
A Graph Representing The Evolutionary History Of Sequences
Relationship of sequences to one another (How everything is connected) Dissect the order of appearance of insertions, deletions, and mutations

Identify Related Sequences, Predict Function, Observe Epidemiology (Analyze changes in viral strains)

Tree Characteristics

Tree Properties
Clade: all the descendants of a common ancestor represented by a node
Distance: number of changes that have taken place along a branch
Phylogram
.035 A .012 .009

Tree Types
Cladogram: shows the branching order of nodes Phylogram: shows branching order and distances

.057 C .016 .044

Methods
Distance-based Parsimony Maximum likelihood

Levels of Protein Structure


Primary (1) structure: amino acid sequence of protein Secondary (2) structure: local structure (alpha helices or beta strands) Tertiary (3) structure: 3-dimensional structure of protein Quaternary (4) structure: structure of a multiple protein complex

Protein structures Prediction


protein structures can be determined experimentally (in most cases) by x-ray crystallography nuclear magnetic resonance (NMR) but this is very expensive and time-consuming can we predict structures by computational means instead? PDB Content Growth

Chou-Fasman method Based on the propensities of different amino acids to adopt different secondary structures Predictions are made using a rules-based approach to identify groups of amino acids with shared secondary structure propensities Garnier, Osguthorpe, Robson (GOR) method Statistical method of secondary structure prediction based on informationtheory & Bayesian probability Multiple Sequence Alignment (MSA) methods Performs secondary structure prediction on a multiple sequence alignment as opposed to a single protein sequence Neural network-based methods Example: Profile network from Heidelberg (PHD)

Methods for Secondary Structure Prediction

Methods for Tertiary Structure Prediction


Tertiary (3D) Structure Prediction Homology Modeling Fold Recognition Protein Threading Ab initio structure prediction Quaternary Structure

DRUG DISCOVERY & DESIGN


Rational Approach to Drug Discovery
Identify target Clone gene encoding target Express target in recombinant form

DNA Microarrays

THANKS!

Das könnte Ihnen auch gefallen