Intro PDF

18.
417
Introduction to Computational Molecular Biology
Foundations of Structural Bioinformatics
Sebastian Will
MIT, Math Department
Credits: Slides borrow from slides of J

er
ome Waldisp
uhl
and Dominic Rose/Rolf Backofen
S.Will, 18.417, Fall 2011
Fall 2011
Before we start
Instructor: Sebastian Will
Contact: wills@mit.edu
Office hours: by appointment, Office: 2-155
Lecture: Tuesday, Thursday, 9:30-11:00 am
Room: 8-205
Web: http://math.mit.edu/classes/18.417/
(slides, further information)
Final Project:
study paper in depth, implement/extend
algorithm, or theoretical proof

project report (2-4 pages), talk (20 min)
find a topic during term
S.Will, 18.417, Fall 2011
Credits/Evaluation: no assignments, no exam, but Final Project
What is Computational Molecular Biology

(a.k.a. Bioinformatics)?
Short answer: study of computational approaches to study of
biological systems (at the molecular level)
Today: somewhat longer answer, including
What are the components of biological systems?
How do they work together?
What is their chemistry and structure?
What is Structural Bioinformatics?
What can you learn in this course?
S.Will, 18.417, Fall 2011
Which aspects do we want to study in Computational Biology?
Components of Biological Systems

Three classes of biological macromolecules:
DNA (= deoxyribonucleic acid)
RNA (= ribonucleic acid)
Protein
Single molecules are linear chains of building blocks, specified
by sequence of their building blocks, e.g. ACTGGAGCGTC.

Molecules form 3D-structures. Folding is a physical process
(minimize energy)
Levinthal Paradox: fast folding but huge conformation space
Structure=Function, e.g. lock&key
S.Will, 18.417, Fall 2011
Structure allows macromolecules to interact.
Information Flow Central Dogma

Replication
DNA
Transcription
RNA
Translation
Protein
RNA: intermediate for protein synthesis (messenger RNA),

catalytic and regulatory function (non-coding RNA)
building blocks: 4 nucleotides A,C,G, and U
(U=Uracil) and some rare other nucleotides
Protein: catalytic and regulatory function (enzymes)
building blocks: 20 amino acids + 1 rare aa
S.Will, 18.417, Fall 2011
DNA: store genetic information (e.g. in genome);

regular double helix structure
building blocks: 4 nucleotides A,C,G, and T
(Adenine, Cytosine, Guanine, Thymine)
Genetic code
Transcription: A,C,G,T 7 A,C,G,U
Translation: Tripletts from alphabet {A,C,G,U} (= codons)
S.Will, 18.417, Fall 2011
redundantly code for amino acids
S.Will, 18.417, Fall 2011
Information Flow (Cell Compartments)
Important for molecular mechanism: complementarity of

nucleotides G-C, A-T, A-U
S.Will, 18.417, Fall 2011
Protein Bio-Synthesis
Evolution (
Animals
Slime moulds
Fungi Gram-positives
Chlamydiae
Green nonsulfur bacteria
Plants
ACCGA
Actinobacteria
Algae
Planctomycetes
Spirochaetes
Protozoa
ACCTA
T
Fusobacteria
Crenarchaeota
Cyanobacteria
(blue-green algae)
Nanoarchaeota
C
ACCCGA
TCCTA
T
ACTA
Euryarchaeota
Thermophilic
sulfate-reducers
Acidobacteria
Protoeobacteria
variaton (imperfect replication: point mutation, deletion,

selection
homologous sequences
S.Will, 18.417, Fall 2011
insertion, ... )
S.Will, 18.417, Fall 2011
What can we study (computationally)?
What can we study (computationally)?
Evolutionary relation between homologous
molecules/fragments of molecules
Structural relation between molecules
Relation between sequence and structure
Interaction between molecules
Interaction networks, Regulatory networks, Metabolic networks
Structure of genomes, Relation between genomes
S.Will, 18.417, Fall 2011
...
Areas of Bioinformatics
1. Genomics: Study of entire genomes.
Huge amount of data, fast algorithms,
limited to sequence.
3. Structural Bioinformatics: Study of the

folding process of bio-molecules. Less
structural data than sequence data available, step toward function, fills gap between genomics and systems biology.
S.Will, 18.417, Fall 2011
2. Systems Biology: Study of complex interactions in biological systems. High

level of representation.
Some Organic Chemistry

Biological macromolecules (and most organic compounds) are built
from only few different types of atoms
C Carbon
H Hydrogen O Oxygen
N Nitrogen P Phosphor S Sulfur
CHNO: 99% of cell mass
Organic Chemistry = Chemistry of Carbon
Special properties of Carbon
binds up to 4 other atoms,
(tetrahedron conformation)
strong covalent bonds
covalent bond:
1e
+1
chains and rings
large, stable, complex molecules
+1
2e
H H
HH
+1
S.Will, 18.417, Fall 2011
e.g. Methane
small size
Non-covalent bonds
Covalent
1e
+1
+1
2e
+1
H H
HH
Non-covalent
Van der Waals (sum of the attractive or repulsive forces
between molecules, caused by correlations in the fluctuating
polarizations of nearby particles)
hydrogen bonds (attractive interaction of a hydrogen atom
with an electronegative atom)
ionic bonds (electrostatic attraction between two oppositely
charged ions, e.g. Na+ Cl )

thermal
movement
[in kcal/mol]
0.1
1
noncovalent
Bond
10
100
1000
complete
glucose oxidation
S.Will, 18.417, Fall 2011
CC Bond
Functional groups
organic molecules: carbon skeleton + functional groups
functional groups are involved in specific chemical reactions
Alcohol
OH
Ketone
/Aldehyde
hydroxyl group
carbonyl group
O
C
carboxyl group
C
OH
H
Amine
amino group
N
H
S.Will, 18.417, Fall 2011
Carboxylic Acid
Small organic molecules

Small: 30 atoms
4 families:
sugars
component of building blocks, main energy source

fats / fatty acids
cell membrane, energy source

amino acids
nucleotides
DNA + RNA, energy currency
S.Will, 18.417, Fall 2011
proteins
Sugars
component of building blocks, main energy source
general formula (CH2 O)n ,
different lengths (e.g n=5, n=6)

linear, cyclic
For example, saccharose (glucose+fructose):

H
HO
CH2OH
O
O H
H
OH
OH
H
OH
HO
H
CH2OH
S.Will, 18.417, Fall 2011
CH2OH
Fats
Fat = Triglyceride of fatty acids
S.Will, 18.417, Fall 2011
cell membrane (lipid bilayer), energy source
Amino Acids
all aa same build
aa differ in side chains R

size
charge: positiv/negativ (sauer/basisch)
hydrophobicity: hydrophobic/hydrophilic
S.Will, 18.417, Fall 2011
in naturally occuring proteins: 21 different amino acids
S.Will, 18.417, Fall 2011
Amino Acids
Nucleotides
Purines
pentose
Base
glycosidic bond
Adenine
OH = ribose
H = deoxyribose
Guanine
Pyrimidines
nucleoside
nucleotide monophosphate
nucleotide diphosphate
Cytosine
Uracil
Thymine
Nucleotides work as energy currency of metabolism

NTP P + NDP + E
(split of nucleoside triphosphate into phosphate + nucleoside
diphosphate releases energy)
S.Will, 18.417, Fall 2011
nucleotide triphosphate
Complementarity of Organic Bases
H
H
H
N
N
N
N
N
O
Adenine
N
N
Thymine
Guanine
Cytosine
S.Will, 18.417, Fall 2011
DNA structure
Primary structure: chain of nucleotides
Tertiary Structure: antiparallel double helix
Thymine
5' end
O
O_
NH2
_O
3' end
OH
HN
N
O
O
O_
O
O
O
_O
NH2
P
O
N
N
N
PhosphateO
deoxyribose P
_O
backbone
HN
H2N
O_
O
O
NH
H2N
N
N
O
O
O
_O
NH
N
NH2
O_
H2N
N
O
O
O_
O
OH
Cytosine
Guanine
5' end
3' end
_O
RNA primary structure similar, but

ribose not deoxyribose, U not T, single stranded
S.Will, 18.417, Fall 2011
Adenine
RNA structure
Hammerhead Ribozyme
mainly stabilized by contacts between complementary bases

(H-bonds)
RNA secondary structure = set of base pairs
S.Will, 18.417, Fall 2011
tRNA
RNA secondary structure

A CC
A
GC
GC
GC
CG
GC
UA
G
U A
U GC
U
G
C
A G G C CU A
UGCG
G
UCCGG
G
GCGC
GU A
C
UUC
G
U
C GG
UA
CG
C GA
UCG
U
U
A
A
GC
linear representation
GGGCGUGUGGCGUAGUCGGUAGCGCGCUCCCUUAGCAUGGAGAGGUCUCCGGUUCGAUUCCGGACACGCCCACCA
(((((((..((((........)))).(((((.......)).)))...(((((.......))))))))))))....
note: example is pseudoknot-free
S.Will, 18.417, Fall 2011
set of pairs of (complementary) bases that form H-bonds

2D representation (typical tRNA clover-leaf)
Protein Primary Structure
Protein = chain of amino acids (AA)
and so on . . .
S.Will, 18.417, Fall 2011
aa connected by peptide bonds
Protein Structure Formation / Folding
S.Will, 18.417, Fall 2011
minimization of free energy

Forces between amino acid side chains
hydrophobic interaction
H-bonds
electro-static force
van-der-Waals force
disulfide bonds
Protein secondary structure: -helix
Features:
3.6 amino acids per turn
hydrogen bond between
residues n and n + 4
local motif
approximately 40% of the
S.Will, 18.417, Fall 2011
structure
Protein secondary structure: -sheets
Features:
2 amino acids per turn
hydrogen bond between
residues of different strands

involve long-range
interactions
approximately 20% of the
S.Will, 18.417, Fall 2011
structure
Protein secondary structure: Turns
Features:
Up to 5 residue length
hydrogen bonds depend of
type
local interactions
approximately 5-10% of the
S.Will, 18.417, Fall 2011
structure
S.Will, 18.417, Fall 2011
Protein structure hierarchy
DNA sequencing
A very incomplete overview
= determining the order of nucleotides in DNA

early 1970s: first DNA sequencing, but laborious
1977: Sanger Chain-Termination rapid sequencing
whole genome sequencing, 2001 draft version of Human
genome published
2011 sequencing of a human genome costs about USD 10,000

constant progress in technology (speed & accuracy)
RNA and protein sequences are usually inferred from DNA
S.Will, 18.417, Fall 2011
high throughput sequencing (454, Illumina/Solexa, . . . )
Experimental Structure Determination

How can we know the 3D structure of a protein/RNA?
X-ray cristallography
Requires crystalls of macromolecule.
Often extremely difficult and time-intensive
X-rays send through crystall produce specific patterns
Angles and intensities allow to construct 3D-electron density
From this, one can determine atom positions, bonds, etc.
Experimentally resolved structures are available in the protein
data base (PDB) in a machine-readable format.

The number of resolved structures grows exponentially, but
slower than the one of known sequences.
S.Will, 18.417, Fall 2011
Nuclear magnetic resonance spectroscopy (NMR)

uses phenomenon of nuclear magnetic resonance
only relatively small molecules
does not require crystalls
measure distances between pairs of atoms within the molecule
structure has to be predicted using these constraints
S.Will, 18.417, Fall 2011
Topics of the Class
Sequence Alignment
pairwise alignment
S.Will, 18.417, Fall 2011
Sequence A: ACGTGAACT
Sequence B: AGTGAGT
align A and B
Sequence A: ACGTGAACT
Sequence B: A-GTGA-GT
global and local alignment
multiple alignment (NP-complete heuristics)
RNA Secondary Structure Prediction

Predict minimal free energy structure for single sequence
Predict minimal free energy structure for aligned sequences
Predict common structure for alignment for unaligned
sequences:
Simultaneous Alignment and Folding
fdhA
fwdB
selD
vhuD
vhuU
fruA
hdrA
((..((((((((...(((.................))).))))))))..))
CGC-CACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAG-GUGG-CG
AUG-UUGGAGGGGAACCCGU-------------AAGGGACCCUCCAAG-AU
UUACGAUGUGCCGAACCCUU------------UAAGGGAGGCACAUCGAAA
GU--UCUCUCGGGAACCCGU------------CAAGGGACCGAGAGA--AC
AGC-UCACAACCGAACCCAU-------------UUGGGAGGUUGUGAG-CU
CC--UCGAGGG-GAACCCGA-------------AA-GGGACCCGAGA--GG
GG--CACCACUCGAAGGCUA-------------AG-CCAAAGUGGUG--CU
.........10........20........30........40........50
48
36
39
35
36
32
33
S.Will, 18.417, Fall 2011
A CC
A
GC
GC
GC
CG
GC
UA
UGA
U GC
U
G
C
A G G C CU A
UGCG
G
UCCGG
G
G
GU A CGC
C
UUC
G
C G GU
UA
CG
C GA
UCG
U
U
A
A
GC
Studying the Structure Ensemble of an RNA

Prediction of the structure ensemble
probabilities of structures
probabilities of structure elements and features
Suboptimal Structures
Shape Abstraction of RNA Structure
GGGCGUGUGGCGUA GUCGGUA GCGCGCUCCCUUJ GCJ UGGA GA GGUCUCCGGUUCGA UUCCGGA CA CGCCCA CCA
S.Will, 18.417, Fall 2011
RNA Pseudoknot Prediction

Usually: for RNA structure analysis, assume no pseudoknots
Pseudoknot (PK) prediciton is NP-complete
Efficient PK prediction from restricted classes of PKs
U
A
A
U
A
A
C
A
U
A
U
U
C U
C A G C G G G C G
U U
U U
G U C G C G C G C
C G A C U G
G C U G A C
S.Will, 18.417, Fall 2011
RNA-RNA Interaction
Prediction of interaction complex of two RNAs
Similar to Pseudoknot-prediction, the unrestricted problem is
NP-complete
S.Will, 18.417, Fall 2011
Efficient variants exist for restricted types of interaction
RNA 3D Structure Modeling

De-novo prediction of 3D structure from sequence
MC-Fold predicts secondary structure
including non-canonical base pairs

MC-Sym builds tertiary from secondary structure
S.Will, 18.417, Fall 2011
MC-Fold / MC-Sym
MC-Sym:
Stochastic Context-Free Grammars
HMMs, which can model

secondary structure
"split set"
MP 12 ML 13 MR 14
D 15
MATP 6
inserts
IL 16
"split set"
IR 17
MP 18 ML 19 MR 20
D 21
MATP 7
inserts
Consensus Models for
IL 22
"split set"
describing RNA families.

family members
U
input multiple alignment:
example structure: U
C
[structure] . : : <<< _ _ _ _ > - >> : << - < . _ _ _ . >>> .
5A
human . A A G A C U U C G G A U C U G G C G . A C A . C C C .
G
mouse a U A C A C U U C G G A U G - C A C C . A A A . G U G a
A
A
orc . A G G U C U U C - G C A C G G G C A g C C A c U U C .
2
5
10
15
20
25
MR 24
D 25
MATR 8
insert
Tool Infernal scans database for
IR 23
28
C
G10
G
A
U
C 15
21
U
G GCG A
C
C
C
C
A
27
25
IR 26
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
ROOT 1
MATL
MATL
BIF
BEGL
4
5
MATP
MATP
MATR 8
MATP
MATL 10
MATL 11
MATL 12
MATL 13
END
14
BEGR 15
MATL 16
MATP 17
MATP 18
MATL 19
MATP 20
MATL 21
MATL 22
MATL 23
END
24
S.Will, 18.417, Fall 2011
SCFGs are a generalization of
S
IL
IR
ML
D
IL
ML
D
IL
B
S
MP
ML
MR
D
IL
IR
MP
ML
MR
D
IL
IR
MR
D
IR
MP
ML
MR
D
IL
IR
ML
D
IL
ML
D
IL
ML
D
IL
ML
D
IL
E
S
IL
ML
D
IL
MP
ML
MR
D
IL
IR
MP
ML
MR
D
IL
IR
ML
D
IL
MP
ML
MR
D
IL
IR
ML
D
IL
ML
D
IL
ML
D
IL
E
De-novo Prediction of Structural RNA
scan whole genome
alignments for potential

structural RNA
structural stability
conservation of structure
Fast methods RNAz,
S.Will, 18.417, Fall 2011
EvoFold
Protein Structure Prediction

De-novo Protein Structure Prediction
Homology-based prediction: Protein Threading
S.Will, 18.417, Fall 2011
Protein-Protein Interaction
3D Lattice Protein Models

protein structure prediction is NP-complete even in simple
S.Will, 18.417, Fall 2011
protein models
optimal ab-initio prediction in HP-lattice protein models (3D
cubic and fcc)
Beyond Energy Minimization:
Kinetiks of Protein and RNA folding
vs.
S.Will, 18.417, Fall 2011
Predicting Protein Folding-Pathways (Motion Planning)

Modeling of Folding as Markov Process, Energy Landscapes
Simulated and Exact Folding Kinetics

Intro PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Intro PDF

Hochgeladen von

Copyright:

Verfügbare Formate

18.

Credits: Slides borrow from slides of J

S.Will, 18.417, Fall 2011

study paper in depth, implement/extend

algorithm, or theoretical proof

S.Will, 18.417, Fall 2011

Credits/Evaluation: no assignments, no exam, but Final Project

What is Computational Molecular Biology

S.Will, 18.417, Fall 2011

Which aspects do we want to study in Computational Biology?

Components of Biological Systems

by sequence of their building blocks, e.g. ACTGGAGCGTC.

Levinthal Paradox: fast folding but huge conformation space

Structure=Function, e.g. lock&key

S.Will, 18.417, Fall 2011

Structure allows macromolecules to interact.

Information Flow Central Dogma

RNA: intermediate for protein synthesis (messenger RNA),

S.Will, 18.417, Fall 2011

DNA: store genetic information (e.g. in genome);

S.Will, 18.417, Fall 2011

redundantly code for amino acids

S.Will, 18.417, Fall 2011

Information Flow (Cell Compartments)

Important for molecular mechanism: complementarity of

S.Will, 18.417, Fall 2011

variaton (imperfect replication: point mutation, deletion,

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

What can we study (computationally)?

What can we study (computationally)?

Evolutionary relation between homologous

3. Structural Bioinformatics: Study of the

S.Will, 18.417, Fall 2011

2. Systems Biology: Study of complex interactions in biological systems. High

Some Organic Chemistry

strong covalent bonds

chains and rings

large, stable, complex molecules

S.Will, 18.417, Fall 2011

ionic bonds (electrostatic attraction between two oppositely

charged ions, e.g. Na+ Cl )

S.Will, 18.417, Fall 2011

S.Will, 18.417, Fall 2011

Small organic molecules

component of building blocks, main energy source

cell membrane, energy source

DNA + RNA, energy currency

S.Will, 18.417, Fall 2011

different lengths (e.g n=5, n=6)

For example, saccharose (glucose+fructose):

S.Will, 18.417, Fall 2011

Fat = Triglyceride of fatty acids

S.Will, 18.417, Fall 2011

cell membrane (lipid bilayer), energy source

all aa same build

aa differ in side chains R

S.Will, 18.417, Fall 2011

in naturally occuring proteins: 21 different amino acids

S.Will, 18.417, Fall 2011

Nucleotides work as energy currency of metabolism

S.Will, 18.417, Fall 2011