Sie sind auf Seite 1von 45

18.

417
Introduction to Computational Molecular Biology
Foundations of Structural Bioinformatics
Sebastian Will
MIT, Math Department

Credits: Slides borrow from slides of J


er
ome Waldisp
uhl
and Dominic Rose/Rolf Backofen

S.Will, 18.417, Fall 2011

Fall 2011

Before we start
Instructor: Sebastian Will
Contact: wills@mit.edu
Office hours: by appointment, Office: 2-155
Lecture: Tuesday, Thursday, 9:30-11:00 am
Room: 8-205
Web: http://math.mit.edu/classes/18.417/
(slides, further information)

Final Project:

study paper in depth, implement/extend

algorithm, or theoretical proof


project report (2-4 pages), talk (20 min)
find a topic during term

S.Will, 18.417, Fall 2011

Credits/Evaluation: no assignments, no exam, but Final Project

What is Computational Molecular Biology


(a.k.a. Bioinformatics)?
Short answer: study of computational approaches to study of
biological systems (at the molecular level)
Today: somewhat longer answer, including
What are the components of biological systems?
How do they work together?
What is their chemistry and structure?
What is Structural Bioinformatics?
What can you learn in this course?

S.Will, 18.417, Fall 2011

Which aspects do we want to study in Computational Biology?

Components of Biological Systems


Three classes of biological macromolecules:
DNA (= deoxyribonucleic acid)
RNA (= ribonucleic acid)
Protein
Single molecules are linear chains of building blocks, specified

by sequence of their building blocks, e.g. ACTGGAGCGTC.


Molecules form 3D-structures. Folding is a physical process

(minimize energy)

Levinthal Paradox: fast folding but huge conformation space

Structure=Function, e.g. lock&key

S.Will, 18.417, Fall 2011

Structure allows macromolecules to interact.

Information Flow Central Dogma


Replication

DNA

Transcription

RNA

Translation

Protein

RNA: intermediate for protein synthesis (messenger RNA),


catalytic and regulatory function (non-coding RNA)
building blocks: 4 nucleotides A,C,G, and U
(U=Uracil) and some rare other nucleotides
Protein: catalytic and regulatory function (enzymes)
building blocks: 20 amino acids + 1 rare aa

S.Will, 18.417, Fall 2011

DNA: store genetic information (e.g. in genome);


regular double helix structure
building blocks: 4 nucleotides A,C,G, and T
(Adenine, Cytosine, Guanine, Thymine)

Genetic code
Transcription: A,C,G,T 7 A,C,G,U
Translation: Tripletts from alphabet {A,C,G,U} (= codons)

S.Will, 18.417, Fall 2011

redundantly code for amino acids

S.Will, 18.417, Fall 2011

Information Flow (Cell Compartments)

Important for molecular mechanism: complementarity of


nucleotides G-C, A-T, A-U

S.Will, 18.417, Fall 2011

Protein Bio-Synthesis

Evolution (

Animals
Slime moulds

Fungi Gram-positives
Chlamydiae
Green nonsulfur bacteria

Plants

ACCGA

Actinobacteria

Algae

Planctomycetes
Spirochaetes

Protozoa

ACCTA
T

Fusobacteria

Crenarchaeota

Cyanobacteria
(blue-green algae)

Nanoarchaeota

C
ACCCGA

TCCTA
T

ACTA

Euryarchaeota

Thermophilic
sulfate-reducers
Acidobacteria
Protoeobacteria

variaton (imperfect replication: point mutation, deletion,


selection
homologous sequences

S.Will, 18.417, Fall 2011

insertion, ... )

S.Will, 18.417, Fall 2011

What can we study (computationally)?

What can we study (computationally)?

Evolutionary relation between homologous

molecules/fragments of molecules
Structural relation between molecules
Relation between sequence and structure
Interaction between molecules
Interaction networks, Regulatory networks, Metabolic networks
Structure of genomes, Relation between genomes
S.Will, 18.417, Fall 2011

...

Areas of Bioinformatics
1. Genomics: Study of entire genomes.
Huge amount of data, fast algorithms,
limited to sequence.

3. Structural Bioinformatics: Study of the


folding process of bio-molecules. Less
structural data than sequence data available, step toward function, fills gap between genomics and systems biology.

S.Will, 18.417, Fall 2011

2. Systems Biology: Study of complex interactions in biological systems. High


level of representation.

Some Organic Chemistry


Biological macromolecules (and most organic compounds) are built
from only few different types of atoms
C Carbon
H Hydrogen O Oxygen
N Nitrogen P Phosphor S Sulfur
CHNO: 99% of cell mass
Organic Chemistry = Chemistry of Carbon
Special properties of Carbon
binds up to 4 other atoms,
(tetrahedron conformation)

strong covalent bonds

covalent bond:

1e
+1

chains and rings

large, stable, complex molecules

+1

2e

H H
HH

+1

S.Will, 18.417, Fall 2011

e.g. Methane
small size

Non-covalent bonds
Covalent

1e
+1

+1

2e

+1

H H
HH

Non-covalent
Van der Waals (sum of the attractive or repulsive forces
between molecules, caused by correlations in the fluctuating
polarizations of nearby particles)
hydrogen bonds (attractive interaction of a hydrogen atom
with an electronegative atom)

ionic bonds (electrostatic attraction between two oppositely

charged ions, e.g. Na+ Cl )


thermal
movement

[in kcal/mol]

0.1

1
noncovalent
Bond

10

100

1000

complete
glucose oxidation

S.Will, 18.417, Fall 2011

CC Bond

Functional groups
organic molecules: carbon skeleton + functional groups
functional groups are involved in specific chemical reactions
Alcohol

OH

Ketone
/Aldehyde

hydroxyl group

carbonyl group

O
C

carboxyl group

C
OH

H
Amine

amino group

N
H

S.Will, 18.417, Fall 2011

Carboxylic Acid

Small organic molecules


Small: 30 atoms

4 families:
sugars

component of building blocks, main energy source


fats / fatty acids

cell membrane, energy source


amino acids
nucleotides

DNA + RNA, energy currency

S.Will, 18.417, Fall 2011

proteins

Sugars
component of building blocks, main energy source
general formula (CH2 O)n ,

different lengths (e.g n=5, n=6)


linear, cyclic

For example, saccharose (glucose+fructose):


H
HO

CH2OH
O

O H

H
OH

OH

H
OH

HO
H

CH2OH

S.Will, 18.417, Fall 2011

CH2OH

Fats

Fat = Triglyceride of fatty acids

S.Will, 18.417, Fall 2011

cell membrane (lipid bilayer), energy source

Amino Acids

all aa same build

aa differ in side chains R


size
charge: positiv/negativ (sauer/basisch)
hydrophobicity: hydrophobic/hydrophilic

S.Will, 18.417, Fall 2011

in naturally occuring proteins: 21 different amino acids

S.Will, 18.417, Fall 2011

Amino Acids

Nucleotides
Purines

pentose
Base
glycosidic bond

Adenine

OH = ribose
H = deoxyribose

Guanine

Pyrimidines

nucleoside
nucleotide monophosphate
nucleotide diphosphate
Cytosine

Uracil

Thymine

Nucleotides work as energy currency of metabolism


NTP P + NDP + E
(split of nucleoside triphosphate into phosphate + nucleoside
diphosphate releases energy)

S.Will, 18.417, Fall 2011

nucleotide triphosphate

Complementarity of Organic Bases

H
H
H

N
N

N
N

N
O

Adenine

N
N

Thymine

Guanine

Cytosine

S.Will, 18.417, Fall 2011

DNA structure
Primary structure: chain of nucleotides
Tertiary Structure: antiparallel double helix
Thymine

5' end
O

O_

NH2

_O

3' end

OH

HN

N
O

O
O_

O
O
O
_O

NH2

P
O
N

N
N

PhosphateO
deoxyribose P
_O
backbone

HN

H2N

O_

O
O

NH

H2N

N
N

O
O
O
_O

NH
N

NH2

O_

H2N

N
O

O
O_

O
OH

Cytosine
Guanine
5' end

3' end

_O

RNA primary structure similar, but


ribose not deoxyribose, U not T, single stranded

S.Will, 18.417, Fall 2011

Adenine

RNA structure

Hammerhead Ribozyme

mainly stabilized by contacts between complementary bases


(H-bonds)
RNA secondary structure = set of base pairs

S.Will, 18.417, Fall 2011

tRNA

RNA secondary structure


A CC
A
GC
GC
GC
CG
GC
UA
G
U A
U GC
U
G
C
A G G C CU A
UGCG
G
UCCGG
G
GCGC
GU A
C
UUC
G
U
C GG
UA
CG
C GA
UCG
U
U
A
A
GC

linear representation
GGGCGUGUGGCGUAGUCGGUAGCGCGCUCCCUUAGCAUGGAGAGGUCUCCGGUUCGAUUCCGGACACGCCCACCA
(((((((..((((........)))).(((((.......)).)))...(((((.......))))))))))))....

note: example is pseudoknot-free

S.Will, 18.417, Fall 2011

set of pairs of (complementary) bases that form H-bonds


2D representation (typical tRNA clover-leaf)

Protein Primary Structure

Protein = chain of amino acids (AA)

and so on . . .

S.Will, 18.417, Fall 2011

aa connected by peptide bonds

Protein Structure Formation / Folding

S.Will, 18.417, Fall 2011

minimization of free energy


Forces between amino acid side chains
hydrophobic interaction
H-bonds
electro-static force
van-der-Waals force
disulfide bonds

Protein secondary structure: -helix

Features:
3.6 amino acids per turn
hydrogen bond between

residues n and n + 4
local motif
approximately 40% of the

S.Will, 18.417, Fall 2011

structure

Protein secondary structure: -sheets

Features:
2 amino acids per turn
hydrogen bond between

residues of different strands


involve long-range

interactions
approximately 20% of the

S.Will, 18.417, Fall 2011

structure

Protein secondary structure: Turns

Features:
Up to 5 residue length
hydrogen bonds depend of

type
local interactions
approximately 5-10% of the

S.Will, 18.417, Fall 2011

structure

S.Will, 18.417, Fall 2011

Protein structure hierarchy

DNA sequencing
A very incomplete overview

= determining the order of nucleotides in DNA


early 1970s: first DNA sequencing, but laborious

1977: Sanger Chain-Termination rapid sequencing

whole genome sequencing, 2001 draft version of Human

genome published

2011 sequencing of a human genome costs about USD 10,000


constant progress in technology (speed & accuracy)

RNA and protein sequences are usually inferred from DNA

S.Will, 18.417, Fall 2011

high throughput sequencing (454, Illumina/Solexa, . . . )

Experimental Structure Determination


How can we know the 3D structure of a protein/RNA?
X-ray cristallography
Requires crystalls of macromolecule.
Often extremely difficult and time-intensive
X-rays send through crystall produce specific patterns
Angles and intensities allow to construct 3D-electron density
From this, one can determine atom positions, bonds, etc.

Experimentally resolved structures are available in the protein

data base (PDB) in a machine-readable format.


The number of resolved structures grows exponentially, but

slower than the one of known sequences.

S.Will, 18.417, Fall 2011

Nuclear magnetic resonance spectroscopy (NMR)


uses phenomenon of nuclear magnetic resonance
only relatively small molecules
does not require crystalls
measure distances between pairs of atoms within the molecule
structure has to be predicted using these constraints

S.Will, 18.417, Fall 2011

Topics of the Class

Sequence Alignment
pairwise alignment

S.Will, 18.417, Fall 2011

Sequence A: ACGTGAACT
Sequence B: AGTGAGT
align A and B
Sequence A: ACGTGAACT
Sequence B: A-GTGA-GT
global and local alignment
multiple alignment (NP-complete heuristics)

RNA Secondary Structure Prediction


Predict minimal free energy structure for single sequence
Predict minimal free energy structure for aligned sequences
Predict common structure for alignment for unaligned

sequences:
Simultaneous Alignment and Folding

fdhA
fwdB
selD
vhuD
vhuU
fruA
hdrA

((..((((((((...(((.................))).))))))))..))
CGC-CACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAG-GUGG-CG
AUG-UUGGAGGGGAACCCGU-------------AAGGGACCCUCCAAG-AU
UUACGAUGUGCCGAACCCUU------------UAAGGGAGGCACAUCGAAA
GU--UCUCUCGGGAACCCGU------------CAAGGGACCGAGAGA--AC
AGC-UCACAACCGAACCCAU-------------UUGGGAGGUUGUGAG-CU
CC--UCGAGGG-GAACCCGA-------------AA-GGGACCCGAGA--GG
GG--CACCACUCGAAGGCUA-------------AG-CCAAAGUGGUG--CU
.........10........20........30........40........50

48
36
39
35
36
32
33

S.Will, 18.417, Fall 2011

A CC
A
GC
GC
GC
CG
GC
UA
UGA
U GC
U
G
C
A G G C CU A
UGCG
G
UCCGG
G
G
GU A CGC
C
UUC
G
C G GU
UA
CG
C GA
UCG
U
U
A
A
GC

Studying the Structure Ensemble of an RNA


Prediction of the structure ensemble
probabilities of structures
probabilities of structure elements and features
Suboptimal Structures
Shape Abstraction of RNA Structure
GGGCGUGUGGCGUA GUCGGUA GCGCGCUCCCUUJ GCJ UGGA GA GGUCUCCGGUUCGA UUCCGGA CA CGCCCA CCA
GGGCGUGUGGCGUA GUCGGUA GCGCGCUCCCUUJ GCJ UGGA GA GGUCUCCGGUUCGA UUCCGGA CA CGCCCA CCA

GGGCGUGUGGCGUA GUCGGUA GCGCGCUCCCUUJ GCJ UGGA GA GGUCUCCGGUUCGA UUCCGGA CA CGCCCA CCA

S.Will, 18.417, Fall 2011

GGGCGUGUGGCGUA GUCGGUA GCGCGCUCCCUUJ GCJ UGGA GA GGUCUCCGGUUCGA UUCCGGA CA CGCCCA CCA

RNA Pseudoknot Prediction


Usually: for RNA structure analysis, assume no pseudoknots
Pseudoknot (PK) prediciton is NP-complete
Efficient PK prediction from restricted classes of PKs

U
A
A
U
A
A
C
A
U
A
U
U
C U
C A G C G G G C G
U U
U U
G U C G C G C G C
C G A C U G

G C U G A C

S.Will, 18.417, Fall 2011

RNA-RNA Interaction
Prediction of interaction complex of two RNAs
Similar to Pseudoknot-prediction, the unrestricted problem is

NP-complete

S.Will, 18.417, Fall 2011

Efficient variants exist for restricted types of interaction

RNA 3D Structure Modeling


De-novo prediction of 3D structure from sequence

MC-Fold predicts secondary structure

including non-canonical base pairs


MC-Sym builds tertiary from secondary structure

S.Will, 18.417, Fall 2011

MC-Fold / MC-Sym
MC-Sym:

Stochastic Context-Free Grammars

HMMs, which can model


secondary structure

"split set"

MP 12 ML 13 MR 14

D 15

MATP 6
inserts

IL 16

"split set"

IR 17

MP 18 ML 19 MR 20

D 21

MATP 7
inserts

Consensus Models for

IL 22

"split set"

describing RNA families.


family members

U
input multiple alignment:
example structure: U
C
[structure] . : : <<< _ _ _ _ > - >> : << - < . _ _ _ . >>> .
5A
human . A A G A C U U C G G A U C U G G C G . A C A . C C C .
G
mouse a U A C A C U U C G G A U G - C A C C . A A A . G U G a
A
A
orc . A G G U C U U C - G C A C G G G C A g C C A c U U C .
2
5

10

15

20

25

MR 24

D 25

MATR 8
insert

Tool Infernal scans database for

IR 23

28

C
G10
G
A
U
C 15
21
U
G GCG A
C
C
C
C
A
27
25

IR 26

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

ROOT 1
MATL

MATL

BIF
BEGL

4
5

MATP

MATP

MATR 8

MATP

MATL 10
MATL 11
MATL 12
MATL 13
END

14

BEGR 15
MATL 16

MATP 17

MATP 18

MATL 19

MATP 20

MATL 21
MATL 22
MATL 23
END

24

S.Will, 18.417, Fall 2011

SCFGs are a generalization of

S
IL
IR
ML
D
IL
ML
D
IL
B
S
MP
ML
MR
D
IL
IR
MP
ML
MR
D
IL
IR
MR
D
IR
MP
ML
MR
D
IL
IR
ML
D
IL
ML
D
IL
ML
D
IL
ML
D
IL
E
S
IL
ML
D
IL
MP
ML
MR
D
IL
IR
MP
ML
MR
D
IL
IR
ML
D
IL
MP
ML
MR
D
IL
IR
ML
D
IL
ML
D
IL
ML
D
IL
E

De-novo Prediction of Structural RNA

scan whole genome

alignments for potential


structural RNA
structural stability
conservation of structure
Fast methods RNAz,
S.Will, 18.417, Fall 2011

EvoFold

Protein Structure Prediction


De-novo Protein Structure Prediction
Homology-based prediction: Protein Threading

S.Will, 18.417, Fall 2011

Protein-Protein Interaction

3D Lattice Protein Models


protein structure prediction is NP-complete even in simple

S.Will, 18.417, Fall 2011

protein models
optimal ab-initio prediction in HP-lattice protein models (3D
cubic and fcc)

Beyond Energy Minimization:

Kinetiks of Protein and RNA folding

vs.

S.Will, 18.417, Fall 2011

Predicting Protein Folding-Pathways (Motion Planning)


Modeling of Folding as Markov Process, Energy Landscapes
Simulated and Exact Folding Kinetics

Das könnte Ihnen auch gefallen