Sie sind auf Seite 1von 30

Computational Biology, Part 2 Sequence Comparison with Dot Matrices

Robert F. Murphy Copyright 1996, 1999-2006. All rights reserved.

Sequence Alignment

Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences
Pair-wise

alignment: compare two sequences Multiple sequence alignment: compare more than two sequences

Example sequence alignment


Task: align abcdef with abdgf Write second sequence below the first

abcdef abdgf

Move sequences to give maximum match between them Show characters that match using vertical bar

Example sequence alignment


abcdef || abdgf

Insert gap between b and d on lower sequence to allow d and f to align

Example sequence alignment


abcdef || | | ab-dgf

Example sequence alignment


abcdef || | | ab-dgf

Note e and g dont match

Matching Similarity vs. Identity


Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters More on how to define similarity later

Global vs. Local Alignment

We distinguish
Global

alignment algorithms which optimize overall alignment between two sequences Local alignment algorithms which seek only relatively conserved pieces of sequence
Alignment

stops at the ends of regions of strong

similarity Favors finding conserved patterns in otherwise different pairs of sequences

Global vs. Local Alignment

Global

LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Local

--------GKG-------||| --------GKG--------

Global vs. Local Alignment

Global

LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Local

-------TGKG-------||| -------AGKG--------

Why do sequence alignments?


To find whether two (or more) genes or proteins are evolutionarily related to each other To find structurally or functionally similar regions within proteins

Origin of similar genes


Similar genes arise by gene duplication Copy of a gene inserted next to the original Two copies mutate independently Each can take on separate functions All or part can be transferred from one part of genome to another

Methods for Pairwise Alignment


Dot matrix analysis Dynamic Programming Word or k-tuple methods (FASTA and BLAST)

Sequence comparison with dot matrices

Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function)

Sequence comparison with dot matrices

Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position.

Examples for protein sequences


(Demonstration A6, Sequence 1 vs. 2) (Demonstration A6, Sequence 2 vs. 3)

Interpretation of dot matrices


Regions of similarity appear as diagonal runs of dots Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals crossing diagonals (Xs) indicate palindromes

(Demonstration

A6, Sequence 4 vs. 4)

Interpretation of dot matrices

Can link or "join" separate diagonals to form alignment with "gaps"


Each

a.a. or base can only be used once

Can't

trace vertically or horizontally Can't double back


A

gap is introduced by each vertical or horizontal skip

Uses for dot matrices


Can use dot matrices to align two proteins or two nucleic acid sequences Can use to find amino acid repeats within a protein by comparing a protein sequence to itself

Repeats

appear as a set of diagonal runs stacked vertically and/or horizontally


(Demonstration

A6, Sequence 5 vs. 6)

Uses for dot matrices


Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Excellent approach for finding sequence transpositions

Filtering to remove noise


A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) Solution use a window and a threshold

compare

character by character within a window (have to choose window size) require certain fraction of matches within window in order to display it with a dot

Example spreadsheet with window

(Demonstration A7)

How do we choose a window size?

Window size changes with goal of analysis


size

of average exon size of average protein structural element size of gene promoter size of enzyme active site

How do we choose a threshold value?

Threshold based on statistics


using

shuffled actual sequence

find

average (m) and s.d. () of match scores of shuffled sequence convert original (unshuffled) scores (x) to Z scores
Z = (x - m)/
use

threshold Z of of 3 to 6

using

analysis of other sets of sequences


objective standard of significance

provides

Dot matrix analysis with DNA Strider (Mount, Fig 3.4)


Get phage l cI and phage P22 c2 repressor sequences from Genbank (X00166 and V01153 respectively) Use DNA Strider 1.4 (contact TA to get a copy) Use window size of 11 and stringency of 7

Dot matrix (Mount Fig 3.4)


100 200 300 400 500 600

Note set of diagonals in lower right that do not line up due to insertion near 475 on cI

100

100

200

200

300

300

400

400

500

500

600

600

700 100 200 300 400 500 600

700

Dot matrix analysis with DNA Strider (Mount, Fig 3.6)


Get human LDL receptor protein sequence from Genbank (P01130) Use weighting Identity Use window size of 1 and stringency of 1 Use window size of 23 and stringency of 7

Dot matrix (Mount Fig 3.6)


W=1 S=1 Note set of stacked diagonals in upper left

100 200 300 400 500 600 700 800 100 100

200

200

300

300

400

400

500

500

600

600

700

700

800

800

100

200

300

400

500

600

700

800

Dot matrix (Mount Fig 3.6)


W=23 S=7 Note set of stacked diagonals in upper left

100 200 300 400 500 600 700 800 100 100

200

200

300

300

400

400

500

500

600

600

700

700

800

800

100

200

300

400

500

600

700

800

Reading for next class


Mount, Chapter 3 through page 93 Look over paper by Needleman and Wunsch on web site (03-510/710) Durbin et al, pp 17-32

Das könnte Ihnen auch gefallen