Sequence Comparison With Dot Matrices

Computational Biology, Part 2 Sequence Comparison with Dot Matrices
Robert F. Murphy Copyright 1996, 1999-2006. All rights reserved.
Sequence Alignment
Definition: Procedure for comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences
Pair-wise
alignment: compare two sequences Multiple sequence alignment: compare more than two sequences
Example sequence alignment

Task: align abcdef with abdgf Write second sequence below the first
abcdef abdgf

Move sequences to give maximum match between them Show characters that match using vertical bar

abcdef || abdgf
Insert gap between b and d on lower sequence to allow d and f to align

abcdef || | | ab-dgf

abcdef || | | ab-dgf
Note e and g dont match
Matching Similarity vs. Identity

Alignments can be based on finding only identical characters, or (more commonly) can be based on finding similar characters More on how to define similarity later
Global vs. Local Alignment
We distinguish
Global
alignment algorithms which optimize overall alignment between two sequences Local alignment algorithms which seek only relatively conserved pieces of sequence
Alignment
stops at the ends of regions of strong
similarity Favors finding conserved patterns in otherwise different pairs of sequences
Global
LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA
Local
--------GKG-------||| --------GKG--------
Global
LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA
Local
-------TGKG-------||| -------AGKG--------
Why do sequence alignments?

To find whether two (or more) genes or proteins are evolutionarily related to each other To find structurally or functionally similar regions within proteins
Origin of similar genes

Similar genes arise by gene duplication Copy of a gene inserted next to the original Two copies mutate independently Each can take on separate functions All or part can be transferred from one part of genome to another
Methods for Pairwise Alignment

Dot matrix analysis Dynamic Programming Word or k-tuple methods (FASTA and BLAST)
Sequence comparison with dot matrices
Goal: Graphically display regions of similarity between two sequences (e.g., domains in common between two proteins of suspected similar function)
Sequence comparison with dot matrices
Basic Method: For two sequences of lengths M and N, lay out an M by N grid (matrix) with one sequence across the top and one sequence down the left side. For each position in the grid, compare the sequence elements at the top (column) and to the left (row). If and only if they are the same, place a dot at that position.
Examples for protein sequences

(Demonstration A6, Sequence 1 vs. 2) (Demonstration A6, Sequence 2 vs. 3)
Interpretation of dot matrices

Regions of similarity appear as diagonal runs of dots Reverse diagonals (perpendicular to diagonal) indicate inversions Reverse diagonals crossing diagonals (Xs) indicate palindromes
(Demonstration
A6, Sequence 4 vs. 4)
Interpretation of dot matrices
Can link or "join" separate diagonals to form alignment with "gaps"

Each
a.a. or base can only be used once
Can't
trace vertically or horizontally Can't double back

A
gap is introduced by each vertical or horizontal skip
Uses for dot matrices

Can use dot matrices to align two proteins or two nucleic acid sequences Can use to find amino acid repeats within a protein by comparing a protein sequence to itself
Repeats
appear as a set of diagonal runs stacked vertically and/or horizontally

(Demonstration
A6, Sequence 5 vs. 6)
Uses for dot matrices

Can use to find self base-pairing of an RNA (e.g., tRNA) by comparing a sequence to itself complemented and reversed Excellent approach for finding sequence transpositions
Filtering to remove noise

A problem with dot matrices for long sequences is that they can be very noisy due to lots of insignificant matches (i.e., one A) Solution use a window and a threshold
compare
character by character within a window (have to choose window size) require certain fraction of matches within window in order to display it with a dot
Example spreadsheet with window
(Demonstration A7)
How do we choose a window size?
Window size changes with goal of analysis

size
of average exon size of average protein structural element size of gene promoter size of enzyme active site
How do we choose a threshold value?
Threshold based on statistics

using
shuffled actual sequence
find
average (m) and s.d. () of match scores of shuffled sequence convert original (unshuffled) scores (x) to Z scores
Z = (x - m)/
use
threshold Z of of 3 to 6
using
analysis of other sets of sequences

objective standard of significance
provides
Dot matrix analysis with DNA Strider (Mount, Fig 3.4)

Get phage l cI and phage P22 c2 repressor sequences from Genbank (X00166 and V01153 respectively) Use DNA Strider 1.4 (contact TA to get a copy) Use window size of 11 and stringency of 7
Dot matrix (Mount Fig 3.4)

100 200 300 400 500 600
Note set of diagonals in lower right that do not line up due to insertion near 475 on cI
100
100
200
200
300
300
400
400
500
500
600
600
700 100 200 300 400 500 600
700
Dot matrix analysis with DNA Strider (Mount, Fig 3.6)

Get human LDL receptor protein sequence from Genbank (P01130) Use weighting Identity Use window size of 1 and stringency of 1 Use window size of 23 and stringency of 7

W=1 S=1 Note set of stacked diagonals in upper left
100 200 300 400 500 600 700 800 100 100
200
200
300
300
400
400
500
500
600
600
700
700
800
800
100
200
300
400
500
600
700
800

W=23 S=7 Note set of stacked diagonals in upper left
100 200 300 400 500 600 700 800 100 100
200
200
300
300
400
400
500
500
600
600
700
700
800
800
100
200
300
400
500
600
700
800
Reading for next class

Mount, Chapter 3 through page 93 Look over paper by Needleman and Wunsch on web site (03-510/710) Durbin et al, pp 17-32

Sequence Comparison With Dot Matrices

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Sequence Comparison With Dot Matrices

Hochgeladen von

Copyright:

Verfügbare Formate

Computational Biology, Part 2 Sequence Comparison with Dot Matrices

Robert F. Murphy Copyright 1996, 1999-2006. All rights reserved.

Example sequence alignment

Example sequence alignment

Insert gap between b and d on lower sequence to allow d and f to align

Example sequence alignment

Example sequence alignment

Note e and g dont match

Matching Similarity vs. Identity

Global vs. Local Alignment

stops at the ends of regions of strong

similarity Favors finding conserved patterns in otherwise different pairs of sequences

Global vs. Local Alignment

LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Global vs. Local Alignment

LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Why do sequence alignments?

Origin of similar genes

Methods for Pairwise Alignment

Sequence comparison with dot matrices

Sequence comparison with dot matrices

Examples for protein sequences

Interpretation of dot matrices

A6, Sequence 4 vs. 4)

Interpretation of dot matrices

Can link or "join" separate diagonals to form alignment with "gaps"

a.a. or base can only be used once

trace vertically or horizontally Can't double back

gap is introduced by each vertical or horizontal skip

Uses for dot matrices

appear as a set of diagonal runs stacked vertically and/or horizontally

A6, Sequence 5 vs. 6)

Uses for dot matrices

Filtering to remove noise

Example spreadsheet with window

How do we choose a window size?

Window size changes with goal of analysis

How do we choose a threshold value?

Threshold based on statistics

shuffled actual sequence

analysis of other sets of sequences

Dot matrix analysis with DNA Strider (Mount, Fig 3.4)

Dot matrix (Mount Fig 3.4)

700 100 200 300 400 500 600

Dot matrix analysis with DNA Strider (Mount, Fig 3.6)

Dot matrix (Mount Fig 3.6)

Dot matrix (Mount Fig 3.6)

Reading for next class

Das könnte Ihnen auch gefallen