You are on page 1of 6

Lecture 3, 4 and 5: Homology, similarity and identity Homology: share a common evolutionary ancestry No degree of homology, yes or no Share

re significant homology Homology: Qualitative inference Similarity and identity: Quantitative inference Notably, two molecules may be homologous without sharing statistically significant amino acid (or nucleotide) identity. Orthologs and Paralogs Orthologs: homologous sequences in different species that arose from a common ancestral gene during speciation, are presumed to have similar biological functions Paralogs: are homologous sequences that arose by a mechanism such as gene duplication, distinct but related functions

Pairwise sequence alignment - BLAST tools Pairwise alignment: lining up two sequences to achieve maximal levels of identity, assess the degree of similarity and possibility of of homology Identity: the extent to which two amino acid (or nucleotide) sequences are invariant Similarity: they are related to each other because they share similar biochemical properties structurally or functionally related Percent similarity: identical + similar matches - Gaps Represent insertions and deletions Adding gaps is to make the overall length of each alignment exactly the same Gap penalty: gap creation + one additional residue that the gap extends Scoring system - Dayhoff model Accepted point mutation: a replacement of one amino acid in a protein by another residue that has been accepted by natural selection Natural selection: a gene undergoes a DNA mutation such that it encodes a different amino acid and the entire species adopts that change as the predominant form of the protein. compared them to the inferred common ancestor of those sequences: to avoid the effects of multiple substitutions occurring in a aligned pair of residues

Relative mutabilities of the amino acids: divided the number of times each amino acid was observed to mutate by the overall frequency of occurrence of that amino acid Accepted mutation + probabilities of occurrence of each amino acid = mutation probability matrix PAM1 matrix: protein undergoes 1% change One PAM: the unit of evolutionary divergence in which 1% of the amino acids have been changed between the two protein sequences, alignment of closely related protein sequences were at least 85% identical within a protein family. Scoring matrix: scores for the interchange of residues i and j The scoring matrix further incorporates a logarithm to generate log-odds scores Approach: to define a set of scores for the comparison of aligned amino acid residues

Score si, j : score for aligning any two residues Probability q i, j: observed frequency of substitution for each pair of amino acids. Difference scoring for different PAM matrices: Identical amino acids score higher in PAM 10 than PAM 250 Greater mismatch penalties in PAM 10 PAM 10: negative scores for substitution that are scored positively in the PAM 250 matrix Different scoring matrices vary in their sensitivity to protein sequences (or DNA sequences) of varying relatedness.

BIOSUM scoring matrix BIOSUM 62: merges all proteins in an alignment that have 62% amino acid identity or greater into one sequence. Eg If a block of aligned globin orthologs includes several that have 62%, 80%, and 95% amino acid identity, these would all be weighted (grouped) as one sequence especially useful for identifying weakly scoring alignments based on empirical observations of more distantly related protein alignments; PAM: extrapolates the probabilities for more distantly related proteins Note: percent identity is not an exact indicator of the number of mutations that have occurred across a protein sequence; also not the sole determinant for homology The twilight zone : evolutionary distance corresponding to about 20% identity between two proteins. Proteins with this degree of amino acid sequence identity may be homologous, but such homology is difficult to detect: Use of Multiple sequence alignment and structural alignment to detect homology

Global and local alignment - Setting up the matrix - Scoring the matrix Four outcomes: Two residues may be perfectly matched (i.e., identical). They may be mismatched. A gap may be introduced from the first sequence. A gap may be introduced from the second sequence. - Identify the optimal alignment - Limitation of Smith-Waterman algorithm: very slow both the computer space and the time required to align two sequences is proportional to at least the length of the two query sequences multiplied against each other, m x n. For the search of a database of size N, this is m x N. To solve this: BLAST: sacrifice sensitivity for speed - Statistical significance for local alignment: E value E value: number of matches having a particular score (or better) that are expected to occur by chance BLAST - Selecting the BLAST program

Selecting the database Optimize the optional parameters

Expect threshold: number of different alignments with scores equal to or greater than some score S that are expected to occur in a database search by chance

Composition based statistics: improves the calculation of the E value statistic, reduce false positive search results in specialized circumstances such as subjects matching queries of very different lengths Filtering and masking: Low-complexity sequences: having commonly found stretches of amino acids (or nucleotides) with limited information content eg dinucleotide repeat and Alu sequences BLAST algorithm - compiles a list of words of a fixed length w that are derived from the query sequence threshold value T: Score of the aligned words Those words either at or above the threshold are collected and used to identify database matches Lowering threshold: increase the time required to perform the search and may increase the sensitivity With the high threshold some matches were missed, although the reported matches are more likely to be true positives; with the lower threshold values there were somewhat more successful extensions.

After compiling a list of word pairs at or above threshold T, the BLAST algorithm scans a database for hits BLAST extends hits to find alignments called high-scoring segment pairs (HSPs) For sufficiently high-scoring alignments, a gapped extension is triggered. The extension process is terminated when a score falls below a cutoff.


The value of E decreases exponentially with increasing S

Raw score and bit score - Raw score: calculated from the substitution matrix and gap penalty parameters that are chosen - Bit score: calculated from the raw score by normalizing with the statistical variables that define a given scoring system allow comparison using different substitution matrices and databases