Sie sind auf Seite 1von 24

eDICTIONARY OF BIOINFORMATICS

ARUN JAGOTA
Bioinformatics By The Bay Press
http://bioinformaticsbythebay.hypermart.net

©Copyright, Arun Jagota, 2001. For personal use only.


Preface

This dictionary describes a number of terms, concepts, tools, and databases in Bioinformatics.
Descriptions range from a few sentences to a few paragraphs each. (Neither URLs, nor references
are included. The reader can easily find more information about a term by querying a search
engine on the web with the term.) The terms described are specific to bioinformatics; molecular
biology or statistical concepts not known to the author to be used significantly in bioinformatics (to
date) are excluded.

If some fonts are not coming out right, make sure you are reading in Acrobat Reader (or Adobe
Acrobat) 4.0 or higher.

About the Author

Arun Jagota teaches a number of bioinformatics courses in the national award winning certificate
program at the University of California, Santa Crux Extension. He is a PhD in Computer Science
from the State University of New York at Buffalo in 1993. He has been a visiting faculty member in
the Department of Mathematical Sciences at the University of Memphis, in the Department of
Computer Science at the University of North Texas, an affiliated faculty member in the Department
of Computer Science at the University of California, Santa Cruz, and an adjunct faculty member in
the Department of Mathematical Sciences at San Jose State University. He is presently a part-time
instructor and research scientist at the University of California, Santa Cruz, and an adjunct faculty
member in the Department of Applied Mathematics as well as at the Department of Computer
Engineering at Santa Clara University. Arun Jagota has taught more than thirty different courses in
Computer Science, some in Discrete Mathematics, and a few in Probability and Statistics.

2
AAComp{Ident,Sim}

These are search tools at the ExPasy server that use the composition of an amino acid sequence,
rather than the sequence itself, to find sequences in a database (SWISS-PROT) with similar
composition. (The composition of a query is searched against a database of the compositions of
sequences in SWISS-PROT.)

AACompIdent inputs the composition of the query and is useful when the query sequence is not
known. In fact, it may be used to identify the sequence from its composition. AACompSim inputs a
query sequence and computes its composition internally.

ACCESSION NUMBER

This is a unique identifier to a Genbank sequence record. The typical format is 1-2 letters followed
by 4-6 digits. Here are a few fictitious though syntactically correct examples: AG123456, BF43251.

AFFINE GAP COST

This is a model for scoring gaps in a pairwise sequence alignment. It has two parameters, a gap-
open penalty o applied once to each contiguous block of gaps, and a gap-extension penalty
e < o applied to each gap in a block except the first one. For example, a gap block ---- would incur
a cost of 6 if o = 3 and e = 1 .

ASN.1

This is the data model that underlies the set of integrated databases at NCBI. This model is similar
to object-oriented data structures and databases, and offers rich support for nested and inter-
related data structures. The richness of this model facilitates the seamless and well-navigable
access to these databases that the NCBI web browser Entrez provides. Contrast this model to the
flat-file format of Genbank, which has its deficiencies.

BLAST

Basic Local Alignment Search Tool. This is a popular sequence-based database searching
program. It performs, one-by-one, a heuristic, local alignment of the query sequence to each
sequence in the specified database. (Heuristic means non-optimal. Local means as opposed to
global.) BLAST is much faster than dynamic programming, while remaining effective. (Dynamic
programming computes optimal local alignments.) BLAST returns a list of hits ranked by E-value.
The lower the E-value of a hit, the more significant it is.

BLASTN

This is the version of BLAST that searches a nucleotide sequence against a nucleotide database.
It uses a primitive substitution matrix (match=1, mismatch=-2).

BLASTP

This is the version of BLAST that searches a protein sequence against a protein database. It
supports searches using the BLOSUM or the PAM family of substitution matrices. BLOSUM62 is
the most popular matrix used in BLAST searches.

3
BLASTX

This is the version of BLAST that searches a nucleotide sequence against a protein database by
translating the nucleotide sequence in all frames.

BLAST2SEQUENCES

This is a version of BLAST which locally aligns two sequences, instead of searching one sequence
against a database.

BLOCKS

This is a collection of blocks, where a block is an ungapped multiple alignment of highly similar
fragments of protein sequences. A block represents a motif with some structural or functional
significance. The popular BLOSUM62 substitution matrix is constructed from this database.

BLOSUM

This is a family of substitution matrices for scoring alignments, for example when doing BLAST
searches. BLOSUM62 is the most widely used member of this family.

BOOTSTRAP

This is a statistical method used to assess the robustness of phylogenetic trees produced by
various tree-building methods. The original multiple alignment from which a phylogenetic tree was
first produced is used to generate numerous bootstrap alignments. Each bootstrap alignment is
obtained from the original multiple alignment by sampling without replacement one column at
random from the n columns in the original alignment, n times. Thus a bootstrap alignment has the
same number of columns as the original alignment. A phylogenetic tree is built from each bootstrap
alignment. Once all the trees have been built, these trees may be compared to yield a consensus
tree or consensus subtrees. Well-conserved branches (or more generally, subtrees) are often
deemed as reliable.

BRANCH-AND-BOUND

This is a method for speeding up brute-force search for an optimal solution to a combinatorial
optimization problem. A combinatorial optimization problem is defined by a set S of solutions, with
a quality q ( s ) assigned to every solution s in S . A brute-force search generates and tests each
solution in S to find a solution with maximum quality. Branch-and-bound uses the following clever
idea to prune the search space. Suppose that a subset of solutions has been visited and a best
solution in this set found with quality p . Consider another subset Q of unvisited solutions, and an
upper bound q on the quality of any solution in Q . If p ≥ q , then no solution in Q need be
visited, since our present best solution cannot be improved by visiting Q . The better the upper
bound we can place on Q and the larger the size of the set Q , the more the speedup will be.

In bioinformatics, the branch-and-bound method has been used to optimally thread a protein
sequence through a tertiary structure template and to find a maximum parsimony phylogenetic
tree.

4
CATH

This is a database which classifies protein domain structures in PDB. Each structure is classified at
four levels. One of the levels is Class, which characterizes the secondary structure content of the
structure as mainly-alpha (contains mainly alpha-helices), mainly-beta (contains mainly beta-
sheets), alpha/beta (contains alternating alpha/beta structures, or many of each), or neither
(structure has low secondary structure content). Another level is Architecture, which characterizes
the overall shape of the domain structure in terms of the orientation of the secondary structure
elements. A third level is Topology, which characterizes the fold family of the structure based on
the orientation and the inter-connectivity of the secondary structure elements. A fourth level is
Homologous Superfamily, which groups together protein domain structures that appear to be
evolutionarily related.

CE

This is a tertiary structure-to-structure alignment method for structures in PDB. It is accessible from
the PDB web site. CE builds a long alignment from short alignments called aligned fragment pairs
(AFPs). An AFP is a short fragment in one structure aligned with a short fragment in the other,
based on local geometry. CE first picks the most similar AFP, then extends the alignment by
adding more AFPs to it. The AFPs in the extended alignment are required to form a continuous
chain, with no overlapping fragments but with gaps permitted, as illustrated below.

The two axes are the protein structures, the solid diagonal lines are the AFPs, and the continuous
chain of the three AFPs (with intervening gaps) forms the alignment. The dashed lines show the
projection of this alignment onto the two structures.

CLUSTALW

This is a popular multiple alignment tool based on the progressive alignment method.

CLUSTER ANALYSIS

This is a computational and statistical procedure for partitioning a data set into subsets of similar
items. It is often used to group together genes with similar expression patterns (or experiments with
similar response patterns of genes) from microarray gene expression data. It is also used to group
together proteins with similar sequences or similar structure (many protein classification databases
are constructed this way).

5
CLUSTR

This is a database of an automated clustering of the proteins in SWISS-PROT plus TrEMBL into
groups of related proteins. The clustering is based on a similarity matrix of all pairs of protein
sequences. Similarity of two sequences is measured as the score of an optimal local alignment, as
computed via the Smith-Waterman algorithm.

CONSERVED DOMAIN DATABASE

This is a database at NCBI which may be used to locate domains in a protein sequence. Domain
motifs, represented as position-specific weight matrix profiles, are scanned against the query
sequence to find which motifs hit where.

Cn3D

This is a 3D structure viewer for protein structures in MMDB. It includes the ability to display
structure-to-structure alignments and structure-based sequence-to-sequence alignments. In
contrast to conventional multiple alignments, these ones are of a master (the query structure) to
slaves (the structures aligned to the query structure). It uses VAST to find and align slave
structures to the master structure. Cn3D also permits one to annotate the view of a structure and
also has other improved features (e.g. ability to view correlated disorder in Xray structures,
something often ignored by earlier viewers.

CPHMODELS

This is a homology-based program for predicting the tertiary structure of a protein from its
sequence. The sequence is compared with other sequences to find a highly similar sequence
whose tertiary structure is known. This structure is returned as the predicted structure of the
original sequence.

DALI

This is a tertiary structure-to-structure alignment method and program. One can query PDB with a
structure and find similar structures in PDB, using DALI. Pre-computed alignments are stored in
the FSSP database.

DALI works by aligning the distance matrices of the two protein structures. (The distance matrix of
a protein tertiary structure stores the distances between all pairs of residues, specifically the
distances between their alpha-carbons.) DALI compares the two distance matrices by finding
highly similar "aligned submatrix pairs", then putting together many such pairs to form an
alignment. This is illustrated below.

The outer boxes denote the distance matrices of the two structures. The enclosed boxes denote
the submatrices. Solid submatrices in the two structures form one aligned pair, the dashed

6
submatrices form another pair. DALI scores an aligned pair, i.e. a pair of submatrices, one from
each structure, by quantifying the similarity of the submatrices being compared, specifically by
aggregating, component-wise, the values in their cells.

DIRICHLET DISTRIBUTION

This is a continuous distribution that is conjugate to the multinomial distribution. It is used to specify
a prior distribution over probability parameters. Let us explain this distribution and this use of it with
the following example. We toss a coin n times, and observe k heads. The maximum likelihood
estimate (of the probability that the coin will hand a head on the next toss) is k / n . When n is
small, this estimate is very unreliable. The Bayesian alternative would be to assume a prior
distribution over the probability parameter PH that the coin lands a head in any one toss. From the
data, in which k heads have been observed in n tosses, one could then compute the posterior
density P ( PH | k , n ) using Bayes rule. (In practice, this is not usually done, rather the mean of the
posterior is computed, which is much simpler.) The posterior is affected by our choice of the prior
distribution (which models our belief of what we expect to happen when a coin is tossed) as well as
by our observations: k heads in n tosses. For instance, if assume, reasonably, that all coins are
fair unless we are shown evidence to the contrary, then this prior distribution would be
concentrated around PH → 0.5 . On the other hand, we might equally reasonably assume that all
possible values of PH , ranging from 0 to 1 , are equally likely. (In other words, that all degrees of
skewness of the coin are equally likely.) In this case, we would have a flat Dirichlet density. More
generally, we can span the spectrum of flat Dirichlet densities (all degrees of skewness of the coin
equally likely) to biased Dirichlet densities (which concentrate the prior probabilities on particular
values of PH ) by certain parameters, α , which control the shape of the distribution.

It turns out that, when the prior distribution P ( PH ) is a Dirichlet distribution, the posterior
distribution P ( PH | k , n ) is also a Dirichlet distribution.

In bioinformatics, a Dirichlet distribution (and sometimes a mixture of Dirichlet distributions) is often


used to model the prior distribution of the emission probabilities of alphabet symbols (the four
nucleotides in the case of DNA sequences) in a column in a multiple alignment, or in a position in a
functional site. This is usually done for protein sequences, although our example below is a DNA
one.

For example, let us suppose that we would like to build a position-weight matrix profile for a set of
aligned donor site sequences. Consider the position +1 in this profile, the first base into the intron.
We expect in advance that this position will almost surely contain a G . We will choose the Dirichlet
distribution to model the prior distribution of the probability parameters of the four nucleotides at this
position in such a way that it is concentrated around PG → 1 . This way, the posterior probability
P ( PG → 1| data ) will remain close to 1 unless we find enough sequences in which there is a
letter other than G at this position (which is extremely unlikely). We will use flatter Dirichlet
distributions at other columns in the multiple alignment where we do not have a prior expectation of
a strong bias in favor of some nucleotides over others.

The approach described in the previous example works well even when there is limited data. For
instance, if we constructed a position-weight matrix profile from an aligned set of just (say) three
sequences, the profile would reveal that we can trust the +1 column of the profile (even though our

7
multiple alignment only has three sequences) because there is a strong prior expectation of a G
at this position, and our method has exploited this knowledge. On the other hand, our profile would
reveal a wider spread at some positions further away from the site boundary, which would indicate
that we don't have enough data to offset the flatter prior that reflects our uncertainty of what
compositions we would expect at these positions.

DOT MATRIX

This is a nice way to visualize a local alignment, especially many alignments of pairs of, possibly
overlapping, fragments in the two sequences, as illustrated below.

The horizontal and vertical axes correspond to the two sequences. Solid diagonal lines denote
aligned fragments.

DDBJ

This is the Japanese sibling of Genbank.

DYNAMIC PROGRAMMING

This is an algorithm for computing an optimal local or global alignment of two sequences.

EMBL

This is the European sibling of Genbank, an annotated collection of all public domain DNA and
protein sequences.

ENSEMBL

This is a web server which provides access to automatically generated annotations of the
genomes of complex organisms such as humans. One can search the human genome, browse
chromosomes, find genes, find genomic sequences similar to a given protein sequence, etc.

ENTREZ

This is a text-based search engine for bioinformatics databases, at NCBI. It provides access to a
literature database (PubMed), nucleotide database (Genbank), protein sequence database, 3D
structure database (MMDB), Genome database (complete genome assemblies), population sets,
and Online Mendelian Inheritance in Man (OMIM).

8
ENTROPY

This is an information-theoretic concept that quantifies the amount of disorder in a probability


function. For example, the entropy H ( P ) of random DNA, i.e. PA = PC = PG = PT , is 2 . Entropy
ranges from 0 to log 2 Σ , where Σ denotes the number of symbols in the alphabet Σ .

Entropy is useful in quantifying the degree of conservedness in a column in a multiple alignment (or
in a position in a functional site). The lower the entropy, the more conserved the column is.

Also see information content, sequence logos.

E-VALUE

This is the expected number of chance hits that score as well or better than the given hit. The lower
the E-value of a hit, the more significant the hit is. BLAST reports the E-value of a hit, and ranks
hits by their E-values.

EXTREME VALUE DISTRIBUTION

This is the distribution used by BLAST to compute its E-values. This distribution models the score
of a highest-scoring hit of a query sequence to a database composed of random sequences.

For concreteness, consider the setting of ungapped global alignment. We have a query sequence
q of length n , to be compared individually against m sequences in a database D composed of
random sequences, each also of length n . For simplicity, let us define our score function S ( x, y )
that quantifies the similarity of two equal-length sequences x and y as percent identity. For any
fixed sequence d in the random database, the distribution of S ( q, d ) is binomial, and may be
approximated by a normal distribution. On the other hand, the distribution of the best score,
S (q, d * ) = max d in D ( S (q, d ) ) is not normal, it is an extreme-value distribution associated with
the normal distribution.

FASTA

This is a program similar to BLAST for heuristic local alignment and for sequence-based database
searching.

FASTA FORMAT

This is a widely used format for DNA and protein sequences. Many search tools accept sequences
in this format. For each sequence, there is a header line beginning with > followed by one or more
lines containing the sequence itself, as illustrated below.

>sequence header
ACAGAAA
>Sequence header
>ACAAAGA

9
FRAGMENT ASSEMBLY PROBLEM

A somewhat idealized version of the problem is as follows. We wish to determine the sequence
X = X 1 X 2 ... X n of a DNA molecule, say 1000s of base pairs long. Experimentally, we sequence
its fragments (typically about 400 bases each) at random, unknown locations. Moreover, some of
these fragments may come from the forward strand, others from the backward strand. We don't
know which fragment comes from which strand. The problem is to chain these fragments together
in such a way (based on their overlaps) that we get a good reconstruction of X , as illustrated
below.

Reconstructed X
Fragments

Note that some fragments may need to be reverse-complemented in the assembly process.

If the strands of all the fragments are known, then the problem may be modeled as one of finding
the shortest common superstring of the fragments.

A greedy algorithm is commonly used as the center-piece of a fragment assembly algorithm. The
greedy method combines, at each stage, the two fragments with the highest overlap and replaces
them with their overlapped concatenation (contig), as depicted below.

Contig

Fragments

FSSP

This is a database of all-against-all structure-to-structure alignments of structures in PDB,


alignments done via the program Dali.

In more detail, all protein chains from PDB of length at least 30 are taken and partitioned into two
sets: a representative set R and a set of sequence homologs H to structures in R . The
sequences of any two structures in R have no more than 25% identity. The sequence of each
structure in H has identity of more than 25% to the sequence of at least one structure in H . First,
all pairs of structures in R are aligned by Dali. Next, each structure in R is aligned with its
sequence homologs in H .

GCG

This is a large commercial package for sequence analysis containing more than one hundred
programs for various tasks: pairwise alignment, multiple alignment, phylogenetic analysis, format
conversion, fragment assembly, protein structure prediction, etc. It is also called the Wisconsin
package.

10
GENBANK

This is the premier database of nucleotide sequences. It is in flat file format.

GENSCAN

This is a popular genefinding program. It uses hidden Markov models.

GLOBAL ALIGNMENT

This is a full alignment of two DNA or protein sequences, with gaps inserted in one or both
sequences, as needed.

GRAIL

This is a popular genefinding program. It uses neural networks to combine information about
predicted local sites such as splice sites with predicted coding regions.

HIDDEN MARKOV MODEL

This is a sophisticated probabilistic model for biological sequences. It is especially useful in


describing a multiple alignment of protein sequences, with gaps in the alignment. See Profile
Hidden Markov Models. Many modern genefinders also use hidden Markov models.

HITS

This is a database of protein domains and motifs. Motifs take the form of regular expressions (from
Prosite), generalized profiles (from Prosite), and Hidden Markov Models (from Pfam).

HMMER

This is a program for building a profile hidden Markov model as well as a multiple alignment from a
set of protein sequences.

HSSP

This is a database of homology-derived secondary structure of proteins produced by aligning


protein sequences of unknown structure to the sequences of known structures.

INFORMATION CONTENT

This is the amount of information in a probability function P . It is defined as


IC( P) = H ( R) − H ( P) . Here H denotes entropy, and R is a reference probability function,
typically one that assigns the same probability to each outcome in the sample space. For example,
if the sample space is the DNA alphabet and R is the uniformly random probability function,
H ( R) equals 2 .

Information content is used to quantify the amount of information in a column of a multiple


alignment (or of a position in a functional site). Information content is used in sequence logos , a

11
visualization tool for multiple alignments that reveals the information content of each column, as
well as its composition, in a visual format.

JUKES-CANTOR MODEL

This is a time-dependent substitution matrix for nucleotide sequences. The matrix contains just two
time-dependent quantities P (α | α , t ) and P ( β | α , t ) which are independent of α and β . Here
α denotes a nucleotide and β a nucleotide different from α . The JC substitution matrix does not
distinguish different types of substitutions, say transitions versus transversions.

KEGG

This is a database of biochemical pathways.

KIMURA MODEL

This is a variant of the Jukes-Cantor substitution matrix for nucleotide sequences which
distinguishes between transition substitutions and transversion substitutions.

LALIGN

This is a tool for local alignment of two sequences, where the local alignment is allowed to be
fragmented.

LIGAND

This is a chemical database of enzyme reactions.

LINEAR CORRELATION COEFFICIENT

This is a normalized measure of the correlation ρ ( X , Y ) between two random variables X and
Y . It is defined as

covariance( X , Y ) (standard deviation( X ) × standard deviation(Y ) )

ρ ( X , Y ) ranges from −1 to +1 , with −1 indicating that X and Y are maximally anti-correlated,


+1 indicating that X and Y are maximally positively correlated, and 0 indicating that X and
Y are uncorrelated.
Linear correlation coefficient is widely used to compare the expression vectors of genes (or
experiments) in microarray data, especially during cluster analysis.

LOCAL ALIGNMENT

This is the process of finding and aligning highly similar regions of two DNA or protein sequences.

12
LOW-COMPLEXITY REGION

This is a portion of a DNA or a protein sequence with many repeats of short sequences or other
forms of biased composition. It can cause database searching programs like BLAST to give false
hits. It is a good idea to have the low-complexity regions removed from a query sequence before
doing a BLAST search (this is the default action in BLAST).

LPFC

Library of Protein Family Cores. Think of this as a collection of structure profiles. A structure profile
corresponds to a structurally conserved region and is formed by a structural alignment of proteins
in a family. Functional sites in a 3D structure may be detected by comparison with structure
profiles. These profiles are also useful in threading.

MARKOV MODEL

This is a probability model that permits bounded dependencies. Specifically, in a first-order Markov
model, the probability of a letter at a particular position in a sequence is permitted to depend on the
letter at the previous position. That is,P ( X i | X i −1 ,..., X 1 ) = P ( X i | X i −1 ) , where X i denotes
the ith letter in a sequence X . In a kth -order Markov model,

P ( X i | X i −1 ,..., X 1 ) = P ( X i | X i −1 ,..., X i − k )

In bioinformatics, Markov models are used to model short well-conserved patterns called signals
as well as the statistics of variable-length sequences called content. (A donor splice site is an
example of a signal, an exon an example of content.)

MEME

This is a tool for automatically discovering motifs (represented by ungapped position-weight matrix
profiles) from a set of related protein or DNA sequences. The motif discovery algorithm is based
on fitting a two-component mixture model to the given set of sequences, using the EM algorithm.
One component describes the motif by a fixed-width position-weight matrix profile. The other
component models background, i.e. all other positions in the sequences.

META-MEME

This is a tool that inputs a set of related protein sequences and a set of motifs in them (discovered
by MEME, a related tool) and produces a hidden Markov model. The HMM may then be used to
multiply align the protein sequences, or to search (the HMM) against a database of sequences.
These uses are similar to those of profile HMMs.

MMDB

Molecular Modeling Database. This is the NCBI database of protein tertiary structures. Think of it
as a value-added, validated version of PDB. The distinguishing features of MMDB are explicit and
consistent chemical graph information following extensive validation, uniformly derived secondary
structure definitions and domain descriptions, citation matching to MEDLINE, taxonomy
assignment to each protein, and structure neighbors for all structure chains and individual domains.
Structure neighbors of each MMDB structure in the database are pre-computed using the VAST

13
structure-to-structure alignment tool. Sequence neighbors of each MMDB structure in the database
are pre-computed using BLAST on the sequence of the structure. The Cn3D viewer is useful for
viewing not only a structure but also alignments of a structure to other structures as well as
corresponding alignments of their sequences.

MULTINOMIAL DISTRIBUTION

This is the generalization of the binomial distribution to more than two outcomes. For example,
given the background probabilities P ≡p PA , PC , PG , PT f of A, C , G , T , this computes the
probability that a sequence of length n in which the letter at each position is generated randomly
according to P , and independently of other positions, has a composition vector
p #As,#Cs,#Gs,#Ts f . (Here all four compositions must add up to n .)

This distribution is suited to modeling the composition of DNA or protein sequences. It has
applications to scoring the composition of a column in a multiple alignment to a model of the
expected composition, to quantifying codon usage bias in protein sequences, to trying to
distinguish exons from introns by composition, and others.

MULTIPLE ALIGNMENT

This is a global alignment of more than two DNA or protein sequences. The alignment is typically
scored by scoring each column of the alignment and adding up the column scores. Gaps costs
may be incurred in the score of a column, or separately. One way to score a column is by its
degree of conservedness (more conserved columns indicate better-scoring alignments). This can
be done by computing the information content of the column. A more widely used method is called
the sum-of-pairs method. In this method, the score of a column is the sum of the scores of all
distinct pairs of letters in the column, where a pair of letters is scored via a substitution matrix (such
as BLOSUM or PAM in the case of protein sequences).

Multiple alignments have many uses including towards building descriptors for functional sites, for
domains, and for families. A descriptor built from a multiple alignment is often used to find more
sequences that fit the descriptor (more sequences that belong to a functional site, domain, or
protein family, as the case may be) and add these to the alignment. A multiple alignment is also a
first step in phylogenetic tree-building.

MUTUAL INFORMATION

This quantity, M ( X , Y ) , quantifies the amount of dependence between two symbolic random
variables X and Y . (Symbolic here means that X and Y need not take numeric values. For
example, X and Y could range over the nucleotides A, C , G , T .) M ( X , Y ) is defined as the
relative entropy H ( P ( X , Y ) || P ( X ) P (Y ) ) , where P ( X , Y ) is the joint distribution of X and
Y and P( X ) and P(Y ) are its marginal distributions. M ( X , Y ) is always nonnegative, the more
positive it is, the more dependent X and Y are.

Mutual information has been used in RNA structure analysis, specifically in quantifying the base-
pairing strength of two columns iand j in a multiple alignment of RNA sequences. The higher the
mutual information of these two columns, the more likely it is that they are base-pairing columns.
This calculation assists in the prediction of RNA secondary structure from a multiple alignment.

14
Mutual information may also be used to reveal and quantify the amount of dependence between
(adjacent or non-adjacent) positions in a functional site, for example a donor splice site.

Mutual information may also be used to quantify the dependence between pairs of residues in near
proximity, in protein tertiary structures.

NEIGHBOR-JOINING METHOD

This is a phylogenetic tree-building method that is "one notch above" the UPGMA tree-building
method.

NEURAL NETWORK

This is an algorithm for learning from a given set of examples, loosely inspired by real neural
circuits in the brain. Neural networks are widely used for pattern recognition, for instance to classify
an input vector into one of k classes. A neural network is trained on a set of labeled examples, i.e.
the correct class of each example is given to the network during training. After training, the neural
network is used to classify new examples.

In bioinformatics, neural networks have found use in protein secondary structure prediction, in
predicting short semi-conserved sites such as donor and acceptor splice sites, in genefinding, and
other areas.

NUSSINOV RNA FOLDING ALGORITHM

This is a dynamic programming algorithm for computing the maximum number of nested canonical
base pairs in an RNA sequence. The sequence ACAACGU has two nested canonical base pairs,
the outer pair being A-U and the inner pair being C-G and this is the maximum number in this
sequence. The sequence ACAGCGU on the other hand has three nested canonical base pairs.

PARSIMONY

This is a character-based method for constructing a phylogenetic tree from a multiply aligned set of
sequences. The parsimony of a tree is defined as the minimum number of substitutions required to
produce a given set of sequences, placed in a particular way at the leaves of the tree. Here is an
example on the DNA alphabet, for a fixed tree on whose leaves one-letter sequences are placed.

C C A C

The minimum number of substitutions on this tree to yield the leaves is 1, achieved by placing the
same 1-letter sequence, C, at all the internal nodes of the tree, including the root.

The parsimony of a tree may be computed efficiently by a clever method. It is much more time
consuming to find the tree which has maximum parsimony (minimum number of substitutions). A

15
naïve method would enumerate all trees. This would be prohibitively time consuming since the
number of trees is exponential in the number of sequences. A branch-and-bound method, which
prunes the search space, is used. It is still slow, though.

PAM

Point Accepted Mutation. This is a family of protein substitution matrices for scoring alignments,
especially when doing BLAST searches. One PAM is a unit of evolutionary divergence in which 1%
of the residues have changed. PAMX is thus the matrix for "X PAM units". First, a PAM matrix is
constructed for a small value of X from a correct multiple alignment of closely related protein
sequences. This PAM is then extrapolated to construct PAMY, where Y >> X, by essentially matrix
multiplication. To align highly divergent sequences use PAM250.

PAML

This is a software package for Phylogenetic Analysis By Maximum Likelihood. According to the
author, this package is not good for building trees, but is good for evaluating and comparing trees
built by other methods such as PAUP.

PAUP

Phylogenetic Analysis Using Parsimony. This is a software package for constructing a phylogenetic
tree from biomolecular sequences or other data. It includes distance-based and maximum-
likelihood methods as well.

PDB

Protein Data Bank. This is the repository of experimentally determined tertiary structures of
proteins. For each structure, it stores the 3D coordinates of the atoms in the structure, and various
types of annotations including the references (authors, titles, journals, etc), bonding information,
secondary structures, and so on.

Pfam

This is a database of profile Hidden Markov Model motifs. Each motif is constructed from a seed
multiple alignment of highly similar sequences, which grows to include more distant members
iteratively.

PHD

This is a class of neural network methods for predicting secondary structure (PHDsec), solvent
accessibility (PHDacc), and transmembrane helices (PHDhtm).

Predictions are made as follows. First, using blastp, the sequence x is searched against SWISS-
PROT to find similar sequences. From the hits, a multiple alignment is generated by a program
called MaxHom. From this multiple alignment, a profile is built. This profile is then searched against
SWISS-PROT, to find more related sequences. (During this process, motifs in x are found by
searching the ProSite database, domains in x found by searching the ProDom database, and low-
complexity regions in x marked using the program SEG.)

The final profile is fed to PHDsec which makes the secondary structure prediction. Each residue is
labeled as being in an alpha-helix, a beta-sheet, or neither. Confidence values are also returned.

16
The reported accuracy of PHDsec is about 73%. This same profile may be fed to PHDacc which
makes a solvent accessibility prediction. The output is ten states of relative accessibility. This same
profile may be fed to PHDhtm which predicts transmembrane helices. The output is two states:
each residue is tagged as being in a transmembrane helix or not. The string of outputs is refined by
a dynamic programming type method, to filter out errors using context. (For instance, if the
predictions of three consecutive residues are TNT where T denotes "in transmembrane helix" and
N not, then the N will be replaced by T in the refinement phase.) The reported accuracy of PHDhtm
is about 95%.

PHI-BLAST

Pattern Hit Initiated BLAST. This is a version of BLAST that inputs a protein sequence and a
regular expression pattern in it. It may be used to find other protein sequences that not only contain
the same pattern but are also similar to the input protein sequence in the proximity of the
occurrence of the pattern, in the two sequences. This is illustrated below.

Query sequence

Similar regions

Hit sequence

PHYLIP

Phylogeny Inference Package. This is a free software package for constructing a phylogenetic tree
from molecular sequences, gene sequences, or other data. It supports parsimony, distance matrix
methods, and likelihood methods. It also supports bootstrap analysis and consensus tree building.

PHYLOGENETIC ANALYSIS

This is the process of building a phylogenetic tree from a given set of sequences (or other data).
The first step is to do a multiple alignment of the sequences, using CLUSTALW perhaps. The
second step is to clean up the multiple alignment (remove outlying sequences, handle gaps,
downweight overrepresented sequences, etc). The third step is to build one or more trees, using a
distance-based method, a parsimony method, or a likelihood-based method. The fourth step
(which may interact with earlier steps) is to assess the quality of the built trees, perhaps using
bootstrap.

PIR

This is a well-annotated and nonredundant protein sequence database. A closely related database
is iProClass, which has classified more than 200,000 protein sequences from PIR and SWISS-
PROT into about 30,000 superfamilies, about 100,000 families, about 2,500 domains, and so on.
Interestingly, iProClass is implemented on a relational database system.

17
POSITION-WEIGHT MATRIX PROFILE

This is a matrix that describes the composition of each column in a multiple alignment (or of a
functional site). Here is a DNA example. Consider the ungapped alignment

ACTAG, ATCCG, TTCAG

One version of a profile associated with this aligned set is the 3 × 5 matrix P given by

P[ A,1] = P[ A,5] = 2; P[C , 2] = P[C , 4] = 1; P[C ,3] = 2;


P[G,5] = 3; P[T ,1] = P[T ,3] = 1; P[T , 2] = 2

with the rest of the entries being zero. In practice, what are stored in the profile are not counts, but
rather log-probabilities. For instance, the log-probability version of P[ A,1] would be log 2 / 3 .

Let X denote a sequence and P a profile, both of the same length n . The alignment of X to P
is scored as follows:

S ( X , P) = ∑ P ( X ,i)
i =1,..., n
i

The score of the sequence ACCAG against the version of the profile P that stores the counts is

P[ A,1] + P[C , 2] + P[C ,3] + P[ A, 4] + P[G,5] = 2 + 1 + 2 + 2 + 2 = 9

When P stores log-probabilities, the score S ( X , P) has a simple probabilistic interpretation: it is


the log-probability of generating the sequence X from the profile P under the assumption that the
positions in P are independent.

PREDICTPROTEIN

This web server features the class of PHD programs for predicting secondary structure, solvent
accessibility, and transmembrane helices, as well as programs for prediction of tertiary structure,
coiled-coil regions, etc. At the same site, in the Meta PredictProtein section, is another secondary
structure prediction program, JPRED, which takes a consensus between a number of methods.
There are also three other programs for predicting transmembrane helices, TMHMM, TOPRED,
and DAS. The tertiary structure prediction programs include TOPITS, SWISS-MODEL, and
CPHmodels.

PRINCIPAL COMPONENTS ANALYSIS

This is a method of visualizing high-dimensional data by transforming it into a very low-dimensional


space (usually 1, 2, or 3D). The transformation is achieved (by rotating and translating the axes of
the original space) in such a way that the first few axes describe the data as best as possible. The
remaining axes may then be "thrown away", with possibly some (but hopefully not a lot) of loss of
information. This is illustrated in the example below.

18
Here the original data is in the two-dimensional space x-y but most of it may be described in the
one-dimensional space PC1.

Principal components analysis is often used to visualize microarray data because of its high
dimensionality (think of the expression levels of thousands of genes in a single experiment as
forming a vector in a thousands-dimensional space).

PRODOM

This is a database of protein domains.

PROFILE HIDDEN MARKOV MODELS

This is a sophisticated probabilistic method, employing a hidden Markov model, to multiply align a
set of protein sequences. The trained hidden Markov model itself serves as a descriptor for the
alignment. New sequences may be aligned by aligning them to the HMM. The HMM is superior to
a simple (position-weight matrix) profile in that it explicitly models gap columns in an alignment, as
illustrated below.

Insert column 2

A T T G
C A
ACTTA
ATTT
ATTA
ATTG
ATCA 1 3 4 5

This HMM has five states. The directed arcs denote transitions among states. They have
probabilities on them, not shown here. The circular states denote match (i.e., non-gap) columns.
The square state denotes an insert column. Each state emits nucleotides. The first state emits only
As, the second state emits all four nucleotides with equal probability, the third state emits only Ts,
the fourth state emits Ts with higher probability than Cs but no other nucleotides, and the fifth state
emits As with higher probability than Gs but no other nucleotides. Imagine that the five sequences
shown to the left are the ones from which this HMM was trained. Now consider a new sequence,
ATCG. The alignment algorithm will realize that A is best interpreted as being emitted from state 1,
T from state 3, C from state 4, and G from state 5. The alignment algorithm will thus produce A-
TCG, the aligned version of ATCG.

The Pfam database of HMM profiles is a profile HMM of this type. See Pfam.

19
PROGRESSIVE ALIGNMENT

This is a class of heuristic methods for multiple alignment. A multiple alignment is built
progressively by doing repeated pairwise alignments. Here is an illustrative example of a
progressive alignment method. First, the two sequences that are most similar in the given set are
identified, and aligned. Next, a third sequence that is most similar to the aligned set in step 1 is
identified, and aligned to this set. Next, a fourth sequence that is most similar to the current aligned
set is identified, and aligned to the current set. And so on.

PROSITE

This is a database of short biologically significant motifs in protein sequences. The motifs are
stored as regular expression patterns and also as position-weight matrix profiles. The PROSITE
web server supports tools for scanning a protein sequence to find patterns in the patterns database
that occur in the sequence, and profiles in the profiles database that score sufficiently well against
portions of the sequence. The PROSITE web server also supports tools for finding sequences in a
sequence database (SWISS-PROT) that match a given pattern.

PROTOMAP

This is a database of an automated hierarchical clustering of the proteins in SWISS-PROT into


groups of related proteins. The clustering uses transitivity (if protein A is strongly similar to protein
B, and B strongly similar to protein C, then A is judged to be similar to C), with similarity measured
in the usual way via substitution matrices (BLOSUM50 and BLOSUM62).

PSI-BLAST

Position-Specific Iterated BLAST. This is a variant of BLAST for doing a heuristic version of a
profile search. PSI-BLAST inputs a query sequence and first attempts to find strong hits in a
database via the usual gapped BLAST. The hits are then multiply aligned, and a profile constructed
from this alignment. Next, this profile is searched against the database to try to pick up more hits.
PSI-BLAST is designed with the hope that this method will better facilitate the detection of remote
homologues. On the other hand, this method is more prone to false hits than the usual BLASTs.
False positives that get included in the first alignment may tarnish subsequent hits.

PSEUDO-COUNT METHOD

This is a method of incorporating prior knowledge into a multiple alignment in the form of pseudo-
counts. Let us illustrate it with a simple example. Consider n independent tosses of a coin in which
k heads are observed. The maximum likelihood estimate of the probability PH that the next toss
lands a head is k n . This estimate is very unreliable when n is small. (For example, when
n equals 1 , the maximum likelihood estimate sets PH to be 0 or to 1 , both unreasonable values.)
If we assume that the coin is fair unless we see enough evidence to the contrary---a reasonable
assumption---we can generate virtual examples (under this assumption) to increase the number of
examples. We do this as follows. We make w virtual tosses, each with a coin assumed to be fair.
The expected number of heads in the w virtual tosses is then 0.5w . We now compute the
maximum likelihood estimate on this enlarged data set, which yields PH = ( k + 0.5w ) ( n + w )
The parameter w allows us to control the trade-off between how much we trust the data versus
how much we trust our prior assumption.

20
In the previous paragraph, we may replace the assumed prior probability of landing heads, one-
prior
half, by any prior probability PH . We may even replace it with a distribution over prior
probabilities, usually the so-called Dirichlet distribution.

In bioinformatics, the pseudo-count method is used to construct position-weight matrix profiles for
multiple alignments, or for functional sites, when only limited data is available but some prior
knowledge is also available.

RASMOL

This is a widely used and powerful free viewer for PDB structures. RasMol allows one to
manipulate the view of the structure being viewed by a combination of mouse operations, drop-
down menus, and command-line commands. RasMol supports a powerful set of commands,
entered in a separate command-line window.

RELATIVE ENTROPY

This is an information-theoretic concept that quantifies the extent of the difference between two
probability functions. For example, let PA , PC , PG , PT and QA , QC , QG , QT denote two probability
functions on the DNA alphabet. The relative entropy H ( P || Q) quantifies how different Q is from
P . Relative entropy is always nonnegative. Relative entropy is not necessarily symmetric, it is
rarely the case that H ( P || Q ) equals H (Q || P ) .

Relative entropy may be used to compare the composition of a column in a multiple alignment to
the expected composition at that column (assuming one exists). This in turn may be used to score
how well the column is aligned.

REPEAT MASKER

This is a tool that removes (masks) repeats from a DNA sequence. (Repeats are low-complexity
regions that can produce spurious hits when the sequence containing repeats is searched against
databases using BLAST or other tools. By default, BLAST masks these in the query sequence
before initiating a database search.)

RESTRICTION MAP CONSTRUCTION

Restriction enzymes cut foreign DNA at locations called restriction sites. A restriction map is a map
of these locations in the DNA. These locations are not determined experimentally. What is
determined experimentally are some properties of fragments formed after the DNA has been cut
by various restriction enzymes in various ways. From this data, a restriction map is then
constructed by a nontrivial computer algorithm. Graph theory and algorithms are often used.

SCOP

Structural Classification of Proteins. This is a database which hierarchically classifies proteins at


four levels: classes, folds, superfamilies, and families. Proteins are classified from their 3D
structure, taken from PDB. The database is constructed by manual comparison of structures,
aided by software tools as found useful. Sometimes proteins are grouped together if they share a
common domain, other times a wider context of multiple domains is used.

21
Proteins are placed together in the same family if they are found to be clearly evolutionarily related.
Usually this determination is made only when the percentage identity of their aligned sequences is
found to be 30% or greater. Proteins are placed together in the same superfamily if they are
judged to be probably evolutionarily related. Usually this determination is made from commonality
in the structural and functional features of the two proteins, even though their sequences have low
identity (significantly less than 30%). Proteins are placed together in the same fold category if they
have major secondary structure elements in the same order and topology. Proteins in the same
fold category are structurally but not necessarily evolutionarily related.

SEG

This is a program that identifies low-complexity regions in protein sequences, and replaces the
amino acids at these positions by Xs. A low-complexity region is one with repeats or other types of
biased composition. A low-complexity region in a sequence input to BLAST can lead to false hits
that score well (have low E-values) but are unrelated to the query sequence except in that they
have similar low-complexity regions. It is therefore a good idea to filter out low-complexity regions
from a sequence before doing a BLAST query.

SEQUENCE LOGO

This is a method for visualizing the composition of the various columns in a multiple alignment, or
that of various positions in a functional site. Consider the case of nucleotide sequences. The height
of each position in a sequence logo (a position corresponds to a column in the multiple alignment)
indicates the information content of the corresponding column in the multiple alignment. In our
case, this would range from 0 to 2, with 0 indicating that this column has zero information content
and 2 indicating that this column has the maximum possible information content, 2 bits. Each
position in a sequence logo also displays, in addition to the height, the relative contributions of each
of the four nucleotides A, C, G, T to the total information content (i.e., this height).

SRS

This is a text-based search engine for over 80 databases, supported by EBI. It comes with a
powerful query language that is reminiscent of query languages for relational databases. Check
out especially the link operators, resembling the join operator in relational databases.

STOCHASTIC CONTEXT-FREE GRAMMAR

This is a probabilistic version of a context-free grammar. It is a generalization of a hidden Markov


model to the context-free setting. It is suited to modeling palindrome languages such as those
arising in RNA secondary structures (nested base pairings).

SUBSTITUTION MATRIX

This is a 20-by-20 matrix S where S (a, b) indicates the likelihood of a mutatation of a residue a
to residue b (or vice-versa). It is used to score an alignment of two sequences, most significantly
the alignment of a database hit to the query sequence. The BLOSUM and PAM family of
substitution matrices are the most widely used, in tools such as BLAST and many others.

22
SUFFIX TREE

This is a clever data structure used to speed up the search for exact matches of a query sequence
to fragments in a database. The database is preprocessed and stored in the form of a suffix tree.
Locating the first match then only takes time proportional to the length of the query sequence,
which is amazingly fast. Suffix trees have other uses, including finding repeats in a sequence.

SWISS-PROT

This is a curated database of protein sequences with a high level of annotation (functions of a
protein, domains in it, etc) and low redundancy.

SWISS-MODEL

This is a program for predicting the tertiary structure of a protein from its sequence, via the
prediction by homology method. The query sequence x is first searched against sequences in the
crystallographic database ExPdb. If a clear homolog y is found (i.e., y has sufficient identity with x)
an atomic model for the predicted structure of x is then built, using y. If no clear homolog is found,
the process is aborted.

TANDEM REPEAT FINDER

This is a tool that finds tandem repeats---two or more exact or near-exact copies of the same
sequence fragment that are adjacent---in a nucleotide sequence.

TBLASTN

This is the version of BLAST that searches a protein query sequence against a version of a
nucleotide database translated on-the-fly in all reading frames.

TBLASTX

This is the version of BLAST that searches a nucleotide sequence translated in all reading frames
against a nucleotide database translated on-the-fly in all reading frames.

tRNA-SCAN-SE

THREADING

This is an alignment of a protein sequence to the 3D structure of another protein.

TOPITS

This is a program for predicting the tertiary structure of a protein from its sequence. It uses a
database of secondary structure strings derived from tertiary structures in PDB. (This database has
one such string for each tertiary structure in PDB.) First, this program uses the PHDsec program
to predict the secondary structure string sse(x) of a protein sequence x. Next, it aligns, one by one,
this predicted string sse( x) with each secondary structure string in the database. (Alignments are
done via dynamic programming.) Finally, it picks a high-scoring alignment (sse(x), sse(y)) where

23
sse(y) is a string from the database derived from tertiary structure y from PDB. Finally, it predicts
the structure of x to be y.

TREMBL

This is an automatically annotated adjunct to SWISS-PROT.

UNGAPPED ALIGNMENT

This is the process of finding highly similar regions of two DNA or protein sequences. Gaps are not
permitted to be inserted.

UPGMA

This is a distance-based method for building a phylogenetic tree from a multiply aligned set of
sequences. It performs a hierarchical clustering of the given data set, which yields this tree.

VAST

This is a structure-to-structure alignment method for structures in MMDB. It works by aligning the
secondary structure elements in the two proteins.

24

Das könnte Ihnen auch gefallen