Bioinfo Training Material

BIOINFORMATICS
TRAINING PROGRAM
May 16-18, 2007
Reference Material
Indian Institute of Advanced Research
www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

BIOINFORMATICS
1.0 Introduction 1-2
2.0 Protein Structure 3-4
3.0 Genome Analysis 5-9
4.0 Phylogeny 10-15
5.0 Modeling 16-18
6.0 Tools for Structure based drug design and docking 19-22
7.0 Computational Resources 23-36
1.0 Introduction
Bioinformatics is the field of science in which biology, computer science, and

information technology merge to form a single discipline. The ultimate goal of the field is
to enable the discovery of new biological insights as well as to create a global perspective
from which unifying principles in biology can be discerned. At the beginning of the
"genomic revolution", a bioinformatics concern was the creation and maintenance of a
database to store biological information, such as nucleotide and amino acid sequences.
Development of this type of database involved not only design issues but the
development of complex interfaces whereby researchers could both access existing data
as well as submit new or revised data.
Ultimately, however, all of this information must be combined to form a comprehensive

picture of normal cellular activities so that researchers may study how these activities are
altered in different disease states. Therefore, the field of bioinformatics has evolved such
that the most pressing task now involves the analysis and interpretation of various types
of data, including nucleotide and amino acid sequences, protein domains, and protein
structures. The process of analyzing and interpreting data it is hoped, will lead to
elucidation of underlying principles in the biological phenomenon.
Some Important Landmarks in the development of Bioinformatics:
1962 The first theory of molecular evolution; the Molecular Clock concept (Linus
Pauling and Emile Zukerkandl)
1965 Atlas of Protein Sequences, the first protein database (Margaret Dayhoff and
coworkers)
1970 Needleman-Wunsch algorithm for global protein sequence alignment
1977 New DNA sequencing methods (Fred Sanger, Walter Gilbert and coworkers);
bacteriophage X174 sequence
1977 First software for sequence analysis (Roger Staden)
1977 Phylogenetic taxonomy; archaea discovered; the notion of the three primary
kingdoms of life introduced (Carl Woese and coworkers)
1981 Smith-Waterman algorithm for local protein sequence alignment
1981 Human mitochondrial genome sequenced

1981 The concept of a sequence motif (Russell Doolittle)
1982 GenBank Release 3 made public
1982 Phage genome sequenced (Fred Sanger and coworkers)
1983 The first practical sequence database searching algorithm (John Wilbur and David
Lipman)
1985 FASTP/FASTN: fast sequence similarity searching (William Pearson and David
Lipman)
1986 Introduction of Markov models for DNA analysis (Mark Borodovsky and
coworkers)
1987 First profile search algorithm (Michael Gribskov, Andrew McLachlan, David
Eisenberg)
1988 National Center for Biotechnology Information (NCBI) created at NIH/NLM
1988 EMBnet network for database distribution created
1990 BLAST: fast sequence similarity searching with rigorous statistics (Stephen
Altschul, David Lipman and coworkers)
1991 EST: expressed sequence tag sequencing (Craig Venter and coworkers)
1994 Hidden Markov Models of multiple alignments (David Haussler and coworkers;
Pierre Baldi and coworkers)
1994 SCOP classification of protein structures (Alexei Murzin, Cyrus Chothia and
coworkers)
1995 First bacterial genomes completely sequenced
1996 First archaeal genome completely sequenced
1996 First eukaryotic genome (yeast) completely sequenced
1997 Introduction of gapped BLAST and PSI-BLAST
1997 COGs: Evolutionary classification of proteins from complete genomes
1998 Worm genome, the first multicellular genome, (nearly) completely sequenced
1999 Fly genome (nearly) completely sequenced
2001 Human genome (nearly) completely sequenced

2.0 Protein Structure
A set of 20 different subunits, called amino acids, can be arranged in any order to form a
polypeptide that can be thousands of amino acids long. These chains can then loop about
each other or fold, in a variety of ways, but only one of these ways allows a protein to
function properly. The critical feature of a protein is its ability to fold into a conformation
that creates structural features, such as surface grooves, ridges, and pockets, which allow
it to fulfill its role in a cell. A protein's conformation is usually described in terms of
levels of structure. Traditionally, proteins are looked upon as having four distinct levels
of structure, with each level of structure dependent on the one below it. In some proteins,
functional diversity may be further amplified by the addition of new chemical groups
after synthesis is complete.
The stringing together of the amino acid chain to form a polypeptide is referred to as the
primary structure. The secondary structure is generated by the folding of the primary
sequence and refers to the path that the polypeptide backbone of the protein follows in
space. Certain types of secondary structures are relatively common. Two well-described
secondary structures are the alpha helix and the beta sheet. In the first case, certain types
of bonding between groups located on the same polypeptide chain cause the backbone to
twist into a helix, most often in a form known as the alpha helix. Beta sheets are formed
when a polypeptide chain bonds with another chain that is running in the opposite
direction. Beta sheets may also be formed between two sections of a single polypeptide
chain that is arranged such that adjacent regions are in reverse orientation.
The tertiary structure describes the organization in three dimensions of all of the atoms
in the polypeptide. If a protein consists of only one polypeptide chain, this level then
describes the complete structure.
Multimeric proteins, or proteins that consist of more than one polypeptide chain, require
a higher level of organization. The quaternary structure defines the conformation
assumed by a multimeric protein. In this case, the individual polypeptide chains that
make up a multimeric protein are often referred to as the protein subunits. The four
levels of protein structure are hierarchal, that is, each level of the build process is
dependent upon the one below it.
A protein's primary amino acid sequence is crucial in determining its final structure. In
some cases, amino acid sequence is the sole determinant, whereas in other cases,
additional interactions may be required before a protein can attain its final conformation.
For example, some proteins require the presence of a cofactor, or a second molecule that
is part of the active protein, before it can attain its final conformation. Multimeric
proteins often require one or more subunits to be present for another subunit to adopt the
proper higher order structure. The entire process is cooperative, that is, the formation of
one region of secondary structure determines the formation of the next region.
Allosteric Proteins: These are proteins which under certain conditions have a stable
alternate conformation, or shape, that enables it to carry out a different biological
function. The interaction of an allosteric protein with a specific cofactor, or with another
protein, may influence the transition of the protein between shapes. In addition, any

change in conformation brought about by an interaction at one site may lead to an
alteration in the structure, and thus function, at another site. One should bear in mind,
though, that this type of transition affects only the protein's shape, not the primary amino
acid sequence. Allosteric proteins play an important role in both metabolic and genetic
regulation.
Protein structure determination: Traditionally, a protein's structure was determined

using one of two techniques: X-ray crystallography or nuclear magnetic resonance
(NMR) spectroscopy.
X-ray Crystallography: Crystals are a solid form of a substance in which the component
molecules are present in an ordered array called a lattice. The basic building block of a
crystal is called a unit cell. Each unit cell contains exactly one unique set of the crystal's
components, the smallest possible set that is fully representative of the crystal. When the
crystal is placed in an X-ray beam, all of the unit cells present the same face to the beam;
therefore, many molecules are in the same orientation with respect to the incoming X-
rays. The X-ray beam enters the crystal and a number of smaller beams emerge: each one
in a different direction, each one with a different intensity. If an X-ray detector, such as a
piece of film, is placed on the opposite side of the crystal from the X-ray source, each
diffracted ray, called a reflection, will produce a spot on the film. However, because only
a few reflections can be detected with any one orientation of the crystal, an important
component of any X-ray diffraction instrument is a device for accurately setting and
changing the orientation of the crystal. The set of diffracted, emerging beams contains
information about the underlying crystal structure.
The major drawback associated with this technique is that crystallization of the proteins
is a difficult task. Crystals are formed by slowly precipitating proteins under conditions
that maintain their native conformation or structure. These exact conditions can only be
discovered by repeated trials that entail varying certain experimental conditions, one at a
time. This is a very time consuming and tedious process.
Nuclear Magnetic Resonance (NMR) Spectroscopy: The basic phenomenon of NMR

spectroscopy was discovered in 1945. In this technique, a sample is immersed in a
magnetic field and the positively charged nucleus spins, the moving charge creates what
is called a magnetic moment. When the radio waves hit the spinning nuclei, they tilt even
more, sometimes flipping over. hese resonating nuclei emit a unique signal that is then
picked up by a detector and processed by the Fourier Transform algorithm, a complex
equation that translates the language of the nuclei into something a scientist can
understand. By measuring the frequencies at which different nuclei flip, scientists can
determine molecular structure, as well as many other interesting properties of the
molecule. In the past 10 years, NMR has proven to be a powerful alternative to X-ray
crystallography for the determination of molecular structure. NMR has the advantage
over crystallographic techniques in that experiments are performed in solution as opposed
to a crystal lattice. However, the principles that make NMR possible tend to make this
technique very time consuming and limit the application to small- and medium-sized
molecules.

3.0 GENOME ANALYSIS
Homology refers to two genes sharing a common evolutionary history. Scientists also use
the term homology, or homologous, to simply mean similar, regardless of the
evolutionary relationship. In comparative genomics one of the major functions is the
identification of homologous genes in different organisms. An important tool which is
utilized for this function is BLAST (Basic local alignment search tool).
The BLAST algorithm is a heuristic program, which means that it relies on some smart
shortcuts to perform the search faster. BLAST performs "local" alignments. Most
proteins are modular in nature, with functional domains often being repeated within the
same protein as well as across different proteins from different species. The BLAST
algorithm is tuned to find these domains or shorter stretches of sequence similarity. The
local alignment approach also means that a mRNA can be aligned with a piece of
genomic DNA, as is frequently required in genome assembly and analysis. If instead
BLAST started out by attempting to align two sequences over their entire lengths (known
as a global alignment), fewer similarities would be detected, especially with respect to
domains and motifs.
When a query is submitted via one of the BLAST Web pages, the sequence, plus any
other input information such as the database to be searched, word size, expect value, and
so on, are fed to the algorithm on the BLAST server. BLAST works by first making a
look-up table of all the "words" (short subsequences, which for proteins the default is
three letters) and "neighboring words", i.e., similar words in the query sequence. The
sequence database is then scanned for these "hot spots". When a match is identified, it is
used to initiate gap-free and gapped extensions of the "word".
BLAST Scores and Statistics: Once BLAST has found a similar sequence to the query
in the database, it is helpful to have some idea of whether the alignment is "good" and
whether it portrays a possible biological relationship, or whether the similarity observed
is attributable to chance alone. BLAST uses statistical theory to produce a bit score and
expect value (E-value) for each alignment pair (query to hit).
The bit score gives an indication of how good the alignment is; the higher the score, the
better the alignment. In general terms, this score is calculated from a formula that takes
into account the alignment of similar or identical residues, as well as any gaps introduced
to align the sequences. A key element in this calculation is the "substitution matrix ",
which assigns a score for aligning any possible pair of residues. The BLOSUM62 matrix
is the default for most BLAST programs, the exceptions being blastn and MegaBLAST
(programs that perform nucleotide nucleotide comparisons and hence do not use protein-
specific matrices). Bit scores are normalized, which means that the bit scores from
different alignments can be compared, even if different scoring matrices have been used.
The E-value gives an indication of the statistical significance of a given pairwise

alignment and reflects the size of the database and the scoring system used. The lower the
E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05
means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone.

Although a statistician might consider this to be significant, it still may not represent a
biologically meaningful result, and analysis of the alignments (see below) is required to
determine "biological" significance. The score and E value are calculated using the
equations given below:
S’ = λS-ln K E = mn 2-S’
ln 2
Where S’ is the normalized score, S is the raw score, λ and K are constants, m and n is
the length of the query and hit sequences
Tools for Comparative Genomics:

1. CisMols
CisMols (Cis-regulatory Modules) is a tool that identifies compositionally
predicted cis-clusters that occur in groups of co-regulated genes within each of their
ortholog-pair evolutionarily conserved cis-regulatory regions.
2. CMR
The Comprehensive Microbial Resource (CMR) gives access to a central
repository of the sequence and annotation of all complete public prokaryotic genomes as
well as comparative genomics tools across all of the genomes in the database.
3. DAVID Bioinformatic Resources

The Database for Annotation, Visualization and Integrated Discovery (DAVID)
provides a comprehensive set of functional annotation tools for investigators to
understand biological meaning behind large list of genes.
4. DCODE.ORG
The dcode.org website provides access to tools for comparative genomic analyses
developed by the Comparative Genomics Center at the Lawerence Livermore National
Laboratory. Tools include: zPicture, Mulan, eShadow, rVista, CREME, and the ECR
Browser.
5. EnteriX
EnteriX is a collection of tools for viewing pairwise and multiple alignments for
bacterial genome sequences.
6. FootPrinter
FootPrinter is a program for phylogenetic footprinting that identifies regions of
DNA that are well conserved across a set of orthologous sequences in order to infer
phylogenetic relationships.
7. FootPrinter3
FootPrinter3 is a web server for predicting transcription factor binding sites
(TFBS) by using phylogenetic footprinting. FootPrinter3 extends the motif discovery

algorithms of FootPrinter by making use of local multiple sequence alignment blocks
when those are available and reliable, but also allowing finding motifs in unalignable
regions.
8. GenomeTraFaC
GenomeTraFaC is a comparative genomics based resource for initial
characterization of gene models and the identification of putative cis-regulatory regions
of RefSeq gene orthologs.
9. GENSTYLE
GENSTYLE is based on the genomic signature paradigm and allows the user to
classify and characterize nucleotide sequences using oligonucleotide frequencies.
10. IBM Genome Annotation Page

IBM's Bio-Dictionary-based Annotations Of Completed Genomes page lists
annotations for over 75 complete genomes (archae, bacteria, eurkaryotes, and viruses).
You can query these annotations at the sequence level as well as search/compare across
genomes.
11. Integrated Microbial Genomes (IMG)

The Integrated Microbial Genomes (IMG) system facilitates the comparison of
genomes sequenced by the Joint Genome Institute (JGI). It can be searched using
keywords or BLASTp, and the gene records diplayed include biochemical properties,
protein domains, chromosomal location and neighbourhood and lists of paralogues and
orthologues. One can easily build a list of genomes to be considered or excluded from the
search and the Phylogenetic Profiler tool allows one to refine the selection by building a
list of homologues either common to or excluded from specific organisms.
12. ISC Large-scale Sequencing Project Database

The International Sequencing Consortium (ISC) Large-scale Sequencing Project
Database contains information on current and completed sequencing projects, including
project timelines, funding agencies, sequencing strategy and links out to project web
pages.
13. Mauve
Mauve is a stand-alone software tool for constructing multiple genome
alignments.
14. MicroFootPrinter
MicroFootPrinter identifies the conserved motifs in regulatory regions of
prokaryotic genomes using the phylogenetic footprinting program FootPrinter.
15. MIPS

Munich Information Centre for Protein Sequences projects include: fungal
genome analysis, plant genome bioinformatics, structural genomics, proteomics and
genome annotation. Projects and databases include: CYGD, MNCDB, NGFN, MPPI,
SIMAP, QUIPOS, MATDB, MOsDB, SPUTNIK, and PEDANT.
16. MLST
MLST (Multi Locus Sequence Typing) is a nucleotide sequence based approach
for the unambiguous characterisation of isolates of bacteria and other organisms using the
sequences of internal fragments of seven house-keeping genes.
17. NEMBASE2
NEMBASE2 is a database resource for EST datasets for 37 species of nematode.
Sequences are clustered to redunce redundacy. Comparisons can be by library and at a
sequence level; a visualisation tool is included. Coding region predictions for each
cluster, further annotations such as GO terms and physical properties are also included.
18. PartiGeneDB
PartiGeneDB is a database of about 300 partial genomes from eukaryotic
organisms that have been assembled from EST data.
19. PhenomicDB
PhenomicDB integrates the genotype and phenotype information of several
organisms from public data sources. The mapping of phenotypic data fields allows cross-
species phenotype comparison.
20. Phydbac2
Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and
explore the phylogenomic profiles of bacterial protein sequences. It also allows the user
to view sequence similarity across different organisms, access other genes with similar
conservation profiles, and view genes that are found nearby a selected gene in multiple
genomes.
21. Projector 2
Projector 2 allows users to map completed portions of the genome sequence of an
organism onto the finished (or unfinished) genome of a closely-related species or strain.
Using the related genome sequence as a template can facilitate sequence assembly and
the sequencing of the remaining gaps.
22. Sockeye
Sockeye is a visualization tool allowing one to assemble and analyze genomic
information in a three dimensional workspace. It can be used to view features at various
levels, ranging from SNPs to karyotypes. Sockeye displays genomic features along
tracks, and links to the Ensembl database.
23. SPRING

Sorting Permutation by Reversals and Block Interchanges (SPRING) is a tool for
the analysis of genome rearrangements. SPRING takes two or more chromosomes as its
input and then computes a minimum series of reversals and/or block-interchanges for
transforming one chromosome into another. Phylogenetic trees based on the
rearrangement analysis are also shown as part of the results.
24. SVC
SVC (Structured Visualization of Evolutionary Conserved Sequences) is a tool
that can search for pairs of orthologous genes, align the protein coding sequences, and
visualize the evolutionary sequence conservation mapped back onto the gene structure
scaffold.
25. T-STAG
Tissue-Specific Transcripts And Genes (T-STAG) is a system integrating EST,
gene expression, alternative splicing and human-mouse orthology information for the
analysis of tissue-specific gene and transcript expression patterns.
26. TIGR Software Tools

A list of open-source software packages available for free from The Institute for
Genomic Research (TIGR).
27. TraFaC
TraFaC (Transcription Factor Binding Site Comparison) is a tool that identifes
regulatory regions using a comparative sequence analysis approach.
28. Viral Bioinformatics

Viral Bioinformatics provides access to viral genomes and a variety of tools for
comparative genomic analyses.
29. YOGY
Eukaryotic Orthology (YOGY) is a resource for retrieving orthologous proteins
from nine eukaryotic organisms. Using a gene or protein identifier as a query, this
database provides comprehensive, combined information on orthologs in other species
using data from five independent resources: KOGs, Inparanoid, Homologene,
OrthoMCL, and a table of curated orthologs between budding yeast and fission yeast.
Associated Gene Ontology (GO) terms of orthologs can also be retrieved.

4.0 Phylogeny
New insight into the molecular basis of a disease may come from investigating the
function of homologs of a disease gene in model organisms. In this case, homology refers
to two genes sharing a common evolutionary history. Scientists also use the term
homology, or homologous, to simply mean similar, regardless of the evolutionary
relationship.
Equally exciting is the potential for uncovering evolutionary relationships and patterns
between different forms of life. With the aid of nucleotide and protein sequences, it
should be possible to find the ancestral ties between different organisms. Thus far,
experience has taught us that closely related organisms have similar sequences and that
more distantly related organisms have more dissimilar sequences. Proteins that show
significant sequence conservation, indicating a clear evolutionary relationship, are said to
be from the same protein family. By studying protein folds (distinct protein building
blocks) and families, scientists are able to reconstruct the evolutionary relationship
between two species and to estimate the time of divergence between two organisms since
they last shared a common ancestor.
The Three Domains of Life: In the mid-1970s, while studying some unusual groups of
bacteria, thermophilic methanogens and halophiles, Carl Woese and colleagues cocluded
that these organisms were not really bacteria but should be assigned to a separate domain
of life with the same status as bacteria and eukaryotes. This group was originally referred
to as archaebacteria and later renamed archaea. The uniqueness of the archaea was
apparent, even from some of their biochemical features, such as the unusual structure of
lipids and the topology of phylogenetic trees of 16S rRNA. These trees clearly indicated
that archaea comprised a unique branch of life, distinct from both bacteria and
eukaryotes. Furthermore, although, phenotypically, archaea are obviously prokaryotes,
like bacteria, i.e. have small cells without nuclei or organelles, they are, in some
important respects, closer to eukaryotes than to bacteria.These eukaryote-like features of
archaea include the structure of the ribosomes, which have a number of proteins shared
with eukaryotes but not with bacteria, the presence of histones (in one of the two major
branches of archaea), the organization of the basal transcriptional apparatus, with several
transcription factors of the eukaryotic variety, and the organization of the DNA
replication apparatus, which is also conserved in archaea and eukaryotes but not in
bacteria.
The evolutionary process:
Genetic Variation: Evolution is not always discrete with clearly defined boundaries that
pinpoint the origin of a new species, nor is it a steady continuum. Evolution requires
genetic variation which results from changes within a gene pool, the genetic make-up of a
specific population. A gene pool is the combination of all the alleles —alternative forms
of a genetic locus—for all traits that population may exhibit. Changes in a gene pool can
result from mutation—variation within a particular gene—or from changes in gene
frequency—the proportion of an allele in a given population.

Every organism possesses a genome that contains all of the biological information needed
to construct and maintain a living example of that organism. The biological information
contained in a genome is encoded in the nucleotide sequence of its DNA or RNA
molecules and is divided into discrete units called genes. The information stored in a
gene is read by proteins, which attach to the genome and initiate a series of reactions
called gene expression.
Every time a cell divides, it must make a complete copy of its genome, a process called
DNA replication. DNA replication must be extremely accurate to avoid introducing
mutations or changes in the nucleotide sequence of a short region of the genome.
Inevitably, some mutations do occur, usually in one of two ways; either from errors in
DNA replication or from damaging
Mutations in the coding regions of genes are much more important. Those mutations that
do have an evolutionary effect can be divided into two categories, loss-of-function
mutations and gain-of-function mutations. A loss-of-function mutation results in reduced
or abolished protein function. Gain-of-function mutations, which are much less common,
confer an abnormal activity on a protein.
Phylogenetic Trees: Systematics describes the pattern of relationships among taxa and is
intended to help us understand the history of all life. But history is not something we can
see—it has happened once and leaves only clues as to the actual events. Scientists use
these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the
most convenient way of visually presenting evolutionary relationships among a group of
organisms is through illustrations called phylogenetic trees.
Tools for phylogeny reconstruction
1. Bioinformatics Toolkit
This Toolkit is a collection of a wide range of tools and links for sequence analysis,
function, and structure prediction. This resource offers convienent web interfaces for
many freely available tools.
2. CIPRes
The Cyberinfrastructure for Phylogenetic Research (CIPRes) project aims to develop a
computational infrastructure for systematics. Other goals of the project include providing
a central resource enabling computational systematics and education and training
initiatives. The website also contains a substantial list of links to related software.
3. Codon Usage Database

Find GC content and frequency of codon usage for any organism that has a sequence in
GenBank.
4. ConSeq
ConSeq is a tool for predicting functionally and structurally important amino acid
residues in protein sequences. The predictions are based on the assumptions that residues

of functional importance are often conserved and solvent-accessible, and those of
structural importance are often conserved and located in the protein core. A multiple
sequence alignment is used to predict the relative solvent accessibility state and the
evolutionary rate at each residue.
5. cpnDB
cpnDB is a curated collection of chaperonin sequence data collected from public
databases or generated by a network of collaborators exploiting the cpn60 target in
clinical, phylogenetic and microbial ecology studies. The database contains all available
sequences for both group I and group II chaperonins. cpnDB is built and maintained with
open source tools.
6. IBM Genome Annotation Page

IBM's Bio-Dictionary-based Annotations Of Completed Genomes page lists annotations
for over 75 complete genomes (archae, bacteria, eurkaryotes, and viruses). You can query
these annotations at the sequence level as well as search/compare across genomes.
7. JEvTrace
Jevtrace is a tool that combines multiple sequence alignments, phylogenetic, and
structural data for identification of functional sites in proteins.
8. Joes Site - Phylogeny Programs

Comprehensive list of phylogeny packages, compiled by Joe Felsenstein, creator of
Phylip.
9. MEGA
MEGA (Molecular Evolutionary Genetics Analysis) is a software package for
phylogenetic analysis with a graphical user interface. It allows viewing and editing of the
aligned input sequence data and provides many tools for phylogenetic and statistical
analysis of the alignments.
10. Mesquite
Mesquite is an open source software project designed to deal with comparative data
about organisms and evolutionary analyses. Mesquite contains modules for phylogenetic
analysis, population genetics, and non-phylogenetic multivariate analysis.
11. MIGenAS Toolkit

Max-Planck Integrated Gene Analysis System (MIGenAS) provides access to many
different bioinformatics software tools and databases for sequence similarity searching,
multiple sequence alignments, phylogenetic analysis, and protein structure prediction.
Users can also configure "meta"-tools as a pipeline of individual tools and intermediate
filters.
12. MINER

MINER is a tool for the identification and visualization of phylogenetic motifs (regions
within a multiple sequence alignment (MSA) that conserve the overall phylogeny of the
complete family).
13. MPI Toolkit

Max-Planck Institute Bioinformatics Toolkit provides access to many different
bioinformatics software tools and databases for sequence similarity searching, multiple
sequence alignments, phylogenetic analysis, and protein structure prediction.
14. NCBI Taxonomy Database

Taxonomic classification of all organisms with sequences in GenBank.
15. NEWT
NEWT is the taxonomy database maintained by the UniProt group.
16. NJplot
NJplot is a tool for visualizing binary trees such as the phylogenetic trees output from the
PHYLIP programs. Available for several platforms including Windows, MacOS, Linux
and Solaris.
17. Orthologue Search Service

BLAST a protein sequence then perform automated phylogenetic analysis to detect
orthologous sequences.
18. PAL2NAL
PAL2NAL converts a multiple sequence alignment of proteins and the
corresponding DNA (or mRNA) sequences into a codon alignment. Synonymous (Ks)
and non-synonymous (Ka) substitution rates can be calculated.
19. Phydbac2
Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and
explore the phylogenomic profiles of bacterial protein sequences. It also allows the user
to view sequence similarity across different organisms, access other genes with similar
conservation profiles, and view genes that are found nearby a selected gene in multiple
genomes.
20. PHYLIP
Comprehensive set of programs for phylogenetic analyses; available for PC and Mac;
source code available for easy compiling in UNIX.
21. PhyloBLAST
BLAST a protein sequence, then perform automated phylogenetic analysis on hits or on
uploaded sequences; PHYLIP-based analyses.

22. PhyloDome
PhyloDome is a tool with which you can visualize and analyze the phylogenetic
distribution of one or more eukaryotic domains.
23. PHYML
Phyml is a program that constructs phylogenetic trees from sequence alignments using
the maximum likelihood method.
24. POWER
The Phylogenetic Web Repeater (POWER) allows users to perform phylogenetic
analysis using the PHYLIP package. The POWER pipeline can start with processing
either multiple sequence alignments (MSA) or can proceed directly with aligned
sequences.
25. ProtTest
ProtTest is a program that determines the best-fit model of evolution, among a set
of candidate models, for a given protein sequence alignment.
26. Puzzleboot
Puzzleboot is a UNIX shell script facilitating bootstrap analysis using TREE-PUZZLE
and PHYLIP. It enhances TREE-PUZZLE by allowing one to analyse multiple datasets,
and can be used for both protein and DNA distance bootstrap analysis.
27. Ribosomal Database Project

Highly curated database of aligned and annotated rRNA sequences with accompanying
phylogenies; data available for download.
28. STING Millenium

STING is a suite of tools for the analysis of protein sequence, structure, stability
and function - and the relationships between them.
29. SWAKK
Sliding Window Analysis of Ka and Ks (SWAKK) is tool for detecting positive
selection in proteins using a sliding window substitution rate analysis. The program can
display the results on a 3D protein structure.
30. T-COFFEE
The T-COFFEE site includes links to a collection of tools for computing,
evaluating, and manipulating multiple alignments of protein sequences and structures. T-
COFFEE is a protein multiple sequence alignment tool that is more accurate than
ClustalW for sequences with less than 30% identity. Expresso (or 3DCoffee) aligns
sequences using structural information. PROTOGENE turns amino acid alignments into
CDS nucleotide alignments.
31. Tree Editors

Tree Editors is an annotated listing of software for the visualization and manipulation of
phylogenetic trees.

32. Tree of Life
Multi-authored project attempting to represent online the entire phylogeny of life on
earth.
33. TREE-PUZZLE
Tree-puzzle is a program that constructs phylogenetic trees from sequence alignments
using the maximum likelihood method.
34. TreeDomViewer
TreeDomViewer is a tool for the visualization of phylogeny and protein domain
structure. TreeDomViewer constructs phylogenetic trees and projects the corresponding
protein domain information onto the multiple sequence alignment.
35. TreeJuxtaposer
TreeJuxtaposer is a free software tool that allows a visual comparison of two trees
in Newick format (phylogenies, taxonomies, gene trees, etc.). It can work with trees
having up to 500,000 nodes, and automatically calculates and marks the differences.
36. TreeView
Generates nice graphics of trees; reads multiple tree file formats; available for
download to Mac or PC.
37. TSEMA
The Server for Efficient Mapping Assessment (TSEMA) predicts possible
protein-protein interactions based on the comparision of phylogenetic trees derived from
sequences of associated protein families.
38. UCMP Phylogeny Wing

"Phylogeny-Diversity of Life Through Time" is an on-line exhibit at the
University of California Museum of Paleontology website. There is an introduction to
phylogenetics and cladistics, and you can navigate through a very informative
phylogenetic tree rooted at the three main domains of life (Archaea, Bacteria and
Eukaryota). At each level of the tree there is a brief summary, and links to more
information about the Fossil Record, Life History and Ecology, Systematics and
Morphology.
39. Understanding Evolution

A fantastic site for teaching/understanding evolution.
40. Weighbor
Weighbor is a tool for building phylogenetic trees from distance matrices. It
employs a weighted version of the neighbour-joining method in which longer distances in
the matrix are given less weight.

5.0 Protein Modeling
The process of evolution has resulted in the production of DNA sequences that encode
proteins with specific functions. In the absence of a protein structure that has been
determined by X-ray crystallography or nuclear magnetic resonance (NMR)
spectroscopy, researchers can try to predict the three-dimensional structure using protein
or molecular modeling. This method uses experimentally determined protein structures
(templates) to predict the structure of another protein that has a similar amino acid
sequence (target).
Identifying a protein's shape, or structure, is key to understanding its biological function
and its role in health and disease. Illuminating a protein's structure also paves the way for
the development of new agents and devices to treat a disease. Yet solving the structure of
a protein is no easy feat. It often takes scientists working in the laboratory months,
sometimes years, to experimentally determine a single structure. Therefore, scientists
have begun to turn toward computers to help predict the structure of a protein based on its
sequence. The challenge lies in developing methods for accurately and reliably
understanding this intricate relationship.
Although molecular modeling may not be as accurate at determining a protein's structure

as experimental methods, it is still extremely helpful in proposing and testing various
biological hypotheses. Molecular modeling also provides a starting point for researchers
wishing to confirm a structure through X-ray crystallography and NMR spectroscopy.
Because the different genome projects are producing more sequences and because novel
protein folds and families are being determined, protein modeling will become an
increasingly important tool for scientists working to understand normal and disease-
related processes in living organisms. Protein modeling involves identification of the
proteins with known three-dimensional structures that are related to the target sequence,
constructing a model for the target sequence based on its alignment with the template
structure(s) and evaluating the model against a variety of criteria to determine if it is
satisfactory.
Evaluating the Alignment: The best way to assess the accuracy is to compare
alignments from sequence comparisons with alignments from protein three-dimensional
structures.
Identification of Structurally Conserved and Structurally Variable Regions:
After the known structures are aligned, they are examined to identify the structurally
conserved regions (SCRs) from which an average structure, or framework, can be
constructed for these regions of the proteins. Variable regions (VRs), in which each of the
known structures may differ in conformation, also must be identified because special
techniques must be applied to model these regions of the unknown protein.
Generating Coordinates for the Unknown Structure: When generating coordinates for
the unknown structure, one needs to model main chain atoms and side chain atoms, both
in SCRs and VRs. For the SCRs, it is straightforward to generate the coordinates of the
main chain atoms of the unknown structure from those of the known structure(s). Side
chain coordinates are copied if the residue type in the unknown is identical or very
similar to that in the known homologues. For other side chain coordinates one can apply a
side chain rotamer library in a systematic approach to explore possible side chain

conformations. It may be desirable to weight the contribution of each homologue in each
SCR based on the extent of similarity with the unknown. In the event that some
coordinates in the unknown are undefined in the SCRs, regularization can be used to
build and relax both main chain and side chain atoms in those regions. Note that this
procedure should be used only if the region of undefined atoms is one or two residues in
length.
For the VRs, a variety of approaches may be applied in assigning coordinates to the
unknown. Recall that these regions will correspond most often to the loops on the surface
of the protein. If a loop in one of the known structures is a good model for that of the
unknown, then the main chain coordinates of that known structure can be copied. Side
chain coordinates of residues that are similar in length and character also may be copied.
Rotamer libraries can be used to define other side chain coordinates.
When a good model for a loop cannot be found among the known structures, one can
search fragment databases for loops in other proteins that may provide a suitable model
for the unknown. A residue range is chosen to include the undefined loop as well as a few
residues (e.g., three) on either side of the loop for which coordinates have been defined.
Fragments are examined for their ability to fit in the undefined region without making
bad contacts with other atoms and to overlap well with the residues on either side of the
loop. The loop may then be subjected to conformational searching to identify low energy
conformers if desired. Coordinates for side chain atoms in these loop regions may be
copied if residues are similar, though it is likely that considerable application of side
chain rotamer libraries will be required to define coordinates in these regions.
Databases of Structures from Homology Modeling: Databases are now available that
contain large numbers of protein structures that have been obtained by comparative
(homology) modeling. Two of these databases are ModBase and 3DCrunch.
Modbase was created by Sali and co-workers, using their program Modeller, which
creates models based on the satisfaction of spatial restraints. That is, restraints are
identified from the alignments of homologues of known structure, and these restraints are
then applied to the unknown sequence. Restraints can include distances between alpha
carbons, other distances within the main-chain, and main-chain and side-chain dihedral
angles. Routines to satisfy the restraints optimally include conjugate gradient
minimization and molecular dynamics with simulated annealing.
3DCrunch is a large scale modeling project that aims to submit all entries from protein
sequence databases to SWISS-MODEL. Currently the database contains 64,000 entries.
Automated Web-Based Homology Modeling: Web-based tools are now available to
generate models of protein 3-dimensional structures using comparative modeling
techniques.
SWISS-MODEL is available through Glaxo Wellcome Experimental Research in
Geneva, Switzerland.
WHAT IF, available on EMBL servers, includes three components, one to generate the
homology models, one to evaluate the quality of the homology models, and one to

evaluate models of proteins for which the structure is already known, thereby providing
for evaluation of the quality of the modeling program.
Evaluation and Refinement of the Structure: For a homology model from any source,
it is important to demonstrate that the structural features of the model are reasonable in
terms of what is know about protein structures in general. That is, researchers have
analyzed three-dimensional structures of proteins from which basic principles of protein
structure and folding have been developed. Several programs are available to assist in this
analysis of correctness of a homology model. Programs that provide structure analysis
along with output that is useful for publication include PROCHECK and 3D-Profiler
PROCHECK is based on an analysis of (phi,psi) angles, peptide bond planarity, bond
lengths, bond angles, hydrogen-bond geometry, and side-chain conformations of known
protein structures as a function of atomic resolution. Thus, the expected values of these
parameters are known and can be compared to a modeled structure based on the atomic
resolution of the structures from which the model was developed. 3D-profiler compares a
homology model to its sequence using a 3D profile. The profile is based on the statistical
preferences of each of the 20 amino acids for particular environments within the protein.
Each residue position in a 3D model can be characterized by its environment. Preferred
environments for amino acids are derived from known three-dimensional structures and
are defined by three parameters: (1) the area of each residue that is buried, (2) the fraction
of side-chain area that is covered by polar atoms (i.e., O and N), and (3) the local
secondary structure. Based on these environment variables, a 3D structure is converted
into a 1D profile that describes each residue in the folded protein structure. Examination
of these profiles reveals which regions of a sequence appear to be folded correctly and
which do not.
Once any irregularities have been resolved, the entire structure may then be subjected to
further refinement. This process may consist of energy minimization with restraints,
especially for the SCRs. The restraints then may be gradually removed for subsequent
minimizations. It also may be advantageous to apply molecular dynamics in conjunction
with energy minimization. For any of these refinement procedures, the structure should
be solvated, using for example crystallographic waters from the known homologues, a
solvent shell, or a periodic box of pre-equilibrated water molecules.

6.0 Tools for structure based drug design and docking
Docking Software
1. Protein-Ligand Docking Affinity (Accelrys Inc.) automated, flexible docking

uses the energy of the ligand/receptor complex to automatically find the best binding
modes of the ligand to the receptor (energy-driven method)
2. AutoDock (The Scripps Research Institute) automated docking of flexible ligands to

macromolecules designed to predict how small molecules, such as substrates or drug
candidates, bind to a receptor of known 3D structure
3. CombiBUILD (Sandia National Labs)

structure-based drug design program created to aid the design of combinatorial libraries
screens a library possible reactants on the computer, and predicts which ones will be the
most potent and successfully applied to find nanomolar inhibitors of Cathepsin D
roughly an order of magnitude superior to standard diversity approaches
4. DockVision (University of Alberta) docking package created by scientists for scientists

by including Monte Carlo, Genetic Algorithm, and database screening docking
algorithms
5. FRED (OpenEye) accurate and extremely fast, multiconformer docking program

examines all possible poses within a protein active site, filtering for shape
complementarity and optional pharmacaphoric features before scoring with more
traditional functions
6. FlexiDock (Tripos) simple, flexible docking of ligands into binding sites on proteins
fast genetic algorithm for generation of configurations rigid, partially flexible, or fully
flexible receptor side chains provide optimal control of ligand binding characteristics
conformationally flexible ligands tunable energy evaluation function with special H-bond
treatment very fast run times
7. FlexX (BioSolveIT GmbH) fast computer program for predicting protein-ligand

interactions two main applications: complex prediction (create and rank a series of
possible protein-ligand complexes), virtual screening (selecting a set of compounds for
experimental testing) conformational flexibility of the ligand; rigid protein placement
algorithm based on the interactions occurring between the molecules (limited to low-
energy structures)
8. MIMUMBA torsion angle database used for the creation of conformers; interaction
geometry database used to exactly describe intermolecular interaction patterns
Boehm function (with minor adaptions necessary for docking) applied for scoring
9.GLIDE (Schrödinger GmbH) high-throughput ligand-receptor docking for fast library

screening fast and accurate docking program identifies the best binding mode through
Monte Carlo sampling provides an accurate scoring function for ranking of binding

affinities can enrich the fraction of suitable lead candidates in a chemical database - by
predicting binding affinity rapidly and with a reasonable level of accuracy - will greatly
enhance the probability of success in a drug discovery program
10. GOLD (CCDC) calculating docking modes of small molecules into protein binding
sites genetic algorithm for protein-ligand docking full ligand and partial protein
flexibility energy functions partly based on conformational and non-bonded
contactinformation from the CSD choice of scoring functions: GoldScore, ChemScore
and User defined score virtual library screening
11. HINT! (Virginia Commonwealth University) Hydropathic Interactions empirical

molecular modeling system with new methods for de novo drug design and protein or
nucleic acid structural analysis translates the well-developed Medicinal Chemistry and
QSAR formalism of LogP and hydrophobicity into a free energy interaction model for all
biomolecular systems based on the experimental data from solvent partitioning calculates
3D hydropathy fields and 3D hydropathic interaction maps estimates LogP for modeled
molecules or data files numerically and graphically evaluates binding of drugs or
inhibitors into protein structures and scores DOCK orientations constructs hydropathic
(LOCK and KEY) complementarity maps (can be used to predict an ideal substrate from
a known receptor or protein structure or to propose the hydropathic structure from known
agonists or antagonists) evaluates/predicts effects of site-directed mutagenesis on protein
structure and stability
12. LIGPLOT (University College of London) program for automatically plotting

protein-ligand interactions generates schematic diagrams of protein-ligand interactions
for a given PDB file interactions shown are those mediated by hydrogen bonds (dashed
lines between the atoms involved) and by hydrophobic contacts (represented by an arc
with spokes radiating towards the ligand atoms they contact)
13. SITUS (Scripps Research Institute) program package for modeling of atomic
resolution structures into low-resolution density maps software supports both rigid-body
and flexible docking using a variety of fitting strategies
14. SenSitus interactive docking and visualization program for low-resolution density
maps and atomic structures GUI-based alternative to certain Situs docking programs that
can benefit from an interactive user interface and 3D visualization methods
15. VEGA (Milan University) calculation of ligand-receptor interaction energy
Protein-Ligand & Protein-Protein Docking
16. DOCK (UCSF Molecular Design Institute) generates many possible orientations (and
more recently, conformations) of a putative ligand within a user-selected region of a
receptor structure orientations may be scored using several schemes designed to measure
steric and/or chemical complementarity of the receptor-ligand complex evaluate likely
orientations of a single ligand, or to rank molecules from a database search databases for

DNA-binding compounds examine possible binding orientations of protein-protein and
protein-DNA complexes design combinatorial libraries
17. GRAMM (SUNY) Global Range Molecular Matching empirical approach to

smoothing the intermolecular energy function by changing the range of the atom-atom
potentials requires only the atomic coordinates of the two molecules to predict the
complex structure (no binding site information needed) performs an exhaustive 6-
dimensional search through the relative translations and rotations of the molecules
see also the database of Protein-Protein Decoys for the validation of energy functions and
refinement procedures
18. ICM-Dock (MolSoft LLC) fast and accurate docking simulations unique set of tools
for accurate individual ligand-protein docking, peptide-protein docking, and protein-
protein docking, including interactive graphics tools
Protein-Protein (Peptide) Docking
19. 3D-Dock Suite (BioMolecular Modeling, Cancer Research UK) incorporating

FTDock, RPScore and MultiDock FTDock (Fourier Transform Dock) performs rigid-
body docking on two biomolecules in order to predict their correct binding geometry
outputs multiple predictions that can be screened using biochemical information RPScore
(Residue level Pair potential Score) uses a single distance constraint empiricaly derived
pair potential to screen the ouptut from FTDock can reduce dramatically the list of
possible complexes within which can be found a correct solution MultiDock (Multiple
copy side-chain refinement Dock)
20. Bielefeld Protein Docking (Bielefeld University) detects geometrical and chemical
complementarities between surfaces of proteins and estimates docking positions
21. BiGGER (BioTecnol, S.A.) Biomolecular complex Generation with Global

Evaluation and Ranking efficient protein-docking algorithm predicts the structure of
binary protein complexes from the unbound structures search the complete binding space
and select a set of candidate complexes evaluate and rank each candidate according to the
estimated probability of being an accurate model of the native complex intergrated in
chemera, a molecular graphics and modeling program for studying protein structures and
interactions
22. ClusPro (Boston University) integrated approach to protein-protein docking

docking algorithm includes the following steps: rigid body docking based on the Fourier
correlation approach (used DOT and ZDOCK docking programs) selection of structures
with favorable desolvation and electrostatic properties clustering the retained complexes
using a pairwise RMSD criterion refinement of the 25 largest clusters by the flexible
docking algorithm SmoothDock
23. DOT (San Diego Supercomputer Center) Daughter Of TURNIP TURNIP - program,
developed by V. Roberts at The Scripps Research Institute for use in the study of

macromolecular dockingcomputation of the electrostatic potential energy between two
proteins or other charged molecules
24. ESCHER NG (Milan University) enhanced version of the original ESCHER protein-
protein automatic docking system developed in 1997 by G. Ausiello, G. Cesareni and M.
Helmer Citterich new release, with a reengineered code, includes some new features:
protein-protein and DNA-protein docking capability fast surface calculation based on the
NSC algorithm
25. HADDOCK (Utrecht University Netherlands) High Ambiguity Driven protein-

protein Docking biochemical and/or biophysical interaction data such as chemical shift
perturbation data resulting from NMR titration experiments or mutagenesis data
introduced as ambiguous interaction restraints (AIRs) to drive the docking process
AIR - defined as an ambiguous distance between all residues shown to be involved in the
interaction
26. HEX (University of Aberdeen) protein docking and molecular superposition program
use spherical polar Fourier correlations to accelerate docking calculations

7.0 Computational Resources
Web services: The Web describes information using HyperText Markup Language
(HTML) and transmits it using HyperText Transport Protocol (HTTP). The current
common name, The Web, is a contraction of its original name, the Word Wide Web, also
abbreviated as WWW or W3. A Web browser performs multiple tasks. First, any Web
browser is an HTTP client; it knows how to transfer data using the HTTP protocol.
Second, any Web browser also knows how to interpret and display HTML, the content
markup language used on the Web. Different browsers have different display capabilities
and display the same HTML code in different ways (which is why HTML is referred to
as a content markup language instead of a page description language) but all of them can
understand (parse) HTML and do something reasonable with it.
Some of the differences in the way different Web browsers display the same Web page
come from different design decisions ("what font should be used for <H1> text?") and
some of it comes from the fact that different Web clients have different capabilities.
Some of these differences, such as the ability to display various kinds of still or moving
images as part of the Web page or to run programs written in Java, Active X, or
Javascript, represent extensions to HTML. These extra capabilities may be built into the
browser or may be added by "plugins"; software extensions which give the browser new
functionality. Finally, the behavior of a Web browser can frequently changed by
configuring its preferences; if you find the default font too small, that can often be
increased.
Many new computer users assume that the Web and the Internet are synonymous.
However, many protocols other than HTTP flow over the Internet. In part, the new user is
confused by the fact that, in addition to supporting extensions to HTML, many popular
web browsers have support for other protocols such as email (SMTP, POP, IMAP),
newsgroups, ftp, and gopher for example. What this really means is that the particular
piece of software (e.g. Netscape Communicator) is more than just a Web client, it is also
an email client, an FTP client and a Gopher client. Finally, HTTP does not have to be
transmitted over the Internet, and HTML doesn't have to be transmitted via HTTP. Web
technology has become a common interface tool for communication between computers
on a local network (sometimes called an Intranet), and every Web client I have worked
with has the ability to read and display local HTML files.
Because virtually every Web client is also a limited FTP client, many people choose to so
use them. In the case where a Web page contains a link to an FTP server, simply
selecting the link downloads the file. If, however, you are given the following
instructions to retrieve a file:

Networking
Telnet: Telnet is one of the oldest of the network services and perhaps the easiest to
understand. Telnet allows one computer to "log on" to another computer as if it were a
terminal. Once logged on, you frequently will have all the privileges of a local user; you
can run programs, create and delete files. This is probably the most common way that
users with accounts will use a computer.
Although "full service logins" as is described above are perhaps the most common use of
the telnet protocol, in fact as much control as the host's system administrator desires may
be imposed on a telnet connection. Thus, a telnet service may be advertised with a public
login name and password. Login with this name, however, is likely to be restricted to a
limited number of commands. The National Institutes of Health in the United States used,
at one point, such a telnet login to disseminate information as to the membership of study
sections. Such specialized telnet services have become much less common since the rise
in popularity of the Web.
A telnet session can negotiate a range of different protocols, but this almost always
includes ASCII text. Because many protocols for other services (e.g. SMTP, HTTP) are
encoded as ASCII text, a telnet client can sometimes be used to connect to a server for
these other protocols. Most people will use a telnet client the first time connecting to a
MOO, and some people will continue to use telnet as their client, although most of us
find dedicated clients to be significantly more convenient. Similarly, it is possible to
connect to a Web server with a telnet client if you understand the syntax of HTTP. This is
almost never done to use a Web server, but is occasionally done when debugging.
From a practical point of view, every telnet host will be different, and thus you will need
to learn about each one as you have occasion to use it.
ftp: Telnet is useful for interactive computer access, but is much less useful for
transferring files. Ftp is an older service designed specifically for file transfer. Originally
it, like telnet, was intended for account owners. However, as it became apparent that it
was useful to make files available to the world at large without giving all those wanting
the files an account, the variant of "anonymous ftp" developed. In this variant, logging in
with a "magic" user name (most commonly "anonymous" or "ftp") eliminates the
requirement for a password.
Once logged on via ftp, access to the host filesystem is accomplished by a series of
commands. On a unix ftp client, the commands are unix-like; cd to Change Directory and
ls to LiSt the files in that directory. To transfer files, you execute either get a file from the
host computer or put a file onto it (where allowed). These commands do not depend on
the host computer running UNIX! These are ftp commands, some of which happen to be
similar to unix commands. A client may choose to hide these commands; a client with a
graphical user interface (GUI), for example, might not have typed commands at all, but
buttons.

One pair of ftp commands which is especially important to understand are binary and
ascii. Ftp transfers occur in ascii (text) mode by default. In ascii mode, the file received
may not be identical to the one on the host, as ftp may make changes in the file during
transfer, to allow for differences in how different operating systems handle text. For
example, UNIX terminates lines with the linefeed character (ASCII 10 decimal), the
Macintosh operating system with a carriage return (ASCII 13 decimal) and MSDOS uses
one of each. These differences are corrected for during an ascii transfer. This is highly
desirable for text files, but catastrophic for binary files like program object code and
pictures. Thus, before getting such a file, it is important to issue the binary command.
This instructs ftp to transfer files unmodified.
Email: Both ftp and telnet are interactive, more or less real time programs. Sometimes it
is useful, however, to communicate with another computer, or more commonly, a user on
another computer, by leaving them a message which they can read and respond to at their
convenience. This is done over the Internet by using email.
Email is a generic term for a variety of processes which can use different protocols and
network technology, and which, in many cases uses a more complex client/server model.
At present, most email is transmitted by SMTP (Simple Mail Transport Protocol) via
TCP/IP over the Internet. SMTP transmits email on port 25 between two dedicated, full
time servers. Although the assumption is that both SMTP servers will be generally
available, should the receiving server not be reachable when the transmitting server needs
to send email, the email message will be held and the transmission will be retried several
times over a period of days until a successful transmission occurs or until the maximum
retry time has been exceeded, at which point an error message will be returned to the
sender.
The SMTP programs discussed above are typically symmetrical (e.g. the program can
alternatively serve as client or server), and are complex. Typically, you will not interact
with these programs directly. Rather, dedicated client software is used to compose, send,
receive, and read email, and it is that software which communicates with the SMTP
server. If you send and receive email via a computer that is always on and always
connected to a network reachable by your mail server (e.g. a Unix workstation), then
incoming mail is saved to a mail spool file on your computer from whence your client
software retrieves it, and outgoing email is passed to the SMTP server. Examples of
client software running on Unix workstations are mail, mailx, mush, elm, mutt, and pine.
Also, as is discussed below, web browsers sometimes can be used as email clients.
If you send and receive email via a computer that is not always on and/or not always
connected to the network, sending email proceeds as above, but receiving email is
different in that the SMTP server cannot necessarily get incoming email onto your
computer's file system. In that case, a different protocol is used, most commonly POP3.
(IMAP is a newer protocol for accomplishing the same task about which you may hear
more in the future.) The SMTP server stores your email on a remote host and your local
client retrieves it from a POP3 server when you check for mail. Typically, a POP3
account will be provided by whoever provides your Internet access. Thus, to install an

email client on a Mac or Windows computer, you typically have to provide the domain
name and/or IP address of the SMTP and POP3 servers (frequently the same) and the
user name and password for the POP3 account.
Important commands in LINUX/UNIX operating systems
1. cat - display or concatenate files

cat takes a copy of a file and sends it to the standard output (i.e. to be displayed on your
terminal, unless redirected elsewhere), so it is generally used either to read files, or to
string together copies of several files, writing the output to a new file.
cat ex
displays the contents of the file ex.
cat ex1 ex2 > newex
creates a new file newex containing copies of ex1 and ex2, with the contents of ex2
following the contents of ex1.
2. cd - change directory
cd is used to change from one directory to another.
cd dir1
changes directory so that dir1 is your new current directory. dir1 may be either the full
pathname of the directory, or its pathname relative to the current directory.
cd
changes directory to your home directory.
cd ..
moves to the parent directory of your current directory.
3. chmod - change the permissions on a file or directory

chmod alters the permissions on files and directories using either symbolic or octal
numeric codes. The symbolic codes are given here:-
u user + to add a permission r read
g group - to remove a permission w write
o other = to assign a permission explicitly x execute (for files),
access (for directories)
The following examples illustrate how these codes are used.
chmod u=rw file1
sets the permissions on the file file1 to give the user read and write permission on file1.
No other permissions are altered.
chmod u+x,g+w,o-r file1

alters the permissions on the file file1 to give the user execute permission on file1, to
give members of the user's group write permission on the file, and prevent any users
not in this group from reading it.
chmod u+w,go-x dir1
gives the user write permission in the directory dir1, and prevents all other users
having access to that directory (by using cd. They can still list its contents using ls.)
4. cp - copy a file
The command cp is used to make copies of files and directories.
cp file1 file2
copies the contents of the file file1 into a new file called file2. cp cannot copy a file
onto itself.
cp file3 file4 dir1
creates copies of file3 and file4 (with the same names), within the directory dir1. dir1
must already exist for the copying to succeed.
cp -r dir2 dir3
recursively copies the directory dir2, together with its contents and subdirectories, to
the directory dir3. If dir3 does not already exist, it is created by cp, and the contents
and subdirectories of dir2 are recreated within it. If dir3 does exist, a subdirectory
called dir2 is created within it, containing a copy of all the contents of the original dir2.
5. date - display the current date and time
date returns information on the current date and time in the format shown below:-
Tue Mar 25 15:21:16 GMT 1997
It is possible to alter the format of the output from date. For example, using the command
line
date '+The date is %d/%m/%y, and the time is %H:%M:%S.'
at exactly 3.10pm on 14th December 1997, would produce the output
The date is 14/12/97, and the time is 15:10:00.
6. diff - display differences between text files

diff file1 file2 reports line-by-line differences between the text files file1 and file2. The
default output will contain lines such as n1 a n2,n3 and n4,n5 c n6,n7 , (where n1 a n2,n3
means that file2 has the extra lines n2 to n3 following the line that has the number n1 in
file1, and n4,n5 c n6,n7 means that lines n4 to n5 in file1 differ from lines n6 to n7 in file2).
After each such line, diff prints the relevant lines from the text files, with < in front of
each line from file1 and > in front of each line from file2.
There are several options to diff, including diff -i, which ignores the case of letters when
comparing lines, and diff -b, which ignores all trailing blanks.
diff -cn
produces a listing of differences within n lines of context, where the default is three
lines. The form of the output is different from that given by diff, with + indicating

lines which have been added, - indicating lines which have been removed, and !
indicating lines which have been changed.
diff dir1 dir2
will sort the contents of directories dir1 and dir2 by name, and then run diff on the text
files which differ.
7. file - determine the type of a file

file tests named files to determine the categories their contents belong to.
file file1
can tell if file1 is, for example, a source program, an executable program or shell
script, an empty file, a directory, or a library, but (a warning!) it does sometimes
make mistakes.
8. find - find files of a specified name or type

find searches for files in a named directory and all its subdirectories.
find . -name '*.f' -print
searches the current directory and all its subdirectories for files ending in .f, and
writes their names to the standard output. In some versions of Unix the names of the
files will only be written out if the -print option is used.
find /local -name core -user user1 -print
searches the directory /local and its subdirectories for files called core belonging to the
user user1 and writes their full file names to the standard output.
9. grep - searches files for a specified string or expression

grep searches for lines containing a specified pattern and, by default, writes them to the
standard output.
grep motif1 file1
searches the file file1 for lines containing the pattern motif1. If no file name is given,
grep acts on the standard input. grep can also be used to search a string of files, so
grep motif1 file1 file2 ... filen
will search the files file1, file2, ... , filen, for the pattern motif1.
grep -c motif1 file1
will give the number of lines containing motif1 instead of the lines themselves.
grep -v motif1 file1
will write out the lines of file1 that do NOT contain motif1.

10. gzip - compress a file
gzip reduces the size of named files, replacing them with files of the same name extended
by .gz . The amount of space saved by compression varies.
gzip file1
results in a compressed file called file1.gz, and deletes file1.
gzip -v file2
compresses file2 and gives information, in the format shown below, on the percentage
of the file's size that has been saved by compression:-
file2 : Compression 50.26 -- replaced with file2.gz
To restore files to their original state use the command gunzip. If you have a compressed
file file2.gz, then
gunzip file2
will replace file2.gz with the uncompressed file file2.
11. help - display information about bash builtin commands

help gives access to information about builtin commands in the bash shell. Using help on
its own will give a list of the commands it has information about. help followed by the
name of one of these commands will give information about that commands. help history,
for example, will give details about the bash shell history listings.
12. info - read online documentation

info is a hypertext information system. Using the command info on its own will enter the
info system, and give a list of the major subjects it has information about. Use the
command q to exit info. For example, info bash will give details about the bash shell.
13. kill - kill a process to kill a process using kill requires the process id (PID).
14. lpr - print out a file

lpr is used to send the contents of a file to a printer. If the printer is a laserwriter, and the
file contains PostScript, then the PostScript will be interpreted and the results of that
printed out.
lpr -Pprinter1 file1
will send the file file1 to be printed out on the printer printer1. To see the status of the
job on the printer queue use
lpq -Pprinter1
for a list of the jobs queued for printing on printer1. (This may not work for remote
printers.)

15. ls - list names of files in a directory
ls lists the contents of a directory, and can be used to obtain information on the files and
directories within it.
ls dir1
lists the names of the files and directories in the directory dir1, (excluding files whose
names begin with . ). If no directory is named, ls lists the contents of the current
directory.
ls -a dir1
will list the contents of dir1, (including files whose names begin with . ).
ls -l file1
gives details of the access permissions for the file file1, its size in kbytes, and the time
it was last altered.
ls -l dir1
gives such information on the contents of the directory dir1. To obtain the information
on dir1 itself, rather than its contents, use
ls -ld dir1
16. man - display an on-line manual page

man displays on-line reference manual pages.
man command1
will display the manual page for command1, e.g man cp, man man.
man -k keyword
lists the manual page subjects that have keyword in their headings. This is useful if
you do not yet know the name of a command you are seeking information about.
man -Mpath command1
is used to change the set of directories that man searches for manual pages on
command1
17. mkdir - make a directory

mkdir is used to create new directories. In order to do this you must have write permission
in the parent directory of the new directory.
mkdir newdir
will make a new directory called newdir.
mkdir -p can be used to create a new directory, together with any parent directories
required.
mkdir -p dir1/dir2/newdir
will create newdir and its parent directories dir1 and dir2, if these do not already exist.

18. more - scan through a text file page by page
more displays the contents of a file on a terminal one screenful at a time.
more file1
starts by displaying the beginning of file1. It will scroll up one line every time the
return key is pressed, and one screenful every time the space bar is pressed. Type ?
for details of the commands available within more. Type q if you wish to quit more
before the end of file1 is reached.
more -n file1
will cause n lines of file1 to be displayed in each screenful instead of the default
(which is two lines less than the number of lines that will fit into the terminal's
screen).
19. mv - move or rename files or directories

mv is used to change the name of files or directories, or to move them into other
directories. mv cannot move directories from one file-system to another, so, if it is
necessary to do that, use cp instead.
mv file1 file2
changes the name of a file from file1 to file2 unless dir2 already exists, in which case
dir1 will be moved into dir2.
mv dir1 dir2
changes the name of a directory from dir1 to dir2.
mv file1 file2 dir3
moves the files file1 and file2 into the directory dir3.
20. nice - change the priority at which a job is being run

nice causes a command to be run at a lower than usual priority. nice can be particularly
useful when running a long program that could cause annoyance if it slowed down the
execution of other users' commands. An example of the use of nice is
nice compress file1
which will execute the compression of file1 at a lower priority.
21. passwd - change your password

Use passwd when you wish to change your password. You will be prompted once for your
current password, and twice for your new password. Neither password will be displayed
on the screen.

22. ps - list processes
ps displays information on processes currently running on your machine. This
information includes the process id, the controlling terminal (if there is one), the cpu time
used so far, and the name of the command being run.
ps
gives brief details of your own processes in your current session.
To obtain full details of all your processes, including those from previous sessions use:-
ps -fu user1
using your own user name in place of user1.
ps is a command whose options vary considerably in different versions of Unix (such as
BSD and SystemV). Use man ps for details of all the options available on the machine you
are using.
23. pwd - display the name of your current directory

The command pwd gives the full pathname of your current directory.
24. quota - disk quota and usage

quota gives information on a user's disk space quota and usage.
quota
will only give details of where you have exceeded your disc quota on local disks,
whereas
quota -v
will display your quota and usage, whether the quota has been exceeded or not, and
includes information on disks mounted from other machines, as well as the local
disks.
25. rm - remove files or directories

rm is used to remove files. In order to remove a file you must have write permission in its
directory, but it is not necessary to have read or write permission on the file itself.
rm file1
will delete the file file1. If you use
rm -i file1
instead, you will be asked if you wish to delete file1, and the file will not be deleted
unless you answer y. This is a useful safety check when deleting lots of files.
rm -r dir1
recursively deletes the contents of dir1, its subdirectories, and dir1 itself, and should be
used with suitable caution.

26. rmdir - remove a directory
rmdir removes named empty directories. If you need to delete a non-empty directory rm -r
can be used instead.
rmdir exdir
will remove the empty directory exdir.
27. sort - sort and collate lines

The command sort sorts and collates lines in files, sending the results to the standard
output. If no file names are given, sort acts on the standard input. By default, sort sorts
lines using a character by character comparison, working from left to right, and using the
order of the ASCII character set.
sort -d
uses "dictionary order", in which only letters, digits, and white-space characters are
considered in the comparisons.
sort -r
reverses the order of the collating sequence.
sort -n
sorts lines according to the arithmetic value of leading numeric strings. Leading
blanks are ignored when this option is used, (except in some System V versions of
sort, which treat leading blanks as significant. To be certain of ignoring leading
blanks use sort -bn instead.).
28. slogin - secure remote login program

slogin is used for logging onto a remote machine and for executing commands on a remote
machine, and provides secure encrypted communications between the local and remote
machines using an SSH protocol. The remote machine must be running an SSH server for
such connections to be possible.
29. telnet - remote login program

telnet communicates with another computer using the TELNET protocol.
telnet host1
will connect to the remote machine host1 (if it allows telnet connections). For
example, using telnet to connect to the Central Unix Service
You can then login using your user name on cus.cam.ac.uk. If you use the escape
character instead, you will enter telnet's command mode (you'll get the prompt telnet >
), and the command quit will get you back to the command line of your local machine.

As communications between the two machines are not encrypted when using telnet, it
is preferable to use ssh, if that is available.
Some Bioprograming tools:
1. Bioconductor
Bioconductor is an open source and open development software project that aims to
provide access to a wide range of powerful statistical and graphical methods for the
analysis of genomic data.
2. BioDAS
This site is the center of development of an Open Source system for exchanging
annotations on genomic sequence data.
3. BioJava
The BioJava Project is an open-source project dedicated to providing Java tools for
processing biological data.
4. BioMoby
BioMOBY is an international research project involving biological data hosts,
biological data service providers, and coders whose aim is to explore various
methodologies for biological data representation, distribution, and discovery.
5. BioPax
The BioPAX web site provides information about a collaborative effort to create a data
exchange format for biological pathways.
6. BioPerl
The BioPerl Project is an international association of developers of open source Perl
tools for bioinformatics, genomics and life science research.
7. BioPerl course
Great tutorial for those interested in the bioperl group of modules.
8. BioPHP
Open Source PHP code for bioinformatics. Includes functions and minitools (copy and
paste one page scripts for basic tasks in bioinformatics. A wiki-like service allows
modification and improvement of code.
9. BioPipe
The biopipe is a workflow framework that seeks to address some of the complexity
involved in carrying out large scale bioinformatics analysis. It has been designed to work
intimately with the bioperl package.

10. BioPython
The Biopython Project is an international association of developers of freely available
Python tools for computational molecular biology.
11. BioRuby
The BioRuby project aims to implement an integrated environment for bioinformatics
with Ruby.
12. CCT
CCT (Current Comparative Table) is a software package that you can install and set-up
on your own system to help you to maintain and search databases.
13. Ensembl API

Ensembl is a freely available software system for genomic analysis. The documentation
page at Ensembl is the best place to get information on the Ensembl application
programming interface (API). In particular, the tutorial document includes lots of
examples of scripts and exercises for you to try.
14. Human Ageing Genomic Resources

The Human Ageing Genomic Resources (HAGR) website provides tools and curated
databases relevant to the genetics of human ageing. GenAge is a database of genes
related to human ageing, and AnAge is a multi-species database facilitating the
comparative biology of ageing. The Ageing Research Computational Tools (ARCT) is a
collection of Perl modules to assist comparative genomics research.
15. NCBI C++ toolkit

The NCBI C++ Toolkit is a collection of C++ modules developed by the NCBI for
writing bioinformatics software and applications.
16. Open Bioinformatics Foundation

The Open Bioinformatics Foundation is a non profit, volunteer run organization focused
on supporting open source programming in bioinformatics.
17. PyMOL
PyMOL is a molecular graphics system with an embedded Python interpreter designed
for real-time visualization and rapid generation of high-quality molecular graphics
images and animations.
18. R
System for statistical computation and graphics; an interpreted computer language which
allows branching and looping as well as modular programming using functions.
19. Seqhound API

SeqHound is a bioinformatics application programming platform that provides access to
biological sequence, structure and functional annotation data. An application
programming interface (API) is available to programmers using C, C++, Java and PERL.
20. Systems Biology Markup Language (SBML)

The Systems Biology Markup Language (SBML) is a computer-readable format for
representing models of biochemical reaction networks. SBML is applicable to metabolic
networks, cell-signaling pathways, genomic regulatory networks, and many other areas in
systems biology.

References
[1] Fitch, W.M. (1983) "Random sequences." J. Mol. Biol. 163:171-176.
[2] Lipman, D.J., Wilbur, W.J., Smith T.F. & Waterman, M.S. (1984) "On the statistical
significance of nucleic acid similarities." Nucl. Acids Res. 12:215-226.
[3] Altschul, S.F. & Erickson, B.W. (1985) "Significance of nucleotide sequence
alignments: a method for random sequence permutation that preserves dinucleotide and
codon usage." Mol. Biol. Evol. 2:526-538.
[4] Deken, J. (1983) "Probabilistic behavior of longest-common-subsequence length." In

"Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence
Comparison." D. Sankoff & J.B. Kruskal (eds.), pp. 55-91, Addison-Wesley, Reading,
MA.
[5] Reich, J.G., Drabsch, H. & Daumler, A. (1984) "On the statistical assessment of
similarities in DNA sequences." Nucl. Acids Res. 12:5529-5543.
[6] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol. 215:403-410.
[7] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
subsequences." J. Mol. Biol. 147:195-197.
[8] Sellers, P.H. (1984) "Pattern recognition in genetic sequences by mismatch density."
Bull. Math. Biol. 46:501-514.
[9] Gumbel, E. J. (1958) "Statistics of extremes." Columbia University Press, New York,
NY.
[10] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance
of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci.
USA 87:2264-2268.
[11] Dembo, A., Karlin, S. & Zeitouni, O. (1994) "Limit distribution of maximal non-
aligned two-sequence segmental score." Ann. Prob. 22:2022-2039.
[12] Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448.
[13] Pearson, W.R. (1995) "Comparison of methods for searching protein sequence
databases." Prot. Sci. 4:1145-1160.
[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol.
266:460-480.

[15] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. &
Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs." Nucleic Acids Res. 25:3389-3402.
[16] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of
nucleic acid similarities." Nucleic Acids Res. 13:645-656.
[17] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein
sequence similarities." Comput. Appl. Biosci. 4:67-71.
[18] Mott, R. (1992) "Maximum-likelihood estimation of the statistical distribution of

Smith-Waterman local sequence similarity scores." Bull. Math. Biol. 54:59-75.
[19] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical
significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628.
[20] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
Poisson approximation." Stat. Sci. 9:367-381.
[21] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity
searches." J. Mol. Biol. 276:71-84.
[22] Arratia, R. & Waterman, M.S. (1994) "A phase transition for the score in matching
random sequences allowing deletions." Ann. Appl. Prob. 4:200-225.
[23] McLachlan, A.D. (1971) "Tests for comparing related amino-acid sequences.
Cytochrome c and cytochrome c-551." J. Mol. Biol. 61:409-424.
[24] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary
change in proteins." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed.
M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC.
[25] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant
relationships." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O.
Dayhoff), p. 353-358. Natl. Biomed. Res. Found., Washington, DC.
[26] Feng, D.F., Johnson, M.S. & Doolittle, R.F. (1984) "Aligning amino acid sequences:
comparison of commonly used methods." J. Mol. Evol. 21:112-125.
[27] Wilbur, W.J. (1985) "On the PAM matrix model of protein evolution." Mol. Biol.
Evol. 2:434-447.
[28] Taylor, W.R. (1986) "The classification of amino acid conservation." J. Theor. Biol.
119:205-218.

[29] Rao, J.K.M. (1987) "New scoring matrix for amino acid residue exchanges based on
residue characteristic physical parameters." Int. J. Peptide Protein Res. 29:276-281.
[30] Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) "Amino acid
substitutions in structurally related proteins. A pattern recognition approach.
Determination of a new and efficient scoring matrix." J. Mol. Biol. 204:1019-1029.
[31] Altschul, S.F. (1991) "Amino acid substitution matrices from an information
theoretic perspective." J. Mol. Biol. 219:555-565.
[32] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid
database searches using application-specific scoring matrices." Methods 3:66-70.
[33] Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992) "Exhaustive matching of the
entire protein sequence database." Science 256:1443-1445.
[34] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein
blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
[35] Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) "The rapid generation of
mutation data matrices from protein sequences." Comput. Appl. Biosci. 8:275-282.
[36] Overington, J., Donnelly, D., Johnson M.S., Sali, A. & Blundell, T.L. (1992)
"Environment-specific amino acid substitution tables: Tertiary templates and prediction
of protein folds." Prot. Sci. 1:216-226.
[37] Henikoff, S. & Henikoff, J.G. (1993) "Performance evaluation of amino acid
substitution matrices." Proteins 17:49-61.
[38] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J.
Mol. Biol. 162:705-708.
[39] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad.
Sci. USA 80:1382-1386.
[40] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine
gap costs." Bull. Math. Biol. 48:603-616.
[41] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput.
Appl. Biosci. 4:11-17.
[42] Claverie, J.-M. & States, D.J. (1993) "Information enhancement methods for large-
scale sequence-analysis." Comput. Chem. 17:191-201.
[43] Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid
sequences and sequence databases." Comput. Chem. 17:149-163.

[44] Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994) "Issues in searching
molecular sequence databases." Nature Genet. 6:119-129.
[45] Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., and Thornton, J.M. (1987)
Knowledge-Based Prediction of Protein Structures and the Design of Novel Molecules.
Nature 326: 347-352.
[46] Fetrow, J.S. and Bryant, S.H. (1993) New Programs for Protein Tertiary Structure
Prediction. Bio/Technology 11: 479-484.
[47] Greer, J. (1991) Comparative Modeling of Homologous Proteins. Meth. Enzymol.

202: 239-252.
[48] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and Blundell, T.L. (1994)
Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29: 1-68.
[49] Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From
Comparisons of Protein Sequences and Structures to Protein Modelling and Design.
Trends Biochem. Sci. 15: 235-240.
[50] Lewin, R. (1987) When Does Homology Mean Something Else? Science 237: 1570.
[51] Reeck, G.R. et al. (1987) "Homology" in Proteins and Nucleic Acids: A
Terminology Muddle and a Way out of It. Cell 50: 667.
[52] Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the
Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 48:
442-453.
[53] Dayhoff, M.O. and Eck, R.V. (1968) A Model of Evolutionary Change in Proteins.
In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 3, pp. 33-41,
National Biomedical Research Foundation, Washington, D.C.
[54] Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978) A Model for Evolutionary
Change. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 5, suppl. 3,
pp. 345-358, National Biomedical Research Foundation, Washington, D.C.
[55] Dayhoff, M.O., Barker, W.C., and Hunt, L.T. (1983) Establishing Homologies in
Protein Sequences. Meth. Enzymol. 91: 524-545.
[56] Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from
Protein Blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.
[57] Johnson, M.S. and Overington, J.P. (1993) A Structural Basis for Sequence
Comparisons - An Evaluation of Scoring Methodologies. J. Mol. Biol. 233: 716-738.

[58] Pearson, W.R. (1995) Comparison of Methods for Searching Protein Sequence
Databases. Protein Sci. 4: 1145-1160.
[59] Kabsch, W. and Sander, C. (1983) Dictionary of Protein Secondary Structure:

Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 22:
2577.
[60] Sali, A. and Blundell, T.L. (1993) Comparative Protein Modelling by Satisfaction of
Spatial Restraints. J. Mol. Biol. 234: 779-815.
[61] Luthy, R., Bowie, J.U., and Eisenberg, D. (1992) Assessment of Protein Models with
Three-Dimensional Profiles. Nature 356: 83-85.
[62] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991) A Method to Identify Protein
Sequences That Fold into a Known Three-Dimensional Structure. Science 253: 164-170.
[63] Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu, K., and Berendzen, J.
(1998) Class-directed Structure Determination: Foundation for a Protein Structure
Initiative. Protein Sci. 7: 1851-1856.

Bioinfo Training Material

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bioinfo Training Material

Hochgeladen von

Copyright:

Verfügbare Formate

BIOINFORMATICS

Indian Institute of Advanced Research

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

Bioinformatics is the field of science in which biology, computer science, and

Ultimately, however, all of this information must be combined to form a comprehensive

Some Important Landmarks in the development of Bioinformatics:

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

Protein structure determination: Traditionally, a protein's structure was determined

Nuclear Magnetic Resonance (NMR) Spectroscopy: The basic phenomenon of NMR

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

The E-value gives an indication of the statistical significance of a given pairwise

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

Tools for Comparative Genomics:

3. DAVID Bioinformatic Resources

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

10. IBM Genome Annotation Page

11. Integrated Microbial Genomes (IMG)

12. ISC Large-scale Sequencing Project Database

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

26. TIGR Software Tools

28. Viral Bioinformatics

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

The evolutionary process:

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

Tools for phylogeny reconstruction

3. Codon Usage Database

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

6. IBM Genome Annotation Page

8. Joes Site - Phylogeny Programs

11. MIGenAS Toolkit

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

13. MPI Toolkit

14. NCBI Taxonomy Database

17. Orthologue Search Service

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

27. Ribosomal Database Project

28. STING Millenium

31. Tree Editors

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

38. UCMP Phylogeny Wing

39. Understanding Evolution

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

Although molecular modeling may not be as accurate at determining a protein's structure

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

1. Protein-Ligand Docking Affinity (Accelrys Inc.) automated, flexible docking

2. AutoDock (The Scripps Research Institute) automated docking of flexible ligands to

3. CombiBUILD (Sandia National Labs)

4. DockVision (University of Alberta) docking package created by scientists for scientists

5. FRED (OpenEye) accurate and extremely fast, multiconformer docking program

7. FlexX (BioSolveIT GmbH) fast computer program for predicting protein-ligand

9.GLIDE (Schrödinger GmbH) high-throughput ligand-receptor docking for fast library

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

11. HINT! (Virginia Commonwealth University) Hydropathic Interactions empirical

12. LIGPLOT (University College of London) program for automatically plotting

15. VEGA (Milan University) calculation of ligand-receptor interaction energy

Protein-Ligand & Protein-Protein Docking

www.helpBIOTECH.blogspot.com | Your Gate Way to Life Science Career

17. GRAMM (SUNY) Global Range Molecular Matching empirical approach to