Sie sind auf Seite 1von 42


May 16-18, 2007

Reference Material Indian Institute of Advanced Research
Reference Material
Indian Institute of Advanced Research | Your Gate Way to Life Science Career






Protein Structure



Genome Analysis









Tools for Structure based drug design and docking



Computational Resources


1.0 Introduction

7.0 Computational Resources 23-36 1.0 Introduction Bioinformatics is the field of science in which biology,

Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. At the beginning of the "genomic revolution", a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues but the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.

Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The process of analyzing and interpreting data it is hoped, will lead to elucidation of underlying principles in the biological phenomenon.

Some Important Landmarks in the development of Bioinformatics:




The first theory of molecular evolution; the Molecular Clock concept (Linus Pauling and Emile Zukerkandl)

Atlas of Protein Sequences, the first protein database (Margaret Dayhoff and coworkers)

Needleman-Wunsch algorithm for global protein sequence alignment






New DNA sequencing methods (Fred Sanger, Walter Gilbert and coworkers); bacteriophage X174 sequence

First software for sequence analysis (Roger Staden)

Phylogenetic taxonomy; archaea discovered; the notion of the three primary kingdoms of life introduced (Carl Woese and coworkers)

Smith-Waterman algorithm for local protein sequence alignment

Human mitochondrial genome sequenced | Your Gate Way to Life Science Career






















The concept of a sequence motif (Russell Doolittle)

GenBank Release 3 made public

Phage genome sequenced (Fred Sanger and coworkers)

The first practical sequence database searching algorithm (John Wilbur and David Lipman)

FASTP/FASTN: fast sequence similarity searching (William Pearson and David Lipman)

Introduction of Markov models for DNA analysis (Mark Borodovsky and coworkers)

First profile search algorithm (Michael Gribskov, Andrew McLachlan, David Eisenberg)

EMBnet network for database distribution created EST: expressed sequence tag sequencing (Craig Venter and coworkers)
EMBnet network for database distribution created
EST: expressed sequence tag sequencing (Craig Venter and coworkers)
First bacterial genomes completely sequenced
First archaeal genome completely sequenced
First eukaryotic genome (yeast) completely sequenced
Introduction of gapped BLAST and PSI-BLAST
COGs: Evolutionary classification of proteins from complete genomes
Fly genome (nearly) completely sequenced
Human genome (nearly) completely sequenced

Worm genome, the first multicellular genome, (nearly) completely sequenced

National Center for Biotechnology Information (NCBI) created at NIH/NLM

BLAST: fast sequence similarity searching with rigorous statistics (Stephen Altschul, David Lipman and coworkers)

Hidden Markov Models of multiple alignments (David Haussler and coworkers; Pierre Baldi and coworkers)

SCOP classification of protein structures (Alexei Murzin, Cyrus Chothia and coworkers) | Your Gate Way to Life Science Career

2.0 Protein Structure

A set of 20 different subunits, called amino acids, can be arranged in any order to form a

polypeptide that can be thousands of amino acids long. These chains can then loop about each other or fold, in a variety of ways, but only one of these ways allows a protein to function properly. The critical feature of a protein is its ability to fold into a conformation that creates structural features, such as surface grooves, ridges, and pockets, which allow

it to fulfill its role in a cell. A protein's conformation is usually described in terms of

levels of structure. Traditionally, proteins are looked upon as having four distinct levels

of structure, with each level of structure dependent on the one below it. In some proteins,

functional diversity may be further amplified by the addition of new chemical groups after synthesis
functional diversity may be further amplified by the addition of new chemical groups
after synthesis is complete.
The stringing together of the amino acid chain to form a polypeptide is referred to as the
primary structure. The secondary structure is generated by the folding of the primary
sequence and refers to the path that the polypeptide backbone of the protein follows in
space. Certain types of secondary structures are relatively common. Two well-described
secondary structures are the alpha helix and the beta sheet. In the first case, certain types
bonding between groups located on the same polypeptide chain cause the backbone to
twist into a helix, most often in a form known as the alpha helix. Beta sheets are formed
when a polypeptide chain bonds with another chain that is running in the opposite
direction. Beta sheets may also be formed between two sections of a single polypeptide
chain that is arranged such that adjacent regions are in reverse orientation.
The tertiary structure describes the organization in three dimensions of all of the atoms
the polypeptide. If a protein consists of only one polypeptide chain, this level then
describes the complete structure.
Multimeric proteins, or proteins that consist of more than one polypeptide chain, require
higher level of organization. The quaternary structure defines the conformation
assumed by a multimeric protein. In this case, the individual polypeptide chains that
make up a multimeric protein are often referred to as the protein subunits. The four
levels of protein structure are hierarchal, that is, each level of the build process is
dependent upon the one below it.
protein's primary amino acid sequence is crucial in determining its final structure. In
some cases, amino acid sequence is the sole determinant, whereas in other cases,

additional interactions may be required before a protein can attain its final conformation. For example, some proteins require the presence of a cofactor, or a second molecule that

is part of the active protein, before it can attain its final conformation. Multimeric

proteins often require one or more subunits to be present for another subunit to adopt the proper higher order structure. The entire process is cooperative, that is, the formation of one region of secondary structure determines the formation of the next region. Allosteric Proteins: These are proteins which under certain conditions have a stable alternate conformation, or shape, that enables it to carry out a different biological function. The interaction of an allosteric protein with a specific cofactor, or with another protein, may influence the transition of the protein between shapes. In addition, any | Your Gate Way to Life Science Career

change in conformation brought about by an interaction at one site may lead to an alteration in the structure, and thus function, at another site. One should bear in mind, though, that this type of transition affects only the protein's shape, not the primary amino acid sequence. Allosteric proteins play an important role in both metabolic and genetic regulation.

Protein structure determination: Traditionally, a protein's structure was determined using one of two techniques: X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy.

X-ray Crystallography: Crystals are a solid form of a substance in which the component molecules are present in an ordered array called a lattice. The basic building block of a crystal is called a unit cell. Each unit cell contains exactly one unique set of the crystal's components, the smallest possible set that is fully representative of the crystal. When the crystal is placed in an X-ray beam, all of the unit cells present the same face to the beam; therefore, many molecules are in the same orientation with respect to the incoming X- rays. The X-ray beam enters the crystal and a number of smaller beams emerge: each one

the crystal and a number of smaller beams emerge: each one in a different direction, each

in a different direction, each one with a different intensity. If an X-ray detector, such as a piece of film, is placed on the opposite side of the crystal from the X-ray source, each diffracted ray, called a reflection, will produce a spot on the film. However, because only


few reflections can be detected with any one orientation of the crystal, an important

component of any X-ray diffraction instrument is a device for accurately setting and changing the orientation of the crystal. The set of diffracted, emerging beams contains

information about the underlying crystal structure.

The major drawback associated with this technique is that crystallization of the proteins is a difficult task. Crystals are formed by slowly precipitating proteins under conditions that maintain their native conformation or structure. These exact conditions can only be discovered by repeated trials that entail varying certain experimental conditions, one at a time. This is a very time consuming and tedious process.

Nuclear Magnetic Resonance (NMR) Spectroscopy: The basic phenomenon of NMR

spectroscopy was discovered in 1945. In this technique, a sample is immersed in a magnetic field and the positively charged nucleus spins, the moving charge creates what


more, sometimes flipping over. hese resonating nuclei emit a unique signal that is then picked up by a detector and processed by the Fourier Transform algorithm, a complex equation that translates the language of the nuclei into something a scientist can understand. By measuring the frequencies at which different nuclei flip, scientists can determine molecular structure, as well as many other interesting properties of the molecule. In the past 10 years, NMR has proven to be a powerful alternative to X-ray crystallography for the determination of molecular structure. NMR has the advantage over crystallographic techniques in that experiments are performed in solution as opposed to a crystal lattice. However, the principles that make NMR possible tend to make this technique very time consuming and limit the application to small- and medium-sized


called a magnetic moment. When the radio waves hit the spinning nuclei, they tilt even | Your Gate Way to Life Science Career


Homology refers to two genes sharing a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship. In comparative genomics one of the major functions is the identification of homologous genes in different organisms. An important tool which is utilized for this function is BLAST (Basic local alignment search tool).

The BLAST algorithm is a heuristic program, which means that it relies on some smart shortcuts to perform the search faster. BLAST performs "local" alignments. Most proteins are modular in nature, with functional domains often being repeated within the same protein as well as across different proteins from different species. The BLAST algorithm is tuned to find these domains or shorter stretches of sequence similarity. The local alignment approach also means that a mRNA can be aligned with a piece of genomic DNA, as is frequently required in genome assembly and analysis. If instead BLAST started out by attempting to align two sequences over their entire lengths (known as a global alignment), fewer similarities would be detected, especially with respect to domains and motifs.

be detected, especially with respect to domains and motifs. When a query is submitted via one

When a query is submitted via one of the BLAST Web pages, the sequence, plus any other input information such as the database to be searched, word size, expect value, and so on, are fed to the algorithm on the BLAST server. BLAST works by first making a look-up table of all the "words" (short subsequences, which for proteins the default is three letters) and "neighboring words", i.e., similar words in the query sequence. The sequence database is then scanned for these "hot spots". When a match is identified, it is used to initiate gap-free and gapped extensions of the "word".

BLAST Scores and Statistics: Once BLAST has found a similar sequence to the query in the database, it is helpful to have some idea of whether the alignment is "good" and whether it portrays a possible biological relationship, or whether the similarity observed is attributable to chance alone. BLAST uses statistical theory to produce a bit score and expect value (E-value) for each alignment pair (query to hit).

The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences. A key element in this calculation is the "substitution matrix ", which assigns a score for aligning any possible pair of residues. The BLOSUM62 matrix is the default for most BLAST programs, the exceptions being blastn and MegaBLAST (programs that perform nucleotide nucleotide comparisons and hence do not use protein- specific matrices). Bit scores are normalized, which means that the bit scores from different alignments can be compared, even if different scoring matrices have been used.

compared, even if different scoring matrices have been used. The E-value gives an indication of the

The E-value gives an indication of the statistical significance of a given pairwise alignment and reflects the size of the database and the scoring system used. The lower the E-value, the more significant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone. | Your Gate Way to Life Science Career

Although a statistician might consider this to be significant, it still may not represent a biologically meaningful result, and analysis of the alignments (see below) is required to determine "biological" significance. The score and E value are calculated using the equations given below:

-S’ S’ = λS-ln K ln 2 E = mn 2 Where S’ is the
S’ = λS-ln K
ln 2
E = mn 2
Where S’ is the normalized score, S is the raw score, λ and K are constants, m and n is
the length of the query and hit sequences
Tools for Comparative Genomics:
CisMols (Cis-regulatory Modules) is a tool that identifies compositionally
predicted cis-clusters that occur in groups of co-regulated genes within each of their
ortholog-pair evolutionarily conserved cis-regulatory regions.
The Comprehensive Microbial Resource (CMR) gives access to a central
repository of the sequence and annotation of all complete public prokaryotic genomes as
well as comparative genomics tools across all of the genomes in the database.
DAVID Bioinformatic Resources
The Database for Annotation, Visualization and Integrated Discovery (DAVID)
provides a comprehensive set of functional annotation tools for investigators to
understand biological meaning behind large list of genes.
The website provides access to tools for comparative genomic analyses
developed by the Comparative Genomics Center at the Lawerence Livermore National
Laboratory. Tools include: zPicture, Mulan, eShadow, rVista, CREME, and the ECR
EnteriX is a collection of tools for viewing pairwise and multiple alignments for

bacterial genome sequences.

6. FootPrinter

FootPrinter is a program for phylogenetic footprinting that identifies regions of DNA that are well conserved across a set of orthologous sequences in order to infer phylogenetic relationships.

7. FootPrinter3

FootPrinter3 is a web server for predicting transcription factor binding sites (TFBS) by using phylogenetic footprinting. FootPrinter3 extends the motif discovery | Your Gate Way to Life Science Career

algorithms of FootPrinter by making use of local multiple sequence alignment blocks when those are available and reliable, but also allowing finding motifs in unalignable regions.

8. GenomeTraFaC

GenomeTraFaC is a comparative genomics based resource for initial characterization of gene models and the identification of putative cis-regulatory regions

of RefSeq gene orthologs.

9. GENSTYLE GENSTYLE is based on the genomic signature paradigm and allows the user to
GENSTYLE is based on the genomic signature paradigm and allows the user to
classify and characterize nucleotide sequences using oligonucleotide frequencies.
IBM Genome Annotation Page
IBM's Bio-Dictionary-based Annotations Of Completed Genomes page lists
annotations for over 75 complete genomes (archae, bacteria, eurkaryotes, and viruses).
You can query these annotations at the sequence level as well as search/compare across
Integrated Microbial Genomes (IMG)
The Integrated Microbial Genomes (IMG) system facilitates the comparison of
genomes sequenced by the Joint Genome Institute (JGI). It can be searched using
keywords or BLASTp, and the gene records diplayed include biochemical properties,
protein domains, chromosomal location and neighbourhood and lists of paralogues and
orthologues. One can easily build a list of genomes to be considered or excluded from the
search and the Phylogenetic Profiler tool allows one to refine the selection by building a
list of homologues either common to or excluded from specific organisms.
ISC Large-scale Sequencing Project Database
The International Sequencing Consortium (ISC) Large-scale Sequencing Project
Database contains information on current and completed sequencing projects, including
project timelines, funding agencies, sequencing strategy and links out to project web
14. MicroFootPrinter

prokaryotic genomes using the phylogenetic footprinting program FootPrinter.

15. MIPS | Your Gate Way to Life Science Career

Munich Information Centre for Protein Sequences projects include: fungal genome analysis, plant genome bioinformatics, structural genomics, proteomics and genome annotation. Projects and databases include: CYGD, MNCDB, NGFN, MPPI, SIMAP, QUIPOS, MATDB, MOsDB, SPUTNIK, and PEDANT.

16. MLST

MLST (Multi Locus Sequence Typing) is a nucleotide sequence based approach for the unambiguous characterisation of isolates of bacteria and other organisms using the sequences of internal fragments of seven house-keeping genes.

17. NEMBASE2 NEMBASE2 is a database resource for EST datasets for 37 species of nematode.
NEMBASE2 is a database resource for EST datasets for 37 species of nematode.
Sequences are clustered to redunce redundacy. Comparisons can be by library and at a
sequence level; a visualisation tool is included. Coding region predictions for each
cluster, further annotations such as GO terms and physical properties are also included.
PartiGeneDB is a database of about 300 partial genomes from eukaryotic
organisms that have been assembled from EST data.
PhenomicDB integrates the genotype and phenotype information of several
organisms from public data sources. The mapping of phenotypic data fields allows cross-
species phenotype comparison.
Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and
explore the phylogenomic profiles of bacterial protein sequences. It also allows the user
to view sequence similarity across different organisms, access other genes with similar
conservation profiles, and view genes that are found nearby a selected gene in multiple
Projector 2
Projector 2 allows users to map completed portions of the genome sequence of an
organism onto the finished (or unfinished) genome of a closely-related species or strain.
Using the related genome sequence as a template can facilitate sequence assembly and
the sequencing of the remaining gaps.

22. Sockeye

Sockeye is a visualization tool allowing one to assemble and analyze genomic information in a three dimensional workspace. It can be used to view features at various levels, ranging from SNPs to karyotypes. Sockeye displays genomic features along tracks, and links to the Ensembl database.

23. SPRING | Your Gate Way to Life Science Career

Sorting Permutation by Reversals and Block Interchanges (SPRING) is a tool for the analysis of genome rearrangements. SPRING takes two or more chromosomes as its input and then computes a minimum series of reversals and/or block-interchanges for transforming one chromosome into another. Phylogenetic trees based on the rearrangement analysis are also shown as part of the results.

24. SVC

SVC (Structured Visualization of Evolutionary Conserved Sequences) is a tool that can search for pairs of orthologous genes, align the protein coding sequences, and visualize the evolutionary sequence conservation mapped back onto the gene structure scaffold.

T-STAG TIGR Software Tools TraFaC YOGY
TIGR Software Tools


Tissue-Specific Transcripts And Genes (T-STAG) is a system integrating EST, gene expression, alternative splicing and human-mouse orthology information for the analysis of tissue-specific gene and transcript expression patterns.


A list of open-source software packages available for free from The Institute for Genomic Research (TIGR).


TraFaC (Transcription Factor Binding Site Comparison) is a tool that identifes regulatory regions using a comparative sequence analysis approach.


Viral Bioinformatics Viral Bioinformatics provides access to viral genomes and a variety of tools for

comparative genomic analyses.


Eukaryotic Orthology (YOGY) is a resource for retrieving orthologous proteins from nine eukaryotic organisms. Using a gene or protein identifier as a query, this database provides comprehensive, combined information on orthologs in other species using data from five independent resources: KOGs, Inparanoid, Homologene, OrthoMCL, and a table of curated orthologs between budding yeast and fission yeast. Associated Gene Ontology (GO) terms of orthologs can also be retrieved. | Your Gate Way to Life Science Career

4.0 Phylogeny

New insight into the molecular basis of a disease may come from investigating the function of homologs of a disease gene in model organisms. In this case, homology refers to two genes sharing a common evolutionary history. Scientists also use the term homology, or homologous, to simply mean similar, regardless of the evolutionary relationship.

Equally exciting is the potential for uncovering evolutionary relationships and patterns between different forms of life. With the aid of nucleotide and protein sequences, it should be possible to find the ancestral ties between different organisms. Thus far, experience has taught us that closely related organisms have similar sequences and that more distantly related organisms have more dissimilar sequences. Proteins that show significant sequence conservation, indicating a clear evolutionary relationship, are said to be from the same protein family. By studying protein folds (distinct protein building blocks) and families, scientists are able to reconstruct the evolutionary relationship between two species and to estimate the time of divergence between two organisms since they last shared a common ancestor.

two organisms since they last shared a common ancestor. The Three Domains of Life: In the

The Three Domains of Life:

In the mid-1970s, while studying some unusual groups of

bacteria, thermophilic methanogens and halophiles, Carl Woese and colleagues cocluded that these organisms were not really bacteria but should be assigned to a separate domain of life with the same status as bacteria and eukaryotes. This group was originally referred to as archaebacteria and later renamed archaea. The uniqueness of the archaea was apparent, even from some of their biochemical features, such as the unusual structure of lipids and the topology of phylogenetic trees of 16S rRNA. These trees clearly indicated that archaea comprised a unique branch of life, distinct from both bacteria and eukaryotes. Furthermore, although, phenotypically, archaea are obviously prokaryotes, like bacteria, i.e. have small cells without nuclei or organelles, they are, in some important respects, closer to eukaryotes than to bacteria.These eukaryote-like features of archaea include the structure of the ribosomes, which have a number of proteins shared with eukaryotes but not with bacteria, the presence of histones (in one of the two major branches of archaea), the organization of the basal transcriptional apparatus, with several transcription factors of the eukaryotic variety, and the organization of the DNA replication apparatus, which is also conserved in archaea and eukaryotes but not in bacteria.

The evolutionary process:

Genetic Variation: Evolution is not always discrete with clearly defined boundaries that pinpoint the origin of a new species, nor is it a steady continuum. Evolution requires genetic variation which results from changes within a gene pool, the genetic make-up of a specific population. A gene pool is the combination of all the alleles —alternative forms of a genetic locus—for all traits that population may exhibit. Changes in a gene pool can result from mutation—variation within a particular gene—or from changes in gene frequency—the proportion of an allele in a given population. | Your Gate Way to Life Science Career

Every organism possesses a genome that contains all of the biological information needed

to construct and maintain a living example of that organism. The biological information

contained in a genome is encoded in the nucleotide sequence of its DNA or RNA molecules and is divided into discrete units called genes. The information stored in a gene is read by proteins, which attach to the genome and initiate a series of reactions

called gene expression.

Every time a cell divides, it must make a complete copy of its genome, a process called DNA replication. DNA replication must be extremely accurate to avoid introducing mutations or changes in the nucleotide sequence of a short region of the genome. Inevitably, some mutations do occur, usually in one of two ways; either from errors in DNA replication or from damaging Mutations in the coding regions of genes are much more important. Those mutations that

Bioinformatics Toolkit
Bioinformatics Toolkit

do have an evolutionary effect can be divided into two categories, loss-of-function mutations and gain-of-function mutations. A loss-of-function mutation results in reduced

abolished protein function. Gain-of-function mutations, which are much less common, confer an abnormal activity on a protein.


Phylogenetic Trees: Systematics describes the pattern of relationships among taxa and is intended to help us understand the history of all life. But history is not something we can see—it has happened once and leaves only clues as to the actual events. Scientists use these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the most convenient way of visually presenting evolutionary relationships among a group of organisms is through illustrations called phylogenetic trees.

Tools for phylogeny reconstruction


This Toolkit is a collection of a wide range of tools and links for sequence analysis, function, and structure prediction. This resource offers convienent web interfaces for many freely available tools.

2. CIPRes

The Cyberinfrastructure for Phylogenetic Research (CIPRes) project aims to develop a computational infrastructure for systematics. Other goals of the project include providing

a central resource enabling computational systematics and education and training

initiatives. The website also contains a substantial list of links to related software.

3. Codon Usage Database

Find GC content and frequency of codon usage for any organism that has a sequence in GenBank.

4. ConSeq

ConSeq is a tool for predicting functionally and structurally important amino acid residues in protein sequences. The predictions are based on the assumptions that residues | Your Gate Way to Life Science Career

of functional importance are often conserved and solvent-accessible, and those of structural importance are often conserved and located in the protein core. A multiple sequence alignment is used to predict the relative solvent accessibility state and the evolutionary rate at each residue.

5. cpnDB

cpnDB is a curated collection of chaperonin sequence data collected from public databases or generated by a network of collaborators exploiting the cpn60 target in clinical, phylogenetic and microbial ecology studies. The database contains all available sequences for both group I and group II chaperonins. cpnDB is built and maintained with open source tools.

6. IBM Genome Annotation Page IBM's Bio-Dictionary-based Annotations Of Completed Genomes page lists annotations
IBM Genome Annotation Page
IBM's Bio-Dictionary-based Annotations Of Completed Genomes page lists annotations
for over 75 complete genomes (archae, bacteria, eurkaryotes, and viruses). You can query
these annotations at the sequence level as well as search/compare across genomes.
Jevtrace is a tool that combines multiple sequence alignments, phylogenetic, and
structural data for identification of functional sites in proteins.
Joes Site - Phylogeny Programs
Comprehensive list of phylogeny packages, compiled by Joe Felsenstein, creator of
MEGA (Molecular Evolutionary Genetics Analysis) is a software package for
phylogenetic analysis with a graphical user interface. It allows viewing and editing of the
aligned input sequence data and provides many tools for phylogenetic and statistical
analysis of the alignments.
Mesquite is an open source software project designed to deal with comparative data
about organisms and evolutionary analyses. Mesquite contains modules for phylogenetic
analysis, population genetics, and non-phylogenetic multivariate analysis.
11. MIGenAS Toolkit

Max-Planck Integrated Gene Analysis System (MIGenAS) provides access to many different bioinformatics software tools and databases for sequence similarity searching, multiple sequence alignments, phylogenetic analysis, and protein structure prediction. Users can also configure "meta"-tools as a pipeline of individual tools and intermediate filters.


MINER | Your Gate Way to Life Science Career

MINER is a tool for the identification and visualization of phylogenetic motifs (regions within a multiple sequence alignment (MSA) that conserve the overall phylogeny of the complete family).

13. MPI Toolkit

Max-Planck Institute Bioinformatics Toolkit provides access to many different bioinformatics software tools and databases for sequence similarity searching, multiple

sequence alignments, phylogenetic analysis, and protein structure prediction.

14. NCBI Taxonomy Database

Taxonomic classification of all organisms with sequences in GenBank.

NEWT NJplot Orthologue Search Service PAL2NAL Phydbac2
Orthologue Search Service


NEWT is the taxonomy database maintained by the UniProt group.


NJplot is a tool for visualizing binary trees such as the phylogenetic trees output from the PHYLIP programs. Available for several platforms including Windows, MacOS, Linux and Solaris.


BLAST a protein sequence then perform automated phylogenetic analysis to detect

orthologous sequences.


PAL2NAL converts a multiple sequence alignment of proteins and the corresponding DNA (or mRNA) sequences into a codon alignment. Synonymous (Ks) and non-synonymous (Ka) substitution rates can be calculated.


Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and explore the phylogenomic profiles of bacterial protein sequences. It also allows the user to view sequence similarity across different organisms, access other genes with similar conservation profiles, and view genes that are found nearby a selected gene in multiple genomes.


Comprehensive set of programs for phylogenetic analyses; available for PC and Mac;

source code available for easy compiling in UNIX.

21. PhyloBLAST

BLAST a protein sequence, then perform automated phylogenetic analysis on hits or on

uploaded sequences; PHYLIP-based analyses. | Your Gate Way to Life Science Career



PhyloDome is a tool with which you can visualize and analyze the phylogenetic distribution of one or more eukaryotic domains.


Phyml is a program that constructs phylogenetic trees from sequence alignments using the maximum likelihood method.


The Phylogenetic Web Repeater (POWER) allows users to perform phylogenetic analysis using the PHYLIP package.
The Phylogenetic Web Repeater (POWER) allows users to perform phylogenetic
analysis using the PHYLIP package. The POWER pipeline can start with processing
either multiple sequence alignments (MSA) or can proceed directly with aligned
ProtTest is a program that determines the best-fit model of evolution, among a set
of candidate models, for a given protein sequence alignment.
Puzzleboot is a UNIX shell script facilitating bootstrap analysis using TREE-PUZZLE
and PHYLIP. It enhances TREE-PUZZLE by allowing one to analyse multiple datasets,
and can be used for both protein and DNA distance bootstrap analysis.
Ribosomal Database Project
Highly curated database of aligned and annotated rRNA sequences with accompanying
phylogenies; data available for download.
STING Millenium
STING is a suite of tools for the analysis of protein sequence, structure, stability
and function - and the relationships between them.
Sliding Window Analysis of Ka and Ks (SWAKK) is tool for detecting positive
selection in proteins using a sliding window substitution rate analysis. The program can
display the results on a 3D protein structure.

The T-COFFEE site includes links to a collection of tools for computing, evaluating, and manipulating multiple alignments of protein sequences and structures. T- COFFEE is a protein multiple sequence alignment tool that is more accurate than ClustalW for sequences with less than 30% identity. Expresso (or 3DCoffee) aligns sequences using structural information. PROTOGENE turns amino acid alignments into CDS nucleotide alignments.

31. Tree Editors

Tree Editors is an annotated listing of software for the visualization and manipulation of phylogenetic trees. | Your Gate Way to Life Science Career


Tree of Life

Multi-authored project attempting to represent online the entire phylogeny of life on




Tree-puzzle is a program that constructs phylogenetic trees from sequence alignments

using the maximum likelihood method.

34. TreeDomViewer

TreeDomViewer is a tool for the visualization of phylogeny and protein domain structure. TreeDomViewer constructs phylogenetic trees and projects the corresponding protein domain information onto the multiple sequence alignment.

TreeJuxtaposer TSEMA UCMP Phylogeny Wing
UCMP Phylogeny Wing


TreeJuxtaposer is a free software tool that allows a visual comparison of two trees in Newick format (phylogenies, taxonomies, gene trees, etc.). It can work with trees having up to 500,000 nodes, and automatically calculates and marks the differences.


TreeView Generates nice graphics of trees; reads multiple tree file formats; available for

download to Mac or PC.


The Server for Efficient Mapping Assessment (TSEMA) predicts possible protein-protein interactions based on the comparision of phylogenetic trees derived from sequences of associated protein families.


"Phylogeny-Diversity of Life Through Time" is an on-line exhibit at the University of California Museum of Paleontology website. There is an introduction to phylogenetics and cladistics, and you can navigate through a very informative phylogenetic tree rooted at the three main domains of life (Archaea, Bacteria and Eukaryota). At each level of the tree there is a brief summary, and links to more information about the Fossil Record, Life History and Ecology, Systematics and Morphology.

39. Understanding Evolution

A fantastic site for teaching/understanding evolution.

40. Weighbor

Weighbor is a tool for building phylogenetic trees from distance matrices. It employs a weighted version of the neighbour-joining method in which longer distances in the matrix are given less weight. | Your Gate Way to Life Science Career

5.0 Protein Modeling

The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions. In the absence of a protein structure that has been determined by X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target). Identifying a protein's shape, or structure, is key to understanding its biological function and its role in health and disease. Illuminating a protein's structure also paves the way for the development of new agents and devices to treat a disease. Yet solving the structure of a protein is no easy feat. It often takes scientists working in the laboratory months, sometimes years, to experimentally determine a single structure. Therefore, scientists have begun to turn toward computers to help predict the structure of a protein based on its sequence. The challenge lies in developing methods for accurately and reliably understanding this intricate relationship.

and reliably understanding this intricate relationship. Although molecular modeling may not be as accurate at

Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy. Because the different genome projects are producing more sequences and because novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease- related processes in living organisms. Protein modeling involves identification of the proteins with known three-dimensional structures that are related to the target sequence, constructing a model for the target sequence based on its alignment with the template structure(s) and evaluating the model against a variety of criteria to determine if it is satisfactory.

Evaluating the Alignment

: The best way to assess the accuracy is to compare

alignments from sequence comparisons with alignments from protein three-dimensional


Identification of Structurally Conserved and Structurally Variable Regions:

After the known structures are aligned, they are examined to identify the structurally conserved regions (SCRs) from which an average structure, or framework, can be constructed for these regions of the proteins. Variable regions (VRs), in which each of the known structures may differ in conformation, also must be identified because special techniques must be applied to model these regions of the unknown protein.

Generating Coordinates for the Unknown Structure: When generating coordinates for the unknown structure, one needs to model main chain atoms and side chain atoms, both in SCRs and VRs. For the SCRs, it is straightforward to generate the coordinates of the main chain atoms of the unknown structure from those of the known structure(s). Side chain coordinates are copied if the residue type in the unknown is identical or very similar to that in the known homologues. For other side chain coordinates one can apply a side chain rotamer library in a systematic approach to explore possible side chain | Your Gate Way to Life Science Career

conformations. It may be desirable to weight the contribution of each homologue in each SCR based on the extent of similarity with the unknown. In the event that some coordinates in the unknown are undefined in the SCRs, regularization can be used to build and relax both main chain and side chain atoms in those regions. Note that this procedure should be used only if the region of undefined atoms is one or two residues in length.

For the VRs, a variety of approaches may be applied in assigning coordinates to the unknown. Recall that these regions will correspond most often to the loops on the surface of the protein. If a loop in one of the known structures is a good model for that of the unknown, then the main chain coordinates of that known structure can be copied. Side chain coordinates of residues that are similar in length and character also may be copied. Rotamer libraries can be used to define other side chain coordinates.

can be used to define other side chain coordinates. When a good model for a loop

When a good model for a loop cannot be found among the known structures, one can search fragment databases for loops in other proteins that may provide a suitable model for the unknown. A residue range is chosen to include the undefined loop as well as a few residues (e.g., three) on either side of the loop for which coordinates have been defined. Fragments are examined for their ability to fit in the undefined region without making bad contacts with other atoms and to overlap well with the residues on either side of the loop. The loop may then be subjected to conformational searching to identify low energy conformers if desired. Coordinates for side chain atoms in these loop regions may be copied if residues are similar, though it is likely that considerable application of side chain rotamer libraries will be required to define coordinates in these regions.

Databases of Structures from Homology Modeling: Databases are now available that contain large numbers of protein structures that have been obtained by comparative (homology) modeling. Two of these databases are ModBase and 3DCrunch.

Modbase was created by Sali and co-workers, using their program Modeller, which creates models based on the satisfaction of spatial restraints. That is, restraints are identified from the alignments of homologues of known structure, and these restraints are then applied to the unknown sequence. Restraints can include distances between alpha carbons, other distances within the main-chain, and main-chain and side-chain dihedral angles. Routines to satisfy the restraints optimally include conjugate gradient minimization and molecular dynamics with simulated annealing.

3DCrunch is a large scale modeling project that aims to submit all entries from protein sequence databases to SWISS-MODEL. Currently the database contains 64,000 entries.

Automated Web-Based Homology Modeling: Web-based tools are now available to generate models of protein 3-dimensional structures using comparative modeling techniques.

SWISS-MODEL is available through Glaxo Wellcome Experimental Research in Geneva, Switzerland.

WHAT IF, available on EMBL servers, includes three components, one to generate the homology models, one to evaluate the quality of the homology models, and one to | Your Gate Way to Life Science Career

evaluate models of proteins for which the structure is already known, thereby providing for evaluation of the quality of the modeling program.

Evaluation and Refinement of the Structure:

it is important to demonstrate that the structural features of the model are reasonable in terms of what is know about protein structures in general. That is, researchers have analyzed three-dimensional structures of proteins from which basic principles of protein structure and folding have been developed. Several programs are available to assist in this analysis of correctness of a homology model. Programs that provide structure analysis along with output that is useful for publication include PROCHECK and 3D-Profiler PROCHECK is based on an analysis of (phi,psi) angles, peptide bond planarity, bond lengths, bond angles, hydrogen-bond geometry, and side-chain conformations of known protein structures as a function of atomic resolution. Thus, the expected values of these parameters are known and can be compared to a modeled structure based on the atomic resolution of the structures from which the model was developed. 3D-profiler compares a homology model to its sequence using a 3D profile. The profile is based on the statistical preferences of each of the 20 amino acids for particular environments within the protein. Each residue position in a 3D model can be characterized by its environment. Preferred environments for amino acids are derived from known three-dimensional structures and are defined by three parameters: (1) the area of each residue that is buried, (2) the fraction of side-chain area that is covered by polar atoms (i.e., O and N), and (3) the local secondary structure. Based on these environment variables, a 3D structure is converted into a 1D profile that describes each residue in the folded protein structure. Examination of these profiles reveals which regions of a sequence appear to be folded correctly and

For a homology model from any source,

folded correctly and For a homology model from any source, which do not. Once any irregularities

which do not.

Once any irregularities have been resolved, the entire structure may then be subjected to further refinement. This process may consist of energy minimization with restraints, especially for the SCRs. The restraints then may be gradually removed for subsequent minimizations. It also may be advantageous to apply molecular dynamics in conjunction with energy minimization. For any of these refinement procedures, the structure should be solvated, using for example crystallographic waters from the known homologues, a solvent shell, or a periodic box of pre-equilibrated water molecules. | Your Gate Way to Life Science Career

6.0 Tools for structure based drug design and docking

Docking Software

1. Protein-Ligand Docking Affinity (Accelrys Inc.) automated, flexible docking

uses the energy of the ligand/receptor complex to automatically find the best binding modes of the ligand to the receptor (energy-driven method)

2. AutoDock (The Scripps Research Institute) automated docking of flexible ligands to

macromolecules designed to predict how small molecules, such as substrates or drug

candidates, bind to a receptor of known 3D structure


structure-based drug design program created to aid the design of combinatorial libraries screens a library possible reactants on the computer, and predicts which ones will be the most potent and successfully applied to find nanomolar inhibitors of Cathepsin D

CombiBUILD (Sandia National Labs)

inhibitors of Cathepsin D CombiBUILD (Sandia National Labs) roughly an order of magnitude superior to standard

roughly an order of magnitude superior to standard diversity approaches


by including Monte Carlo, Genetic Algorithm, and database screening docking algorithms

DockVision (University of Alberta) docking package created by scientists for scientists


examines all possible poses within a protein active site, filtering for shape

FRED (OpenEye) accurate and extremely fast, multiconformer docking program

complementarity and optional pharmacaphoric features before scoring with more traditional functions


fast genetic algorithm for generation of configurations rigid, partially flexible, or fully flexible receptor side chains provide optimal control of ligand binding characteristics

FlexiDock (Tripos) simple, flexible docking of ligands into binding sites on proteins

conformationally flexible ligands tunable energy evaluation function with special H-bond treatment very fast run times

7. FlexX (BioSolveIT GmbH) fast computer program for predicting protein-ligand interactions two main applications: complex prediction (create and rank a series of possible protein-ligand complexes), virtual screening (selecting a set of compounds for experimental testing) conformational flexibility of the ligand; rigid protein placement algorithm based on the interactions occurring between the molecules (limited to low- energy structures)

8. MIMUMBA torsion angle database used for the creation of conformers; interaction

geometry database used to exactly describe intermolecular interaction patterns Boehm function (with minor adaptions necessary for docking) applied for scoring

9.GLIDE (Schrödinger GmbH) high-throughput ligand-receptor docking for fast library screening fast and accurate docking program identifies the best binding mode through Monte Carlo sampling provides an accurate scoring function for ranking of binding | Your Gate Way to Life Science Career

affinities can enrich the fraction of suitable lead candidates in a chemical database - by predicting binding affinity rapidly and with a reasonable level of accuracy - will greatly enhance the probability of success in a drug discovery program

10. GOLD (CCDC) calculating docking modes of small molecules into protein binding

sites genetic algorithm for protein-ligand docking full ligand and partial protein flexibility energy functions partly based on conformational and non-bonded

contactinformation from the CSD choice of scoring functions: GoldScore, ChemScore and User defined score virtual library screening

11. HINT! (Virginia Commonwealth University) Hydropathic Interactions empirical

molecular modeling system with new methods for de novo drug design and protein or nucleic acid structural analysis translates the well-developed Medicinal Chemistry and QSAR formalism of LogP and hydrophobicity into a free energy interaction model for all biomolecular systems based on the experimental data from solvent partitioning calculates 3D hydropathy fields and 3D hydropathic interaction maps estimates LogP for modeled molecules or data files numerically and graphically evaluates binding of drugs or inhibitors into protein structures and scores DOCK orientations constructs hydropathic (LOCK and KEY) complementarity maps (can be used to predict an ideal substrate from a known receptor or protein structure or to propose the hydropathic structure from known agonists or antagonists) evaluates/predicts effects of site-directed mutagenesis on protein structure and stability


mutagenesis on protein structure and stability 12. LIGPLOT (University College of London) program for

LIGPLOT (University College of London) program for automatically plotting

protein-ligand interactions generates schematic diagrams of protein-ligand interactions for a given PDB file interactions shown are those mediated by hydrogen bonds (dashed lines between the atoms involved) and by hydrophobic contacts (represented by an arc with spokes radiating towards the ligand atoms they contact)


SITUS (Scripps Research Institute) program package for modeling of atomic

resolution structures into low-resolution density maps software supports both rigid-body and flexible docking using a variety of fitting strategies


SenSitus interactive docking and visualization program for low-resolution density

maps and atomic structures GUI-based alternative to certain Situs docking programs that can benefit from an interactive user interface and 3D visualization methods

15. VEGA (Milan University) calculation of ligand-receptor interaction energy

Protein-Ligand & Protein-Protein Docking

16. DOCK (UCSF Molecular Design Institute) generates many possible orientations (and

more recently, conformations) of a putative ligand within a user-selected region of a receptor structure orientations may be scored using several schemes designed to measure steric and/or chemical complementarity of the receptor-ligand complex evaluate likely orientations of a single ligand, or to rank molecules from a database search databases for | Your Gate Way to Life Science Career

DNA-binding compounds examine possible binding orientations of protein-protein and protein-DNA complexes design combinatorial libraries

17. GRAMM (SUNY) Global Range Molecular Matching empirical approach to

smoothing the intermolecular energy function by changing the range of the atom-atom potentials requires only the atomic coordinates of the two molecules to predict the complex structure (no binding site information needed) performs an exhaustive 6- dimensional search through the relative translations and rotations of the molecules see also the database of Protein-Protein Decoys for the validation of energy functions and refinement procedures


ICM-Dock (MolSoft LLC) fast and accurate docking simulations unique set of tools

fast and accurate docking simulations unique set of tools for accurate individual ligand-protein docking,

for accurate individual ligand-protein docking, peptide-protein docking, and protein- protein docking, including interactive graphics tools

Protein-Protein (Peptide) Docking


3D-Dock Suite (BioMolecular Modeling, Cancer Research UK) incorporating

FTDock, RPScore and MultiDock FTDock (Fourier Transform Dock) performs rigid- body docking on two biomolecules in order to predict their correct binding geometry outputs multiple predictions that can be screened using biochemical information RPScore (Residue level Pair potential Score) uses a single distance constraint empiricaly derived pair potential to screen the ouptut from FTDock can reduce dramatically the list of possible complexes within which can be found a correct solution MultiDock (Multiple copy side-chain refinement Dock)


complementarities between surfaces of proteins and estimates docking positions

Bielefeld Protein Docking (Bielefeld University) detects geometrical and chemical


Evaluation and Ranking efficient protein-docking algorithm predicts the structure of binary protein complexes from the unbound structures search the complete binding space and select a set of candidate complexes evaluate and rank each candidate according to the estimated probability of being an accurate model of the native complex intergrated in chemera, a molecular graphics and modeling program for studying protein structures and interactions

BiGGER (BioTecnol, S.A.) Biomolecular complex Generation with Global

22. ClusPro (Boston University) integrated approach to protein-protein docking

docking algorithm includes the following steps: rigid body docking based on the Fourier correlation approach (used DOT and ZDOCK docking programs) selection of structures with favorable desolvation and electrostatic properties clustering the retained complexes using a pairwise RMSD criterion refinement of the 25 largest clusters by the flexible docking algorithm SmoothDock

23. DOT (San Diego Supercomputer Center) Daughter Of TURNIP TURNIP - program,

developed by V. Roberts at The Scripps Research Institute for use in the study of | Your Gate Way to Life Science Career

macromolecular dockingcomputation of the electrostatic potential energy between two proteins or other charged molecules

24. ESCHER NG (Milan University) enhanced version of the original ESCHER protein-

protein automatic docking system developed in 1997 by G. Ausiello, G. Cesareni and M. Helmer Citterich new release, with a reengineered code, includes some new features:

protein-protein and DNA-protein docking capability fast surface calculation based on the

NSC algorithm

25. HADDOCK (Utrecht University Netherlands) High Ambiguity Driven protein- protein Docking biochemical and/or biophysical interaction data such as chemical shift perturbation data resulting from NMR titration experiments or mutagenesis data introduced as ambiguous interaction restraints (AIRs) to drive the docking process AIR - defined as an ambiguous distance between all residues shown to be involved in the interaction

between all residues shown to be involved in the interaction 26. use spherical polar Fourier correlations


use spherical polar Fourier correlations to accelerate docking calculations

HEX (University of Aberdeen) protein docking and molecular superposition program | Your Gate Way to Life Science Career

7.0 Computational Resources

Web services: The Web describes information using HyperText Markup Language (HTML) and transmits it using HyperText Transport Protocol (HTTP). The current common name, The Web, is a contraction of its original name, the Word Wide Web, also abbreviated as WWW or W3. A Web browser performs multiple tasks. First, any Web browser is an HTTP client; it knows how to transfer data using the HTTP protocol. Second, any Web browser also knows how to interpret and display HTML, the content markup language used on the Web. Different browsers have different display capabilities and display the same HTML code in different ways (which is why HTML is referred to as a content markup language instead of a page description language) but all of them can understand (parse) HTML and do something reasonable with it.

understand (parse) HTML and do something reasonable with it. Some of the differences in the way

Some of the differences in the way different Web browsers display the same Web page come from different design decisions ("what font should be used for <H1> text?") and some of it comes from the fact that different Web clients have different capabilities. Some of these differences, such as the ability to display various kinds of still or moving images as part of the Web page or to run programs written in Java, Active X, or Javascript, represent extensions to HTML. These extra capabilities may be built into the browser or may be added by "plugins"; software extensions which give the browser new functionality. Finally, the behavior of a Web browser can frequently changed by configuring its preferences; if you find the default font too small, that can often be increased.

Many new computer users assume that the Web and the Internet are synonymous. However, many protocols other than HTTP flow over the Internet. In part, the new user is confused by the fact that, in addition to supporting extensions to HTML, many popular web browsers have support for other protocols such as email (SMTP, POP, IMAP), newsgroups, ftp, and gopher for example. What this really means is that the particular piece of software (e.g. Netscape Communicator) is more than just a Web client, it is also an email client, an FTP client and a Gopher client. Finally, HTTP does not have to be transmitted over the Internet, and HTML doesn't have to be transmitted via HTTP. Web technology has become a common interface tool for communication between computers on a local network (sometimes called an Intranet), and every Web client I have worked with has the ability to read and display local HTML files.

Because virtually every Web client is also a limited FTP client, many people choose to so use them. In the case where a Web page contains a link to an FTP server, simply selecting the link downloads the file. If, however, you are given the following instructions to retrieve a file: | Your Gate Way to Life Science Career


Telnet: Telnet is one of the oldest of the network services and perhaps the easiest to understand. Telnet allows one computer to "log on" to another computer as if it were a terminal. Once logged on, you frequently will have all the privileges of a local user; you can run programs, create and delete files. This is probably the most common way that users with accounts will use a computer.

Although "full service logins" as is described above are perhaps the most common use of the telnet protocol, in fact as much control as the host's system administrator desires may

be imposed on a telnet connection. Thus, a telnet service may be advertised with a public

login name and password. Login with this name, however, is likely to be restricted to a limited number of commands. The National Institutes of Health in the United States used, at one point, such a telnet login to disseminate information as to the membership of study

to disseminate information as to the membership of study sections. Such specialized telnet services have become

sections. Such specialized telnet services have become much less common since the rise

in popularity of the Web.

A telnet session can negotiate a range of different protocols, but this almost always

includes ASCII text. Because many protocols for other services (e.g. SMTP, HTTP) are encoded as ASCII text, a telnet client can sometimes be used to connect to a server for these other protocols. Most people will use a telnet client the first time connecting to a MOO, and some people will continue to use telnet as their client, although most of us find dedicated clients to be significantly more convenient. Similarly, it is possible to connect to a Web server with a telnet client if you understand the syntax of HTTP. This is

almost never done to use a Web server, but is occasionally done when debugging.

From a practical point of view, every telnet host will be different, and thus you will need


learn about each one as you have occasion to use it.

ftp: Telnet is useful for interactive computer access, but is much less useful for

transferring files. Ftp is an older service designed specifically for file transfer. Originally


was useful to make files available to the world at large without giving all those wanting the files an account, the variant of "anonymous ftp" developed. In this variant, logging in with a "magic" user name (most commonly "anonymous" or "ftp") eliminates the requirement for a password.

like telnet, was intended for account owners. However, as it became apparent that it

Once logged on via ftp, access to the host filesystem is accomplished by a series of commands. On a unix ftp client, the commands are unix-like; cd to Change Directory and ls to LiSt the files in that directory. To transfer files, you execute either get a file from the host computer or put a file onto it (where allowed). These commands do not depend on the host computer running UNIX! These are ftp commands, some of which happen to be similar to unix commands. A client may choose to hide these commands; a client with a graphical user interface (GUI), for example, might not have typed commands at all, but buttons. | Your Gate Way to Life Science Career

One pair of ftp commands which is especially important to understand are binary and ascii. Ftp transfers occur in ascii (text) mode by default. In ascii mode, the file received may not be identical to the one on the host, as ftp may make changes in the file during transfer, to allow for differences in how different operating systems handle text. For example, UNIX terminates lines with the linefeed character (ASCII 10 decimal), the Macintosh operating system with a carriage return (ASCII 13 decimal) and MSDOS uses one of each. These differences are corrected for during an ascii transfer. This is highly desirable for text files, but catastrophic for binary files like program object code and pictures. Thus, before getting such a file, it is important to issue the binary command. This instructs ftp to transfer files unmodified.

Email: Both ftp and telnet are interactive, more or less real time programs. Sometimes it is useful, however, to communicate with another computer, or more commonly, a user on another computer, by leaving them a message which they can read and respond to at their convenience. This is done over the Internet by using email.

convenience. This is done over the Internet by using email. Email is a generic term for

Email is a generic term for a variety of processes which can use different protocols and network technology, and which, in many cases uses a more complex client/server model. At present, most email is transmitted by SMTP (Simple Mail Transport Protocol) via TCP/IP over the Internet. SMTP transmits email on port 25 between two dedicated, full time servers. Although the assumption is that both SMTP servers will be generally available, should the receiving server not be reachable when the transmitting server needs to send email, the email message will be held and the transmission will be retried several times over a period of days until a successful transmission occurs or until the maximum retry time has been exceeded, at which point an error message will be returned to the sender.

The SMTP programs discussed above are typically symmetrical (e.g. the program can alternatively serve as client or server), and are complex. Typically, you will not interact with these programs directly. Rather, dedicated client software is used to compose, send, receive, and read email, and it is that software which communicates with the SMTP server. If you send and receive email via a computer that is always on and always connected to a network reachable by your mail server (e.g. a Unix workstation), then incoming mail is saved to a mail spool file on your computer from whence your client software retrieves it, and outgoing email is passed to the SMTP server. Examples of client software running on Unix workstations are mail, mailx, mush, elm, mutt, and pine. Also, as is discussed below, web browsers sometimes can be used as email clients.

If you send and receive email via a computer that is not always on and/or not always connected to the network, sending email proceeds as above, but receiving email is different in that the SMTP server cannot necessarily get incoming email onto your computer's file system. In that case, a different protocol is used, most commonly POP3. (IMAP is a newer protocol for accomplishing the same task about which you may hear more in the future.) The SMTP server stores your email on a remote host and your local client retrieves it from a POP3 server when you check for mail. Typically, a POP3 account will be provided by whoever provides your Internet access. Thus, to install an | Your Gate Way to Life Science Career

email client on a Mac or Windows computer, you typically have to provide the domain name and/or IP address of the SMTP and POP3 servers (frequently the same) and the user name and password for the POP3 account.

Important commands in LINUX/UNIX operating systems

1. cat - display or concatenate files

cat takes a copy of a file and sends it to the standard output (i.e.
cat takes a copy of a file and sends it to the standard output (i.e. to be displayed on your
terminal, unless redirected elsewhere), so it is generally used either to read files, or to
string together copies of several files, writing the output to a new file.
cat ex
displays the contents of the file ex.
cat ex1 ex2 > newex
creates a new file newex containing copies of ex1 and ex2, with the contents of ex2
following the contents of ex1 .
cd - change directory
is used to change from one directory to another.
changes directory so that dir1 is your new current directory. dir1 may be either the full
pathname of the directory, or its pathname relative to the current directory.
changes directory to your home directory.
moves to the parent directory of your current directory.
3. chmod - change the permissions on a file or directory
chmod alters the permissions on files and directories using either symbolic or octal
numeric codes. The symbolic codes are given here:-
+ to add a permission
- to remove a permission
= to assign a permission explicitly
w write
x execute (for files),

access (for directories)

The following examples illustrate how these codes are used.

chmod u=rw file1

sets the permissions on the file file1 to give the user read and write permission on file1. No other permissions are altered.

chmod u+x,g+w,o-r file1 | Your Gate Way to Life Science Career

alters the permissions on the file file1 to give the user execute permission on file1, to give members of the user's group write permission on the file, and prevent any users not in this group from reading it.

chmod u+w,go-x dir1

gives the user write permission in the directory dir1, and prevents all other users having access to that directory (by using cd. They can still list its contents using ls .)

4. cp - copy a file

The command cp is used to make copies of files and directories.

cp file1 file2 copies the contents of the file file1 into a new file called
cp file1 file2
copies the contents of the file file1 into a new file called file2. cp cannot copy a file
onto itself.
cp file3 file4 dir1
creates copies of file3 and file4 (with the same names), within the directory dir1. dir1
must already exist for the copying to succeed.
cp -r dir2 dir3
recursively copies the directory dir2, together with its contents and subdirectories, to
the directory dir3. If dir3 does not already exist, it is created by cp, and the contents
and subdirectories of dir2 are recreated within it. If dir3 does exist, a subdirectory
called dir2 is created within it, containing a copy of all the contents of the original dir2 .
date - display the current date and time
date returns information on the current date and time in the format shown below:-
Tue Mar 25 15:21:16 GMT 1997
It is possible to alter the format of the output from date. For example, using the command
date '+The date is %d/%m/%y, and the time is %H:%M:%S.'
The date is 14/12/97, and the time is 15:10:00.
diff - display differences between text files
diff file1 file2 reports line-by-line differences between the text files file1 and file2. The
default output will contain lines such as n1 a n2,n3 and n4,n5 c n6,n7 , (where n1 a n2,n3
means that file2 has the extra lines n2 to n3 following the line that has the number n1 in
file1, and n4,n5 c n6,n7 means that lines n4 to n5 in file1 differ from lines n6 to n7 in file2).
After each such line, diff prints the relevant lines from the text files, with < in front of
each line from file1 and > in front of each line from file2.
There are several options to diff, including diff -i, which ignores the case of letters when
comparing lines, and diff -b, which ignores all trailing blanks.

diff -cn

produces a listing of differences within n lines of context, where the default is three lines. The form of the output is different from that given by diff, with + indicating | Your Gate Way to Life Science Career

lines which have been added, - indicating lines which have been removed, and ! indicating lines which have been changed.

diff dir1 dir2

will sort the contents of directories dir1 and dir2 by name, and then run diff on the text files which differ.

7. file - determine the type of a file

file tests named files to determine the categories their contents belong to.

file file1 can tell if file1 is, for example, a source program, an executable program
file file1
can tell if file1 is, for example, a source program, an executable program or shell
script, an empty file, a directory, or a library, but (a warning!) it does sometimes
make mistakes.
find - find files of a specified name or type
find searches for files in a named directory and all its subdirectories.
find . -name '*.f' -print
searches the current directory and all its subdirectories for files ending in .f, and
writes their names to the standard output. In some versions of Unix the names of the
files will only be written out if the -print option is used.
find /local -name core -user user1 -print
searches the directory /local and its subdirectories for files called core belonging to the
user user1 and writes their full file names to the standard output.
grep - searches files for a specified string or expression
grep searches for lines containing a specified pattern and, by default, writes them to the
standard output.
grep motif1 file1
searches the file file1 for lines containing the pattern motif1. If no file name is given,
grep acts on the standard input. grep can also be used to search a string of files, so
grep motif1 file1 file2

will search the files file1, file2,

grep -c motif1 file1

, filen, for the pattern motif1.

will give the number of lines containing motif1 instead of the lines themselves.

grep -v motif1 file1

will write out the lines of file1 that do NOT contain motif1. | Your Gate Way to Life Science Career


gzip - compress a file

gzip reduces the size of named files, replacing them with files of the same name extended by .gz . The amount of space saved by compression varies.

gzip file1

results in a compressed file called file1.gz, and deletes file1.

gzip -v file2

compresses file2 and gives information, in the format shown below, on the percentage

of the file's size that has been saved by compression:- file2 : Compression 50.26 --
file2 : Compression 50.26 -- replaced with file2.gz
To restore files to their original state use the command gunzip. If you have a compressed
file file2.gz, then
gunzip file2
will replace file2.gz with the uncompressed file file2 .
help - display information about bash builtin commands
help gives access to information about builtin commands in the bash shell. Using help on
its own will give a list of the commands it has information about. help followed by the
name of one of these commands will give information about that commands. help history ,
for example, will give details about the bash shell history listings.
info - read online documentation
info is a hypertext information system. Using the command info on its own will enter the
info system, and give a list of the major subjects it has information about. Use the
command q to exit info. For example, info bash will give details about the bash shell.
13. kill - kill a process to kill a process using kill requires the process id (PID).
14. lpr - print out a file

lpr is used to send the contents of a file to a printer. If the printer is a laserwriter, and the file contains PostScript, then the PostScript will be interpreted and the results of that printed out.

lpr -Pprinter1 file1

will send the file file1 to be printed out on the printer printer1. To see the status of the job on the printer queue use

lpq -Pprinter1

for a list of the jobs queued for printing on printer1. (This may not work for remote printers.) | Your Gate Way to Life Science Career


ls - list names of files in a directory

ls lists the contents of a directory, and can be used to obtain information on the files and directories within it.

ls dir1

lists the names of the files and directories in the directory dir1, (excluding files whose names begin with . ). If no directory is named, ls lists the contents of the current directory.

ls -a dir1 will list the contents of dir1, (including files whose names begin with
ls -a dir1
will list the contents of dir1, (including files whose names begin with . ).
ls -l file1
gives details of the access permissions for the file file1, its size in kbytes, and the time
it was last altered.
ls -l dir1
gives such information on the contents of the directory dir1. To obtain the information
on dir1 itself, rather than its contents, use
ls -ld dir1
man - display an on-line manual page
man displays on-line reference manual pages.
man command1
will display the manual page for command1, e.g man cp, man man.
man -k keyword
lists the manual page subjects that have keyword in their headings. This is useful if
you do not yet know the name of a command you are seeking information about.
man -Mpath command1
is used to change the set of directories that man searches for manual pages on

17. mkdir - make a directory

mkdir is used to create new directories. In order to do this you must have write permission in the parent directory of the new directory.

mkdir newdir

will make a new directory called newdir.

mkdir -p can be used to create a new directory, together with any parent directories required.

mkdir -p dir1/dir2/newdir

will create newdir and its parent directories dir1 and dir2, if these do not already exist. | Your Gate Way to Life Science Career


more - scan through a text file page by page

more displays the contents of a file on a terminal one screenful at a time.

more file1

starts by displaying the beginning of file1. It will scroll up one line every time the return key is pressed, and one screenful every time the space bar is pressed. Type ? for details of the commands available within more. Type q if you wish to quit more before the end of file1 is reached.

more -n file1 will cause n lines of file1 to be displayed in each screenful
more -n file1
will cause n lines of file1 to be displayed in each screenful instead of the default
(which is two lines less than the number of lines that will fit into the terminal's
mv - move or rename files or directories
mv is used to change the name of files or directories, or to move them into other
directories. mv cannot move directories from one file-system to another, so, if it is
necessary to do that, use cp instead.
file1 file2
changes the name of a file from file1 to file2 unless dir2 already exists, in which case
dir1 will be moved into dir2.
dir1 dir2
changes the name of a directory from dir1 to dir2.
file1 file2 dir3
moves the files file1 and file2 into the directory dir3 .
nice - change the priority at which a job is being run
nice causes a command to be run at a lower than usual priority. nice can be particularly
useful when running a long program that could cause annoyance if it slowed down the
execution of other users' commands. An example of the use of nice is
nice compress file1

which will execute the compression of file1 at a lower priority.

21. passwd - change your password

Use passwd when you wish to change your password. You will be prompted once for your current password, and twice for your new password. Neither password will be displayed on the screen. | Your Gate Way to Life Science Career


ps - list processes

ps displays information on processes currently running on your machine. This information includes the process id, the controlling terminal (if there is one), the cpu time used so far, and the name of the command being run.


gives brief details of your own processes in your current session.

To obtain full details of all your processes, including those from previous sessions use:-

ps -fu user1

using your own user name in place of user1. ps is a command whose options
using your own user name in place of user1.
ps is a command whose options vary considerably in different versions of Unix (such as
BSD and SystemV). Use man ps for details of all the options available on the machine you
are using.
pwd - display the name of your current directory
The command pwd gives the full pathname of your current directory.
quota - disk quota and usage
quota gives information on a user's disk space quota and usage.
will only give details of where you have exceeded your disc quota on local disks,
quota -v
will display your quota and usage, whether the quota has been exceeded or not, and
includes information on disks mounted from other machines, as well as the local
rm - remove files or directories
rm is used to remove files. In order to remove a file you must have write permission in its
directory, but it is not necessary to have read or write permission on the file itself.

rm file1

will delete the file file1. If you use

rm -i file1

instead, you will be asked if you wish to delete file1, and the file will not be deleted unless you answer y. This is a useful safety check when deleting lots of files.

rm -r dir1

recursively deletes the contents of dir1, its subdirectories, and dir1 itself, and should be used with suitable caution. | Your Gate Way to Life Science Career


rmdir - remove a directory

rmdir removes named empty directories. If you need to delete a non-empty directory rm -r can be used instead.

rmdir exdir

will remove the empty directory exdir .

27. sort - sort and collate lines The command sort sorts and collates lines in
27. sort - sort and collate lines
The command sort sorts and collates lines in files, sending the results to the standard
output. If no file names are given, sort acts on the standard input. By default, sort sorts
lines using a character by character comparison, working from left to right, and using the
order of the ASCII character set.
sort -d
uses "dictionary order", in which only letters, digits, and white-space characters are
considered in the comparisons.
sort -r
reverses the order of the collating sequence.
sort -n
sorts lines according to the arithmetic value of leading numeric strings. Leading
blanks are ignored when this option is used, (except in some System V versions of
sort, which treat leading blanks as significant. To be certain of ignoring leading
blanks use sort -bn instead.).
slogin - secure remote login program
is used for logging onto a remote machine and for executing commands on a remote
machine, and provides secure encrypted communications between the local and remote
machines using an SSH protocol. The remote machine must be running an SSH server for
such connections to be possible.

29. telnet - remote login program

telnet communicates with another computer using the TELNET protocol.



will connect to the remote machine host1 (if it allows telnet connections). For example, using telnet to connect to the Central Unix Service

You can then login using your user name on If you use the escape character instead, you will enter telnet's command mode (you'll get the prompt telnet > ), and the command quit will get you back to the command line of your local machine. | Your Gate Way to Life Science Career

As communications between the two machines are not encrypted when using telnet, it is preferable to use ssh, if that is available.

Some Bioprograming tools:

1. Bioconductor

Bioconductor is an open source and open development software project that aims to provide access
Bioconductor is an open source and open development software project that aims to
provide access to a wide range of powerful statistical and graphical methods for the
analysis of genomic data.
This site is the center of development of an Open Source system for exchanging
annotations on genomic sequence data.
The BioJava Project is an open-source project dedicated to providing Java tools for
processing biological data.
biological data
methodologies for biological data representation, distribution, and discovery.
The BioPAX web site provides information about a collaborative effort to create a data
exchange format for biological pathways.
The BioPerl Project is an international association of developers of open source Perl
tools for bioinformatics, genomics and life science research.
BioPerl course
Great tutorial for those interested in the bioperl group of modules.

Open Source PHP code for bioinformatics. Includes functions and minitools (copy and paste one page scripts for basic tasks in bioinformatics. A wiki-like service allows modification and improvement of code.

9. BioPipe

The biopipe is a workflow framework that seeks to address some of the complexity involved in carrying out large scale bioinformatics analysis. It has been designed to work intimately with the bioperl package. | Your Gate Way to Life Science Career



The Biopython Project is an international association of developers of freely available

Python tools for computational molecular biology.

11. BioRuby

The BioRuby project aims to implement an integrated environment for bioinformatics

with Ruby.

12. CCT

CCT (Current Comparative Table) is a software package that you can install and set-up on
CCT (Current Comparative Table) is a software package that you can install and set-up
on your own system to help you to maintain and search databases.
Ensembl API
Ensembl is a freely available software system for genomic analysis. The documentation
page at Ensembl is the best place to get information on the Ensembl application
programming interface (API). In particular, the tutorial document includes lots of
examples of scripts and exercises for you to try.
Human Ageing Genomic Resources
The Human Ageing Genomic Resources (HAGR) website provides tools and curated
databases relevant to the genetics of human ageing. GenAge is a database of genes
related to human ageing, and AnAge is a multi-species database facilitating the
comparative biology of ageing. The Ageing Research Computational Tools (ARCT) is a
collection of Perl modules to assist comparative genomics research.
NCBI C++ toolkit
The NCBI C++ Toolkit is a collection of C++ modules developed by the NCBI for
writing bioinformatics software and applications.
Open Bioinformatics Foundation
The Open Bioinformatics Foundation is a non profit, volunteer run organization focused
on supporting open source programming in bioinformatics.
PyMOL is a molecular graphics system with an embedded Python interpreter designed
for real-time visualization and rapid generation of high-quality molecular graphics
images and animations.

18. R

System for statistical computation and graphics; an interpreted computer language which allows branching and looping as well as modular programming using functions.

19. Seqhound API | Your Gate Way to Life Science Career

SeqHound is a bioinformatics application programming platform that provides access to biological sequence, structure and functional annotation data. An application programming interface (API) is available to programmers using C, C++, Java and PERL.

20. Systems Biology Markup Language (SBML) The Systems Biology Markup Language (SBML) is a computer-readable format for representing models of biochemical reaction networks. SBML is applicable to metabolic networks, cell-signaling pathways, genomic regulatory networks, and many other areas in systems biology.

networks, and many other areas in systems biology. | Your Gate Way to Life Science | Your Gate Way to Life Science Career


[1] Fitch, W.M. (1983) "Random sequences." J. Mol. Biol. 163:171-176.

[2] Lipman, D.J., Wilbur, W.J., Smith T.F. & Waterman, M.S. (1984) "On the statistical significance of nucleic acid similarities." Nucl. Acids Res. 12:215-226.

[3] Altschul, S.F. & Erickson, B.W. (1985) "Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage." Mol. Biol. Evol. 2:526-538.

and codon usage." Mol. Biol. Evol. 2:526-538. [4] Deken, J. (1983) "Probabilistic behavior of

[4] Deken, J. (1983) "Probabilistic behavior of longest-common-subsequence length." In "Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison." D. Sankoff & J.B. Kruskal (eds.), pp. 55-91, Addison-Wesley, Reading, MA.

[5] Reich, J.G., Drabsch, H. & Daumler, A. (1984) "On the statistical assessment of similarities in DNA sequences." Nucl. Acids Res. 12:5529-5543.

[6] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

[7] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular subsequences." J. Mol. Biol. 147:195-197.

[8] Sellers, P.H. (1984) "Pattern recognition in genetic sequences by mismatch density." Bull. Math. Biol. 46:501-514.

[9] Gumbel, E. J. (1958) "Statistics of extremes." Columbia University Press, New York, NY.

[10] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.

[11] Dembo, A., Karlin, S. & Zeitouni, O. (1994) "Limit distribution of maximal non- aligned two-sequence segmental score." Ann. Prob. 22:2022-2039.

[12] Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448.

[13] Pearson, W.R. (1995) "Comparison of methods for searching protein sequence databases." Prot. Sci. 4:1145-1160.

[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol.

266:460-480. | Your Gate Way to Life Science Career

[15] Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402.

[16] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656.

[17] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein sequence similarities." Comput. Appl. Biosci. 4:67-71.

[18] Mott, R. (1992) "Maximum-likelihood estimation of the statistical distribution of Smith-Waterman local sequence similarity scores." Bull. Math. Biol. 54:59-75.

sequence similarity scores." Bull. Math. Biol. 54:59-75. [19] Waterman, M.S. & Vingron, M. (1994) "Rapid and

[19] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628.

[20] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and Poisson approximation." Stat. Sci. 9:367-381.

[21] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276:71-84.

[22] Arratia, R. & Waterman, M.S. (1994) "A phase transition for the score in matching random sequences allowing deletions." Ann. Appl. Prob. 4:200-225.

[23] McLachlan, A.D. (1971) "Tests for comparing related amino-acid sequences. Cytochrome c and cytochrome c-551." J. Mol. Biol. 61:409-424.

[24] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary change in proteins." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC.

[25] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant relationships." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), p. 353-358. Natl. Biomed. Res. Found., Washington, DC.

[26] Feng, D.F., Johnson, M.S. & Doolittle, R.F. (1984) "Aligning amino acid sequences:

comparison of commonly used methods." J. Mol. Evol. 21:112-125.

[27] Wilbur, W.J. (1985) "On the PAM matrix model of protein evolution." Mol. Biol. Evol. 2:434-447.

[28] Taylor, W.R. (1986) "The classification of amino acid conservation." J. Theor. Biol.

119:205-218. | Your Gate Way to Life Science Career

[29] Rao, J.K.M. (1987) "New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters." Int. J. Peptide Protein Res. 29:276-281.

[30] Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) "Amino acid substitutions in structurally related proteins. A pattern recognition approach. Determination of a new and efficient scoring matrix." J. Mol. Biol. 204:1019-1029.

[31] Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565.

[32] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid database searches using application-specific scoring matrices." Methods 3:66-70.

application-specific scoring matrices." Methods 3:66-70. [33] Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992)

[33] Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992) "Exhaustive matching of the entire protein sequence database." Science 256:1443-1445.

[34] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.

[35] Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) "The rapid generation of mutation data matrices from protein sequences." Comput. Appl. Biosci. 8:275-282.

[36] Overington, J., Donnelly, D., Johnson M.S., Sali, A. & Blundell, T.L. (1992) "Environment-specific amino acid substitution tables: Tertiary templates and prediction of protein folds." Prot. Sci. 1:216-226.

[37] Henikoff, S. & Henikoff, J.G. (1993) "Performance evaluation of amino acid substitution matrices." Proteins 17:49-61.

[38] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J. Mol. Biol. 162:705-708.

[39] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.

[40] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine gap costs." Bull. Math. Biol. 48:603-616.

[41] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput. Appl. Biosci. 4:11-17.

[42] Claverie, J.-M. & States, D.J. (1993) "Information enhancement methods for large- scale sequence-analysis." Comput. Chem. 17:191-201.

[43] Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid sequences and sequence databases." Comput. Chem. 17:149-163. | Your Gate Way to Life Science Career

[44] Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994) "Issues in searching molecular sequence databases." Nature Genet. 6:119-129.

[45] Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., and Thornton, J.M. (1987) Knowledge-Based Prediction of Protein Structures and the Design of Novel Molecules. Nature 326: 347-352.

[46] Fetrow, J.S. and Bryant, S.H. (1993) New Programs for Protein Tertiary Structure Prediction. Bio/Technology 11: 479-484.

[47] Greer, J. (1991) Comparative Modeling of Homologous Proteins. Meth. Enzymol. 202: 239-252.

of Homologous Proteins. Meth. Enzymol. 202: 239-252. [48] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and

[48] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and Blundell, T.L. (1994) Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29: 1-68.

[49] Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From Comparisons of Protein Sequences and Structures to Protein Modelling and Design. Trends Biochem. Sci. 15: 235-240.

[50] Lewin, R. (1987) When Does Homology Mean Something Else? Science 237: 1570.

[51] Reeck, G.R. et al. (1987) "Homology" in Proteins and Nucleic Acids: A Terminology Muddle and a Way out of It. Cell 50: 667.

[52] Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 48:


[53] Dayhoff, M.O. and Eck, R.V. (1968) A Model of Evolutionary Change in Proteins. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 3, pp. 33-41, National Biomedical Research Foundation, Washington, D.C.

[54] Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978) A Model for Evolutionary Change. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 5, suppl. 3, pp. 345-358, National Biomedical Research Foundation, Washington, D.C.

[55] Dayhoff, M.O., Barker, W.C., and Hunt, L.T. (1983) Establishing Homologies in Protein Sequences. Meth. Enzymol. 91: 524-545.

[56] Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.

[57] Johnson, M.S. and Overington, J.P. (1993) A Structural Basis for Sequence Comparisons - An Evaluation of Scoring Methodologies. J. Mol. Biol. 233: 716-738. | Your Gate Way to Life Science Career

[58] Pearson, W.R. (1995) Comparison of Methods for Searching Protein Sequence Databases. Protein Sci. 4: 1145-1160.

[59] Kabsch, W. and Sander, C. (1983) Dictionary of Protein Secondary Structure:

Pattern Recognition of Hydrogen-Bonded and Geometrical Features. Biopolymers 22:


[60] Sali, A. and Blundell, T.L. (1993) Comparative Protein Modelling by Satisfaction of Spatial Restraints. J. Mol. Biol. 234: 779-815.

[61] Luthy, R., Bowie, J.U., and Eisenberg, D. (1992) Assessment of Protein Models with Three-Dimensional Profiles. Nature 356: 83-85.

Models with Three-Dimensional Profiles. Nature 356: 83-85. [62] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991)

[62] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991) A Method to Identify Protein Sequences That Fold into a Known Three-Dimensional Structure. Science 253: 164-170.

[63] Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu, K., and Berendzen, J. (1998) Class-directed Structure Determination: Foundation for a Protein Structure Initiative. Protein Sci. 7: 1851-1856. | Your Gate Way to Life Science Career