Beruflich Dokumente
Kultur Dokumente
TRAINING PROGRAM
May 16-18, 2007
Reference Material
1.0 Introduction
1962 The first theory of molecular evolution; the Molecular Clock concept (Linus
Pauling and Emile Zukerkandl)
1965 Atlas of Protein Sequences, the first protein database (Margaret Dayhoff and
coworkers)
1970 Needleman-Wunsch algorithm for global protein sequence alignment
1977 New DNA sequencing methods (Fred Sanger, Walter Gilbert and coworkers);
bacteriophage X174 sequence
1977 First software for sequence analysis (Roger Staden)
1977 Phylogenetic taxonomy; archaea discovered; the notion of the three primary
kingdoms of life introduced (Carl Woese and coworkers)
1981 Smith-Waterman algorithm for local protein sequence alignment
1981 Human mitochondrial genome sequenced
The stringing together of the amino acid chain to form a polypeptide is referred to as the
primary structure. The secondary structure is generated by the folding of the primary
sequence and refers to the path that the polypeptide backbone of the protein follows in
space. Certain types of secondary structures are relatively common. Two well-described
secondary structures are the alpha helix and the beta sheet. In the first case, certain types
of bonding between groups located on the same polypeptide chain cause the backbone to
twist into a helix, most often in a form known as the alpha helix. Beta sheets are formed
when a polypeptide chain bonds with another chain that is running in the opposite
direction. Beta sheets may also be formed between two sections of a single polypeptide
chain that is arranged such that adjacent regions are in reverse orientation.
The tertiary structure describes the organization in three dimensions of all of the atoms
in the polypeptide. If a protein consists of only one polypeptide chain, this level then
describes the complete structure.
Multimeric proteins, or proteins that consist of more than one polypeptide chain, require
a higher level of organization. The quaternary structure defines the conformation
assumed by a multimeric protein. In this case, the individual polypeptide chains that
make up a multimeric protein are often referred to as the protein subunits. The four
levels of protein structure are hierarchal, that is, each level of the build process is
dependent upon the one below it.
A protein's primary amino acid sequence is crucial in determining its final structure. In
some cases, amino acid sequence is the sole determinant, whereas in other cases,
additional interactions may be required before a protein can attain its final conformation.
For example, some proteins require the presence of a cofactor, or a second molecule that
is part of the active protein, before it can attain its final conformation. Multimeric
proteins often require one or more subunits to be present for another subunit to adopt the
proper higher order structure. The entire process is cooperative, that is, the formation of
one region of secondary structure determines the formation of the next region.
Allosteric Proteins: These are proteins which under certain conditions have a stable
alternate conformation, or shape, that enables it to carry out a different biological
function. The interaction of an allosteric protein with a specific cofactor, or with another
protein, may influence the transition of the protein between shapes. In addition, any
X-ray Crystallography: Crystals are a solid form of a substance in which the component
molecules are present in an ordered array called a lattice. The basic building block of a
crystal is called a unit cell. Each unit cell contains exactly one unique set of the crystal's
components, the smallest possible set that is fully representative of the crystal. When the
crystal is placed in an X-ray beam, all of the unit cells present the same face to the beam;
therefore, many molecules are in the same orientation with respect to the incoming X-
rays. The X-ray beam enters the crystal and a number of smaller beams emerge: each one
in a different direction, each one with a different intensity. If an X-ray detector, such as a
piece of film, is placed on the opposite side of the crystal from the X-ray source, each
diffracted ray, called a reflection, will produce a spot on the film. However, because only
a few reflections can be detected with any one orientation of the crystal, an important
component of any X-ray diffraction instrument is a device for accurately setting and
changing the orientation of the crystal. The set of diffracted, emerging beams contains
information about the underlying crystal structure.
The major drawback associated with this technique is that crystallization of the proteins
is a difficult task. Crystals are formed by slowly precipitating proteins under conditions
that maintain their native conformation or structure. These exact conditions can only be
discovered by repeated trials that entail varying certain experimental conditions, one at a
time. This is a very time consuming and tedious process.
The BLAST algorithm is a heuristic program, which means that it relies on some smart
shortcuts to perform the search faster. BLAST performs "local" alignments. Most
proteins are modular in nature, with functional domains often being repeated within the
same protein as well as across different proteins from different species. The BLAST
algorithm is tuned to find these domains or shorter stretches of sequence similarity. The
local alignment approach also means that a mRNA can be aligned with a piece of
genomic DNA, as is frequently required in genome assembly and analysis. If instead
BLAST started out by attempting to align two sequences over their entire lengths (known
as a global alignment), fewer similarities would be detected, especially with respect to
domains and motifs.
When a query is submitted via one of the BLAST Web pages, the sequence, plus any
other input information such as the database to be searched, word size, expect value, and
so on, are fed to the algorithm on the BLAST server. BLAST works by first making a
look-up table of all the "words" (short subsequences, which for proteins the default is
three letters) and "neighboring words", i.e., similar words in the query sequence. The
sequence database is then scanned for these "hot spots". When a match is identified, it is
used to initiate gap-free and gapped extensions of the "word".
BLAST Scores and Statistics: Once BLAST has found a similar sequence to the query
in the database, it is helpful to have some idea of whether the alignment is "good" and
whether it portrays a possible biological relationship, or whether the similarity observed
is attributable to chance alone. BLAST uses statistical theory to produce a bit score and
expect value (E-value) for each alignment pair (query to hit).
The bit score gives an indication of how good the alignment is; the higher the score, the
better the alignment. In general terms, this score is calculated from a formula that takes
into account the alignment of similar or identical residues, as well as any gaps introduced
to align the sequences. A key element in this calculation is the "substitution matrix ",
which assigns a score for aligning any possible pair of residues. The BLOSUM62 matrix
is the default for most BLAST programs, the exceptions being blastn and MegaBLAST
(programs that perform nucleotide nucleotide comparisons and hence do not use protein-
specific matrices). Bit scores are normalized, which means that the bit scores from
different alignments can be compared, even if different scoring matrices have been used.
S’ = λS-ln K E = mn 2-S’
ln 2
Where S’ is the normalized score, S is the raw score, λ and K are constants, m and n is
the length of the query and hit sequences
2. CMR
The Comprehensive Microbial Resource (CMR) gives access to a central
repository of the sequence and annotation of all complete public prokaryotic genomes as
well as comparative genomics tools across all of the genomes in the database.
4. DCODE.ORG
The dcode.org website provides access to tools for comparative genomic analyses
developed by the Comparative Genomics Center at the Lawerence Livermore National
Laboratory. Tools include: zPicture, Mulan, eShadow, rVista, CREME, and the ECR
Browser.
5. EnteriX
EnteriX is a collection of tools for viewing pairwise and multiple alignments for
bacterial genome sequences.
6. FootPrinter
FootPrinter is a program for phylogenetic footprinting that identifies regions of
DNA that are well conserved across a set of orthologous sequences in order to infer
phylogenetic relationships.
7. FootPrinter3
FootPrinter3 is a web server for predicting transcription factor binding sites
(TFBS) by using phylogenetic footprinting. FootPrinter3 extends the motif discovery
8. GenomeTraFaC
GenomeTraFaC is a comparative genomics based resource for initial
characterization of gene models and the identification of putative cis-regulatory regions
of RefSeq gene orthologs.
9. GENSTYLE
GENSTYLE is based on the genomic signature paradigm and allows the user to
classify and characterize nucleotide sequences using oligonucleotide frequencies.
13. Mauve
Mauve is a stand-alone software tool for constructing multiple genome
alignments.
14. MicroFootPrinter
MicroFootPrinter identifies the conserved motifs in regulatory regions of
prokaryotic genomes using the phylogenetic footprinting program FootPrinter.
15. MIPS
16. MLST
MLST (Multi Locus Sequence Typing) is a nucleotide sequence based approach
for the unambiguous characterisation of isolates of bacteria and other organisms using the
sequences of internal fragments of seven house-keeping genes.
17. NEMBASE2
NEMBASE2 is a database resource for EST datasets for 37 species of nematode.
Sequences are clustered to redunce redundacy. Comparisons can be by library and at a
sequence level; a visualisation tool is included. Coding region predictions for each
cluster, further annotations such as GO terms and physical properties are also included.
18. PartiGeneDB
PartiGeneDB is a database of about 300 partial genomes from eukaryotic
organisms that have been assembled from EST data.
19. PhenomicDB
PhenomicDB integrates the genotype and phenotype information of several
organisms from public data sources. The mapping of phenotypic data fields allows cross-
species phenotype comparison.
20. Phydbac2
Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and
explore the phylogenomic profiles of bacterial protein sequences. It also allows the user
to view sequence similarity across different organisms, access other genes with similar
conservation profiles, and view genes that are found nearby a selected gene in multiple
genomes.
21. Projector 2
Projector 2 allows users to map completed portions of the genome sequence of an
organism onto the finished (or unfinished) genome of a closely-related species or strain.
Using the related genome sequence as a template can facilitate sequence assembly and
the sequencing of the remaining gaps.
22. Sockeye
Sockeye is a visualization tool allowing one to assemble and analyze genomic
information in a three dimensional workspace. It can be used to view features at various
levels, ranging from SNPs to karyotypes. Sockeye displays genomic features along
tracks, and links to the Ensembl database.
23. SPRING
24. SVC
SVC (Structured Visualization of Evolutionary Conserved Sequences) is a tool
that can search for pairs of orthologous genes, align the protein coding sequences, and
visualize the evolutionary sequence conservation mapped back onto the gene structure
scaffold.
25. T-STAG
Tissue-Specific Transcripts And Genes (T-STAG) is a system integrating EST,
gene expression, alternative splicing and human-mouse orthology information for the
analysis of tissue-specific gene and transcript expression patterns.
27. TraFaC
TraFaC (Transcription Factor Binding Site Comparison) is a tool that identifes
regulatory regions using a comparative sequence analysis approach.
29. YOGY
Eukaryotic Orthology (YOGY) is a resource for retrieving orthologous proteins
from nine eukaryotic organisms. Using a gene or protein identifier as a query, this
database provides comprehensive, combined information on orthologs in other species
using data from five independent resources: KOGs, Inparanoid, Homologene,
OrthoMCL, and a table of curated orthologs between budding yeast and fission yeast.
Associated Gene Ontology (GO) terms of orthologs can also be retrieved.
Equally exciting is the potential for uncovering evolutionary relationships and patterns
between different forms of life. With the aid of nucleotide and protein sequences, it
should be possible to find the ancestral ties between different organisms. Thus far,
experience has taught us that closely related organisms have similar sequences and that
more distantly related organisms have more dissimilar sequences. Proteins that show
significant sequence conservation, indicating a clear evolutionary relationship, are said to
be from the same protein family. By studying protein folds (distinct protein building
blocks) and families, scientists are able to reconstruct the evolutionary relationship
between two species and to estimate the time of divergence between two organisms since
they last shared a common ancestor.
The Three Domains of Life: In the mid-1970s, while studying some unusual groups of
bacteria, thermophilic methanogens and halophiles, Carl Woese and colleagues cocluded
that these organisms were not really bacteria but should be assigned to a separate domain
of life with the same status as bacteria and eukaryotes. This group was originally referred
to as archaebacteria and later renamed archaea. The uniqueness of the archaea was
apparent, even from some of their biochemical features, such as the unusual structure of
lipids and the topology of phylogenetic trees of 16S rRNA. These trees clearly indicated
that archaea comprised a unique branch of life, distinct from both bacteria and
eukaryotes. Furthermore, although, phenotypically, archaea are obviously prokaryotes,
like bacteria, i.e. have small cells without nuclei or organelles, they are, in some
important respects, closer to eukaryotes than to bacteria.These eukaryote-like features of
archaea include the structure of the ribosomes, which have a number of proteins shared
with eukaryotes but not with bacteria, the presence of histones (in one of the two major
branches of archaea), the organization of the basal transcriptional apparatus, with several
transcription factors of the eukaryotic variety, and the organization of the DNA
replication apparatus, which is also conserved in archaea and eukaryotes but not in
bacteria.
Genetic Variation: Evolution is not always discrete with clearly defined boundaries that
pinpoint the origin of a new species, nor is it a steady continuum. Evolution requires
genetic variation which results from changes within a gene pool, the genetic make-up of a
specific population. A gene pool is the combination of all the alleles —alternative forms
of a genetic locus—for all traits that population may exhibit. Changes in a gene pool can
result from mutation—variation within a particular gene—or from changes in gene
frequency—the proportion of an allele in a given population.
Every time a cell divides, it must make a complete copy of its genome, a process called
DNA replication. DNA replication must be extremely accurate to avoid introducing
mutations or changes in the nucleotide sequence of a short region of the genome.
Inevitably, some mutations do occur, usually in one of two ways; either from errors in
DNA replication or from damaging
Mutations in the coding regions of genes are much more important. Those mutations that
do have an evolutionary effect can be divided into two categories, loss-of-function
mutations and gain-of-function mutations. A loss-of-function mutation results in reduced
or abolished protein function. Gain-of-function mutations, which are much less common,
confer an abnormal activity on a protein.
Phylogenetic Trees: Systematics describes the pattern of relationships among taxa and is
intended to help us understand the history of all life. But history is not something we can
see—it has happened once and leaves only clues as to the actual events. Scientists use
these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the
most convenient way of visually presenting evolutionary relationships among a group of
organisms is through illustrations called phylogenetic trees.
1. Bioinformatics Toolkit
This Toolkit is a collection of a wide range of tools and links for sequence analysis,
function, and structure prediction. This resource offers convienent web interfaces for
many freely available tools.
2. CIPRes
The Cyberinfrastructure for Phylogenetic Research (CIPRes) project aims to develop a
computational infrastructure for systematics. Other goals of the project include providing
a central resource enabling computational systematics and education and training
initiatives. The website also contains a substantial list of links to related software.
4. ConSeq
ConSeq is a tool for predicting functionally and structurally important amino acid
residues in protein sequences. The predictions are based on the assumptions that residues
5. cpnDB
cpnDB is a curated collection of chaperonin sequence data collected from public
databases or generated by a network of collaborators exploiting the cpn60 target in
clinical, phylogenetic and microbial ecology studies. The database contains all available
sequences for both group I and group II chaperonins. cpnDB is built and maintained with
open source tools.
7. JEvTrace
Jevtrace is a tool that combines multiple sequence alignments, phylogenetic, and
structural data for identification of functional sites in proteins.
9. MEGA
MEGA (Molecular Evolutionary Genetics Analysis) is a software package for
phylogenetic analysis with a graphical user interface. It allows viewing and editing of the
aligned input sequence data and provides many tools for phylogenetic and statistical
analysis of the alignments.
10. Mesquite
Mesquite is an open source software project designed to deal with comparative data
about organisms and evolutionary analyses. Mesquite contains modules for phylogenetic
analysis, population genetics, and non-phylogenetic multivariate analysis.
12. MINER
15. NEWT
NEWT is the taxonomy database maintained by the UniProt group.
16. NJplot
NJplot is a tool for visualizing binary trees such as the phylogenetic trees output from the
PHYLIP programs. Available for several platforms including Windows, MacOS, Linux
and Solaris.
18. PAL2NAL
PAL2NAL converts a multiple sequence alignment of proteins and the
corresponding DNA (or mRNA) sequences into a codon alignment. Synonymous (Ks)
and non-synonymous (Ka) substitution rates can be calculated.
19. Phydbac2
Phydbac2 (Phylogenomic display of bacterial genes) is a tool to visualize and
explore the phylogenomic profiles of bacterial protein sequences. It also allows the user
to view sequence similarity across different organisms, access other genes with similar
conservation profiles, and view genes that are found nearby a selected gene in multiple
genomes.
20. PHYLIP
Comprehensive set of programs for phylogenetic analyses; available for PC and Mac;
source code available for easy compiling in UNIX.
21. PhyloBLAST
BLAST a protein sequence, then perform automated phylogenetic analysis on hits or on
uploaded sequences; PHYLIP-based analyses.
23. PHYML
Phyml is a program that constructs phylogenetic trees from sequence alignments using
the maximum likelihood method.
24. POWER
The Phylogenetic Web Repeater (POWER) allows users to perform phylogenetic
analysis using the PHYLIP package. The POWER pipeline can start with processing
either multiple sequence alignments (MSA) or can proceed directly with aligned
sequences.
25. ProtTest
ProtTest is a program that determines the best-fit model of evolution, among a set
of candidate models, for a given protein sequence alignment.
26. Puzzleboot
Puzzleboot is a UNIX shell script facilitating bootstrap analysis using TREE-PUZZLE
and PHYLIP. It enhances TREE-PUZZLE by allowing one to analyse multiple datasets,
and can be used for both protein and DNA distance bootstrap analysis.
30. T-COFFEE
The T-COFFEE site includes links to a collection of tools for computing,
evaluating, and manipulating multiple alignments of protein sequences and structures. T-
COFFEE is a protein multiple sequence alignment tool that is more accurate than
ClustalW for sequences with less than 30% identity. Expresso (or 3DCoffee) aligns
sequences using structural information. PROTOGENE turns amino acid alignments into
CDS nucleotide alignments.
33. TREE-PUZZLE
Tree-puzzle is a program that constructs phylogenetic trees from sequence alignments
using the maximum likelihood method.
34. TreeDomViewer
TreeDomViewer is a tool for the visualization of phylogeny and protein domain
structure. TreeDomViewer constructs phylogenetic trees and projects the corresponding
protein domain information onto the multiple sequence alignment.
35. TreeJuxtaposer
TreeJuxtaposer is a free software tool that allows a visual comparison of two trees
in Newick format (phylogenies, taxonomies, gene trees, etc.). It can work with trees
having up to 500,000 nodes, and automatically calculates and marks the differences.
36. TreeView
Generates nice graphics of trees; reads multiple tree file formats; available for
download to Mac or PC.
37. TSEMA
The Server for Efficient Mapping Assessment (TSEMA) predicts possible
protein-protein interactions based on the comparision of phylogenetic trees derived from
sequences of associated protein families.
40. Weighbor
Weighbor is a tool for building phylogenetic trees from distance matrices. It
employs a weighted version of the neighbour-joining method in which longer distances in
the matrix are given less weight.
For the VRs, a variety of approaches may be applied in assigning coordinates to the
unknown. Recall that these regions will correspond most often to the loops on the surface
of the protein. If a loop in one of the known structures is a good model for that of the
unknown, then the main chain coordinates of that known structure can be copied. Side
chain coordinates of residues that are similar in length and character also may be copied.
Rotamer libraries can be used to define other side chain coordinates.
When a good model for a loop cannot be found among the known structures, one can
search fragment databases for loops in other proteins that may provide a suitable model
for the unknown. A residue range is chosen to include the undefined loop as well as a few
residues (e.g., three) on either side of the loop for which coordinates have been defined.
Fragments are examined for their ability to fit in the undefined region without making
bad contacts with other atoms and to overlap well with the residues on either side of the
loop. The loop may then be subjected to conformational searching to identify low energy
conformers if desired. Coordinates for side chain atoms in these loop regions may be
copied if residues are similar, though it is likely that considerable application of side
chain rotamer libraries will be required to define coordinates in these regions.
Databases of Structures from Homology Modeling: Databases are now available that
contain large numbers of protein structures that have been obtained by comparative
(homology) modeling. Two of these databases are ModBase and 3DCrunch.
Modbase was created by Sali and co-workers, using their program Modeller, which
creates models based on the satisfaction of spatial restraints. That is, restraints are
identified from the alignments of homologues of known structure, and these restraints are
then applied to the unknown sequence. Restraints can include distances between alpha
carbons, other distances within the main-chain, and main-chain and side-chain dihedral
angles. Routines to satisfy the restraints optimally include conjugate gradient
minimization and molecular dynamics with simulated annealing.
3DCrunch is a large scale modeling project that aims to submit all entries from protein
sequence databases to SWISS-MODEL. Currently the database contains 64,000 entries.
Automated Web-Based Homology Modeling: Web-based tools are now available to
generate models of protein 3-dimensional structures using comparative modeling
techniques.
SWISS-MODEL is available through Glaxo Wellcome Experimental Research in
Geneva, Switzerland.
WHAT IF, available on EMBL servers, includes three components, one to generate the
homology models, one to evaluate the quality of the homology models, and one to
6. FlexiDock (Tripos) simple, flexible docking of ligands into binding sites on proteins
fast genetic algorithm for generation of configurations rigid, partially flexible, or fully
flexible receptor side chains provide optimal control of ligand binding characteristics
conformationally flexible ligands tunable energy evaluation function with special H-bond
treatment very fast run times
8. MIMUMBA torsion angle database used for the creation of conformers; interaction
geometry database used to exactly describe intermolecular interaction patterns
Boehm function (with minor adaptions necessary for docking) applied for scoring
10. GOLD (CCDC) calculating docking modes of small molecules into protein binding
sites genetic algorithm for protein-ligand docking full ligand and partial protein
flexibility energy functions partly based on conformational and non-bonded
contactinformation from the CSD choice of scoring functions: GoldScore, ChemScore
and User defined score virtual library screening
13. SITUS (Scripps Research Institute) program package for modeling of atomic
resolution structures into low-resolution density maps software supports both rigid-body
and flexible docking using a variety of fitting strategies
14. SenSitus interactive docking and visualization program for low-resolution density
maps and atomic structures GUI-based alternative to certain Situs docking programs that
can benefit from an interactive user interface and 3D visualization methods
16. DOCK (UCSF Molecular Design Institute) generates many possible orientations (and
more recently, conformations) of a putative ligand within a user-selected region of a
receptor structure orientations may be scored using several schemes designed to measure
steric and/or chemical complementarity of the receptor-ligand complex evaluate likely
orientations of a single ligand, or to rank molecules from a database search databases for
18. ICM-Dock (MolSoft LLC) fast and accurate docking simulations unique set of tools
for accurate individual ligand-protein docking, peptide-protein docking, and protein-
protein docking, including interactive graphics tools
20. Bielefeld Protein Docking (Bielefeld University) detects geometrical and chemical
complementarities between surfaces of proteins and estimates docking positions
23. DOT (San Diego Supercomputer Center) Daughter Of TURNIP TURNIP - program,
developed by V. Roberts at The Scripps Research Institute for use in the study of
24. ESCHER NG (Milan University) enhanced version of the original ESCHER protein-
protein automatic docking system developed in 1997 by G. Ausiello, G. Cesareni and M.
Helmer Citterich new release, with a reengineered code, includes some new features:
protein-protein and DNA-protein docking capability fast surface calculation based on the
NSC algorithm
26. HEX (University of Aberdeen) protein docking and molecular superposition program
use spherical polar Fourier correlations to accelerate docking calculations
Some of the differences in the way different Web browsers display the same Web page
come from different design decisions ("what font should be used for <H1> text?") and
some of it comes from the fact that different Web clients have different capabilities.
Some of these differences, such as the ability to display various kinds of still or moving
images as part of the Web page or to run programs written in Java, Active X, or
Javascript, represent extensions to HTML. These extra capabilities may be built into the
browser or may be added by "plugins"; software extensions which give the browser new
functionality. Finally, the behavior of a Web browser can frequently changed by
configuring its preferences; if you find the default font too small, that can often be
increased.
Many new computer users assume that the Web and the Internet are synonymous.
However, many protocols other than HTTP flow over the Internet. In part, the new user is
confused by the fact that, in addition to supporting extensions to HTML, many popular
web browsers have support for other protocols such as email (SMTP, POP, IMAP),
newsgroups, ftp, and gopher for example. What this really means is that the particular
piece of software (e.g. Netscape Communicator) is more than just a Web client, it is also
an email client, an FTP client and a Gopher client. Finally, HTTP does not have to be
transmitted over the Internet, and HTML doesn't have to be transmitted via HTTP. Web
technology has become a common interface tool for communication between computers
on a local network (sometimes called an Intranet), and every Web client I have worked
with has the ability to read and display local HTML files.
Because virtually every Web client is also a limited FTP client, many people choose to so
use them. In the case where a Web page contains a link to an FTP server, simply
selecting the link downloads the file. If, however, you are given the following
instructions to retrieve a file:
Telnet: Telnet is one of the oldest of the network services and perhaps the easiest to
understand. Telnet allows one computer to "log on" to another computer as if it were a
terminal. Once logged on, you frequently will have all the privileges of a local user; you
can run programs, create and delete files. This is probably the most common way that
users with accounts will use a computer.
Although "full service logins" as is described above are perhaps the most common use of
the telnet protocol, in fact as much control as the host's system administrator desires may
be imposed on a telnet connection. Thus, a telnet service may be advertised with a public
login name and password. Login with this name, however, is likely to be restricted to a
limited number of commands. The National Institutes of Health in the United States used,
at one point, such a telnet login to disseminate information as to the membership of study
sections. Such specialized telnet services have become much less common since the rise
in popularity of the Web.
A telnet session can negotiate a range of different protocols, but this almost always
includes ASCII text. Because many protocols for other services (e.g. SMTP, HTTP) are
encoded as ASCII text, a telnet client can sometimes be used to connect to a server for
these other protocols. Most people will use a telnet client the first time connecting to a
MOO, and some people will continue to use telnet as their client, although most of us
find dedicated clients to be significantly more convenient. Similarly, it is possible to
connect to a Web server with a telnet client if you understand the syntax of HTTP. This is
almost never done to use a Web server, but is occasionally done when debugging.
From a practical point of view, every telnet host will be different, and thus you will need
to learn about each one as you have occasion to use it.
ftp: Telnet is useful for interactive computer access, but is much less useful for
transferring files. Ftp is an older service designed specifically for file transfer. Originally
it, like telnet, was intended for account owners. However, as it became apparent that it
was useful to make files available to the world at large without giving all those wanting
the files an account, the variant of "anonymous ftp" developed. In this variant, logging in
with a "magic" user name (most commonly "anonymous" or "ftp") eliminates the
requirement for a password.
Once logged on via ftp, access to the host filesystem is accomplished by a series of
commands. On a unix ftp client, the commands are unix-like; cd to Change Directory and
ls to LiSt the files in that directory. To transfer files, you execute either get a file from the
host computer or put a file onto it (where allowed). These commands do not depend on
the host computer running UNIX! These are ftp commands, some of which happen to be
similar to unix commands. A client may choose to hide these commands; a client with a
graphical user interface (GUI), for example, might not have typed commands at all, but
buttons.
Email: Both ftp and telnet are interactive, more or less real time programs. Sometimes it
is useful, however, to communicate with another computer, or more commonly, a user on
another computer, by leaving them a message which they can read and respond to at their
convenience. This is done over the Internet by using email.
Email is a generic term for a variety of processes which can use different protocols and
network technology, and which, in many cases uses a more complex client/server model.
At present, most email is transmitted by SMTP (Simple Mail Transport Protocol) via
TCP/IP over the Internet. SMTP transmits email on port 25 between two dedicated, full
time servers. Although the assumption is that both SMTP servers will be generally
available, should the receiving server not be reachable when the transmitting server needs
to send email, the email message will be held and the transmission will be retried several
times over a period of days until a successful transmission occurs or until the maximum
retry time has been exceeded, at which point an error message will be returned to the
sender.
The SMTP programs discussed above are typically symmetrical (e.g. the program can
alternatively serve as client or server), and are complex. Typically, you will not interact
with these programs directly. Rather, dedicated client software is used to compose, send,
receive, and read email, and it is that software which communicates with the SMTP
server. If you send and receive email via a computer that is always on and always
connected to a network reachable by your mail server (e.g. a Unix workstation), then
incoming mail is saved to a mail spool file on your computer from whence your client
software retrieves it, and outgoing email is passed to the SMTP server. Examples of
client software running on Unix workstations are mail, mailx, mush, elm, mutt, and pine.
Also, as is discussed below, web browsers sometimes can be used as email clients.
If you send and receive email via a computer that is not always on and/or not always
connected to the network, sending email proceeds as above, but receiving email is
different in that the SMTP server cannot necessarily get incoming email onto your
computer's file system. In that case, a different protocol is used, most commonly POP3.
(IMAP is a newer protocol for accomplishing the same task about which you may hear
more in the future.) The SMTP server stores your email on a remote host and your local
client retrieves it from a POP3 server when you check for mail. Typically, a POP3
account will be provided by whoever provides your Internet access. Thus, to install an
2. cd - change directory
cd is used to change from one directory to another.
cd dir1
changes directory so that dir1 is your new current directory. dir1 may be either the full
pathname of the directory, or its pathname relative to the current directory.
cd
changes directory to your home directory.
cd ..
moves to the parent directory of your current directory.
4. cp - copy a file
The command cp is used to make copies of files and directories.
cp file1 file2
copies the contents of the file file1 into a new file called file2. cp cannot copy a file
onto itself.
cp file3 file4 dir1
creates copies of file3 and file4 (with the same names), within the directory dir1. dir1
must already exist for the copying to succeed.
cp -r dir2 dir3
recursively copies the directory dir2, together with its contents and subdirectories, to
the directory dir3. If dir3 does not already exist, it is created by cp, and the contents
and subdirectories of dir2 are recreated within it. If dir3 does exist, a subdirectory
called dir2 is created within it, containing a copy of all the contents of the original dir2.
5. date - display the current date and time
date returns information on the current date and time in the format shown below:-
Tue Mar 25 15:21:16 GMT 1997
It is possible to alter the format of the output from date. For example, using the command
line
date '+The date is %d/%m/%y, and the time is %H:%M:%S.'
at exactly 3.10pm on 14th December 1997, would produce the output
The date is 14/12/97, and the time is 15:10:00.
13. kill - kill a process to kill a process using kill requires the process id (PID).
1. Bioconductor
Bioconductor is an open source and open development software project that aims to
provide access to a wide range of powerful statistical and graphical methods for the
analysis of genomic data.
2. BioDAS
This site is the center of development of an Open Source system for exchanging
annotations on genomic sequence data.
3. BioJava
The BioJava Project is an open-source project dedicated to providing Java tools for
processing biological data.
4. BioMoby
BioMOBY is an international research project involving biological data hosts,
biological data service providers, and coders whose aim is to explore various
methodologies for biological data representation, distribution, and discovery.
5. BioPax
The BioPAX web site provides information about a collaborative effort to create a data
exchange format for biological pathways.
6. BioPerl
The BioPerl Project is an international association of developers of open source Perl
tools for bioinformatics, genomics and life science research.
7. BioPerl course
Great tutorial for those interested in the bioperl group of modules.
8. BioPHP
Open Source PHP code for bioinformatics. Includes functions and minitools (copy and
paste one page scripts for basic tasks in bioinformatics. A wiki-like service allows
modification and improvement of code.
9. BioPipe
The biopipe is a workflow framework that seeks to address some of the complexity
involved in carrying out large scale bioinformatics analysis. It has been designed to work
intimately with the bioperl package.
11. BioRuby
The BioRuby project aims to implement an integrated environment for bioinformatics
with Ruby.
12. CCT
CCT (Current Comparative Table) is a software package that you can install and set-up
on your own system to help you to maintain and search databases.
17. PyMOL
PyMOL is a molecular graphics system with an embedded Python interpreter designed
for real-time visualization and rapid generation of high-quality molecular graphics
images and animations.
18. R
System for statistical computation and graphics; an interpreted computer language which
allows branching and looping as well as modular programming using functions.
[2] Lipman, D.J., Wilbur, W.J., Smith T.F. & Waterman, M.S. (1984) "On the statistical
significance of nucleic acid similarities." Nucl. Acids Res. 12:215-226.
[3] Altschul, S.F. & Erickson, B.W. (1985) "Significance of nucleotide sequence
alignments: a method for random sequence permutation that preserves dinucleotide and
codon usage." Mol. Biol. Evol. 2:526-538.
[5] Reich, J.G., Drabsch, H. & Daumler, A. (1984) "On the statistical assessment of
similarities in DNA sequences." Nucl. Acids Res. 12:5529-5543.
[6] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol. 215:403-410.
[7] Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
subsequences." J. Mol. Biol. 147:195-197.
[8] Sellers, P.H. (1984) "Pattern recognition in genetic sequences by mismatch density."
Bull. Math. Biol. 46:501-514.
[9] Gumbel, E. J. (1958) "Statistics of extremes." Columbia University Press, New York,
NY.
[10] Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance
of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci.
USA 87:2264-2268.
[11] Dembo, A., Karlin, S. & Zeitouni, O. (1994) "Limit distribution of maximal non-
aligned two-sequence segmental score." Ann. Prob. 22:2022-2039.
[12] Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448.
[13] Pearson, W.R. (1995) "Comparison of methods for searching protein sequence
databases." Prot. Sci. 4:1145-1160.
[14] Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth. Enzymol.
266:460-480.
[16] Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical distribution of
nucleic acid similarities." Nucleic Acids Res. 13:645-656.
[17] Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of protein
sequence similarities." Comput. Appl. Biosci. 4:67-71.
[19] Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of statistical
significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628.
[20] Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
Poisson approximation." Stat. Sci. 9:367-381.
[21] Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity
searches." J. Mol. Biol. 276:71-84.
[22] Arratia, R. & Waterman, M.S. (1994) "A phase transition for the score in matching
random sequences allowing deletions." Ann. Appl. Prob. 4:200-225.
[23] McLachlan, A.D. (1971) "Tests for comparing related amino-acid sequences.
Cytochrome c and cytochrome c-551." J. Mol. Biol. 61:409-424.
[24] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary
change in proteins." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed.
M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC.
[25] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant
relationships." In "Atlas of Protein Sequence and Structure," Vol. 5, Suppl. 3 (ed. M.O.
Dayhoff), p. 353-358. Natl. Biomed. Res. Found., Washington, DC.
[26] Feng, D.F., Johnson, M.S. & Doolittle, R.F. (1984) "Aligning amino acid sequences:
comparison of commonly used methods." J. Mol. Evol. 21:112-125.
[27] Wilbur, W.J. (1985) "On the PAM matrix model of protein evolution." Mol. Biol.
Evol. 2:434-447.
[28] Taylor, W.R. (1986) "The classification of amino acid conservation." J. Theor. Biol.
119:205-218.
[30] Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) "Amino acid
substitutions in structurally related proteins. A pattern recognition approach.
Determination of a new and efficient scoring matrix." J. Mol. Biol. 204:1019-1029.
[31] Altschul, S.F. (1991) "Amino acid substitution matrices from an information
theoretic perspective." J. Mol. Biol. 219:555-565.
[32] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of nucleic acid
database searches using application-specific scoring matrices." Methods 3:66-70.
[33] Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992) "Exhaustive matching of the
entire protein sequence database." Science 256:1443-1445.
[34] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from protein
blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
[35] Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) "The rapid generation of
mutation data matrices from protein sequences." Comput. Appl. Biosci. 8:275-282.
[36] Overington, J., Donnelly, D., Johnson M.S., Sali, A. & Blundell, T.L. (1992)
"Environment-specific amino acid substitution tables: Tertiary templates and prediction
of protein folds." Prot. Sci. 1:216-226.
[37] Henikoff, S. & Henikoff, J.G. (1993) "Performance evaluation of amino acid
substitution matrices." Proteins 17:49-61.
[38] Gotoh, O. (1982) "An improved algorithm for matching biological sequences." J.
Mol. Biol. 162:705-708.
[39] Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad.
Sci. USA 80:1382-1386.
[40] Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using affine
gap costs." Bull. Math. Biol. 48:603-616.
[41] Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space." Comput.
Appl. Biosci. 4:11-17.
[42] Claverie, J.-M. & States, D.J. (1993) "Information enhancement methods for large-
scale sequence-analysis." Comput. Chem. 17:191-201.
[43] Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in amino acid
sequences and sequence databases." Comput. Chem. 17:149-163.
[45] Blundell, T.L., Sibanda, B.L., Sternberg, M.J.E., and Thornton, J.M. (1987)
Knowledge-Based Prediction of Protein Structures and the Design of Novel Molecules.
Nature 326: 347-352.
[46] Fetrow, J.S. and Bryant, S.H. (1993) New Programs for Protein Tertiary Structure
Prediction. Bio/Technology 11: 479-484.
[48] Johnson, M.S., Srinivasan, N., Sowdhamini, R., and Blundell, T.L. (1994)
Knowledge-Based Protein Modeling. Crit. Rev. Biochem. Mol. Biol. 29: 1-68.
[49] Sali, A., Overington, J.P., Johnson, M.S., and Blundell, T.L. (1990) From
Comparisons of Protein Sequences and Structures to Protein Modelling and Design.
Trends Biochem. Sci. 15: 235-240.
[50] Lewin, R. (1987) When Does Homology Mean Something Else? Science 237: 1570.
[51] Reeck, G.R. et al. (1987) "Homology" in Proteins and Nucleic Acids: A
Terminology Muddle and a Way out of It. Cell 50: 667.
[52] Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the
Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 48:
442-453.
[53] Dayhoff, M.O. and Eck, R.V. (1968) A Model of Evolutionary Change in Proteins.
In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 3, pp. 33-41,
National Biomedical Research Foundation, Washington, D.C.
[54] Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C. (1978) A Model for Evolutionary
Change. In Atlas of Protein Sequence and Structure (Dayhoff, M.O., ed.), vol. 5, suppl. 3,
pp. 345-358, National Biomedical Research Foundation, Washington, D.C.
[55] Dayhoff, M.O., Barker, W.C., and Hunt, L.T. (1983) Establishing Homologies in
Protein Sequences. Meth. Enzymol. 91: 524-545.
[56] Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from
Protein Blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.
[57] Johnson, M.S. and Overington, J.P. (1993) A Structural Basis for Sequence
Comparisons - An Evaluation of Scoring Methodologies. J. Mol. Biol. 233: 716-738.
[60] Sali, A. and Blundell, T.L. (1993) Comparative Protein Modelling by Satisfaction of
Spatial Restraints. J. Mol. Biol. 234: 779-815.
[61] Luthy, R., Bowie, J.U., and Eisenberg, D. (1992) Assessment of Protein Models with
Three-Dimensional Profiles. Nature 356: 83-85.
[62] Bowie, J.U., Luthy, R., and Eisenberg, D. (1991) A Method to Identify Protein
Sequences That Fold into a Known Three-Dimensional Structure. Science 253: 164-170.
[63] Terwilliger, T.C., Waldo, G., Peat, T.S., Newman, J.M., Chu, K., and Berendzen, J.
(1998) Class-directed Structure Determination: Foundation for a Protein Structure
Initiative. Protein Sci. 7: 1851-1856.