You are on page 1of 121

What is a Computer?

Computer is a
programmable Machine

MACHINE is a device that uses energy to
perform some activity, and
Programmable
Input/Output
Machine The DEVICE is a piece of equipment
DATA made for a particular purpose, especially a
mechanical or electrical one.

So a COMPUTER is an Electronic
Device
electronic device that is used to
process, store and retrieve DATA

DATA

Here I want to emphasize that
DATA is centric to COMPUTERS
Where are these
Computers applied?
Information technology (IT) is "the study, design, development, implementation, support or
management of computer bases information systems".

Information technology is a general term that describes any technology that helps to produce,
manipulate, store, communicate, and/or disseminate information.

Information, in its most restricted technical sense, is an ordered sequence of symbols.
As a concept, however, information has many meanings.
Moreover, the concept of information is closely related to notions of constraint, communication, control,
form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation.

INFORMATION = Organized DATA

I want to emphasize that INFORMATION is nothing but a meaningful or an ordered DATA

COMPUTERS are applied in the field of INFORMARION TECHNOLOGY
What is Bioinformatics?
Biology is a natural science
concerned with the study of life
and living organisms,
including their structure,
function, growth, origin,
evolution, distribution, and
taxonomy.

Biology is a vast subject
containing many subdivisions,
topics, and disciplines.

A biologist is a scientist
devoted to and producing
results in biology through the
study of life.
Rough Overview

DATA BASES of early biologist
Cabinets of curiosities, such as that of Ole Worm, were centers of biological
knowledge in the early modern period, bringing organisms from across the world
together in one place. Before the Age of Exploration, naturalists had little idea of
the sheer scale of biological diversity.
In the course of his travels, Alexander von Humboldt mapped the distribution of plants
across landscapes and recorded a variety of physical conditions such as pressure and
temperature.
Why is there Bioinformatics?

Huge datasets

 Lots of new sequences being added
- Automated sequencers
- Genome Projects
- EST sequencing, microarray studies, proteomics

Nature of Data

Most forms of raw data make visual inspection ineffective

Patterns in datasets that can be analyzed
using computers
The Human Genome Project (HGP)
was an international scientific research project
with a primary goal to determine the sequence of
chemical base pairs which make up DNA
and to identify and map the approximately
20,000–25,000 genes
of the human genome from
both a physical and functional standpoint.

Human Genome Project
has been called a Mega Project because of the following factors:
1. The human genome has approx. 3.3 billion base-pairs; if the cost of sequencing is US $3
per base-pair, then the approx. cost will be US $10 billion.

2. If the sequence obtained were to be stored in a typed form in books and if each page
contains 1000 letters and each book contains 1000 pages, then 3300 such books would be
needed to store the complete information.

However, if expressed in computer storage units (3.3 billion base-pairs) x (2 bits per pair) =
825 megabytes of raw data. Which is about the same size of one music CD. If further
compressed, this data can be expected to fit in less than 20 Megabytes.
The first printout of the human genome to be presented as a series of
books, displayed at the Wellcome Collection, London
Think – Pair – Share!

The Biologist in the Age of Information
The job of the biologist is changing...
Bionformatics
- is the combination of biology and information technology.

Major Bioinformatics Tasks

- Data organization and curation
- Data analysis
- Software development
Bioinformatics is the combination of biology and information technology. The
discipline encompasses any computational tools and methods used to manage,
analyze and manipulate large sets of biological data. Essentially, bioinformatics has
three components:

• The creation of databases
allowing the storage and
management of large biological
data sets.
• The development of algorithms
and statistics to determine
relationships among members of
large data sets.
• The use of these tools for the
analysis and interpretation of
various types of biological data,
including DNA, RNA and
protein sequences, protein
structures, gene expression
profiles, and biochemical
pathways.
The term bioinformatics first came into use
in the 1990s and was originally synonymous
with the management and analysis of DNA,
RNA and protein sequence data

Bioinformatics is largely, although not
exclusively, a computer-based discipline.
Computers are a must in bioinformatics for
two reasons:

First, many bioinformatics problems
require the same task to be repeated
`millions of times.

Second, computers are required for their
problem-solving power.
Genetics related applications

There are three types of computational problems in genetics

Analysis of a single sequence to assess similarity with known genes.

Identification of typical features such as binding sites or derive
evolutionary relationships through phylogenetic trees.

Complete genome analysis to identify members of gene families,
determination of the chromosomal location of the gene, etc.

Sequence Comparison
Linkage Analysis
Phylogetic Analysis
Genomics
Microarrays
Sequence assembly
Genome annotation
Proteomics
Pharmacogenomics
Drug Discovery and computer aided drug design
Systems Biology
Implications for Biomedicine... and Bioinformatics

• Physicians will use genetic information to
diagnose and treat disease.

– Virtually all medical conditions (other than trauma)
have a genetic component
– Individualize drugs – reduce side effects
– Single Nucleotide Polymorphisms (SNPs)

• Faster drug development research
– More targets
– Faster clinical trials (selected trial populations)

• Most Biologists will analyze gene sequence
information in their daily work
Bioinformatics will help with....... DNA Sequencing

- Automated sequencers > 40,000 bp per day

- 500 bp reads must be assembled into complete
Sequences

- Detecting errors especially insertions and deletions

- Data flow management
Bioinformatics will help with.......

Similarity Searching Sequence Databases

- What is similar to my sequence?

- Searching gets harder as the databases
get bigger - and quality changes

- Tools: BLAST and FASTA = time saving
heuristics (approximate methods)

- Statistics + informed judgement of the
biologist
Bioinformatics will help with.......

Structure-Function Relationships

Can we predict the function of protein
molecules from their sequence?

sequence > structure > function

Prediction of some simple 3-D structures (α-
helix, β-sheet, membrane spanning, etc.)
Bioinformatics will help with.......

Phylogenetics

Can we define evolutionary
relationships between organisms
by comparing DNA sequences

- What is the molecular clock?
- Lots of methods and software, what is
the "correct" analysis?
Yet another approach
• Sequence data processing Bio
– Base calling I
– Quality determination N
– Trace viewing F
O
– Vector masking R
– Repeat masking M
– Assembly A
• Sequence characterization T
I
– Nucleotide composition C
– Codon usage S
– Gene finding
– Annotation A
• Alignment P
P
– Pairwise sequence alignment and database L
searching I
– Genome alignment C
– Multiple sequence alignment A
T
• Image analysis and processing I
• Phylogenetic analysis O
• Clustering N
• Group determinations S
Molecular biology
is the study of biology at a molecular level.
This field overlaps with
other areas of biology and chemistry,
particularly genetics and biochemistry.
Molecular biology chiefly concerns itself
with understanding the interactions between the
various systems of a cell,
including the interactions between DNA, RNA
and protein biosynthesis
as well as learning
how these interactions are regulated.
Central Dogma of molecular biology
Biological databases are libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analyses.
They contain information from research areas including genomics,
proteomics, metabolomics, microarray gene expression, and
phylogenetics.

Information contained in biological databases includes gene function,
structure, localization (both cellular and chromosomal), clinical effects
of mutations as well as similarities of biological sequences and
structures.

Biological databases are an important tool in assisting scientists
to understand and explain a host of biological phenomena from
the structure of biomolecules and their interaction, to the whole
metabolism of organisms and to understanding the evolution of
species. This knowledge helps facilitate the fight against diseases,
assists in the development of medications and in discovering basic
relationships amongst species in the history of life.
What is the a Database?
Databases — Definition
A database is a set of data that has a
regular structure and that is organized in such a way that
a computer can easily find the desired information.

An organized body of related information.

Major Databases in bioinformatics

Protein Databases
Primary databases, ex – SWISS-PROT, PIR
Secondary/Composite Databases, ex – OWL, NRDB
Structural Databases
PDB, CATH, SCOP
Nucleotide and Genome Sequences
GenBank, DDBJ, EMBL, SGD, EBI,COG,
(GenBank at NCBI is in collaboration with DDBJ, EMBL)
Gene Expression Data
Other Databases, ex – GeneCards, KEGG
http://www.ncbi.nlm.nih.gov/
National Center for Biotechnology Information

The National Center for Biotechnology Information (NCBI)
is part of the United States National Library of Medicine (NLM).

The NCBI houses genome sequencing data in GenBank and
an index of biomedical, biotechnology research articles in PubMed.

All the databases are available online through the Entrez search engine.

The NCBI is directed by David Lipman, one of the original authors of the
BLAST sequence alignment program and a widely respected figure in Bioinformatics.

Since 1992, NCBI has grown to provide other databases in addition to GenBank.

NCBI provides

Online Mendelian Inheritance in Man,
the Molecular Modeling Database (3D protein structures),
dbSNP a database of single-nucleotide polymorphisms,
the Unique Human Gene Sequence Collection,
a Gene Map of the human genome,
a Taxonomy Browser,
and coordinates with the
National Cancer Institute to provide the Cancer Genome Anatomy Project.
The NCBI assigns a unique identifier (Taxonomy ID number) to each species of organism.
GenBank

The NCBI has had responsibility for making available the GenBank
DNA sequence database since 1992.

GenBank coordinates with individual laboratories and other sequence databases such as
those of the European Molecular Biology Laboratory (EMBL)
and the DNA Data Bank of Japan (DDBJ)

The NCBI has many software tools that are available by WWW browsing or by FTP.
For example, BLAST is a sequence similarity searching program.

BLAST can do sequence comparisons against the GenBank DNA database with in 15 sec.
A Model Organism
A Science Primer
Databases and Tools
http://www.ncbi.nlm.nih.gov/Database/datamodel/index.html
Different Data Centers, the last one is BioMed DataCenter
Bioinformatics Informatics (academic field),
is the application of a broad academic field encompassing
statistics and computer science information science,
to the field of molecular biology. information technology,
algorithms, and social science

Informatics is nothing but Study of information

Bioinfomactics is nothing buy study of biological information

Bioinformatics = TECHNOLOGY + Biological DATA

Bioinformatics is the IT field of a Biologist

So we need to PROGRAM bioinformatics,
as we are applying statistics and computers to the field of molecular biology

By now we learned that computer is a programmable machine and is applied to
bioinformatics
What is a Program?
difference between s/w and h/w

A computer program Software
is a sequence of instructions is the collection of computer programs
written to perform and related data
a specified task for a computer. that provide the instructions
telling a computer what to do.
Hardware
(meaning physical device),
in contrast to hardware,
software is intangible,
meaning it "cannot be touched"
Any size of a program, big or a small.
Program = logic (algorithm) + DATA

PROGRAM = LOGIC (algorithm) + DATA

Computers Technology is applied to bioinformatics, by building
bioinformcatics Software & Hardware which is centric to biological DATA
At-least we do not need to build these Software Packages
Why because many of them are bundled through
readily available

Open-source

GNU GPL LINUX Operating System
Topic-Bio-Linux6

Bhargavi Saragadam
MSC Human Genetics
ANDHRA UNIVERSITY

Suresh Saragadam
BE cse
BHARATHIDASAN UNIVERSITY

23/Aug/2010
Bio-linux
Bio-Linux 6.0 Overview

Bio-Linux 6.0 is a fully featured,
powerful, configurable and easy to maintain bioinformatics
workstation.
Bio-Linux provides more than 500 bioinformatics programs
on an Ubuntu Linux 10.04 base.

Before we Jump into bio-linux letus know the terminology

Open-Source
FSF
GNU
GPL
FOSS

Which address the term Software Freedom

And let us understand

the philosophy of free software.
Open source is a development method for software that
harnesses the power of distributed peer review and transparency of process.
The promise of open source is better quality, higher reliability, more flexibility, lower cost, and
an end to predatory vendor lock-in.

The Open Source Initiative (OSI) is a non-profit corporation formed
to educate about and advocate for the benefits of open source and to build bridges
among different constituencies in the open-source community.

OSI was jointly founded by Eric Raymond and Bruce Perens in late February 1998,
with Raymond as its first president
The Open Source Definition
Introduction

Open source doesn't just mean access to the source code.
The distribution terms of open-source software must comply with the following criteria:

1. Free Redistribution

2. Source Code

3. Derived Works

4. Integrity of The Author's Source Code

5. No Discrimination Against Persons or Groups

6. No Discrimination Against Fields of Endeavor

7. Distribution of License

8. License Must Not Be Specific to a Product

9. License Must Not Restrict Other Software

10. License Must Be Technology-Neutral

No provision of the license may be predicated on any individual technology or style of interface
The Free Software Foundation (FSF)
is a non-profit corporation
founded by Richard Stallman
on 4 October 1985
to support the free software movement

# The FSF advocates for free software ideals as outlined in the Free Software Definition,
works for adoption of free software and free media formats, and organizes activist campaigns
against threats to user freedom like Windows 7, Apple's iPhone and OS X, DRM, ebooks and
movies, and software patents.

# The FSF promotes completely free software distributions of GNU/Linux, and advocate that
users of the GNU/Linux operating system switch to a distribution which respects their freedom.

# The FSF drives development of the GNU operating system and maintain a list of high-priority
free software projects to promote replacements for common proprietary applications.

# The FSF builds and update resources useful for the free software community like the Free
Software and Hardware Directories, and the free software jobs board.
The FSF also provide licenses for free software developers to share their code,
including the GNU General Public License.
The GNU General Public License (GNU GPL or simply GPL)
is the most widely used free software license, originally written by
Richard Stallman for the GNU project.

The GPL is the first and foremost copyleft license,
which means that derived works can only be distributed under the same license terms.
Under this philosophy, the GPL grants the recipients of a computer program
the rights of the free software definition and uses copyleft to ensure the
freedoms are preserved, even when the work is changed or added to.
This is in distinction to permissive free software licenses,
of which the BSD licenses are the standard examples.
The GNU operating system is a complete free software system, upward-compatible with Unix.
GNU stands for “GNU's Not Unix”.
Richard Stallman made the Initial Announcement of the GNU Project in September 1983.

The GNU Project, to develop a complete Unix-like
operating system which is free software

* GNU, a computer operating system
* GNU General Public License, a free software license
* GNU Free Documentation License, a copyleft license
for free documentation

The name “GNU” is a recursive acronym for “GNU's Not Unix!”;
— it is pronounced g-noo, as one syllable with no vowel sound between the g and the n.
Free and open source software, also F/OSS, FOSS
is software that is liberally licensed to grant the right of users to
use, study, change, and improve its design through the availability of its source code.
Free software licences and open source licenses
are used by many software packages.

The licenses have important differences, which mirror the differences
in the ways the two kinds of software can be used and distributed
and reflect differences in the philosophy behind the two.

In the context of free and open source software,
"free" is intended to refer to the freedom to copy and re-use the software,
rather than to the price of the software.
What is free Software?

“Free software” is a matter of liberty, not price.
To understand the concept, you should think of “free” as in “free speech”, not as in “free beer”.

Free software is a matter of the users' freedom to run, copy, distribute, study, change and
improve the software.

More precisely, it refers to four kinds of freedom, for the users of the software:

* The freedom to run the program, for any purpose (freedom 0).
* The freedom to study how the program works, and adapt it to your needs (freedom 1).
Access to the source code is a precondition for this.
* The freedom to redistribute copies so you can help your neighbor (freedom 2).
* The freedom to improve the program, and release your improvements to the public,
so that the whole community benefits (freedom 3).
Access to the source code is a precondition for this.
The Free Software Foundation --Richard M. Stallman,

He had started the GNU project in 1983 to develop the free operating system GNU
GNU (a recursive acronym for GNU's Not Unix).

to pursue the Free Software Movement In 1985 Stallman founded
the Free Software Foundation (FSF),
dedicated to promoting computer users' rights
to use, study, copy, modify and redistribute computer programs.

The FSF promotes the development and use of
free software and free documentation.
In particular,
FSF promotes the GNU operating system, used widely today in its GNU/Linux variant,
based on the Linux kernel developed by Linus Torvalds.

FSF believes that free software is a matter of freedom, not price.
The Free Software Foundation of India (FSF India), the official Indian affiliate of the FSF,
was formally inaugurated by Richard Stallman
at the Freedom First! Conference at Thiruvanathapuram, Kerala on 20 July 2001.

FSF INDIA will be the national agency for the promotion of the use of free software,
i.e. software distributed under the GNU General Public Licence (GNU GPL) or
other licences approved by FSF, in all domains.

SO FREE IS ALL ABOUT SOFTWARE FREEDOM
Who is he? Before we know about him Just first let us know his name
He is Mr. Tux
Linux is a free Unix-type operating system originally created by Linus Torvalds
with the assistance of developers around the world. Developed under the
GNU General Public License , the source code for Linux is freely available to everyone.

When Linus Torvalds first developed Linux back in August of 1991,
the operating system basically consisted of his kernel and some GNU tools.
With the help of others Linus added more and more tools and applications.
With time, individuals, university students and companies
began distributing Linux with their
own choice of packages bound around Linus' kernel.
This is where the concept of the "distribution" was born.

Today, creating and selling Linux distributions is a multi-million dollar business.
You can buy a boxed version of Linux from companies such as
Red Hat, Debian, SuSE, MandrakeSoft, and many.......

You can also download Linux from any number of companies and individuals.
Linux has an official mascot, Tux, the Linux penguin, which was selected by Linus Torvalds
to represent the image he associates with the operating system. Tux was created by Larry Ewing and Larry

Apart from the fact that it's freely distributed, Linux's functionality, adaptability and
robustness, has made it the main alternative for proprietary Unix and Microsoft operating systems.
What is an Operating System?
Operating systems provide a software platform on top of which other programs, called application programs, can run.
With the aid of the firmware and device drivers, the operating system provides the most basic level of control over
all of the computer's hardware devices.

A kernel part of the operating system connects the application software to the hardware of a computer.
An operating system can be divided into many different parts.
One of the most important parts is the kernel,

which controls low-level processes that the average user usually cannot see:
it controls how memory is read and written, the order in which processes are executed,
how information is received and sent by devices like the monitor, keyboard and mouse, and
deciding how to interpret information received by networks.

The user interface is the part of the operating system that interacts with the computer user
directly, allowing them to control and use programs.
The user interface may be
graphical with icons and a desktop, or textual, with a command line.

A kernel connects the application software to the hardware of a computer.

With the aid of the firmware and device drivers, the operating system provides the most
basic level of control over all of the computer's hardware devices.
It manages memory access for programs in the RAM,
it determines which programs get access to which hardware resources,
it sets up or resets the CPU's operating states for optimal operation at all times, and
it organizes the data for long-term non-volatile
storage with file systems on such media as disks, tapes, flash memory, etc.
A Linux distribution, commonly called a "distro",
is a project that manages a remote collection of
system software and application software packages
available for download and installation through a network connection.
This allows the user to adapt the operating system to his/her specific needs.
Distributions are maintained by
individuals, loose-knit teams, volunteer organizations, and commercial entities.

A distribution is responsible for the default configuration of the
installed Linux kernel, general system security, and more generally
integration of the different software packages into a coherent whole.
Distributions typically use a package manager such as
Synaptic, YAST, or Portage to install, remove and update
all of a system's software from one central location.

Although Linux distributions are generally available without charge,
several large corporations sell, support, and contribute to the
development of the components of the system and of free software.

An analysis of Linux showed 75 percent of the code from December 2008 to January 2010
was developed by programmers working for corporations, leaving about 18 percent
to the traditional, open source community.

Some of the major corporations that contribute include
Dell, IBM, HP, Oracle, Sun Microsystems, Novell, Nokia.
A number of corporations, notably Red Hat,
have built their entire business around Linux distributions.
Wanna know more about TUX
Some of the Famous Linux distributions for a Desktop/ Laptop
As there are many distributions each for particular domain, bioinformatics has got few Linux
distributions like BioBrew, Bio-Linux, PhyLIS, Vlinux, DNALinux, BioKnoppix
Many more Linux distributions for your better understanding of Linux
Which one is more famous?

my favorite is ubuntu for Desktop/Laptop, I assume that Redhat for Server

Luckily biolinux is of my favorite Ubuntu base
Coming to the point

Bio-linux6 is one of the Linux Distro's

is of Ubuntu Linux 10.04 based
Bioinformatics Platform, An Operating System

of

Ubuntu - GUN/Linux - biolinux6
NEBC works to enable environmental research in the molecular age.
The NEBC collects and stores environmental 'omics data from researchers
in accordance with the NERC Data Policy.

The NERC Environmental Bioinformatics Centre was established in 2002
to provide bioinformatics, data management and computing supporting
to the NERC research community using 'omics technologies.

The NEBC supports environmental researchers who are generating and
using molecular data through the development and provision of tools
designed to fit their needs.

Many of these tools are developed in collaboration with others,
and are generally useful to anyone engaged in biological research.

The Natural Environment Research Council (NERC),
established by Royal Charter in 1965, is one of seven UK Government Research Councils.

The Centre for Ecology and Hydrology (CEH) is one of the Centres and Surveys of NERC,
and is the leading UK body for research, survey, and monitoring in terrestrial,
and freshwater environments.

The NEBC Toolbox includes the Bio-Linux computing platform.
Because an Operating systems provide a software platform on top of which other
programs, called application programs, can run.

Your choice of operating system, therefore, determines to a great extent the applications you
can run.

Biolinux6 is an Operating System provides a software platform for bioinformatics,
Where you can run and build bioinformatics application.

There are nearly 500 bioinformatics applications bundled along with the biolinux6.

A biolinux6 not just a GNU/Linux operating system for bioinformatcis but a
software platform for bioinformatics upon GNU/Linux.

And a Computer systems for biologist
What is the language of computers?
0
10100010011101010101
00110100111010101011
11010001001110101010
10100111001110101011
01010010101101010101
10100010010001011101
00110010111101110101
11100011111101010101
1 0 1 0 1 0 1 0 01 1 1 0 1 0 1 0 1 0 1
10100100111010101010

1
A programming language is an artificial language
designed to express computations that can be performed by a computer.

Programming languages can be used to create programs
that control the behavior of a machine, to express algorithms precisely,
or as a mode of human communication.

Like we have many languages for communication among us,
Computers do have mana languages over time.

Apart from the them,
these language packages are specific to bioinformatics.

Biojava
Bioperl
Bioruby
Biopython
Eclipse
These language specific packs are used to build bioinfomatic
applications upon Bio-Linux platform
i n f o r ma t i c s
b i o
Computer and statistics are
applied to the field of
molecular biology

Inorder to convert
Biological data into
Computer CODE/ 0
Digital DATA 10100010011101010101
00110100111010101011
11010001001110101010
00110010111101110101
11100011111101010101
1 0 1 0 1 0 1 0 01 1 1 0 1 0 1 0 1 0 1
10100100111010101010

Sequence Format
DATA FORMAT of Biological DATA
1
Data Formats

Many bioinformatics Applications and their DB have their own DATA Formats.

Indeed any computer storage, can store only bits,
the computer must have some way of converting information
to 0s and 1s and vice-versa.

There are many formats for different kind of information.

Video file - have many formats, ex: mpg, mpeg, avi, dat, wmv ...
Audio file - have many formats, ex: mp3, wav, ...
Image file - have many formats, ex: jpeg, png, gif, bmp …
Text file - have many formats, ex: notepad, wordpad, pdf, PS ...

Like we have different Number Systems to deal with mathematics,
Biological Data (Information) therefore can have different formats.

Like wise the sequence formats are many in form.
If you don't hold your sequence in a recognized standard format,
you will not be able to analyze your sequence easily.
Sequences can be read and written in a variety of formats.

What a sequence format IS

Sequence formats are ASCII TEXT

They are the required arrangement of characters, symbols and keywords
that specify what things such as the sequence, ID name, comments, etc.

There are generally no hidden, unprintable 'control' characters
in any sequence format.

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

All standard sequence formats can be printed out or viewed simply by
displaying their file.
Sequence Database Formats

* EMBL
* GenBank
* SwissProt
* PIR

Sequence Files
Files can hold sequences in standard recognised formats.

Multiple sequences
Some sequence formats can hold multiple sequences in one file.

Preferably, you should stay away from formats that can't cope
with multiple sequences in a file.

An application may accept different Input/Output File Formats.
Identification
A sequence does not require any sort of identification, but it certainly helps!

Most sequence formats include at least one form of ID name,
usually placed somewhere at the top of the sequence format.

The simple format fasta has the ID name as the first word on its title line.
For example the ID name 'xyz':

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt
IDs and Accessions

An entry in a database must have some way of being uniquely identified in that database.
Most sequence databases have two such identifiers for each sequence
- an ID name and an Accession number.

EMBL, GenBank and SwissProt share an Accession numbering scheme
- an Accession number uniquely identifies a sequence within these three databases.
Annotation and Features

Most formats allow you to hold other description, annotation and comments,
for example fasta format holds comments in the title line:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

Other formats have specific fields for holding information such as references,
keywords, associated entries in other databases and feature tables
The Sequence
Nucleotide (DNA or RNA) sequences are usually stored in the IUBMB standard codes.
Similarly, protein sequences are usually stored in the IUPAC standard one-letter codes.

For example,
fasta format holds the sequence as anything after the '>' line until the next entry starts:

>xyz some other comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

There are exceptions to this code,
for example, staden format uses non-standard ambiguity codes.

Nearly every sequence analysis package written since programs were first used
to read and write sequences has invented its own format. Except for EMBOSS.

Interestingly EMBOSS has not invented its own format.

The software automatically copes with data in a variety of formats
and even allows transparent retrieval of sequence data from the web.

EMBOSS breaks the historical trend towards commercial software packages.
EMBOSS is available as a free Open Source software
"The European Molecular Biology Open Software Suite"
Suite

A free Open Source software analysis package specially developed
for the needs of the molecular biology user community.
EMBOSS is "The European Molecular Biology Open Software Suite".

EMBOSS is a free Open Source software analysis package
specially developed for the needs of the molecular biology (e.g. EMBnet) user community.

The software automatically copes with data in a variety of formats
and even allows transparent retrieval of sequence data from the web. Also,
as extensive libraries are provided with the package,
it is a platform to allow other scientists to develop and release software in
true open source spirit.

EMBOSS also integrates a range of currently available packages and tools
for sequence analysis into a seamless whole.

EMBOSS breaks the historical trend towards commercial software packages.

Jemboss is a graphical user interface to EMBOSS.
Jemboss is developed by the EMBOSS team.
The software is free and part of the EMBOSS distribution.
The uses and interfaces to EMBOSS have long grown beyond our ability to keep track of them.
EMBOSS is used extensively in production environments.
EMBOSS has several important advantages:

- A properly constructed toolkit for creating robust bioinformatics applications or workflows.
- A comprehensive set of sequence analysis programs.
- All sequence and many alignment and structural formats are handled.
- Extensive programming library for common sequence analysis tasks.
- Additional programming libraries for many other areas including string handling,
pattern-matching, list processing and database indexing.
- It is free-of-charge .
- It is an open-source project.
- It runs on practically every UNIX or GNU you can think of and
some that you can't, plus MS Windows and MacOS.
- Each application has the same style of interface so master one and
you've mastered them all.
- The consistent user interface facillitates GUI designers and developers.
- It integrates other popular publicly available packages.
- It is free of arbitrary size limits:
there are no limits on the amount of data that can be processed.
For the programmer, memory management for objects such as
sequences and arrays is simplified.

EMBOSS is mature and stable. A major new version of EMBOSS is released each year.
What can I use EMBOSS for?

Within EMBOSS you will find around hundreds of applications covering areas such as:

* Sequence alignment,
* Rapid database searching with sequence patterns,
* Protein motif identification, including domain analysis,
* Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats,
* Codon usage analysis for small genomes,
* Rapid identification of sequence patterns in large scale sequence sets,
* Presentation tools for publication, ........ and much more

Popular applications include:

prophet Gapped alignment for profiles.
infoseq Displays some simple information about sequences.
water Smith-Waterman local alignment.
Pepstats Protein statistics.
showfeat Show features of a sequence.
palindrome Looks for inverted repeats in a nucleotide sequence.
eprimer3 Picks PCR primers and hybridization oligos.
profit Scan a sequence or database with a matrix or profile.
extractseq Extract regions from a sequence.
marscan Finds MAR/SAR sites in nucleic sequences.
tfscan Scans DNA sequences for transcription factors.
patmatmotifs Compares a protein sequence to the PROSITE motif database.
showdb Displays information on the currently available databases.
wossname Finds programs by keywords in their one-line documentation.
abiview Reads ABI file and display the trace.
tranalign Align nucleic coding regions given the aligned proteins.
EMBOSS, currently consists of more than 200 applications.

These may be used alone or in conjuction with one another to
assist in the computational analysis of biological problems.

22 applications that can be applied to the alignment of two or more sequences.
In addition to applications for creating alignments, such as dot plots, local, global and
multiple alignment there are a variety of programmes for determining consensus and
variation within existing alignments.

61 applications have been written with the purpose of providing analysis for nucleic acids.
Composition, codon usage and repeat motifs may all be established using this portion
of the software. Other analyses such as restriction mapping, primer design,
translation and mutation may also be performed.

41 protein analysis programmes
currently available for analysis of amino acid sequences including secondary structure
prediction, pattern recogniton and composition analysis.

12 separate utilites to create and index databases are also modules within the EMBOSS suite,

enabling you to build and query your own sequence repositories.

Futher programs contribute to the general analysis content of the suite,
the creation of phylogenetic distance matrices.
offering simulation opportunities such as Michaelis-Menten kinetics
EMBOSS can be installed in any of the GNU/Linux Operating Systesm

Note: EMBOSS is coded in the Programming Language C

Bio-Linux distribution actually contains many of the bioinformatics applications,
utilities, IDE like Eclipse SDK and NEBC Tools, is also packaged with EMBOSS.
Open the 'Bioinformatics Docs' icon on the Desk, you can see this html document.
Alphabetical ordered List of all the bioinformatics applications with Bio-Linux6, here Uncategorized
Wanna install Bio-Linux

http://nebc.nerc.ac.uk/
After once downloading the Bio-Linux DVD image file (iso),
You can burn to a DVD by selecting source to the downloaded
ISO, Later you can freshly install it to your hard-drive/ or any other
system, even you can install Bio-Linux as a multi-boot

Dual-boot meaning Installing Bio-Linux Side-by your existing OS,
if windows is your OS, After installing Bio-Linux as dual-boot you
can have boot option for both windows and Bio-Linux.

Once you have the DVD burned, If you do not want to install Bio-
Linux to your system, You can test Bio-Linux by simply setting
booting option to first boot your system from the
Bio-Linux DVD.

Best Option to work with, if you don't have your own System,
You can create your own Bio-Linux USB Start-Up Disk, and You
can carry it to work with Bio-Linux on almost any system.

Once USB Start-Up created, You can even install Bio-Linux for
any of your friends system, or You can simply use it to your work
on Bio-Linux by setting booting option to first boot the system from
the USB Stick.
UNetbootin is a utility for windows users to create a Linux Start-Up Disks

USB Stick Should be not less than 4GB.

If you have installed Bio-Linux DVD, or having a Bio-Linux USB Start-Up Disk you can make
your own startup disk from Bio-Linux

In the MenuBar you can find this utility to make a start-up disk.

Applications / Systems Tools / Usb memory stick maker
DNALinux VD is a preconfigured virtual machine (VM) with
applications targeted for bioinformatics (both DNA and protein
analysis). This virtual machine runs on top of the free VMWare
Player.

With this distrubution you just boot from the CD and you have a
fully functional Linux OS distribution with open source
applications targeted for the molecular biologist.

VLinux Bioinformatics workbench is a Linux distribution for
Bioinformatics. It is easy to use, no installation required, CD-
based distribution based on Knoppix 3.3. It includes a variety of
sequence and structure analysis packages.It is an Open source
product released under the GNU GPL License.
PhyLIS is a user-friendly, free linux distribution for
phylogenetics. Install it and you have an instant phylogenetics
workstation.

BioBrew is a collection of open-source applications for life
scientists and an in-house project at Bioinformatics.Org.
Software Freedom from?
you deserve to use software that is:-

free from restriction
free to share and copy
free to learn and adapt
free to work with others

you deserve free software

gain complete freedom from software you possess
there are no boundaries for Linux except Linux
– sureshsaragadam
THANK YOU