Bioinformatics Softwares: by Rifat Shahriyar Student No: 100705037P

Bioinformatics Softwares
By
Rifat Shahriyar
Student No: 100705037P
A Project Submitted for CSE 6406 : Bioinformatics Algorithms Course

Department of Computer Science and Engineering
Bangladesh University of Engineering and Technology
i
Contents
1 Introduction 1
2 Databases 1
2.1 Protein Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 Structural Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 Nucleotide and Genome Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Gene Mapping and Sequence Assembly 4
4 Sequence Alignment 5
4.1 Pair-wise Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Multiple Sequence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
5 Similarity Search 6
5.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
5.2 BLAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6 Phylogenetic Analysis 8
6.1 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
7 Hidden Markov Models 8

7.1 Software used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
8 Protein Structure Prediction 9

8.1 Protein Identification and characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
8.2 Primary structure analysis and predication . . . . . . . . . . . . . . . . . . . . . . . . . . 9
8.3 Secondary structure analysis and predication . . . . . . . . . . . . . . . . . . . . . . . . . 10
9 Drug Discovery 10
9.1 Computer-Aided drug designing methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10 Open Source Tools 10

10.1 BioJava . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
10.2 Bioperl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
11 Conclusion i
A Partial Digest Problem i
B Motif Finding Problem v
ii
List of Figures
1 iProClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Protein Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Most recent founded protein: Phylloseptin-2 . . . . . . . . . . . . . . . . . . . . . . . . . . 3
4 Structural Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
5 Accelrys Discovery Studio: Protein Refinement . . . . . . . . . . . . . . . . . . . . . . . . 5
6 FASTA Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7 FASTA Graphical Search Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
8 Hidden Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
iii
1 Introduction
Bioinformatics and computational biology (sometimes called systems biology) involve the use or
development of techniques, including applied mathematics, informatics, statistics, computer sci-
ence, artificial intelligence, chemistry, and biochemistry to solve biological problems, usually on the
molecular level. The primary goal of bioinformatics is to increase our understanding of biological
processes. It is different from other approaches because of its focus on developing and applying
computationally intensive techniques to achieve this goal. Major research efforts in the field in-
clude sequence alignment, gene finding, genome assembly, protein structure alignment, protein
structure prediction, prediction of gene expression and protein-protein interactions, and the mod-
eling of evolution. Most of the Bioinformatics algorithm deals with very large sequences of DNA or
Proteins. Maintaining and processing of such large sequences have high storage and computational
complexity. So Databases store the sequences and the software manipulates for a lot of purposes.
Most of the software uses a number of databases (including third-party) as their base. They takes
input from user (may be search string) and provide output both in text and graphical format.
The areas in bioinformatics where software are used are :

• Large Biological Data Management using Databases
• Gene Mapping and Sequence Assembly
• Sequence Alignment
• Similarity Search
• Phylogenetic Analysis
• Hidden Markov Models
• Protein Structure Prediction
• Drug Discovery
• Bioinformatics Programming Using Open Source Tools
2 Databases
At the basic level, bioinformatics was used to organize biological data to help the researcher’s access
information, add and modify new information. The key was to make the biological information
available for analysis and developing applications. The approximate classification is :
• Protein Databases
• Structural Databases
• Nucleotide and Genome Sequences Databases

• Others
2.1 Protein Databases

Life depends on 3 critical molecules and one of them is protein. Primary protein databases contain
over 300,000 protein sequences that function as a repository for raw data.
1
SWISS-PROT [http://us.expasy.org/sprot] It is a protein sequence database which strives to
provide a high level of annotation, a minimal level of redundancy and high level of integration
with other 60 databases ([http://us.expasy.org/cgi-bin/lists?dbxref.txt ]). The provided annota-
tions are Function(s) of the protein, Post-translational modification(s) [carbohydrates, phosphoryla-
tion, acetylation, GPI-anchor, etc], Domains and sites [calcium binding regions, ATP-binding sites,
zinc fingers, homeobox, kringle, etc], Secondary structure, Quaternary structure [homodimer, het-
erotrimer, etc], Similarities to other proteins, Disease(s) associated with deficiencie(s) in the protein
and Sequence conflicts, variants, etc.
TrEMBL [http://us.expasy.org/sprot] It is the computer annotated supplement of Swiss-Prot

that contains all the translations of EMBL nucleotide sequence entities not yet integrated in Swiss-
Prot
Protein Information Resource [http://pir.georgetown.edu/] It is the Universal Protein Resource

provides the scientific community with a single, centralized, authoritative resource for protein se-
quences and functional information. It has collaborated with EBI (European Bioinformatics Insti-
tute) and SIB (Swiss Institute of Bioinformatics) to establish the UniProt (United Protein Database),
the central resource of protein sequence and function. Although SWISS-PROT and PIR overlap
extensively, there are still many sequences that can be found in only one of them.
iProClass [http://pir.georgetown.edu/iproclass/] It provides summary descriptions of protein

family, function and structure for PIR-PSD, Swiss-Prot and TrEMBL sequences. It contains links
over 90 biological databases. It is implemented in Oracle 8i.
Figure 1: iProClass
2.2 Structural Databases

It contains macromolecular structures.
2
Figure 2: Protein Databases
Protein Data Bank (PDB) [http://www.rscb.org/pdb] It is the single worldwide repository for the
processing and distribution of 3-D biological macromolecular structures. Most of them were solved
by X-ray crystallography and Nuclear magnetic resonance (NMR). It is the primary archive of 3-D
structures of protein, RNA, DNA etc.
PDBSum [http://www.biochem.ucl.ac.uk/bsm/pdbsum/] PDB entries can be difficult to ex-

tract. So it provides summary information and derived data on the entries in the PDB.
Figure 3: Most recent founded protein: Phylloseptin-2
Figure 4: Structural Databases
CATH [http://www.biochem.ucl.ac.uk/bsm/cath/] and SCOP [http://scop.mrc-lmb.cam.ac.uk/scop/]

It classifies proteins based on the structure in order to identify structural and phylogenetic rela-
tionships. CATH provides hierarchical classification of structures and SCOP deals with all known
protein folds.
3
2.3 Nucleotide and Genome Databases
The biggest excitement has been the availability of complete genome sequences for different organ-
isms. Completion of sequencing of Human Genome was announced on 14 April, 2003.
GenBank [http://www.ncbi.nlm.nih.gov/GenBank/] is an annotated collection of all publicly

available DNA sequences. Other databases are EMBL [http://www.ebi.ac.uk/embl],
UniGene [http://www.ncbi.nlm.nih.gov/UniGene], EBIGenomes [http://www.ebi.ac.uk/genomes],
Ensembl [http://www.ensembl.org], GeneCensus [http://bioinfo.mbb.yale.edu/genome/] etc.
Entrez [http://www.ncbi.nlm.nih.gov/Entrez/] is a retrieval system for searching several linked

databases. It is the most popular database search engine in this field.
2.4 Others
• Genecards : Database of human genes, their products and their involvement in diseases.
• PubMed : Provides biomedical literatures.
• PMD : Provides information of protein mutations.
• Prosite : Provides sequence motifs.
• Others : [http://www.netsci.org/Resources/Software/Bioinform/databases.html]
3 Gene Mapping and Sequence Assembly

Sequencing of the genome is starting point for several applications. One may want to find out
whether a knocked-out gene has a backup copy somewhere in the genome. Gene mapping is a
graphic representation that provides information about the location of genes, their sequence along
a chromosome and the distances between them. Reconstruction of sequence from overlapping sub-
sequences is known as the sequence assembly problem. Here we are given a set of sub-strings; find
the minimal string contained all the members of the set.
The software used for Gene Mapping and Sequence Assembly are :
• Finch Sequencing system [http://www.geospiza.com/products/tools/index.htm] - It is a com-
plete suit for genomic sequencing.
• Phrap [http://www.geospiza.com/products/tools/phrap.htm] - It locates overlapping regions
within individual sequences and assembles them into longer contigs.
• Phred [http://www.geospiza.com/products/tools/phred.htm] - Reads trace from DNA sequenc-
ing instruments and calls bases.
• CAP3 [http://deepc2.zool.iastate.edu/aat/cap/cap.html] - Assembly tool that processes FASTA
inputs.
• Sequencher [http://www.genecodes.com] - It is used for DNA sequencing. It works by aligning
20-30 patient sample and looking for mutation at know locus.
4
The most widely used software for this purpose is Accelrys Discovery Studio. It contains the
following components :
• Gelmerge - Wilbur and Lipman method to find overlapping region among the fragments and
Needleman and Wunsch method to align the fragments.
• GelAssemble - Assemble of fragments
• GelView - Views of contigs and fragments
• GelDisassemble - Disassemble to fragments
Figure 5: Accelrys Discovery Studio: Protein Refinement
4 Sequence Alignment
4.1 Pair-wise Sequence Alignment
Global alignment It uses Needleman-Wunsch dynamic programming algorithm. The Java Imple-
mentation can be found here [http://www25.brinkster.com/denshade/NeedlemanWunsch.java.htm].
One of the major implementation is EMBOSS needle [http://bioweb2.pasteur.fr/docs/EMBOSS/needle.html].
Local alignment It uses Smith-Waterman dynamic programming algorithm. The Java Implemen-
tation can be found here [http://jaligner.sourceforge.net/]. One of the major implementation is
EMBOSS water [http://bioweb2.pasteur.fr/docs/EMBOSS/water.html] . Other implementations
are MPsrch [fast implementation], Scanps [a library of DP algorithm] and SSearch.
Basic-Algorithms-of-Bioinformatics Applet [http://baba.sourceforge.net/] provides Java Applets

for the simulation and understanding of Needleman-Wunsch and Smith-Waterman algorithm.
4.2 Multiple Sequence Alignment

Sum-of-Pairs (SP) method Program MSA [http://www.psc.edu/general/software/package/msa]
utilizes a variant of a multi-dimensional DP to produce an optimal global alignment between several
sequences. It utilizes advanced projective geometry techniques to reduce the space complexity.
5
Progressive alignment method ClustalW [http://www.ebi.ac.uk/clustalw] produces the best match
for the selected sequences. It arranges them so that the identities, similarities and differences can
be seen. Pileup uses Feng and Doolittle DP method for pair-wise alignment. PIMA is also known as
star alignment method and based on scoring method.
Iterative alignment method Dialign [http://www.gsf.de/biodv/dialign.html] aims at the delin-

eation of regions of similarity among the given sequences. It computes a hierarchical tree based on
pair-wise comparisons and resulting scores. Serial Alignment by Genetic Algorithm (SAGA) focuses
on creating as many different multiple sequence alignments by rearrangements that simulate gaps
and genetic recombination events.
5 Similarity Search
FASTA and BLAST are the most popular tools for Similarity Search. They use heuristic methods
instead of DP because the later are computationally intensive. Heuristic methods are approximate
but fast and effective. FASTA uses the Pearson and Lipman algorithm.It is a word based exact
search method. Uses dot plots where similar regions shows up as diagonals. BLAST is also a word
based method. It requires a pre-formatted search database. It searches for the most unusual or
high scoring words.
5.1 FASTA
Algorithm Find runs of identical words - Identify regions shared by two sequences with highest
density of single residue identities. Re-score using PAM matrix - Keep the best score. Re-scan the
best regions using PAM matrix. Join segments using gaps and eliminate other segments - Determine
if gaps can be used to join the regions. Use DP to create the optical alignment - Construct an optimal
alignment using Smith-Waterman algorithm.
Implementation FASTA3 [http://www.ebi.ac.uk/fasta33/] is the most popular FASTA implemen-

tation. It provides sequence similarity and homology searching a gainst complete proteome or
genome db using FASTA programs. For any search of sequence the output will be Histogram,
Sequence Listing, Local alignments and Significance of the E-values. Other implementations are
TFASTA, LFASTA, PLFASTA.
5.2 BLAST
Algorithm Find the list of high scoring words . Locates all similar words in the current test
sequence. Compare the word list to the database and identify the exact matches. If similar words are
found, it tries to expand the alignment to the adjacent words without allowing gaps. After all words
are tested, a set of Maximal Segment Pairs is chosen for that database sequence. Several short, non-
overlapping MSPs may be combined in a statistical test to create a larger, more significant match.
Step:1 BLAST searches for exact matches of a small fixed length w between the query and sequences
in the database. AGTTAC and ACTTAG and w= 3, match = TTA (seed) Step:2 It tries to extend the
match in both directions, starting at the seed. Insertions and deletions are not considered during
this stage Step:3 It performs a gapped alignment between the query sequence and the database
sequence by computing Maximal Segment Pairs using a variation of the Smith-Waterman algorithm
. Statistically significant alignments are then displayed to the user.
Implementation BLAST [http://www.ncbi.nlm.nih.gov/BLAST/] is the most popular BLAST im-

plementation from NCBI. For any search of sequence the output will be Short description of the pro-
gram, database list and program options, List of all the database sequences that matched the query
sequence with quality of the match. BLAST services from NCBI are Nucleotide Blast, MEGABLAST,
6
Figure 6: FASTA Search
Figure 7: FASTA Graphical Search Result
Protein BLAST, PHI-BLAST etc. There is an implementation of BLAST for finding protein families
known as PSI-BLAST.
7
6 Phylogenetic Analysis
Given a set of DNA sequences, one can reconstruct the evolutionary relationships among genes
and organism. A phylogenetic tree is a tree showing the evolutionary relationships among various
biological species or other entities that are believed to have a common ancestor. Each node with
descendants represents the most recent common ancestor of the descendants.
6.1 Software used

PhyloBLAST [http://www.pathogenomics.bc.ca/phyloBLAST] It compares the protein sequence
to a SWISS-PROT/TREMBL database using WU-BLAST2 and then allows to perform user defined
phylogenetic analysis.
Phylip (Phylogeny Interface Package) [http://evolution.genetics.washington.edu/phylip.html]

It supports parsimony, distance matrix, likelihood methods, including bootstrapping and consensus
trees.
PAUP (Phylogenetic Analysis using Parsimony) [http://paup.csit.fsu.edu/] It supports wide

range of DNA substitution models ( Jukes-Cantor, Kimura 2P, HKY85, fastDNAml , Tamura-Nei).
PAML (Phylogenetic Analysis using Maximum Likelihood) [http://abacus.gene.ucl.ac.uk/paml.html]

It is the Phylogenetic analysis of DNA or protein sequences using ML. It is mainly used for the pro-
cess of sequence evaluation, not good for tree
PhyloDraw [http://pearl.cs.pusan.ac.kr/phylodraw/] It is a drawing tool for creating phyloge-

netic trees. It supports various kinds of multi-alignment programs. It visualizes various kinds of
tree diagrams such as rectangular cladogram, slanted cladogram, phylogram , free tree and radial
tree. Here users can manipulate the shape of a phylogenetic tree easily and interactively by using
several control parameters.
7 Hidden Markov Models

A Markov process is a stochastic model. It can exist in various states and emits a symbol each time
one visit it. In biological context a sequence can be viewed as the record of such process. It is a
hidden process that generates a sequence of amino acid residues, where chance play an important
rule in determining the exact sequence being produced. The HMM working as a finite state machine
generates a protein sequence by emitting amino acids as it progresses through a series of states.
Each state has a table of amino acid emission probability and transition probability to other states.
It is hidden because only the symbols emitted are observable not the underlying random walk. HMM
is used in Sequence Alignment, Gene prediction, Modeling protein domains and Pathway analysis.
7.1 Software used

HMMER [http://hmmer.janelia.org] It is freely distributable implementation of HMM software
for protein sequence analysis. It contains several different programs for building a model, using a
model and HMM databases.
SAM(Sequence alignment and modeling system) http://www.cse.ucsc.edu/compbio/sam.html

It is a collection of tools for creating and using HMMs.
8
Figure 8: Hidden Markov Model
Meta-MEME [http://metameme.sdsc.edu/] It is a software toolkit for building and using HMM

of DNA and proteins. Meta-MEME’s HMMs differ from SAM and HMMER because its models are
motif-based.
HMMpro [http://www.netid.com/html/hmmpro.html] It is a general purpose HMM simulator

for biological sequence analysis. It uses machine learning techniques to automatically build statis-
tical models of proteins and DNA sequences
8 Protein Structure Prediction

Protein Structure Prediction from a sequence is one of the high focus problems for researchers. The
3-D structure of a protein is determined solely by the amino-acid sequence information. Related
Protein folding problem is an NP-complete problem. So it is difficult to predict structure from
sequence manually so a large number of tools are used for this purpose.
8.1 Protein Identification and characterization

AACompIdent [http://us.expassy.org/tools/aacomp/] is an important tool to identify protein by
its amino-acid composition. PepSea [http://195.41.108.38/PepSeaIntro.html] is a tool for protein
identification by peptide mapping or peptide sequencing.
8.2 Primary structure analysis and predication

The sequence of the different amino acids is called the primary structure of the protein. SAPS is a
tool to evaluate a wide variety of protein sequence properties using statistical analysis. ProtScale
[http://ca.expassy.org/cgi-bin/protscale.pl] is used to calculate the hydro-phobicity of any given
protein.
9
8.3 Secondary structure analysis and predication
The protein consists of local inter-residue interactions mediated by hydrogen bonds. The most
common structures are alpha helices and beta sheets.
• Chou-Fasman method - PeptideStructure
• GOR(Garnier, Osguthorpe and Robson) method - GOR IV
• Nearest neighbor methods - SIMPA96, NNSSP
• Hidden Markov models - Pfam

• Neural Networks - HNN, nnPredict, PSA, PSIPRED
• Multiple alignment based self-optimization method - SOPMA
9 Drug Discovery
The drug industry is one of the major players involved in the development of the field of bioinfor-
matics. Many pharmaceutical companies have internal teams conducting bioinformatics research.
The main purpose is to beat the competition to solutions of a problem that may give their company
that crucial edge in producing the next major drug. Most of the drugs are small molecules that are
designed to bind, interact and modulate the activity of specific biological receptors. Receptors are
proteins that bind and interact with other molecules to perform the numerous functions required
for the maintenance of life.
9.1 Computer-Aided drug designing methods

Computer-Aided Molecular Design ( CAMD ) It makes use of knowledge of the steric and elec-
tronic aspect of the receptor/ligand and enzyme/substrate interaction to aid the drug design.
Quantum CACHe and Project Leader [http://www.accelrys.com/about/oxmol.html] It is used

to investigate a wide range of molecular properties.
Molecular Graphics [http://scsg9.unige.ch/fln/eng/toc.html] It provides Graphical represen-

tation of molecular structures .
Molecular modeling toolbox [http://webnet.mednet.gu.se/chemistry/molmod/] It is same as

Molecular graphics.
XED [http://www.ch.cam.ac.uk/SGTL/xed/] It provides special emphasis on interactions.
10 Open Source Tools

Two widely used languages in bioinformatics are Java and Perl. Two major open source libraries are
BioJava http://biojava.org/wiki/Main_Page and BioPerl http://bioperl.org/wiki/Main_
Page.
10.1 BioJava
It is an open source project dedicated to providing Java tools for processing biological data. It
includes objects for manipulating sequences; file parsers, CORBA interoperability, DAS, access to
various databases, dynamic programming, and simple statistical routines.
10
10.2 Bioperl
Bioperl is a collection of Perl modules that facilitate the development of Perl scripts for bioinfor-
matics applications. It does not include ready to use programs in the sense that many commercial
packages and free web-based interfaces do (e.g. Entrez, SRS). Bioperl does provide reusable Perl
modules that facilitate writing Perl scripts for sequence manipulation, accessing of databases us-
ing a range of data formats and execution and parsing of the results of various molecular biology
programs including Blast, ClustalW, Genscan, HMMER etc.
11 Conclusion
My project was to explore open source tools (Specially BioJava) so that any bioinformatics related
algorithmic problem can be mapped and solved using it. I explore it little and discover some bioin-
formatics algorithm implementation in BioJava. Moreover I started implementing a library based
on the algorithm we studied in this course. I already implemented Partial Digest Problem and Motif
Finding in Java. The source codes are given in the appendix.
A Partial Digest Problem
PDP.java
1 import java . u t i l . ∗ ;
2 import java . i o . ∗ ;
3
4 public class PDP
5 {
6 int width ;
7 ArrayList L ;
8 ArrayList X;
9
10 public PDP( ArrayList l )
11 {
12 this . L= l ;
13 this .X=new ArrayList ( ) ;
14 this . width=this . getMax ( ) ;
15 this . d e l e t e ( this . L , this . width ) ;
16 this . add ( this . X, 0 ) ;
17 this . add ( this . X, this . width ) ;
18 this . place ( ) ;
19
20 }
21
22 public void d e l e t e ( ArrayList a , int element )
23 {
24 for ( int i =0; i <a . s i z e ( ) ; i ++)
25 {
26 i f ( ( ( I n t e g e r ) a . get ( i ) ) . intValue ( ) == element )
27 {
28 a . remove ( i ) ;
29 break ;
30 }
31 }
32 }
33
34 public void add ( ArrayList a , int element )
35 {
i
36 a . add (new I n t e g e r ( element ) ) ;
37
38 }
39
40 public void p r i n t ( ArrayList a )
41 {
42 this . s o r t ( a ) ;
43 System . out . p r i n t l n ( "******************************************" ) ;
44
45 //System . out . p r i n t l n ( ” T o t a l Element : ” + a . s i z e ( ) ) ;
46 for ( int i =0; i <a . s i z e ( ) ; i ++)
47 {
48 System . out . p r i n t ( ( ( I n t e ger ) a . get ( i ) ) . intValue ( ) + " ");
49 }
50 System . out . p r i n t l n ( "" ) ;
51 System . out . p r i n t l n ( "******************************************" ) ;
52 }
53
54 public void s o r t ( ArrayList a )
55 {
56 for ( int i =0; i <a . s i z e ( ) ; i ++)
57 {
58 for ( int j = i +1; j <a . s i z e ( ) ; j ++)
59 {
60 int x = ( ( I n t e g e r ) a . get ( i ) ) . intValue ( ) ;
61 int y = ( ( I n t e g e r ) a . get ( j ) ) . intValue ( ) ;
62 if ( x > y)
63 {
64 I n t e g e r temp= ( I n teger ) a . get ( i ) ;
65 a . set ( i , a . get ( j ) ) ;
66 a . set ( j , temp ) ;
67 }
68 }
69 }
70 }
71
72
73
74
75 public int getMax ( )
76 {
77 int max=0 ,num=0;
78 for ( int i =0; i <this . L . s i z e ( ) ; i ++)
79 {
80 i f ( i ==0)
81 {
82 max= ( ( I n t e g e r ) this . L . get ( i ) ) . intValue ( ) ;
83 }
84 else
85 {
86 num= ( ( I n t e g e r ) this . L . get ( i ) ) . intValue ( ) ;
87 i f ( num > max )
88 {
89 max=num;
90 }
91 }
92 }
93 return max;
94 }
ii
95
96 public ArrayList D( int value )
97 {
98 ArrayList r e t =new ArrayList ( ) ;
99 for ( int i =0; i <this .X. s i z e ( ) ; i ++)
100 {
101 int num= ( ( I n t e g e r ) this .X. get ( i ) ) . intValue ( ) ;
102 this . add ( ret , Math . abs ( value−num) ) ;
103 }
104 return r e t ;
105 }
106
107 public boolean isSubSet ( ArrayList a , ArrayList b )
108 {
109 boolean isSub= f a l s e ;
110 for ( int i =0; i <a . s i z e ( ) ; i ++)
111 {
112 isSub= f a l s e ;
113 int num1= ( ( I n t e g e r ) a . get ( i ) ) . intValue ( ) ;
114 for ( int j =0; j <b . s i z e ( ) ; j ++)
115 {
116 int num2= ( ( I n t e g e r ) b . get ( j ) ) . intValue ( ) ;
117 i f (num1==num2)
118 {
119 isSub=true ;
120 break ;
121 }
122 }
123 i f ( isSub== f a l s e ) return fals e ;
124 }
125 return isSub ;
126 }
127
128 public void place ( )
129 {
130
131 i f ( this . L . s i z e ( ) == 0 )
132 {
133 this . p r i n t ( this .X ) ;
134 return ;
135 }
136 else
137 {
138 int y=this . getMax ( ) ;
139
140 ArrayList dy=this .D( y ) ;
141 ArrayList dwy=this .D( this . width−y ) ;
142
143 i f ( this . isSubSet ( dy , this . L ) == true )
144 {
145 this . d e l e t e ( this . L , y ) ;
146 this . add ( this . X, y ) ;
147 for ( int k=0;k<dy . s i z e ( ) ; k++)
148 {
149 this . d e l e t e ( this . L , ( ( Integer ) dy . get ( k ) ) . intValue ( ) ) ;
150 }
152 this . d e l e t e ( this . X, y ) ;
153 for ( int k=0;k<dy . s i z e ( ) ; k++)
iii
154 {
155 this . add ( this . L , ( ( Integer ) dy . get ( k ) ) . intValue ( ) ) ;
156 }
157 }
158 i f ( this . isSubSet ( dwy, this . L ) == true )
159 {
160 this . d e l e t e ( this . L , y ) ;
161 this . add ( this . X, this . width−y ) ;
162 for ( int k=0;k<dwy . s i z e ( ) ; k++)
163 {
164 this . d e l e t e ( this . L , ( ( Integer ) dwy . get ( k ) ) . intValue ( ) ) ;
165 }
167 this . d e l e t e ( this . X, this . width−y ) ;
168 for ( int k=0;k<dwy . s i z e ( ) ; k++)
169 {
170 this . add ( this . L , ( ( Integer ) dwy . get ( k ) ) . intValue ( ) ) ;
171 }
172 }
173 }
174 }
175
176
177 public static void main ( String args [ ] )
178 {
179 try
180 {
181 BufferedReader br=new BufferedReader (new FileReader ( "input.txt" ) ) ;
182 while ( true )
183 {
184 ArrayList a=new ArrayList ( ) ;
185 String s=br . readLine ( ) ;
186 i f ( s==null ) break ;
187 StringTokenizer s t =new StringTokenizer ( s ) ;
188 while ( s t . hasMoreTokens ( ) )
189 {
190 a . add (new I n t e g e r ( st . nextToken ( ) ) ) ;
191 }
192 PDP pdp=new PDP( a ) ;
193 }
194 }
195 catch ( Exception e )
196 {
197 System . out . p r i n t l n ( "Exception : " + e ) ;
198
199 }
200
201
202
203
204
205
206
207
208 }
209
210 }
iv
Input
1 2 2 3 3 4 5 6 7 8 10
B Motif Finding Problem
MF.java
1 import java . u t i l . ∗ ;
2 import java . i o . ∗ ;
3
4 public class MF
5 {
6 String [ ] dna ;
7 int l , n , t ;
8 int [ ] best ;
9 int [ ] s ;
10
11 MF( String [ ] dna , int t , int n , int l )
12 {
13 this . dna=dna ;
14 this . t = t ;
15 this . n=n ;
16 this . l = l ;
17 this . best =new int [ this . t ] ;
18 this . s =new int [ this . t ] ;
19 for ( int i =0; i <this . t ; i ++) this . best [ i ]= this . s [ i ] = 0 ;
20 this . greedyMF ( ) ;
21 }
22
23 public void greedyMF ( )
24 {
25
26 for ( this . s [ 0 ] = 0 ; this . s [0] <( this . n−this . l + 1 ) ; this . s [ 0 ] + + )
27 {
28
29 for ( this . s [ 1 ] = 0 ; this . s [1] <( this . n−this . l + 1 ) ; this . s [ 1 ] + + )
30 {
31 i f ( score ( this . s , 2 ) > score ( this . best , 2 ) )
32 {
33
34 this . best [ 0 ] = this . s [ 0 ] ;
35 this . best [ 1 ] = this . s [ 1 ] ;
36 }
37 }
38 this . s [ 0 ] = this . best [ 0 ] ;
39 this . s [ 1 ] = this . best [ 1 ] ;
40 }
41
42 for ( int i =2; i <this . t ; i ++)
43 {
44 System . out . p r i n t l n ( i ) ;
45 for ( this . s [ i ] = 0 ; this . s [ i ] <( this . n−this . l + 1 ) ; this . s [ i ] + + )
46 {
47 i f ( score ( this . s , i ) > score ( this . best , i ) )
48 {
49 this . best [ i ] = this . s [ i ] ;
50 }
51 }
52 this . s [ i ] = this . best [ i ] ;
v
53 }
54 }
55
56 public String substring ( String s , int beg , int len )
57 {
58 String r e t ="" ;
59 for ( int i =beg ; i <len ; i ++)
60 {
61 r e t +=s . charAt ( i ) ;
62 }
63 return r e t ;
64 }
65
66 public int score ( int s [ ] , int index )
67 {
68 int [ ] [ ] count=new int [ 4 ] [ this . l ] ;
69 int max=0 ,sum=0;
70 char ch=’ ’ ;
71 String concensus="" ;
72
73 for ( int i =0; i <this . l ; i ++)
74 {
75 max=0;
76 ch=’ ’ ;
77 for ( int j =0; j <index ; j ++)
78 {
79 i f ( this . dna [ j ] . charAt ( s [ j ] ) == ’A’ )
80 {
81 count [ 0 ] [ i ] + + ;
82 i f ( count [ 0 ] [ i ] > max)
83 {
84 max=count [ 0 ] [ i ] ;
85 ch=’A’ ;
86 }
87 }
88 else i f ( this . dna [ j ] . charAt ( s [ j ] ) == ’C’ )
89 {
90 count [ 1 ] [ i ] + + ;
91 i f ( count [ 1 ] [ i ] > max)
92 {
93 max=count [ 1 ] [ i ] ;
94 ch=’C’ ;
95 }
96 }
97 else i f ( this . dna [ j ] . charAt ( s [ j ] ) == ’G’ )
98 {
99 count [ 2 ] [ i ] + + ;
100 i f ( count [ 2 ] [ i ] > max)
101 {
102 max=count [ 2 ] [ i ] ;
103 ch=’G’ ;
104 }
105 }
106 else i f ( this . dna [ j ] . charAt ( s [ j ] ) == ’T’ )
107 {
108 count [ 3 ] [ i ] + + ;
109
110 i f ( count [ 3 ] [ i ] > max)
111 {
vi
112 max=count [ 3 ] [ i ] ;
113 ch=’T’ ;
114 }
115 }
116 s [ j ]++;
117 }
118 concensus+=ch ;
119 sum+=max;
120 }
121 return sum;
122 }
123
124
125
126 public static void main ( String args [ ] )
127 {
128 try
129 {
130 BufferedReader br=new BufferedReader (new FileReader ( "mfinput.txt" ) ) ;
131 String s=br . readLine ( ) ;
132 int t = I n t e g e r . parseInt ( s ) ;
133 s=br . readLine ( ) ;
134 int n= I n t e g e r . parseInt ( s ) ;
135 s=br . readLine ( ) ;
136 int l = I n t e g e r . parseInt ( s ) ;
137 String [ ] dna = new String [ t ] ;
138 for ( int i =0; i <t ; i ++)
139 {
140 dna [ i ] = br . readLine ( ) ;
141 }
142 MF m=new MF( dna , t , n , l ) ;
143 }
144 catch ( Exception e )
145 {
146 e . printStackTrace ( ) ;
147
148 }
149
150 }
151 }
Input
1 5
2 8
3 8
4 AGGTACTT
5 CCATACGT
6 ACGTTAGT
7 ACGTCCAT
8 CCGTACGG
vii

Bioinformatics Softwares: by Rifat Shahriyar Student No: 100705037P

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bioinformatics Softwares: by Rifat Shahriyar Student No: 100705037P

Hochgeladen von

Copyright:

Verfügbare Formate

Bioinformatics Softwares

A Project Submitted for CSE 6406 : Bioinformatics Algorithms Course

3 Gene Mapping and Sequence Assembly 4

7 Hidden Markov Models 8

8 Protein Structure Prediction 9

10 Open Source Tools 10

A Partial Digest Problem i

B Motif Finding Problem v

The areas in bioinformatics where software are used are :

• Nucleotide and Genome Sequences Databases

2.1 Protein Databases

TrEMBL [http://us.expasy.org/sprot] It is the computer annotated supplement of Swiss-Prot

Protein Information Resource [http://pir.georgetown.edu/] It is the Universal Protein Resource

iProClass [http://pir.georgetown.edu/iproclass/] It provides summary descriptions of protein

2.2 Structural Databases

PDBSum [http://www.biochem.ucl.ac.uk/bsm/pdbsum/] PDB entries can be difficult to ex-

Figure 3: Most recent founded protein: Phylloseptin-2

Figure 4: Structural Databases

CATH [http://www.biochem.ucl.ac.uk/bsm/cath/] and SCOP [http://scop.mrc-lmb.cam.ac.uk/scop/]

GenBank [http://www.ncbi.nlm.nih.gov/GenBank/] is an annotated collection of all publicly

Entrez [http://www.ncbi.nlm.nih.gov/Entrez/] is a retrieval system for searching several linked

3 Gene Mapping and Sequence Assembly

Figure 5: Accelrys Discovery Studio: Protein Refinement

Basic-Algorithms-of-Bioinformatics Applet [http://baba.sourceforge.net/] provides Java Applets

4.2 Multiple Sequence Alignment

Iterative alignment method Dialign [http://www.gsf.de/biodv/dialign.html] aims at the delin-

Implementation FASTA3 [http://www.ebi.ac.uk/fasta33/] is the most popular FASTA implemen-

Implementation BLAST [http://www.ncbi.nlm.nih.gov/BLAST/] is the most popular BLAST im-

Figure 7: FASTA Graphical Search Result

6.1 Software used

Phylip (Phylogeny Interface Package) [http://evolution.genetics.washington.edu/phylip.html]

PAUP (Phylogenetic Analysis using Parsimony) [http://paup.csit.fsu.edu/] It supports wide

PAML (Phylogenetic Analysis using Maximum Likelihood) [http://abacus.gene.ucl.ac.uk/paml.html]

PhyloDraw [http://pearl.cs.pusan.ac.kr/phylodraw/] It is a drawing tool for creating phyloge-

7 Hidden Markov Models

7.1 Software used

SAM(Sequence alignment and modeling system) http://www.cse.ucsc.edu/compbio/sam.html

Meta-MEME [http://metameme.sdsc.edu/] It is a software toolkit for building and using HMM

HMMpro [http://www.netid.com/html/hmmpro.html] It is a general purpose HMM simulator

8 Protein Structure Prediction

8.1 Protein Identification and characterization

8.2 Primary structure analysis and predication

• Hidden Markov models - Pfam

9.1 Computer-Aided drug designing methods

Quantum CACHe and Project Leader [http://www.accelrys.com/about/oxmol.html] It is used

Molecular Graphics [http://scsg9.unige.ch/fln/eng/toc.html] It provides Graphical represen-

Molecular modeling toolbox [http://webnet.mednet.gu.se/chemistry/molmod/] It is same as

XED [http://www.ch.cam.ac.uk/SGTL/xed/] It provides special emphasis on interactions.

10 Open Source Tools

A Partial Digest Problem

B Motif Finding Problem

Das könnte Ihnen auch gefallen