Beruflich Dokumente
Kultur Dokumente
Visvesvaraya Technological University, Belgaum in partial fulfillment for the award of the degree of
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING M.S.RAMAIAH INSTITUTE OF TECHNOLOGY (Autonomous Institute, Affiliated to VTU) BANGALORE-560054
www.msrit.edu
May 2011
Gene Recognition
A project report submitted to
Visvesvaraya Technological University, Belgaum in partial fulfillment for the award of the degree of
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING M. S. RAMAIAH INSTITUTE OF TECHNOLOGY (Autonomous Institute, Affiliated to VTU) BANGALORE-560054
www.msrit.edu
May 2011
BANGALORE-560054
CERTIFICATE
This is to certify that the project work titled Gene Recognition is carried out by 1MS07CS052 Mudra Hegde and 1MS07CS053 Nakul G V in partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science and Engineering during the year 2011. The Project report has been approved as it satisfies the academic requirements with respect to the project work prescribed for Bachelor of Engineering Degree. To the best of our understanding the work submitted in this report has not been submitted, in part or full, for the award of any diploma or degree of this or any other University.
Veena G S Guide
(External Examiner)
BANGALORE-560054
DECLARATION
We hereby declare that the entire work embodied in this report has been carried out by us at M. S. Ramaiah Institute of Technology under the supervision of Veena G S. This report has not been submitted in part or full for the award of any diploma or degree of this or any other University. 1MS07CS052 1MS07CS053 MUDRA HEGDE NAKUL G V
Abstract
A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Prokaryotic genes are relatively easy to find compared to Eukaryotic genes because they lack introns. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (exons) interrupted by introns. The recognition of the promoter regions in a eukaryotic genome is a daunting task. There is a lot of sequencing data that has been generated and they need to be annotated. It helps in having a better understanding of the organism, in drug discovery and also for finding a cure for various genetic disorders. Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. In our work we try to find the coding and non-coding regions of an unlabeled string of DNA nucleotides using Hidden Markov model. A Hidden Markov Model is a generalization of a Markov chain, in which each (internal) state is not directly observable (hence the term hidden) but produces (emits) an observable random output (external) state, also called emission, according to a given stationary probability law. recognition. HMM employs dynamic programming algorithms like Viterbi, Forward-Backward algorithms to aid in gene
ACKNOWLEDGEMENT We consider ourselves privileged to express gratitude and respect towards all those who guided us through the completion of this project. Firstly, we would like to thank Dr. R. Selva Rani, Head of Department, Department of Computer Science and Engineering, MSRIT, Bangalore for giving us the opportunity to do this project and also for the excellent lab facilities provided. We would like to express our gratitude to our project guide, Mrs. Veena G.S; we are privileged to experience a sustained enthusiastic and involved interest from her side. We would also like to express our gratitude to Dr K.G Srinivasa, Professor, Department of Computer Science and Engineering for his support. We would also like to sincerely thank Mr. Shashidhara H S, Associate Professor, Information Science, for all the additional inputs he has given and also helped us gather more information on the various aspects involved in this project. We would like to thank the faculty of the Department of Computer Science and Engineering and the institute for extending a helping hand at every juncture of need and making this possible.
ii
LIST OF FIGURES
Figure 3.1 Use-case Diagram........................................................................................ 14 Figure 3.2 Flowchart..................................................................................................... 15 Figure 4.1 System Architecture..................................................................................... 18 Figure 4.2 Input Component.......................................................................................... 20 Figure 4.3 Preprocessing.................................................................................................21 Figure 4.4 Output Component........................................................................................22 Figure 4.5 Input Screen Shot1.........................................................................................23 Figure 4.6 Input Screen Shot2.........................................................................................24 Figure 4.7 Output Screen Shot........................................................................................25 Figure 5.1 Gene Model................................................................................................... 30
iii
Contents
Abstract Acknowledgements List of Figures Contents i ii iii iv
Introduction
1.1 1.2 1.3 1.4 1.5 General Introduction Statement of the Problem Objectives of the project Current Scope Future Scope
Literature Survey
2.1 2.2 2.3 2.4 Prokaryotic Gene Structure Eukaryotic Gene Structure Hidden Markov Models GENSCAN Algorithm
3.3
3.4 3.5
System Design
4.1 4.2 Introduction and Design Overview System Architectural Design 4.2.1 Chosen System Architecture 4.2.2 Discussion of Alternative Designs
4.3
4.4
4.5
Detailed Description of Components 4.3.1 Input Component 4.3.2 Preprocessing 4.3.3 Computational Component 4.3.4 Output Component User Interface Design 4.4.1 Description of User Interface 4.4.2 Screen Images 4.4.3 Objects and Actions Test Plan 4.5.1 Features to be Tested
5 6
Implementation
5.1 6.1 6.2 Hidden Markov Models in Gene Recognition Introduction 6.1.1 System Overview 6.1.2 Test Approach Test Cases 6.2.1 Case I 6.2.1.1 Purpose 6.2.1.2 Input 6.2.1.3 Expected Output & Pass/ Fail Criteria 6.2.1.4 Test Procedure 6.2.1.5 Test Results 6.2.2 Case II 6.2.2.1 Purpose 6.2.2.2 Input 6.2.2.3 Expected Output & Pass/ Fail Criteria 6.2.2.4 Test Procedure 6.2.2.5 Test Results
Testing
7 8
Chapter 1
INTRODUCTION
1.1 General introduction
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear. A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. A promoter is a region of DNA that facilitates the transcription of a particular gene. Promoters are located near the genes they regulate, on the same strand and typically upstream (towards the 5' region of the sense strand). In order for the transcription to take place, the enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences and response elements which provide a secure initial binding site for RNA polymerase and for proteins called transcription factors that recruit RNA polymerase. The coding region of a gene is that portion of a gene's DNA or RNA, composed of exons, that codes for protein. The region is bounded nearer the 5' end by a start codon and nearer the 3' end with a stop codon. The coding region in mRNA is bounded by the five prime untranslated region and the three prime untranslated region, which are also parts of the exons.
Chapter 2
LITERATURE SURVEY
2.1 Prokaryotic Gene Structure
The organisms can be broadly classified into Prokaryotes and Eukaryotes. Prokaryotes are organisms that lack nucleus and membrane-bound organelles. Prokaryotes have a single chromosome, contained within a nucleoid region rather than a membrane-bound nucleus, but may also have various small circular pieces of DNA called plasmids spread throughout the cell. Their gene structure is much simpler than the gene structure of Eukaryotic DNA. The gene is the functional unit of the DNA. A gene is a unit of heredity in a living organism. It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism. All living things depend on genes, as they specify all proteins and functional RNA chains. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring. A modern working definition of a gene is "a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions ". The prokaryotic gene is made up of three regions viz. Promoter Region Coding Region Terminator Region
A promoter region is the part of the gene that facilitates the transcription process. Promoters are located near the genes that they regulate, on the same strand and upstream (5 region of the sense strand). For transcription to take place, the enzyme, RNA polymerase has to bind to a location near the gene. The promoter region contains specific DNA sequences that form the binding site for the RNA polymerase. The promoter region in the prokaryotes consists of two sequences. The first one, known as the Pribnow box, is the sequence of six nucleotides TATAAT. The other sequence consists of the seven nucleotides TTGACAT.
The coding region is the exons (prokaryotic genes are devoid of introns) which form the part of the DNA that is translated. The coding region starts with the initiation codon (ATG) and end with the termination codon (TAG or TAA or TGA). The terminator region is the region that marks the end of the gene or the operon on genomic DNA for transcription (An operon is a functioning unit of the genomic material containing a cluster of genes under the control of a single promoter). Prokaryotic genes generally overlap with each other which make the detection of translation initiation sites and the predictions of prokaryotic genes difficult. Many gene finding programs for the prokaryotic genes have been developed, the earlier ones being ECOPARSE, ORPHEUS, GeneMark.hmm and the more recent ones such as GeneMark, GeneMarkS, EasyGene and GLIMMER. The GeneMark uses the Bayesian method. GeneMark.hmm uses the modification of the Viterbi algorithm of HMM (Hidden Markov Model) with duration to identify the most likely global path through hidden functional states given the DNA sequence. The extension of the GeneMark algorithm GeneMark.fba uses the forward-backward algorithm for local posterior decoding used in the HMM theory. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. Markov models of several orders were combined in the interpolated model for gene prediction in the Glimmer algorithm.
2.2
The eukaryotic organism, as opposed to a prokaryotic organism, has a well-defined nucleus and membrane-bound organelles. In contrast to the prokaryotes, the eukaryotes have many chromosomes that are large and linear. The chromosomes in eukaryotes are also packaged by proteins into a structure known as the chromatin due to which very long chromosomes fit into the nucleus. The Eukaryotic gene is much more complex than a prokaryotic gene. The eukaryotic gene consists of Promoter Region Exons Introns Terminator Region
The Start Site contains a sequence of 7 bases (TATAAAA) called the TATA box. The basal or core promoter is found in all protein-coding genes. This is in sharp contrast to the upstream promoter whose structure and associated binding factors differ from gene to gene. Many different genes and many different types of cells share the same transcription factors not only those that bind at the basal promoter but even some of those that bind upstream. What turns on a particular gene in a particular cell is probably the unique combination of promoter sites and the transcription factors that are chosen. Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically lie upstream of the gene and can have regulatory elements several kilobases away from the transcriptional start site (enhancers). In eukaryotes, the transcriptional complex can cause the DNA to bend back on itself, which allows for placement of regulatory sequences far from the actual site of transcription. Many eukaryotic promoters, between 10 and 20% of all genes, contain a TATA box (sequence TATAAA), which in turn binds a TATA binding protein which assists in the formation of the RNA polymerase transcriptional complex. The TATA box typically lies very close to the transcriptional start site (often within 50 bases).
Eukaryotic promoter regulatory sequences typically bind proteins called transcription factors which are involved in the formation of the transcriptional complex. An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule after either portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by transsplicing. The mature RNA molecule can be a messenger RNA or a functional form of a noncoding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called exons) interrupted by introns. The regions between genes are likewise not expressed, but may help with chromatin assembly, contain promoters, and so forth. Intron sequences contain some common features. Most introns begin with the sequence GT (GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among them. Intron sequences may be large relative to coding sequences; in some genes, over 90 percent of the sequence between the 5 and 3 ends of the mRNA is introns. RNA polymerase transcribes intron sequences. This means that eukaryotic mRNA precursors must be processed to remove introns as well as to add the caps at the 5 end and polyadenylic acid (poly A) sequences at the 3 end.
2.3
A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including speech recognition, optical character recognition, and problems in computational biology. The main computational biology problems with HMM-based solutions are protein family profiling, protein binding site recognition and the gene finding in DNA. Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding. A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence. The sequence of state transitions is a hidden process and is observed through the sequence of emitted symbols. The field of computational biology involves the application of computer science theories and approaches to biological and medical problems. Computational biology is motivated by newly available and abundant raw molecular datasets gathered from a variety of organisms.
Though the availability of this data marks a new era in biological research, it alone does not provide any biologically significant knowledge. The goal of computational biology is then to elucidate additional information regarding protein coding, protein function and many other cellular mechanisms from the raw datasets. This new information is required for drug design, medical diagnosis, medical treatment and countless fields of research. Many effective tools based on HMMs have been created for the purpose of gene finding. Among the most successful tools are Genie, GeneID and HMMGene. Though each tool has a slightly different model, they each use the technique of combining several specialized submodels into a larger framework. The submodels correspond directly to different regions of DNA defined according to their function in the process of gene transcription. Most of the gene finding tools is hybrid models that include neural network components. In these tools, instead of an HMM, a neural network models certain regions, such as splice sites. The overall framework of an HMM-based gene finder combines the submodels into a larger model corresponding to the organization of a gene in DNA and its functional roles.
2.4
GENSCAN Algorithm
Identifying genes in DNA sequences by computational methods is a topic on which a lot of research has been made in the past few years. This problem deals with the precise sequence determinants of transcription, translation and RNA splicing. Softwares for exon prediction have become common in genome sequencing laboratories to identify genes in newly sequenced regions. Early approaches to the gene recognition concentrated on the prediction of individual functional elements such as promoters, coding regions, splice sites and so on. But, the recent approaches to gene finding focus on integrating all these factors. Some examples of such approaches are: FGENEH, GENMARK, Gene ID, Genie, GeneParser and GRAIL II. Two important limitations of the currently existing algorithms are that they make an assumption that the input sequence has exactly one gene and the accuracy that is measured by independent control sets may be less than what was actually presumed. The accuracy is such that only 50% of the exons are actually identified. GENSCAN uses a general probabilistic model for the human genomic sequences. The overall architecture that the model uses is the Generalized Hidden Markov Model. This algorithm differs from the other algorithms in the followings aspects: i) A double-stranded DNA sequence is considered with potential genes on both the sides of the DNA which are analyzed simultaneously and in an integrated fashion. ii) The assumption that other algorithms have made, that the input sequence has exactly one complete gene is not made here. This model considers the fact that an input sequence may have a partial gene, a complete gene, multiple complete genes or no gene at all. iii) It introduces a new method, Maximum Dependence Decomposition, to model the functional signals in DNA sequences
Chapter 3
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear. A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Gene Recognition is a particularly difficult problem in Bioinformatics. The complexity that the DNA sequences involve makes the task even more daunting. The solution to this problem can be found to a certain extent by using Hidden Markov Models. A Hidden Markov Model is a generalization of a Markov chain, in which each (internal) state is not directly observable (hence the term hidden) but produces (emits) an observable random output (external) state, also called emission, according to a given stationary probability law. In this case, the time evolution of the internal states can be induced only through the sequence of the observed output states. If the number of internal states is N, the transition probability law is described by a matrix with N times N values; if the number of emissions is M, the emission probability law is described by a matrix with N times M values. A model is considered defined once given these two matrices and the initial distribution of the internal states. The most used algorithms in Hidden Markov Models are: 1. The Forward Algorithm: To find the probability of emission distribution (given a model) starting from the beginning of the sequence. 2. The Backward Algorithm: find the probability of emission distribution (given a model) starting from the end of the sequence.
3. Viterbi algorithm: To find the sequence of internal states that has, as a whole, the highest probability. The most used algorithm is the Viterbi algorithm.
3.2
General Description
3.3
Specific Requirements
3.4
Interface Requirements
user
START
NO
INPUT COMPONENT
YES
NO
STARTS WITH '>' ? PRINT ERROR MESSAGE
YES
REMOVE DESCRIPTION
YES
PREPROCESSING
NO
COMPUTATIONAL COMPONENT
DISPLAY OUTPUT
3.5
Performance
The efficiency expected of the forward and backward algorithm is of the order of O(N2L). At run time, warnings are issued when iterators follow an inconsistent or outof- bounds path, and when negative probabilities are encountered. Also, to improve the efficiency, constant transition probabilities are pre-computed outside the main loop whenever possible. Finally, all loops over transitions and states are unrolled, reducing the number of run-time decisions and lookups.
Chapter 4
SYSTEM DESIGN
4.1 Introduction and Design Overview
This project deals with one of the most challenging problems in Bioinformatics i.e. Gene Recognition. Gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. This document deals with the design of the architecture used for this project. The problem of Gene Recognition has been solved by using numerous methods. But, by far, Hidden Markov Models are most widely used. Hidden Markov Models do not present the problems that occur while Gene Recognition is being performed by pattern recognition. The input will be a DNA sequence in FASTA format that is not annotated. It then undergoes preprocessing in order to be checked for the correct format and to be copied on to a temporary file without the description of the sequence. After this is done, the sequence goes through the computation phase where the Viterbi, forward-backward algorithms are used to find the optimal path, hence finding the gene. This output is then displayed to the user.
One of the alternative designs that were proposed for the recognition of genes was that of pattern matching using parallel programming. It involves the recognition of start codons, promoter sequences, terminator codons and intrinsic terminator sequences in a given DNA sequence, hence recognizing the gene. But this design failed due to several reasons and it was also found to be an inefficient way of locating genes. The first step involved identifying the terminator codons in parallel by dividing the input sequence into a fixed number of nucleotides. But, the major drawback here was that certain sequences in the intergenic region also matched the terminator codons. Another major drawback is that, the start codon, ATG, also codes for the protein methionine. Hence, the sequence ATG may be present in the coding region also. This makes it difficult to recognize the start of the gene using pattern matching. Another difficulty that was encountered was regarding the promoter sequences. Prokaryotic genes have two fixed promoter sequences that mark the start of a gene. But, when eukaryotic genes are considered the complexity of the promoter sequences increases. The sequences are extremely diverse and difficult to characterize. They lie several kilobases away from the transcriptional start site which makes it highly inefficient to search for it using pattern matching. The nucleotides in the consensus sequence may also vary from gene to gene. Hence, it becomes very difficult for the consensus sequence to be searched for in the given DNA sequence. The termination of a gene may also be identified using sequences known as intrinsic terminators. The intrinsic terminators are a sequence of inverted repeat (5 CAGTTA| TAACTG 3) followed by up to six thymine nucleotides (TTTTTT). But, the complexity involved in this would be that the inverted repeat may be of any length. Partial genes would also pose a problem to recognize a gene using pattern recognition. Hence, we planned to use Hidden Markov Models for gene recognition.
4.3.2 Preprocessing
The description in the input file is removed and the remaining input, which is the DNA sequence, is copied onto another temporary file. If the user enters data is any format other than FASTA then appropriate error messages are displayed.
INPUT COMPONENT
YES
REMOVE DESCRIPTION
YES
NO
COMPUTATIONAL COMPONENT
The computational component involved in this project is the application of Hidden Markov Models. The preprocessed sequence is taken as input and the Viterbi algorithm and forwardbackward algorithm are applied to it. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, which will generate a given output sequence given the model parameters. The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions.
Text Area1: To enter the input sequence directly. The text area is preceded by a statement Enter the input sequence in FASTA format. Upload File button: If the user does not want to input the DNA sequence directly he may use this button to upload a file that contains the input sequence. When this button is clicked a text area appears where the user may enter the file name.
Submit button: This button may be clicked when the sequence or the file has to be submitted for annotating the sequence that they contain. Once this button is clicked the sequence gets annotated and the output is displayed. Text Area2: This text area is used to display the output i.e. the annotated sequence. It is preceded by The annotated output sequence is. Exit button: This button is used to exit the screen.
Chapter 5
IMPLEMENTATION
Input: A sequence of DNA X = (x1.xn) *, where = A,C, G, T. Output: Correct labeling of each element in X as belonging to a coding region, non-coding region or intergenic region. Gene finding becomes complicated when the problem is approached in more biological detail. A eukaryotic gene contains coding regions called exons which may be interrupted by non-coding regions called introns. The exons and introns are separated by splice sites. Regions outside genes are called intergenic. The goal of gene finding is then to annotate the sets of genomic data with the location of genes and within these genes, specific areas such as promoter regions, introns and exons.
The Hidden Markov Model that we have used in our project is a shown below:
Start
full
Background
bgstart
Start Codon
full
Gene
Stop
End
full
emit background
emit start
emit codon
emit stop
empty
The Gene Model shown above consists of six states, viz, Start, Background, Start Codon, Gene, Stop and End. These six states are divided into three blocks, viz, Block 1, Block 2 and Block 3. Block 1 consists of just the Start state which is the start of scanning the sequence. Block 2 consists of 4 states, Background, Start Codon, Gene and Stop. Block 3 consists of a single state, End. These states constitute the finite set of states of the Hidden Markov Model. The alphabet of output symbols in this HMM are the nucleotides A, T, G, and C. The sequence moves from one state to another using state transition probabilities. The transition from the Start state to the Background State has full probability, i.e. 1.0. The Background state has emission probability emitbackground. The Background state takes care of the intergenic region. Hence the nucleotides encountered may be A, T, G or C. Therefore, the emission probability for Background is emitbackground= 0.25. The model remains in the Background state until a start codon is encountered. The transition probability to remain in the Background state is bgbg i.e. full-bgend-bgstart. Once the start codon ATG is found, the model moves from Background state to Start Codon state. The transition from Background state to Start Codon state has the probability bgstart= Gene Density. The emission probability of Start Codon state is emitstart. If an ATG is encountered then the emission probability is 1.0 otherwise it is considered as 0.0. Once a start
codon is encountered it leads to a gene, hence the model moves from the Start Codon state to the Gene state. The transition probability for this transition is full i.e. 1.0 since once the start codon is found, the gene is found. There is more than a single codon that is encountered when we are in the Gene state. Hence, the probability of staying in the Gene state is extend= 1.0- (1.0/ Gene Length). The emission probability for the Gene state is emitcodon. This state emits any of the 61 codons (except TAA, TGA and TAG) that code for the amino acids that form the protein. Hence, if these codons are found the probability is 1/61 otherwise if stop codons are found the emission probability is 0.0. Then, a transition to the next state is made. The next state is the Stop Codon state which is entered if any of the stop codons (TAA, TAG or TGA) are encountered. The transition probability from Gene state to the Stop state is genestop= 1.0/ Gene Length. Once the stop codons are encountered it marks the end of that particular gene. But the DNA sequence may have more nucleotides. Hence, the transition takes place from Stop state to Background state. This transition probability is full i.e. 1.0. The Stop state emits the stop codons, hence the emission probability of the state is 1/3 if a stop codon is found otherwise it is 0.0. Once the transition has taken place from Stop to Background the entire sequence is scanned and finally a transition from Background to End state takes place with a transition probability of bgend=0.0001. The End state belongs to Block 3 of the model and its emission is empty.
The algorithms used in the implementation of this project are Forward-Backward Algorithm and the Viterbi Algorithm. The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of
, i.e. it computes, for all hidden state variables . The algorithm makes use of the
principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward-backward algorithm. The forward algorithm is used to find the next likely state in the given finite set of states. In our implementation the forward algorithm is implemented in a function called Forward which returns the next likely state. In this function the 8 transitions are defined according to our gene model. It scans the sequence in the forward direction and finds the likely states. The backward algorithm is also used to find the next likely state in a given finite set of states but it scans the sequence in the reverse direction. In our implementation the backward algorithm has been implemented in the function called Backward. This function also returns the next likely state. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models. The algorithm makes a number of assumptions.
First, both the observed events and hidden events must be in a sequence. This sequence often corresponds to time. Second, these two sequences need to be aligned, and an instance of an observed event needs to correspond to exactly one instance of a hidden event. Third, computing the most likely hidden sequence up to a certain point t must depend only on the observed event at point t, and the most likely sequence at point t 1.
In the implementation we have implemented Viterbi algorithm in 2 functions, Viterbi_trace and Viterbi_recurse. Viterbi_trace finds the most likely path in the forward direction whereas Viterbi_recurse finds the path in the reverse direction. These two functions return the most likely path followed by the sequence. Another function called addEdge is used to assign an
edge from the from state to the to state with transitions, probability and emissions as parameters.
Chapter 6
TESTING
6.1 Introduction
6.2.1.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACA TGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAA GCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACA ACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATT AGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCG TCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGT ACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCAC AACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAA CTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGA AGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATC TACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTAC TATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCC CCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGT AGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAG TTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTC
AATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTT GTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTAT GGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGAC CGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCG CCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATA AACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT TCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAAT AGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTAT CTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGG GTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAAT CCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTG TCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGT TCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTAC TAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAG TGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAG CTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACG TCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAA ATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTT CATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAG TAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTA CCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTAC ACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATT GGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGA TGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAG AAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGG TCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTG TCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACT TTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGG ACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAAC GTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAAT CACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTT TTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAA ATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGA TTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATAC CCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTT TGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAA ATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAA CCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCT TCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCT
GTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTG TCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCG TCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTA AAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGT GCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTAT TAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGG TATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCT ACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCG CAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAAC AGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAA ATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTT ATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAG AATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAAT AACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCAT CACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAAT CATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGA ATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGC TCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAG CTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTT CAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTT TTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTT CTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTG ATC
CCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAAC GGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTC ACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAAT TGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATT GCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAG GTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC AACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGAT CCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGAT AATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTT TCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGG ATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTT TTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCA AGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATT TCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACA TCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGT TCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCT CCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATA AAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTT GCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATT AGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAA CCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAA CACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAG ATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAG CAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGT ATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCT GATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGA AGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACA GCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCAT CGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAAC AATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCAT AGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATG TCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGAttatacgcaacgatattttgctt aattttattttcctgttttattttttattagtggtttacagataccctatattttatttagtttttatacttagagacatttaattttaattccattcttcaaatttcatttttgcacttaaaa caaagatccaaaaatgctctcgccctcttcatattgagaatacactccattcaaaattttgtcgtcaccgctgattaatttttcactaaactgatgaataatcaaaggcccc acgtcagaaccgactaaagaagtgagttttattttaggaggttgaaaaccattattgtctggtaaattttcatcttcttgacatttaacccagtttgaatccctttcaatttctg ctttttcctccaaactatcgaccctcctgtttctgtccaacttatgtcctagttccaattcgatcgcattaataactgcttcaaatgttattgtgtcatcgttgactttaggtaatt tctccaaatgcataatcaaactatttaaggaagatcggaattcgtcgaacacttcagtttccgtaatgatctgatcgtctttatccacatgttgtaattcactaaaatctaaa acgtatttttcaatgcataaatcgttctttttattaataatgcagatggaaaatctgtaaacgtgcgttaatttagaaagaacatccagtataagttcttctatatagtcaatta aagcaggatgcctattaatgggaacgaactgcggcaagttgaatgactggtaagtagtgtagtcgaatgactgaggtgggtatacatttctataaaataaaatcaaat
taatgtagcattttaagtataccctcagccacttctctacccatctattcataaagctgacgcaacgattactattttttttttcttcttggatctcagtcgtcgcaaaaacgta taccttctttttccgaccttttttttagctttctggaaaagtttatattagttaaacagggtctagtcttagtgtgaaagctagtggtttcgattgactgatattaagaaagtgg aaattaaattagtagtgtagacgtatatgcatatgtatttctcgcctgtttatgtttctacgtacttttgatttatagcaaggggaaaagaaatacatactattttttggtaaag gtgaaagcataatgtaaaagctagaataaaatggacgaaataaagagaggcttagttcatcttttttccaaaaagcacccaatgataataactaaaatgaaaaggattt gccatctgtcagcaacatcagttgtgtgagcaataataaaatcatcacctccgttgcctttagcgcgtttgtcgtttgtatcttccgtaattttagtcttatcaatgggaatc ataaattttccaatgaattagcaatttcgtccaattctttttgagcttcttcatatttgctttggaattcttcgcacttcttttcccattcatctctttcttcttccaaagcaacgatc cttctacccatttgctcagagttcaaatcggcctctttcagtttatccattgcttccttcagtttggcttcactgtcttctagctgttgttctagatcctggtttttcttggtgtagt tctcattattagatctcaagttattggagtcttcagccaattgctttgtatcagacaattgactctctaacttctccacttcactgtcgagttgctcgtttttagcggacaaag atttaatctcgttttctttttcagtgttagattgctctaattctttgagctgttctctcagctcctcatatttttcttgccatgactcagattctaattttaagctattcaatttctctttg atc
The gene should been correctly identified with the ATG and the stop codon TGA appropriately marked.
6.2.2.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|296148533|ref|NM_001183937.1| Saccharomyces cerevisiae S288c Rny1p (RNY1) mRNA, complete cds ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA
AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
Chapter 7
Chapter 8
[10] Anton M.Shmatkov, Arik A.Melikyan, Felix L.Chernousko, Mark Borodovsky, Finding Prokaryotic Genes by the frame-by-frame Algorithm: Targeting Gene Starts and Overlapping Genes, Bioinformatics, 15: 874-886, 1999 [11] Peter M. Hooper, Haiyan Zhang, David S. Wishart, Prediction of Genetic Structure in Eukaryotic DNA using Reference Point Logistic Regression and Sequence Alignment, Computer Applications in Biosciences, 16: 425 438, 2000 [12] Christopher Burge, Identification Of Genes In Human Genomic DNA, March 2007 [13] Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA, 308-761 Machine Learning Projec,t Kaleigh Smith, January 17, 2002 [14] Ion I. Mandoiu and Alexander Zelikovsky, Bioinformatics Algorithms-Techniques and Applications, John Wiley & Sons, 2008 [15] Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms, The MIT Press, 2004 [16] David W. Mount, Bioinformatics- Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press [17] Valeria De Fonzo, Filippo Aluffi-Pentini, Valerio Parisi, Hidden Markov Models in Bioinformatics, Current Bioinformatics, 2007, 2, 49-61 [18] Catherine Mathe, Marie-France Sagot, Thomas Schiex, Pierre Rouze, Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30 No. 19 4103-4117