Bachelor of Engineering in Computer Science & Engineering: Gene Recognition

Gene Recognition
A project report submitted to
M S Ramaiah Institute of Technology

An Autonomous Institute, Affiliated to
Visvesvaraya Technological University, Belgaum in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering

Submitted by Mudra Hegde Nakul G V Under the guidance of Veena G S Assistant Professor Computer Science and Engineering M S Ramaiah Institute of Technology 1MS07CS052 1MS07CS053
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING M.S.RAMAIAH INSTITUTE OF TECHNOLOGY (Autonomous Institute, Affiliated to VTU) BANGALORE-560054
www.msrit.edu
May 2011
Gene Recognition
A project report submitted to
M. S. Ramaiah Institute of Technology

An Autonomous Institute, Affiliated to
Visvesvaraya Technological University, Belgaum in partial fulfillment for the award of the degree of
Bachelor of Engineering in Computer Science & Engineering

Submitted by Mudra Hegde Nakul G V Under the guidance of Veena G S Assistant Professor Computer Science and Engineering M S Ramaiah Institute of Technology 1MS07CS052 1MS07CS053
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING M. S. RAMAIAH INSTITUTE OF TECHNOLOGY (Autonomous Institute, Affiliated to VTU) BANGALORE-560054
www.msrit.edu
May 2011
Department of Computer Science & Engineering

(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
CERTIFICATE
This is to certify that the project work titled Gene Recognition is carried out by 1MS07CS052 Mudra Hegde and 1MS07CS053 Nakul G V in partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science and Engineering during the year 2011. The Project report has been approved as it satisfies the academic requirements with respect to the project work prescribed for Bachelor of Engineering Degree. To the best of our understanding the work submitted in this report has not been submitted, in part or full, for the award of any diploma or degree of this or any other University.
Veena G S Guide
(Dr.R. Selvarani) Head, Dept of CSE
(External Examiner)
Department of Computer Science & Engineering

(Autonomous Institute, Affiliated to VTU)
BANGALORE-560054
DECLARATION
We hereby declare that the entire work embodied in this report has been carried out by us at M. S. Ramaiah Institute of Technology under the supervision of Veena G S. This report has not been submitted in part or full for the award of any diploma or degree of this or any other University. 1MS07CS052 1MS07CS053 MUDRA HEGDE NAKUL G V
Abstract
A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Prokaryotic genes are relatively easy to find compared to Eukaryotic genes because they lack introns. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (exons) interrupted by introns. The recognition of the promoter regions in a eukaryotic genome is a daunting task. There is a lot of sequencing data that has been generated and they need to be annotated. It helps in having a better understanding of the organism, in drug discovery and also for finding a cure for various genetic disorders. Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. In our work we try to find the coding and non-coding regions of an unlabeled string of DNA nucleotides using Hidden Markov model. A Hidden Markov Model is a generalization of a Markov chain, in which each (internal) state is not directly observable (hence the term hidden) but produces (emits) an observable random output (external) state, also called emission, according to a given stationary probability law. recognition. HMM employs dynamic programming algorithms like Viterbi, Forward-Backward algorithms to aid in gene
ACKNOWLEDGEMENT We consider ourselves privileged to express gratitude and respect towards all those who guided us through the completion of this project. Firstly, we would like to thank Dr. R. Selva Rani, Head of Department, Department of Computer Science and Engineering, MSRIT, Bangalore for giving us the opportunity to do this project and also for the excellent lab facilities provided. We would like to express our gratitude to our project guide, Mrs. Veena G.S; we are privileged to experience a sustained enthusiastic and involved interest from her side. We would also like to express our gratitude to Dr K.G Srinivasa, Professor, Department of Computer Science and Engineering for his support. We would also like to sincerely thank Mr. Shashidhara H S, Associate Professor, Information Science, for all the additional inputs he has given and also helped us gather more information on the various aspects involved in this project. We would like to thank the faculty of the Department of Computer Science and Engineering and the institute for extending a helping hand at every juncture of need and making this possible.
Mudra Hegde Nakul G.V
ii
LIST OF FIGURES
Figure 3.1 Use-case Diagram........................................................................................ 14 Figure 3.2 Flowchart..................................................................................................... 15 Figure 4.1 System Architecture..................................................................................... 18 Figure 4.2 Input Component.......................................................................................... 20 Figure 4.3 Preprocessing.................................................................................................21 Figure 4.4 Output Component........................................................................................22 Figure 4.5 Input Screen Shot1.........................................................................................23 Figure 4.6 Input Screen Shot2.........................................................................................24 Figure 4.7 Output Screen Shot........................................................................................25 Figure 5.1 Gene Model................................................................................................... 30
iii
Contents
Abstract Acknowledgements List of Figures Contents i ii iii iv
Introduction
1.1 1.2 1.3 1.4 1.5 General Introduction Statement of the Problem Objectives of the project Current Scope Future Scope
Literature Survey
2.1 2.2 2.3 2.4 Prokaryotic Gene Structure Eukaryotic Gene Structure Hidden Markov Models GENSCAN Algorithm
Software Requirements Specification

3.1 3.2 Introduction 3.1.1 Purpose 3.1.2 Scope of the Project General Description 3.2.1 Project Perspective 3.2.2 End User Expectation 3.2.3 General Constraints 3.2.4 Assumptions and Dependencies Specific Requirements 3.3.1 Functional Requirements 3.3.2 Non Functional Requirements 3.3.3 Software Requirements 3.3.4 Hardware Requirements Interface Requirements 3.4.1 User Interface Performance
3.3
3.4 3.5
System Design
4.1 4.2 Introduction and Design Overview System Architectural Design 4.2.1 Chosen System Architecture 4.2.2 Discussion of Alternative Designs
4.3
4.4
4.5
Detailed Description of Components 4.3.1 Input Component 4.3.2 Preprocessing 4.3.3 Computational Component 4.3.4 Output Component User Interface Design 4.4.1 Description of User Interface 4.4.2 Screen Images 4.4.3 Objects and Actions Test Plan 4.5.1 Features to be Tested
5 6
Implementation
5.1 6.1 6.2 Hidden Markov Models in Gene Recognition Introduction 6.1.1 System Overview 6.1.2 Test Approach Test Cases 6.2.1 Case I 6.2.1.1 Purpose 6.2.1.2 Input 6.2.1.3 Expected Output & Pass/ Fail Criteria 6.2.1.4 Test Procedure 6.2.1.5 Test Results 6.2.2 Case II 6.2.2.1 Purpose 6.2.2.2 Input 6.2.2.3 Expected Output & Pass/ Fail Criteria 6.2.2.4 Test Procedure 6.2.2.5 Test Results
Testing
7 8
Conclusion & Scope for Future Work Bibliography & References
Chapter 1
INTRODUCTION
1.1 General introduction
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear. A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. A promoter is a region of DNA that facilitates the transcription of a particular gene. Promoters are located near the genes they regulate, on the same strand and typically upstream (towards the 5' region of the sense strand). In order for the transcription to take place, the enzyme that synthesizes RNA, known as RNA polymerase, must attach to the DNA near a gene. Promoters contain specific DNA sequences and response elements which provide a secure initial binding site for RNA polymerase and for proteins called transcription factors that recruit RNA polymerase. The coding region of a gene is that portion of a gene's DNA or RNA, composed of exons, that codes for protein. The region is bounded nearer the 5' end by a start codon and nearer the 3' end with a stop codon. The coding region in mRNA is bounded by the five prime untranslated region and the three prime untranslated region, which are also parts of the exons.
1.2 Statement of the Problem

This project aims at finding the annotations of a genomic sequence i.e the coding regions of a gene given an input DNA sequence which is not annotated.
1.3 Objective of the Problem

To find the coding and non-coding regions of an unlabelled string of DNA nucleotides.
1.4 Current Scope

Gene Recognition is still in the stages of research and certain algorithms like GENSCAN and GrailEXP implement it with certain constraints.
1.5 Future Scope

The biggest challenge that the field of bioinformatics is facing is the huge amount of unannotated data that it has to deal with. It has huge databases of sequences with it but does not know what they represent. So, if the genes are annotated it can lead to a lot of mindboggling inventions. Newer and better drugs can be developed for various genetic disorders. Any hereditary diseases can be detected in the offspring at a very early stage. It will also lead to the better understanding of various organisms.
Chapter 2
LITERATURE SURVEY
2.1 Prokaryotic Gene Structure
The organisms can be broadly classified into Prokaryotes and Eukaryotes. Prokaryotes are organisms that lack nucleus and membrane-bound organelles. Prokaryotes have a single chromosome, contained within a nucleoid region rather than a membrane-bound nucleus, but may also have various small circular pieces of DNA called plasmids spread throughout the cell. Their gene structure is much simpler than the gene structure of Eukaryotic DNA. The gene is the functional unit of the DNA. A gene is a unit of heredity in a living organism. It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism. All living things depend on genes, as they specify all proteins and functional RNA chains. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring. A modern working definition of a gene is "a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions, and or other functional sequence regions ". The prokaryotic gene is made up of three regions viz. Promoter Region Coding Region Terminator Region
A promoter region is the part of the gene that facilitates the transcription process. Promoters are located near the genes that they regulate, on the same strand and upstream (5 region of the sense strand). For transcription to take place, the enzyme, RNA polymerase has to bind to a location near the gene. The promoter region contains specific DNA sequences that form the binding site for the RNA polymerase. The promoter region in the prokaryotes consists of two sequences. The first one, known as the Pribnow box, is the sequence of six nucleotides TATAAT. The other sequence consists of the seven nucleotides TTGACAT.
The coding region is the exons (prokaryotic genes are devoid of introns) which form the part of the DNA that is translated. The coding region starts with the initiation codon (ATG) and end with the termination codon (TAG or TAA or TGA). The terminator region is the region that marks the end of the gene or the operon on genomic DNA for transcription (An operon is a functioning unit of the genomic material containing a cluster of genes under the control of a single promoter). Prokaryotic genes generally overlap with each other which make the detection of translation initiation sites and the predictions of prokaryotic genes difficult. Many gene finding programs for the prokaryotic genes have been developed, the earlier ones being ECOPARSE, ORPHEUS, GeneMark.hmm and the more recent ones such as GeneMark, GeneMarkS, EasyGene and GLIMMER. The GeneMark uses the Bayesian method. GeneMark.hmm uses the modification of the Viterbi algorithm of HMM (Hidden Markov Model) with duration to identify the most likely global path through hidden functional states given the DNA sequence. The extension of the GeneMark algorithm GeneMark.fba uses the forward-backward algorithm for local posterior decoding used in the HMM theory. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. Markov models of several orders were combined in the interpolated model for gene prediction in the Glimmer algorithm.
2.2
Eukaryotic Gene Structure
The eukaryotic organism, as opposed to a prokaryotic organism, has a well-defined nucleus and membrane-bound organelles. In contrast to the prokaryotes, the eukaryotes have many chromosomes that are large and linear. The chromosomes in eukaryotes are also packaged by proteins into a structure known as the chromatin due to which very long chromosomes fit into the nucleus. The Eukaryotic gene is much more complex than a prokaryotic gene. The eukaryotic gene consists of Promoter Region Exons Introns Terminator Region
The Start Site contains a sequence of 7 bases (TATAAAA) called the TATA box. The basal or core promoter is found in all protein-coding genes. This is in sharp contrast to the upstream promoter whose structure and associated binding factors differ from gene to gene. Many different genes and many different types of cells share the same transcription factors not only those that bind at the basal promoter but even some of those that bind upstream. What turns on a particular gene in a particular cell is probably the unique combination of promoter sites and the transcription factors that are chosen. Eukaryotic promoters are extremely diverse and are difficult to characterize. They typically lie upstream of the gene and can have regulatory elements several kilobases away from the transcriptional start site (enhancers). In eukaryotes, the transcriptional complex can cause the DNA to bend back on itself, which allows for placement of regulatory sequences far from the actual site of transcription. Many eukaryotic promoters, between 10 and 20% of all genes, contain a TATA box (sequence TATAAA), which in turn binds a TATA binding protein which assists in the formation of the RNA polymerase transcriptional complex. The TATA box typically lies very close to the transcriptional start site (often within 50 bases).
Eukaryotic promoter regulatory sequences typically bind proteins called transcription factors which are involved in the formation of the transcriptional complex. An exon is a nucleic acid sequence that is represented in the mature form of an RNA molecule after either portions of a precursor RNA (introns) have been removed by cis-splicing or when two or more precursor RNA molecules have been ligated by transsplicing. The mature RNA molecule can be a messenger RNA or a functional form of a noncoding RNA such as rRNA or tRNA. Depending on the context, exon can refer to the sequence in the DNA or its RNA transcript. Genes that are expressed usually have introns that interrupt the coding sequences. A typical eukaryotic gene, therefore, consists of a set of sequences that appear in mature mRNA (called exons) interrupted by introns. The regions between genes are likewise not expressed, but may help with chromatin assembly, contain promoters, and so forth. Intron sequences contain some common features. Most introns begin with the sequence GT (GU in RNA) and end with the sequence AG. Otherwise, very little similarity exists among them. Intron sequences may be large relative to coding sequences; in some genes, over 90 percent of the sequence between the 5 and 3 ends of the mRNA is introns. RNA polymerase transcribes intron sequences. This means that eukaryotic mRNA precursors must be processed to remove introns as well as to add the caps at the 5 end and polyadenylic acid (poly A) sequences at the 3 end.
2.3
Hidden Markov Models
A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including speech recognition, optical character recognition, and problems in computational biology. The main computational biology problems with HMM-based solutions are protein family profiling, protein binding site recognition and the gene finding in DNA. Gene Finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding. A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence. The sequence of state transitions is a hidden process and is observed through the sequence of emitted symbols. The field of computational biology involves the application of computer science theories and approaches to biological and medical problems. Computational biology is motivated by newly available and abundant raw molecular datasets gathered from a variety of organisms.
Though the availability of this data marks a new era in biological research, it alone does not provide any biologically significant knowledge. The goal of computational biology is then to elucidate additional information regarding protein coding, protein function and many other cellular mechanisms from the raw datasets. This new information is required for drug design, medical diagnosis, medical treatment and countless fields of research. Many effective tools based on HMMs have been created for the purpose of gene finding. Among the most successful tools are Genie, GeneID and HMMGene. Though each tool has a slightly different model, they each use the technique of combining several specialized submodels into a larger framework. The submodels correspond directly to different regions of DNA defined according to their function in the process of gene transcription. Most of the gene finding tools is hybrid models that include neural network components. In these tools, instead of an HMM, a neural network models certain regions, such as splice sites. The overall framework of an HMM-based gene finder combines the submodels into a larger model corresponding to the organization of a gene in DNA and its functional roles.
2.4
GENSCAN Algorithm
Identifying genes in DNA sequences by computational methods is a topic on which a lot of research has been made in the past few years. This problem deals with the precise sequence determinants of transcription, translation and RNA splicing. Softwares for exon prediction have become common in genome sequencing laboratories to identify genes in newly sequenced regions. Early approaches to the gene recognition concentrated on the prediction of individual functional elements such as promoters, coding regions, splice sites and so on. But, the recent approaches to gene finding focus on integrating all these factors. Some examples of such approaches are: FGENEH, GENMARK, Gene ID, Genie, GeneParser and GRAIL II. Two important limitations of the currently existing algorithms are that they make an assumption that the input sequence has exactly one gene and the accuracy that is measured by independent control sets may be less than what was actually presumed. The accuracy is such that only 50% of the exons are actually identified. GENSCAN uses a general probabilistic model for the human genomic sequences. The overall architecture that the model uses is the Generalized Hidden Markov Model. This algorithm differs from the other algorithms in the followings aspects: i) A double-stranded DNA sequence is considered with potential genes on both the sides of the DNA which are analyzed simultaneously and in an integrated fashion. ii) The assumption that other algorithms have made, that the input sequence has exactly one complete gene is not made here. This model considers the fact that an input sequence may have a partial gene, a complete gene, multiple complete genes or no gene at all. iii) It introduces a new method, Maximum Dependence Decomposition, to model the functional signals in DNA sequences
Chapter 3
SOFTWARE REQUIREMENTS SPECIFICATION

3.1 Introduction
Organisms can basically be classified as Prokaryotic or Eukaryotic. Prokaryotes do not have a well-defined nucleus and they have a single chromosome which is contained within a nucleoid region. Their gene structure is much simpler than Eukaryotes. Eukaryotes have a well-defined nucleus with many chromosomes which are large and linear. A gene consists of three regions, viz, a promoter region, a coding region and a terminator region. Gene Recognition is a particularly difficult problem in Bioinformatics. The complexity that the DNA sequences involve makes the task even more daunting. The solution to this problem can be found to a certain extent by using Hidden Markov Models. A Hidden Markov Model is a generalization of a Markov chain, in which each (internal) state is not directly observable (hence the term hidden) but produces (emits) an observable random output (external) state, also called emission, according to a given stationary probability law. In this case, the time evolution of the internal states can be induced only through the sequence of the observed output states. If the number of internal states is N, the transition probability law is described by a matrix with N times N values; if the number of emissions is M, the emission probability law is described by a matrix with N times M values. A model is considered defined once given these two matrices and the initial distribution of the internal states. The most used algorithms in Hidden Markov Models are: 1. The Forward Algorithm: To find the probability of emission distribution (given a model) starting from the beginning of the sequence. 2. The Backward Algorithm: find the probability of emission distribution (given a model) starting from the end of the sequence.
3. Viterbi algorithm: To find the sequence of internal states that has, as a whole, the highest probability. The most used algorithm is the Viterbi algorithm.
3.1.1 Purpose of the Project

The purpose of this project is to annotate the unannotated DNA sequences of various species such as Saccharomyces cerevisiae, Homo sapiens etc.
3.1.2 Scope of the Project

This project aims to recognize the genes thus helping in better drug discovery and a better understanding of the organisms.
3.2
General Description
3.2.1 Project Perspective

In this project we aim recognize genes in an unannotated sequence of DNA. The gene will be recognized with the start and the stop codon appropriately marked.
3.2.2 End User Expectation

The output to be expected will be the recognized genomic sequence of a particular species with all the parts of the gene properly identified such as the promoter, the introns, exons and the terminator region.
3.2.3 General Constraints

We come across a large number of constraints in gene recognition. Since the start codon, ATG, codes for the amino acid methionine it may appear even in the coding region. This poses a problem in order to identify the start of the gene. The promoters help in identifying the start of the gene. But, in eukaryotic organisms the promoter sequences are many and quite complex. The stop codons, TAA, TGA and TAG, also occur in the intergenic region. This also is considered as a constraint. Another constraint that we may come across is that the model cannot be made to work for sequences of all the species.
3.2.4 Assumptions and Dependencies

The assumptions that we are making in our project is that the input sequence is complete and does not contain any partial genes.
3.3
Specific Requirements
3.3.1 Functional Requirements

The functional requirements are as follows: i) Take an unannotated input sequence and annotate it. ii) Visualize the annotated output sequence.
3.3.2 Non-Functional Requirements

The performance required from the Forward and Backward Algorithm must be O( N 2L) where L is the length of the sequence and N stands for the number of states in the model.
3.3.3 Software System Requirements

This gene recognition tool requires that the operating system be a Linux operating system with Apache server.
3.3.4 Hardware System Requirements

Processor: Intel Pentium (1.8 GHz) RAM: 1GB Hard Disk: 80 GB Cache: 2 MB L2 Cache
3.4
Interface Requirements
3.4.1 User Interface

We need one GUI for accepting the input sequence and more GUI for visualizing the output annotated sequence.
input the unannotated genome sequence
idenfication of start codons, promoter region and terminator region
visualize the results
user
Fig. 3.1 Use-case Diagram
START
ENTER INPUT FILE(UNANNOTATED DNA SEQUENCE)
FILE IN .fasta FORMAT?
NO
INPUT COMPONENT
YES
NO
STARTS WITH '>' ? PRINT ERROR MESSAGE
YES
REMOVE DESCRIPTION
ANY CHARACTERS OTHER THAN A,T, G,C?
YES
PRINT "FILE CONTAINS NON-NUCLEOTIDE CHARACTERS"
PREPROCESSING
NO
COMPUTATIONAL COMPONENT
DISPLAY OUTPUT
Fig 3.2 Flowchart
3.5
Performance
The efficiency expected of the forward and backward algorithm is of the order of O(N2L). At run time, warnings are issued when iterators follow an inconsistent or outof- bounds path, and when negative probabilities are encountered. Also, to improve the efficiency, constant transition probabilities are pre-computed outside the main loop whenever possible. Finally, all loops over transitions and states are unrolled, reducing the number of run-time decisions and lookups.
Chapter 4
SYSTEM DESIGN
4.1 Introduction and Design Overview
This project deals with one of the most challenging problems in Bioinformatics i.e. Gene Recognition. Gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. This document deals with the design of the architecture used for this project. The problem of Gene Recognition has been solved by using numerous methods. But, by far, Hidden Markov Models are most widely used. Hidden Markov Models do not present the problems that occur while Gene Recognition is being performed by pattern recognition. The input will be a DNA sequence in FASTA format that is not annotated. It then undergoes preprocessing in order to be checked for the correct format and to be copied on to a temporary file without the description of the sequence. After this is done, the sequence goes through the computation phase where the Viterbi, forward-backward algorithms are used to find the optimal path, hence finding the gene. This output is then displayed to the user.
4.2 System Architectural Design
4.2.1 Chosen System Architecture
Fig 4.1 System Architecture

The system has four components: 1. Input Component 2. Preprocessing 3. Computational Component 4. Output Component
4.2.2 Discussion of Alternative Designs
One of the alternative designs that were proposed for the recognition of genes was that of pattern matching using parallel programming. It involves the recognition of start codons, promoter sequences, terminator codons and intrinsic terminator sequences in a given DNA sequence, hence recognizing the gene. But this design failed due to several reasons and it was also found to be an inefficient way of locating genes. The first step involved identifying the terminator codons in parallel by dividing the input sequence into a fixed number of nucleotides. But, the major drawback here was that certain sequences in the intergenic region also matched the terminator codons. Another major drawback is that, the start codon, ATG, also codes for the protein methionine. Hence, the sequence ATG may be present in the coding region also. This makes it difficult to recognize the start of the gene using pattern matching. Another difficulty that was encountered was regarding the promoter sequences. Prokaryotic genes have two fixed promoter sequences that mark the start of a gene. But, when eukaryotic genes are considered the complexity of the promoter sequences increases. The sequences are extremely diverse and difficult to characterize. They lie several kilobases away from the transcriptional start site which makes it highly inefficient to search for it using pattern matching. The nucleotides in the consensus sequence may also vary from gene to gene. Hence, it becomes very difficult for the consensus sequence to be searched for in the given DNA sequence. The termination of a gene may also be identified using sequences known as intrinsic terminators. The intrinsic terminators are a sequence of inverted repeat (5 CAGTTA| TAACTG 3) followed by up to six thymine nucleotides (TTTTTT). But, the complexity involved in this would be that the inverted repeat may be of any length. Partial genes would also pose a problem to recognize a gene using pattern recognition. Hence, we planned to use Hidden Markov Models for gene recognition.
4.3 Detailed Description of Components
4.3.1 Input Component

The input component takes a text file containing DNA sequences in FASTA format as input. This acts as the sequence that has to be annotated.
Fig 4.2 Input Component
4.3.2 Preprocessing
The description in the input file is removed and the remaining input, which is the DNA sequence, is copied onto another temporary file. If the user enters data is any format other than FASTA then appropriate error messages are displayed.
INPUT COMPONENT
UNANNOTATED DNA SEQUENCE NO
STARTS WITH '>' ?
PRINT ERROR MESSAGE
YES
REMOVE DESCRIPTION
ANY CHARACTERS OTHER THAN A,T,G,C?
YES
PRINT "FILE CONTAINS NON-NUCLEOTIDE CHARACTERS"
NO
COMPUTATIONAL COMPONENT
Fig 4.3 Preprocessing
4.3.3 Computational Component
The computational component involved in this project is the application of Hidden Markov Models. The preprocessed sequence is taken as input and the Viterbi algorithm and forwardbackward algorithm are applied to it. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, which will generate a given output sequence given the model parameters. The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions.
4.3.4 Output Component

This component takes the annotated sequence from the computational component and displays the gene.
Fig 4.4 Output Component
4.4 User Interface Design
4.4.1 Description of the User Interface

The user interface for this project involves page where the user may input the DNA sequence or may upload a file containing the sequence. The output is also displayed on the same page.
4.4.2 Screen Images
Fig 4.5 Input Screen Shot 1

The user can enter his DNA sequence in the box provided with the label Enter the input sequence in FASTA format. If the user inputs a sequence that is not in FASTA format a dialog box opens with an error message asking the user to input the sequence in the correct format.
Fig 4.6 Input Screen Shot 2

If the user wants to upload a file containing the input sequence instead of directly inputting the sequence he may do so by clicking the Upload File button. This opens a box wherein the user may type the name of the file that contains the input sequence.
Fig 4.4 Output Screen Shot

Once the user inputs the sequence he may click Submit to obtain the annotated sequence in the box provided below The annotated sequence is.
4.4.3 Objects and Actions

The objects present on the user interface are:
Text Area1: To enter the input sequence directly. The text area is preceded by a statement Enter the input sequence in FASTA format. Upload File button: If the user does not want to input the DNA sequence directly he may use this button to upload a file that contains the input sequence. When this button is clicked a text area appears where the user may enter the file name.
Submit button: This button may be clicked when the sequence or the file has to be submitted for annotating the sequence that they contain. Once this button is clicked the sequence gets annotated and the output is displayed. Text Area2: This text area is used to display the output i.e. the annotated sequence. It is preceded by The annotated output sequence is. Exit button: This button is used to exit the screen.
4.5 Test Plan

4.4.1 Features to be Tested
The features to be tested are as follows: Features to be tested Input in FASTA format Input Sequence entered by the user Expected Output No- Error messages should be displayed Preprocessing Sequence entered by the user Yes- Proceed No- Error Yes- Temporary file containing only the sequence Start codon identification Temporary file with sequence Terminator codon identification Exons and Introns identification Temporary file with sequence Temporary file with sequence should be created ATG sequence should be identified at the proper position TAA/TAG/TGA should be identified at the correct position Exons and introns should follow the AG-GT rule and be identified at the proper position
Table 4.1 Features to be Tested
Chapter 5
IMPLEMENTATION
5.1 Hidden Markov Models in Gene Recognition

A Hidden Markov Model (HMM) is a stochastic model that captures the statistical properties of observed real world data. A good HMM accurately models the real world source of the observed data and has the ability to simulate the source. Machine Learning techniques based on HMMs have been successfully applied to problems including speech recognition, optical character recognition, and problems in computational biology. The main computational biology problems with HMM-based solutions are protein family profiling, protein binding site recognition and the problem that is the topic of this paper, gene finding in DNA. Gene finding or gene prediction in DNA has become one of the foremost computational biology problems for two reasons. Firstly, because completely sequenced genomes have become readily available and most importantly, because of the need to extract actual biological knowledge from this data to explain the molecular interactions that occur in cells and to define important cellular pathways. Discovering the location of genes on the genome is the first step towards building such a body of knowledge. A basic Markov model of a process is a model where each state corresponds to an observable event and the state transition probabilities depend only on the current and predecessor state. This model is extended to a Hidden Markov model for application to more complex processes, including speech recognition and computational gene finding. A generalized Hidden Markov Model (HMM) consists of a finite set of states, an alphabet of output symbols, a set of state transition probabilities and a set of emission probabilities. The emission probabilities specify the distribution of output symbols that may be emitted from each state. Therefore in a hidden model, there are two stochastic processes; the process of moving between states and the process of emitting an output sequence. The problem of finding genes in DNA has been studied for many years. It was one of the first problems tackled once sufficient genomic data became available. The problem is given a sequence of DNA, determine the locations of genes, which are the regions containing information that code for proteins. At a very general level, nucleotides can be classified as belonging to coding regions in a gene, non-coding regions in a gene or intergenic regions. The problem of gene finding can then be stated as follows:
Input: A sequence of DNA X = (x1.xn) *, where = A,C, G, T. Output: Correct labeling of each element in X as belonging to a coding region, non-coding region or intergenic region. Gene finding becomes complicated when the problem is approached in more biological detail. A eukaryotic gene contains coding regions called exons which may be interrupted by non-coding regions called introns. The exons and introns are separated by splice sites. Regions outside genes are called intergenic. The goal of gene finding is then to annotate the sets of genomic data with the location of genes and within these genes, specific areas such as promoter regions, introns and exons.
The Hidden Markov Model that we have used in our project is a shown below:
bgend bgbg BLOCK 1 BLOCK 2 extend gene stop BLOCK 3
Start
full
Background
bgstart
Start Codon
full
Gene
Stop
End
full
emit background
emit start
emit codon
emit stop
empty
Fig 5.1 Gene Model
The Gene Model shown above consists of six states, viz, Start, Background, Start Codon, Gene, Stop and End. These six states are divided into three blocks, viz, Block 1, Block 2 and Block 3. Block 1 consists of just the Start state which is the start of scanning the sequence. Block 2 consists of 4 states, Background, Start Codon, Gene and Stop. Block 3 consists of a single state, End. These states constitute the finite set of states of the Hidden Markov Model. The alphabet of output symbols in this HMM are the nucleotides A, T, G, and C. The sequence moves from one state to another using state transition probabilities. The transition from the Start state to the Background State has full probability, i.e. 1.0. The Background state has emission probability emitbackground. The Background state takes care of the intergenic region. Hence the nucleotides encountered may be A, T, G or C. Therefore, the emission probability for Background is emitbackground= 0.25. The model remains in the Background state until a start codon is encountered. The transition probability to remain in the Background state is bgbg i.e. full-bgend-bgstart. Once the start codon ATG is found, the model moves from Background state to Start Codon state. The transition from Background state to Start Codon state has the probability bgstart= Gene Density. The emission probability of Start Codon state is emitstart. If an ATG is encountered then the emission probability is 1.0 otherwise it is considered as 0.0. Once a start
codon is encountered it leads to a gene, hence the model moves from the Start Codon state to the Gene state. The transition probability for this transition is full i.e. 1.0 since once the start codon is found, the gene is found. There is more than a single codon that is encountered when we are in the Gene state. Hence, the probability of staying in the Gene state is extend= 1.0- (1.0/ Gene Length). The emission probability for the Gene state is emitcodon. This state emits any of the 61 codons (except TAA, TGA and TAG) that code for the amino acids that form the protein. Hence, if these codons are found the probability is 1/61 otherwise if stop codons are found the emission probability is 0.0. Then, a transition to the next state is made. The next state is the Stop Codon state which is entered if any of the stop codons (TAA, TAG or TGA) are encountered. The transition probability from Gene state to the Stop state is genestop= 1.0/ Gene Length. Once the stop codons are encountered it marks the end of that particular gene. But the DNA sequence may have more nucleotides. Hence, the transition takes place from Stop state to Background state. This transition probability is full i.e. 1.0. The Stop state emits the stop codons, hence the emission probability of the state is 1/3 if a stop codon is found otherwise it is 0.0. Once the transition has taken place from Stop to Background the entire sequence is scanned and finally a transition from Background to End state takes place with a transition probability of bgend=0.0001. The End state belongs to Block 3 of the model and its emission is empty.
The algorithms used in the implementation of this project are Forward-Backward Algorithm and the Viterbi Algorithm. The forward-backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of
observations/emissions , the distribution
, i.e. it computes, for all hidden state variables . The algorithm makes use of the
principle of dynamic programming to efficiently compute the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward-backward algorithm. The forward algorithm is used to find the next likely state in the given finite set of states. In our implementation the forward algorithm is implemented in a function called Forward which returns the next likely state. In this function the 8 transitions are defined according to our gene model. It scans the sequence in the forward direction and finds the likely states. The backward algorithm is also used to find the next likely state in a given finite set of states but it scans the sequence in the reverse direction. In our implementation the backward algorithm has been implemented in the function called Backward. This function also returns the next likely state. The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the Viterbi path, that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models. The algorithm makes a number of assumptions.
First, both the observed events and hidden events must be in a sequence. This sequence often corresponds to time. Second, these two sequences need to be aligned, and an instance of an observed event needs to correspond to exactly one instance of a hidden event. Third, computing the most likely hidden sequence up to a certain point t must depend only on the observed event at point t, and the most likely sequence at point t 1.
In the implementation we have implemented Viterbi algorithm in 2 functions, Viterbi_trace and Viterbi_recurse. Viterbi_trace finds the most likely path in the forward direction whereas Viterbi_recurse finds the path in the reverse direction. These two functions return the most likely path followed by the sequence. Another function called addEdge is used to assign an
edge from the from state to the to state with transitions, probability and emissions as parameters.
Chapter 6
TESTING
6.1 Introduction
6.1.1 System Overview

The implementation of our code has been done on the Linux operating system. The language that has been used for coding is C++ . The front end has been developed using PHP.
6.1.2 Test Approach

The input sequences have been taken from the GENBANK database. The output is tested manually in comparison to the details from the GENBANK database.
6.2 Test Cases

6.2.1 Case-1 6.2.1.1 Purpose
To test if the gene has been correctly identified with the start and stop codon appropriately marked.
6.2.1.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|1293613|gb|U49845.1|SCU49845 Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACA TGAGACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAA GCCGCTGAAGTTCTACTAAGGGTGGATAACATCATCCGTGCAAGACCAAGAACCGCCAATAGACA ACATATGTAACATATTTAGGATATACCTCGAAAATAATAAACCGCCACACTGTCATTATTATAATT AGAAACAGAACGCAAAAATTATCCACTATATAATTCAAAGACGCGAAAAAAAAAGAACAACGCG TCATAGAACTTTTGGCAATTCGCGTCACAAATAAATTTTGGCAACTTATGTTTCCTCTTCGAGCAGT ACTCGAGCCCTGTCTCAAGAATGTAATAATACCCATCGTAGGTATGGTTAAAGATAGCATCTCCAC AACCTCAAAGCTCCTTGCCGAGAGTCGCCCTCCTTTGTCGAGTAATTTTCACTTTTCATATGAGAA CTTATTTTCTTATTCTTTACTCTCACATCCTGTAGTGATTGACACTGCAACAGCCACCATCACTAGA AGAACAGAACAATTACTTAATAGAAAAATTATATCTTCCTCGAAACGATTTCCTGCTTCCAACATC TACGTATATCAAGAAGCATTCACTTACCATGACACAGCTTCAGATTTCATTATTGCTGACAGCTAC TATATCACTACTCCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCC CCCAGTGGCAAGAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGT AGACAAGACAGCTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAG TTCTAGAACGTTCTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTC
AATGTAATACTCGAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTT GTTACAAACCGTCCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTAT GGTTATACTAACGGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGAC CGTTCAATGTTCACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCG CCGTTACCCAATTGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATA AACTCGGCGATTGCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTT TCTGCCGTTGAGGTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAAT AGTTTGATAATCAACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTAT CTCGATGACGATCCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGG GTGGCATTAGATAATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAAT CCTGCCAATTTTTCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTG TCTCCACAACGGATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGT TCTCCTACTATTTTTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTAC TAATTCAAGCCAAGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAG TGCCCAAGAATTTCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAG CTATATTTTAACATCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACG TCCACAAGAAGTTCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAA ATTTCTTCTACCTCCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTT CATCTCACAATAAAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAG TAGCTCTCATTTGCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTA CCGCATGCTATTAGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTAC ACCTTTGAACAACCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATT GGCTGCTTTGAACACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGA TGAAAAGAGAGATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAG AAGAATTATTAGCAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGG TCTTCTTCTGTGTATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTG TCACCAGTCTCTGATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACT TTTCGATTTAGAAGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGG ACCCTTGGAACAGCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAAC GTAACGAAGCATCGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAAT CACTCCCACAACAATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTT TTGCTGGGTCCATAGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAA ATAAGAGTAATGTCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGA TTATACGCAACGATATTTTGCTTAATTTTATTTTCCTGTTTTATTTTTTATTAGTGGTTTACAGATAC CCTATATTTTATTTAGTTTTTATACTTAGAGACATTTAATTTTAATTCCATTCTTCAAATTTCATTTT TGCACTTAAAACAAAGATCCAAAAATGCTCTCGCCCTCTTCATATTGAGAATACACTCCATTCAAA ATTTTGTCGTCACCGCTGATTAATTTTTCACTAAACTGATGAATAATCAAAGGCCCCACGTCAGAA CCGACTAAAGAAGTGAGTTTTATTTTAGGAGGTTGAAAACCATTATTGTCTGGTAAATTTTCATCT TCTTGACATTTAACCCAGTTTGAATCCCTTTCAATTTCTGCTTTTTCCTCCAAACTATCGACCCTCCT
GTTTCTGTCCAACTTATGTCCTAGTTCCAATTCGATCGCATTAATAACTGCTTCAAATGTTATTGTG TCATCGTTGACTTTAGGTAATTTCTCCAAATGCATAATCAAACTATTTAAGGAAGATCGGAATTCG TCGAACACTTCAGTTTCCGTAATGATCTGATCGTCTTTATCCACATGTTGTAATTCACTAAAATCTA AAACGTATTTTTCAATGCATAAATCGTTCTTTTTATTAATAATGCAGATGGAAAATCTGTAAACGT GCGTTAATTTAGAAAGAACATCCAGTATAAGTTCTTCTATATAGTCAATTAAAGCAGGATGCCTAT TAATGGGAACGAACTGCGGCAAGTTGAATGACTGGTAAGTAGTGTAGTCGAATGACTGAGGTGGG TATACATTTCTATAAAATAAAATCAAATTAATGTAGCATTTTAAGTATACCCTCAGCCACTTCTCT ACCCATCTATTCATAAAGCTGACGCAACGATTACTATTTTTTTTTTCTTCTTGGATCTCAGTCGTCG CAAAAACGTATACCTTCTTTTTCCGACCTTTTTTTTAGCTTTCTGGAAAAGTTTATATTAGTTAAAC AGGGTCTAGTCTTAGTGTGAAAGCTAGTGGTTTCGATTGACTGATATTAAGAAAGTGGAAATTAA ATTAGTAGTGTAGACGTATATGCATATGTATTTCTCGCCTGTTTATGTTTCTACGTACTTTTGATTT ATAGCAAGGGGAAAAGAAATACATACTATTTTTTGGTAAAGGTGAAAGCATAATGTAAAAGCTAG AATAAAATGGACGAAATAAAGAGAGGCTTAGTTCATCTTTTTTCCAAAAAGCACCCAATGATAAT AACTAAAATGAAAAGGATTTGCCATCTGTCAGCAACATCAGTTGTGTGAGCAATAATAAAATCAT CACCTCCGTTGCCTTTAGCGCGTTTGTCGTTTGTATCTTCCGTAATTTTAGTCTTATCAATGGGAAT CATAAATTTTCCAATGAATTAGCAATTTCGTCCAATTCTTTTTGAGCTTCTTCATATTTGCTTTGGA ATTCTTCGCACTTCTTTTCCCATTCATCTCTTTCTTCTTCCAAAGCAACGATCCTTCTACCCATTTGC TCAGAGTTCAAATCGGCCTCTTTCAGTTTATCCATTGCTTCCTTCAGTTTGGCTTCACTGTCTTCTAG CTGTTGTTCTAGATCCTGGTTTTTCTTGGTGTAGTTCTCATTATTAGATCTCAAGTTATTGGAGTCTT CAGCCAATTGCTTTGTATCAGACAATTGACTCTCTAACTTCTCCACTTCACTGTCGAGTTGCTCGTT TTTAGCGGACAAAGATTTAATCTCGTTTTCTTTTTCAGTGTTAGATTGCTCTAATTCTTTGAGCTGTT CTCTCAGCTCCTCATATTTTTCTTGCCATGACTCAGATTCTAATTTTAAGCTATTCAATTTCTCTTTG ATC
6.2.1.3 Expected Output & Pass/ Fail Criteria

The expected output is:
Gene Recognition: (upper case represents the Gene/s) gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattgccgacatgagacagttaggtatcgtcgagagttacaagctaaaacga gcagtagtcagctctgcatctgaagccgctgaagttctactaagggtggataacatcatccgtgcaagaccaagaaccgccaatagacaacatatgtaacatattta ggatatacctcgaaaataataaaccgccacactgtcattattataattagaaacagaacgcaaaaattatccactatataattcaaagacgcgaaaaaaaaagaacaa cgcgtcatagaacttttggcaattcgcgtcacaaataaattttggcaacttatgtttcctcttcgagcagtactcgagccctgtctcaagaatgtaataatacccatcgta ggtatggttaaagatagcatctccacaacctcaaagctccttgccgagagtcgccctcctttgtcgagtaattttcacttttcatatgagaacttattttcttattctttactct cacatcctgtagtgattgacactgcaacagccaccatcactagaagaacagaacaattacttaatagaaaaattatatcttcctcgaaacgatttcctgcttccaacatc tacgtatatcaagaagcattcacttaccATGACACAGCTTCAGATTTCATTATTGCTGACAGCTACTATATCACTACT CCATCTAGTAGTGGCCACGCCCTATGAGGCATATCCTATCGGAAAACAATACCCCCCAGTGGCAA GAGTCAATGAATCGTTTACATTTCAAATTTCCAATGATACCTATAAATCGTCTGTAGACAAGACAG CTCAAATAACATACAATTGCTTCGACTTACCGAGCTGGCTTTCGTTTGACTCTAGTTCTAGAACGTT CTCAGGTGAACCTTCTTCTGACTTACTATCTGATGCGAACACCACGTTGTATTTCAATGTAATACTC GAGGGTACGGACTCTGCCGACAGCACGTCTTTGAACAATACATACCAATTTGTTGTTACAAACCGT
CCATCCATCTCGCTATCGTCAGATTTCAATCTATTGGCGTTGTTAAAAAACTATGGTTATACTAAC GGCAAAAACGCTCTGAAACTAGATCCTAATGAAGTCTTCAACGTGACTTTTGACCGTTCAATGTTC ACTAACGAAGAATCCATTGTGTCGTATTACGGACGTTCTCAGTTGTATAATGCGCCGTTACCCAAT TGGCTGTTCTTCGATTCTGGCGAGTTGAAGTTTACTGGGACGGCACCGGTGATAAACTCGGCGATT GCTCCAGAAACAAGCTACAGTTTTGTCATCATCGCTACAGACATTGAAGGATTTTCTGCCGTTGAG GTAGAATTCGAATTAGTCATCGGGGCTCACCAGTTAACTACCTCTATTCAAAATAGTTTGATAATC AACGTTACTGACACAGGTAACGTTTCATATGACTTACCTCTAAACTATGTTTATCTCGATGACGAT CCTATTTCTTCTGATAAATTGGGTTCTATAAACTTATTGGATGCTCCAGACTGGGTGGCATTAGAT AATGCTACCATTTCCGGGTCTGTCCCAGATGAATTACTCGGTAAGAACTCCAATCCTGCCAATTTT TCTGTGTCCATTTATGATACTTATGGTGATGTGATTTATTTCAACTTCGAAGTTGTCTCCACAACGG ATTTGTTTGCCATTAGTTCTCTTCCCAATATTAACGCTACAAGGGGTGAATGGTTCTCCTACTATTT TTTGCCTTCTCAGTTTACAGACTACGTGAATACAAACGTTTCATTAGAGTTTACTAATTCAAGCCA AGACCATGACTGGGTGAAATTCCAATCATCTAATTTAACATTAGCTGGAGAAGTGCCCAAGAATT TCGACAAGCTTTCATTAGGTTTGAAAGCGAACCAAGGTTCACAATCTCAAGAGCTATATTTTAACA TCATTGGCATGGATTCAAAGATAACTCACTCAAACCACAGTGCGAATGCAACGTCCACAAGAAGT TCTCACCACTCCACCTCAACAAGTTCTTACACATCTTCTACTTACACTGCAAAAATTTCTTCTACCT CCGCTGCTGCTACTTCTTCTGCTCCAGCAGCGCTGCCAGCAGCCAATAAAACTTCATCTCACAATA AAAAAGCAGTAGCAATTGCGTGCGGTGTTGCTATCCCATTAGGCGTTATCCTAGTAGCTCTCATTT GCTTCCTAATATTCTGGAGACGCAGAAGGGAAAATCCAGACGATGAAAACTTACCGCATGCTATT AGTGGACCTGATTTGAATAATCCTGCAAATAAACCAAATCAAGAAAACGCTACACCTTTGAACAA CCCCTTTGATGATGATGCTTCCTCGTACGATGATACTTCAATAGCAAGAAGATTGGCTGCTTTGAA CACTTTGAAATTGGATAACCACTCTGCCACTGAATCTGATATTTCCAGCGTGGATGAAAAGAGAG ATTCTCTATCAGGTATGAATACATACAATGATCAGTTCCAATCCCAAAGTAAAGAAGAATTATTAG CAAAACCCCCAGTACAGCCTCCAGAGAGCCCGTTCTTTGACCCACAGAATAGGTCTTCTTCTGTGT ATATGGATAGTGAACCAGCAGTAAATAAATCCTGGCGATATACTGGCAACCTGTCACCAGTCTCT GATATTGTCAGAGACAGTTACGGATCACAAAAAACTGTTGATACAGAAAAACTTTTCGATTTAGA AGCACCAGAGAAGGAAAAACGTACGTCAAGGGATGTCACTATGTCTTCACTGGACCCTTGGAACA GCAATATTAGCCCTTCTCCCGTAAGAAAATCAGTAACACCATCACCATATAACGTAACGAAGCAT CGTAACCGCCACTTACAAAATATTCAAGACTCTCAAAGCGGTAAAAACGGAATCACTCCCACAAC AATGTCAACTTCATCTTCTGACGATTTTGTTCCGGTTAAAGATGGTGAAAATTTTTGCTGGGTCCAT AGCATGGAACCAGACAGAAGACCAAGTAAGAAAAGGTTAGTAGATTTTTCAAATAAGAGTAATG TCAATGTTGGTCAAGTTAAGGACATTCACGGACGCATCCCAGAAATGCTGTGAttatacgcaacgatattttgctt aattttattttcctgttttattttttattagtggtttacagataccctatattttatttagtttttatacttagagacatttaattttaattccattcttcaaatttcatttttgcacttaaaa caaagatccaaaaatgctctcgccctcttcatattgagaatacactccattcaaaattttgtcgtcaccgctgattaatttttcactaaactgatgaataatcaaaggcccc acgtcagaaccgactaaagaagtgagttttattttaggaggttgaaaaccattattgtctggtaaattttcatcttcttgacatttaacccagtttgaatccctttcaatttctg ctttttcctccaaactatcgaccctcctgtttctgtccaacttatgtcctagttccaattcgatcgcattaataactgcttcaaatgttattgtgtcatcgttgactttaggtaatt tctccaaatgcataatcaaactatttaaggaagatcggaattcgtcgaacacttcagtttccgtaatgatctgatcgtctttatccacatgttgtaattcactaaaatctaaa acgtatttttcaatgcataaatcgttctttttattaataatgcagatggaaaatctgtaaacgtgcgttaatttagaaagaacatccagtataagttcttctatatagtcaatta aagcaggatgcctattaatgggaacgaactgcggcaagttgaatgactggtaagtagtgtagtcgaatgactgaggtgggtatacatttctataaaataaaatcaaat
taatgtagcattttaagtataccctcagccacttctctacccatctattcataaagctgacgcaacgattactattttttttttcttcttggatctcagtcgtcgcaaaaacgta taccttctttttccgaccttttttttagctttctggaaaagtttatattagttaaacagggtctagtcttagtgtgaaagctagtggtttcgattgactgatattaagaaagtgg aaattaaattagtagtgtagacgtatatgcatatgtatttctcgcctgtttatgtttctacgtacttttgatttatagcaaggggaaaagaaatacatactattttttggtaaag gtgaaagcataatgtaaaagctagaataaaatggacgaaataaagagaggcttagttcatcttttttccaaaaagcacccaatgataataactaaaatgaaaaggattt gccatctgtcagcaacatcagttgtgtgagcaataataaaatcatcacctccgttgcctttagcgcgtttgtcgtttgtatcttccgtaattttagtcttatcaatgggaatc ataaattttccaatgaattagcaatttcgtccaattctttttgagcttcttcatatttgctttggaattcttcgcacttcttttcccattcatctctttcttcttccaaagcaacgatc cttctacccatttgctcagagttcaaatcggcctctttcagtttatccattgcttccttcagtttggcttcactgtcttctagctgttgttctagatcctggtttttcttggtgtagt tctcattattagatctcaagttattggagtcttcagccaattgctttgtatcagacaattgactctctaacttctccacttcactgtcgagttgctcgtttttagcggacaaag atttaatctcgttttctttttcagtgttagattgctctaattctttgagctgttctctcagctcctcatatttttcttgccatgactcagattctaattttaagctattcaatttctctttg atc
The gene should been correctly identified with the ATG and the stop codon TGA appropriately marked.
6.2.1.4 Test Procedure

The input file containing the DNA sequence is entered in the space provided and the program is run. The output that is obtained is matched with the GENBANK data to verify if the gene has been correctly identified.
6.2.1.5 Test Results

The gene has been correctly identified by the program with the ATG and TGA correctly marked as verified from the GENBANK record.
6.2.2 Case-2 6.2.2.1 Purpose

To test if the gene has been correctly identified with the start and stop codon appropriately marked when the sequence ends with the gene.
6.2.2.2 Input
The input is the DNA sequence of Saccharomyces cerevisiae in FASTA format specified in a file of .fasta format. The sequence is:
>gi|296148533|ref|NM_001183937.1| Saccharomyces cerevisiae S288c Rny1p (RNY1) mRNA, complete cds ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA
AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
6.2.2.3 Expected Output & Pass/ Fail Criteria

The expected output is:
Gene Recognition: (upper case represents the Gene/s) ATGTTACTGAAAAACTTACACAGTCTCTTACAACTACCAATTTTTTCGAATGGAGCAGATAAGGGT ATAGAACCAAACTGCCCTATAAACATTCCATTATCATGTTCCAATAAAACTGATATAGACAACTCG TGTTGTTTTGAATATCCAGGTGGAATATTTTTACAAACCCAATTCTGGAATTACTTTCCAAGCAAA AACGATTTAAATGAAACTGAATTAGTGAAGGAGTTAGGGCCTCTAGATTCATTTACAATTCACGG ATTATGGCCAGATAATTGTCATGGTGGCTACCAACAATTCTGTAATAGGTCCTTACAAATTGACGA TGTTTACTACTTATTGCATGACAAGAAATTTAATAATAATGATACATCCCTGCAAATATCGGGCGA AAAGCTGCTTGAATACCTAGACTTATATTGGAAGAGTAATAACGGGAATCATGAGTCCTTATGGA TACACGAGTTTAATAAACATGGCACGTGCATTAGCACAATTAGACCAGAGTGCTATACTGAGTGG GGTGCTAATAGTGTTGACAGAAAAAGAGCGGTCTATGATTATTTTAGAATAACTTATAATCTATTC AAGAAATTGGACACATTTTCAACACTAGAAAAAAATAATATTGTCCCAAGTGTGGACAATTCCTA TTCTTTGGAGCAGATAGAGGCAGCACTAAGTAAAGAGTTTGAAGGAAAAAAAGTCTTCATAGGCT GTGATAGACATAATTCCTTAAACGAAGTATGGTATTATAACCACTTGAAGGGTTCCCTTTTGAGCG AAATGTTTGTGCCCATGGACTCACTTGCCATTCGAACAAATTGTAAAAAAGATGGTATTAAGTTTT TTCCAAAAGGTTATGTCCCAACTTTCAGGAGGAGACCTAATAAGGGAGCAAGATACAGAGGAGTC GTTCGTCTATCAAATATTAATAATGGAGATCAGATGCAAGGCTTTCTAATCAAGAATGGACACTG GATGAGTCAAGGTACACCAGCGAATTACGAGTTGATTAAATCTCCCTATGGGAATTACTACTTGA GAACTAACCAAGGGTTTTGTGACATTATTTCGTCTTCATCTAATGAATTGGTCTGCAAATTCAGGA ACATTAAGGATGCAGGTCAATTCGATTTTGATCCAACGAAAGGAGGAGACGGATATATTGGTTAT
TCTGGTAACTACAACTGGGGCGGTGACACCTATCCAAGGAGAAGGAATCAAAGCCCCATTTTCTC TGTAGACGATGAACAAAATTCCAAGAAATATAAGTTTAAATTAAAATTCATCAAAAATTAA
The entire sequence should be in capitals since it is the entire gene.
6.2.2.4 Test Procedure

The input file containing the DNA sequence is entered in the space provided and the program is run. The output that is obtained is matched with the GENBANK data to verify if the gene has been correctly identified.
6.2.2.5 Test Results

The gene has not been correctly identified by the program with the ATG and TAA marked as verified from the GENBANK record. Hence, this is a fail case.
Chapter 7
CONCLUSION & FUTURE ENHANCEMENTS

In the previous chapters are described the model that we have used to find genes in the given DNA sequences. Our algorithm has been tested on Saccharomyces cerevisiae, Drosophila melanogaster, Zea Mays (maize) and Homo Sapiens sequences. It has been found to work partially on the sequences of Homo Sapiens and achieve about 80-90% success for the other sequences. This field has a lot of future work to be done. The genes of many other organisms still need to be identified and with greater efficiency. The partial genes also have not been able to be identified using this program and a significant future enhancement would be to be able to identify partial genes. Multiple genes have been taken care of to a certain extent. Also, the recognition of promoter sequences is a daunting task and future enhancements can also take care of this. The introns and exons also need to be identified with greater efficiency.
Chapter 8
BIBLIOGRAPHY & REFERENCES

[1] Rajeev K. Azad, Mark Borodovsky, Probabilistic methods of identifying genes in prokaryotic genomes: Connections to the HMM theory, Henry Stewart Publications 14774054. Briefings in bioinformatics. Vol 5. No 2. 118130. June 2004 [2] Alexandre Lomsadze, Vardges Ter-Hovhannisyan, Yury O. Chernoff, Mark Borodovsky, Gene identification in novel eukaryotic genomes by self-training algorithm, 64946506 Nucleic Acids Research, 2005, Vol. 33, No. 20 [3] David Sankoff, The Early Introduction Of Dynamic Programming Into Computational Biology, Oxford University Press, 2000, Vol 16 no1, Pages 41-47 [4] P.S. Novichkov, M.S. Gelfand, A.A. Mironov, Gene Recognition in Eukaryotic DNA by Comparison of Genomic Sequences, Oxford University Press, 2001, Vol 17 no11, Pages 1011-1018 [5] Chris Burge, Samuel Karlin, Prediction of Complete Gene Structures in Human Genomic DNA, J. Mol. Biol. (1997) 268, 78-94 [6] Steven L. Salzberg, Arthur L. Delcher, Simon Kasif, Owen White, Microbial gene identification using interpolated Markov models, Nucleic Acids Research, 1998, Vol. 26, No. 2, 544-548 [7] Sanja Rogic, B.F. Francis Ouellette, Alan K. Mackworth, Improving gene recognition accuracy by combining predictions from two gene-finding programs, Oxford University Press, 2002, Vol 18 no8, Pages 1034-1045 [8] Andrey A. Mironov, Pavel S. Novichkov, Mikhail S. Gelfand, Pro-Frame: similaritybased gene recognition in eukaryotic DNA sequences with errors, Oxford University Press, 2001, Vol 17 no 1, Pages 13-15 [9] Y.V. Kondrakhin, A.E. Kel, N.A. Kolchanov, A.G. Romashchenko, L. Milanesi, Eukaryotic Promoter Recognition by Binding Sites for Transcription Factors, Computer Applications in Biosciences 11: 477-488, 1995
[10] Anton M.Shmatkov, Arik A.Melikyan, Felix L.Chernousko, Mark Borodovsky, Finding Prokaryotic Genes by the frame-by-frame Algorithm: Targeting Gene Starts and Overlapping Genes, Bioinformatics, 15: 874-886, 1999 [11] Peter M. Hooper, Haiyan Zhang, David S. Wishart, Prediction of Genetic Structure in Eukaryotic DNA using Reference Point Logistic Regression and Sequence Alignment, Computer Applications in Biosciences, 16: 425 438, 2000 [12] Christopher Burge, Identification Of Genes In Human Genomic DNA, March 2007 [13] Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA, 308-761 Machine Learning Projec,t Kaleigh Smith, January 17, 2002 [14] Ion I. Mandoiu and Alexander Zelikovsky, Bioinformatics Algorithms-Techniques and Applications, John Wiley & Sons, 2008 [15] Neil C. Jones and Pavel A. Pevzner, An Introduction to Bioinformatics Algorithms, The MIT Press, 2004 [16] David W. Mount, Bioinformatics- Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press [17] Valeria De Fonzo, Filippo Aluffi-Pentini, Valerio Parisi, Hidden Markov Models in Bioinformatics, Current Bioinformatics, 2007, 2, 49-61 [18] Catherine Mathe, Marie-France Sagot, Thomas Schiex, Pierre Rouze, Current methods of gene prediction, their strengths and weaknesses, Nucleic Acids Research, 2002, Vol. 30 No. 19 4103-4117

Bachelor of Engineering in Computer Science & Engineering: Gene Recognition

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Bachelor of Engineering in Computer Science & Engineering: Gene Recognition

Hochgeladen von

Copyright:

Verfügbare Formate

Gene Recognition

A project report submitted to

M S Ramaiah Institute of Technology

Bachelor of Engineering in Computer Science & Engineering

M. S. Ramaiah Institute of Technology

Bachelor of Engineering in Computer Science & Engineering

Department of Computer Science & Engineering

(Dr.R. Selvarani) Head, Dept of CSE

Department of Computer Science & Engineering

Mudra Hegde Nakul G.V

Software Requirements Specification

Conclusion & Scope for Future Work Bibliography & References

1.2 Statement of the Problem

1.3 Objective of the Problem

1.4 Current Scope

1.5 Future Scope

Eukaryotic Gene Structure

Hidden Markov Models

SOFTWARE REQUIREMENTS SPECIFICATION

3.1.1 Purpose of the Project

3.1.2 Scope of the Project

3.2.1 Project Perspective

3.2.2 End User Expectation

3.2.3 General Constraints

3.2.4 Assumptions and Dependencies

3.3.1 Functional Requirements

3.3.2 Non-Functional Requirements

3.3.3 Software System Requirements

3.3.4 Hardware System Requirements

3.4.1 User Interface

input the unannotated genome sequence

idenfication of start codons, promoter region and terminator region

visualize the results

Fig. 3.1 Use-case Diagram

ENTER INPUT FILE(UNANNOTATED DNA SEQUENCE)

FILE IN .fasta FORMAT?

ANY CHARACTERS OTHER THAN A,T, G,C?

PRINT "FILE CONTAINS NON-NUCLEOTIDE CHARACTERS"

Fig 3.2 Flowchart

4.2 System Architectural Design

4.2.1 Chosen System Architecture

Fig 4.1 System Architecture

4.2.2 Discussion of Alternative Designs

4.3 Detailed Description of Components

4.3.1 Input Component

Fig 4.2 Input Component

UNANNOTATED DNA SEQUENCE NO

STARTS WITH '>' ?

PRINT ERROR MESSAGE

ANY CHARACTERS OTHER THAN A,T,G,C?

PRINT "FILE CONTAINS NON-NUCLEOTIDE CHARACTERS"

Fig 4.3 Preprocessing

4.3.3 Computational Component

4.3.4 Output Component

Fig 4.4 Output Component

4.4 User Interface Design

4.4.1 Description of the User Interface

4.4.2 Screen Images

Fig 4.5 Input Screen Shot 1

Fig 4.6 Input Screen Shot 2

Fig 4.4 Output Screen Shot

4.4.3 Objects and Actions

4.5 Test Plan