Unit 3

Unit- 3
Prediction methods using nucleic acid sequence
sequence logo
In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of

nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences). A sequence logo is
created from a collection of aligned sequences and depicts the consensus sequence and diversity of
the sequences. Sequence logos are frequently used to depict sequence characteristics such as
protein-binding sites in DNA or functional units in proteins.
A sequence logo consists of a stack of letters at each position. The relative sizes of the letters
indicates their frequency in the sequences. The total height of the letters depicts the information
content of the position, in bits.
Relative heights of letters reflect their abundance in the alignment.
Total height of stack = entropy-based measurement of conservation.
Highly conserved = low entropy = tall stack. Very variable = high entropy =
low stack.
A consensus logo is a simplified variation of a sequence logo that can be embedded in text format.
Like a sequence logo, a consensus logo is created from a collection of aligned protein or DNA/RNA
sequences and conveys information about the conservation of each position of a sequence
motif or sequence alignment
Gene and promoter prediction
A promoter is a regulatory region of DNA located upstream

(towards the 5' region) of of a gene, providing a control point for
regulated gene transcription.It is at this site that the RNA polymerase
binds for transcription. Understanding promoter strength and regulation will
enhance our understanding of gene expression. Multiple functional sites are
involved in the binding of the polymerase. Such elements as the TATA box,
GC box, CAAT box serve as binding sites for transcription factors. By
analyzing these individual elements within the promoter sites, as well as
their combinatorial effects, our understanding of promoter strength and
regulation will be enhanced, thus increasing our comprehension of gene
expression. Transcription factors play a central role in gene regulation and
there are many databases that are dedicated to them. These programs work
with such databases to predict and identify characteristics of the queried
putative promoter sequences.
PROMOTER ELEMENTS
1. Core promoter - the minimal portion of the promoter required to properly initiate
transcription
Transcription Start Site (TSS)
Approximately -34
A binding site for RNA polymerase
General transcription factor binding sites
2. Proximal promoter - the proximal sequence upstream of the gene that tends to
contain primary regulatory elements
Approximately -250
Specific transcription factor binding sites
Prokaryotic promoters
In prokaryotes, the promoter consists of two short sequences at -10 and -35 positions
upstream from the transcription start site.
The sequence at -10 is called the Pribnow box, or the -10 element, and usually
consists of the six nucleotides TATAAT. The Pribnow box is absolutely
essential to start transcription in prokaryotes.
The other sequence at -35 (the -35 element) usually consists of the six
nucleotides TTGACA. Its presence allows a very high transcription rate.
Eukaryotic promoters
Eukaryotic promoters are extremely diverse and are difficult to characterize. They
typically lie upstream of the gene and can have regulatory elements several kilobases
away from the transcriptional start site. In eukaryotes, the transcriptional complex can
cause the DNA to bend back on itself, which allows for placement of regulatory
sequences far from the actual site of transcription. Many eukaryotic promoters,
contain a TATA box (sequenceTATAAA), which in turn binds a TATA binding protein
which assists in the formation of the RNA polymerase transcriptional complex. The
TATA box typically lies very close to the transcriptional start site (often within 50
bases).
Promoter predicting tools

Prokaryote
PromoterHunter - is part of phiSITE database which is a collection of phage gene

regulatory elements, genes, genomes and other related information, plus tools.
(Reference: Klucar, L. et al. 2010. Nucleic Acids Res. 38(Database Issue): D366-D370).
Promoter Prediction by Neural Network (Martin Reese, Lawrence Berkeley

Laboratory, CA, U.S.A.) - applicable to eukaryotes and prokarotes (Reference: Reese
MG, 2001. Comput Chem 26: 51-56). Dated and prokaryote results must be viewed
skeptically.
Promoters: (Softberry) - choose from BPROM (bacterial), TSSP (plant) and TSSG &
TSSW (human)
Virtual Footprint - offers two types of analyses (a) Regulon Analysis - analysis of a
whole prokaryotic genome with one regulator pattern and (b) Promoter analysis -
Analysis of a promoter region with several regulator patterns (Reference: R. Mnch et
al. 2005. Bioinformatics 2005 21: 4187-4189).
PePPER (University of Groningen, The Netherlands) is a webserver for prediction of
prokaryote promoter elements and regulons (Reference: de Yong, A. et al. 2012. BMC
Genomics 13:299). It is also available here. Also seeProkaryotic promoters.
SCOPE (Suite for Computational identification Of Promoter Elements), an ensemble

of programs aimed at identifying novel cis-regulatory elements from groups of upstream
sequences. (Reference: J.M. Carlson et al. 2007. Nucl. Acids Res. 35: W259-W264)
DOOR2 - Database of prOkaryotic OpeRons - offers high-performance web service

for online operon prediction on user-provided genomic sequences; and, an intuitive
genome browser to support visualization of user-selected data. Plus a huge database of
transcriptional units. (Reference: X. Mao et al. 2014. Nucleic Acids Res. 42(Database
issue): D654-9).
PATLOC (Pattern Locator) (Institute of Bioinformatics, University of Georgia, U.S.A.) -

is a new tool for finding sequence patterns in long DNA sequences. For this web-based
service, a restricted version of Pattern Locator is used, which estimates the time needed
for completion of the search and stops if the estimated CPU time exceeds a certain limit
(currently 90 seconds). The CPU time limit was introduced in order to protect the web
server from overloading due to requests involving too complex sequence patterns. If
you want to search for Sigma-70 (RpoD)-like promoters the pattern syntax for your
search is: <>{TTGACA(N)[15:18]TATAAT}[4]. N.B. the [4] allows for 4 mismatches - I
recommend a maximum of two. If you only want one strand screened omit the <> at the
start. You can restrict the search to intergenic regions (but this will eliminate also
matches that partially overlap with genes or use the .patvic.txt output file to find where
they are (Jan Mrzek, personal communication).
B. Eukaryotic
Not being a eukaryotic molecular biologist I cannot comment on utility and

accuracy of the following promoter- prediction programs.
EPDnew (Eukaryotic Promoter Database) - is a new collection of experimentally

validated promoters in human, mouse, D. melanogaster and zebrafish genomes.
Evidence comes from TSS-mapping from high-throughput expreriments such as CAGE
and Oligocapping. ChIP-seq experiments on H2AZ, H3K4me3, Pol-II and DNA
methylation are also taken into account during the analysis. Include promoter analysis
tools. (Reference: Dreos, R. et al. 2015. Nucl. Acids Res. 43 (D1):D92-D96).
Neural Network Promoter Prediction (Berkeley Drosophila Genome Project, U.S.A.)

- dated (Reference: M.G. Reese 2001. Comput. Chem. 26: 51-6).
Promoter 2.0 Prediction Server (S. Knudsen,Center for Biological Sequence Analysis,
Technical University of Denmark) - predicts transcription start sites of vertebrate Pol II
promoters in DNA sequences
PROMOSER - Human, Mouse and Rat promoter extraction service (Boston
University, U.S.A.) - maps promoter sequences and transcription start sites in
mammalian genomes. (Reference: S. Anason et al. 2003. Nucl. Acids. Res.
2003 31: 3554-59).
Promoter and gene expression regulatory motifs search (Softberry, U.S.A.) - offers a
variety of promoter-scanning programs
GPMiner - identifies promoter regions and annotates regulatory features in user-input

sequences. The proposed promoter identification method, whose predictive sensitivity
and specificity are both ~80%, incorporates the support vector machine (SVM) with
nucleotide composition, over-represented hexamer nucleotides and DNA stability.
Additionally, the input sequence also can be analyzed for homogeneity of experimental
mammalian promoter sequences. After identifying the promoter regions, the
regulatory features such as transcription factor binding sites, CpG islands, tandem
repeats, the TATA box, the CCAAT box, the GC box, over-represented
oligonucleotides, DNA stability and GC-content are graphically visualized to
facilitate the observation of gene promoters.
Gene prediction in eukaryotes and prokaryotes
Determine the beginning and end positions of genes in a genome.
There are two important aspects to any program for gene identification: one is the type of
information used by the program, and the other is the algorithm that is employed to combine
that information into a coherent prediction. Three types of information are used in predicting
gene structures: signals in the sequence, such as splice sites; content statistics, such as
codon bias; and similarity to known genes. The first two types have been used since the early
days of gene prediction, whereas similarity information has been used routinely only in recent
years. One of the reasons that the accuracy of gene-prediction programs have improved in the
last few years is the enormous increase in the number of examples of known coding sequences.
This much larger sample size allows for more reliable statistical measures to be developed, as
well as a much greater likelihood of encountering a gene that is related to one that has been
identified previously.

Unit 3

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unit 3

Hochgeladen von

Copyright:

Verfügbare Formate

Unit- 3

Prediction methods using nucleic acid sequence

In bioinformatics, a sequence logo is a graphical representation of the sequence conservation of

Relative heights of letters reflect their abundance in the alignment.

Total height of stack = entropy-based measurement of conservation.

A promoter is a regulatory region of DNA located upstream

Transcription Start Site (TSS)

A binding site for RNA polymerase

General transcription factor binding sites

Specific transcription factor binding sites

Promoter predicting tools

PromoterHunter - is part of phiSITE database which is a collection of phage gene

Promoter Prediction by Neural Network (Martin Reese, Lawrence Berkeley

SCOPE (Suite for Computational identification Of Promoter Elements), an ensemble

DOOR2 - Database of prOkaryotic OpeRons - offers high-performance web service

PATLOC (Pattern Locator) (Institute of Bioinformatics, University of Georgia, U.S.A.) -

Not being a eukaryotic molecular biologist I cannot comment on utility and

EPDnew (Eukaryotic Promoter Database) - is a new collection of experimentally

Neural Network Promoter Prediction (Berkeley Drosophila Genome Project, U.S.A.)

GPMiner - identifies promoter regions and annotates regulatory features in user-input

Gene prediction in eukaryotes and prokaryotes

Determine the beginning and end positions of genes in a genome.

Das könnte Ihnen auch gefallen