From DNA To Protein Why Genetic Code Context of Nucleotides For DNA Signal Processing A Review

Biomedical Signal Processing and Control 34 (2017) 44–63
Contents lists available at ScienceDirect
Biomedical Signal Processing and Control

journal homepage: www.elsevier.com/locate/bspc
Review Article
From DNA to protein: Why genetic code context of nucleotides for

DNA signal processing? A review
Muneer Ahmad a,∗ , Low Tan Jung b , Al-Amin Bhuiyan a
a
College of Computer Sciences, King Faisal University, Saudi Arabia
b
Department of Computer Sciences, University Technology PETRONAS, Malaysia
a r t i c l e i n f o a b s t r a c t
Article history: Protein coding regions are commonly diffused with non-coding regions due to 1/f background noise
Received 21 July 2016 in such a way that a viable discernment between the two regions becomes cumbersome. Commonly
Received in revised form 6 January 2017 employed digital signal processing methodologies lack fundamental genetic code context of nucleotides
Accepted 9 January 2017
since these approaches treat DNA signal as normal digital signal that could be processed by traditional
Available online 17 January 2017
DSP tools and techniques.
This paper reviews the prevailing approaches for protein coding regions identification that base on
Keywords:
common DSP concepts and highlights the importance of genetic code context to be considered for any
Coding regions
1/f noise
computational solution for protein coding regions identification. Nucleotides in a DNA signal carry certain
Fuzzy logic natural characteristics i.e. presence in a distinctive triplet format, maintaining distinct structure, owning
3-base periodicity and further sharing distribution of densities in codons, fuzzy behaviors, semantic similarities, unbalanced
Digital signal processing nucleotides’ distribution producing a relatively high bias for nucleotides’ usage in coding regions etc.
The computational solutions for protein coding regions identification that exploit these fundamental
characteristic of nucleotides can significantly suppress the signal noise and hence can better contribute
in identification.
© 2017 Elsevier Ltd. All rights reserved.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
1.1. Coding measure schemes (indicator sequences) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.1. Tetrahedron coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.2. 4-bit binary coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.3. Binary coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.4. Molecular mass coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.5. Z curve based coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.6. Pathogenicity islands coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.7. Entropic segmentation coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.8. Paired nucleotide representation coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.9. Integer number representation coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.10. Autoregressive coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.11. Gradient source localization coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.12. Electron-Ion Interaction Potential (EIIP) coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.13. Paired nucleotide atomic number coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.1.14. Complex numbers coding scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Abbreviations: bp, base pair; SNR, signal to noise ratio; DNA, deoxyribonucleic acid; DSP, digital signal processing.
∗ Corresponding author.
E-mail addresses: mmalik@kfu.edu.sa (M. Ahmad), lowtanjung@petronas.com.my (L.T. Jung), mbhuiyan@kfu.edu.sa (A.-A. Bhuiyan).
http://dx.doi.org/10.1016/j.bspc.2017.01.004
1746-8094/© 2017 Elsevier Ltd. All rights reserved.
M. Ahmad et al. / Biomedical Signal Processing and Control 34 (2017) 44–63 45
1.2. Window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

1.2.1. Rectangular window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
1.2.2. Bartlett window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.2.3. Hanning window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.2.4. Hamming window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.2.5. Blackman window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.2.6. Kaiser window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.3. Digital filters for coding regions identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.3.1. Finite impulse response filter (FIR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
1.3.2. Infinite impulse response filter (IIR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4. Fourier transforms based methods for coding regions identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.4.1. Weaknesses of fourier transforms based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5. Wavelet transform based methods for coding regions identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5.1. Continuous wavelet transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.5.2. Discrete wavelet transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.6. Spectrum estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.6.1. Parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
1.6.2. Non parametric methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2. From DNA to protein: prevailing methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1. Coding measure schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.2. Window filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.1. Issues in adoption of window filter for coding regions identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.2. Importance of selecting appropriate factors of window filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3. Noise suppression approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3. Analysis and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4. Critical discussion on analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
1. Introduction ing (called as exons) and non-coding (called as introns) regions as

shown in Fig. 2,
DNA is considered as a repository for carrying the hereditary The coding regions are sequence of nucleotides that actually
information of organisms [1,2]. This genetic information is encoded code for protein while non-coding regions do not code for protein
in the DNA sequence in the form of four important chemical bases [12–14]. It is really challenging and interesting to identify these
called as Adenine, Thymine, Guanine and Cytosine (shortly repre- regions since exons and introns are mingled with each other in such
sented as A, T, G and C, also known as nucleotide bases) [3,4]. DNA a way that a viable identification of exons is an open optimization
sequence is composed of these four letters arranged in a specific problem in Bioinformatics [15–20].
order over the sequence [5,6]. Biologically, DNA sequence is found The correlations, periodicities and patterns existing in DNA
as a twisted pair of double helix structure which carries two such sequence related with exons and introns are of great interest for
strands of nucleotide bases in the sense that base “A” always pairs researchers to apply signal processing techniques and methods to
with base “T” and base “G” always pairs with base “C” as shown in reveal interesting useful facts and solutions [21–25].
Fig. 1 [7], It has already been reported that exons depict 3-base property
Since nucleotide bases “A and T” and “G and C” are com- which exhibits high scale frequency components at period 2/3,
pliment of each other so it is enough to take one strand into while introns exhibit relatively low scale frequency components
account for sequence analysis. In this case, a DNA sequence [19,26–28]. The non-uniform codon usage is the reason for such
would be considered as a long stretch of nucleotide bases (e.g. characteristic of exons. The 3-base property is a good measure for
“ATCCGATCGATCCTAGGCAATG· · ·. . .. . .. . ..”) arranged in specific coding regions identification and has been exploited using time
order. It is very interesting to know that each cell in species mostly and frequency domain methods for prediction of exons in DNA
has an almost identical copy of DNA. However, DNA is unique in all sequences [29–31].
species due to the arrangement of nucleotide bases organized over Digital signal processing (DSP) approaches have been widely
its strand [8–10]. DNA, which is around three billion bases in length, used in genome sequence analysis especially in area of protein
contains genes with various lengths from hundreds to more than coding regions identification [31–33]. The solutions relying on
two million bases. Human DNA, for instance, contains 20, 000–25, DSP approaches require the representation of nucleotide bases to
000 genes [11,12]. Further, genes are parted into two regions: cod- numerical values and conversion of nucleotide sequences into time
domain signals. Once a time domain signal of a DNA sequence is
achieved, digital signal processing approaches can be applied over
it to analyze the signal for coding regions analysis. The subsequent
sections present some important notions (algorithms, tools and
Fig. 1. DNA double helix structure [7]. Fig. 2. Exons and introns in gene.
46 M. Ahmad et al. / Biomedical Signal Processing and Control 34 (2017) 44–63
approaches) related to the coding regions identification in DNA instance, in 1st convention, A or T bases are assigned a weight 0
sequences. while C or G bases with 1. Similarly, C or T bases are assigned 0
The first section describes a few popular coding measure while A or G bases with 1 in 2nd convention. In the 3rd convention,
schemes (also called as nucleotides’ encoding sequences or indi- G or T bases are assigned 1 while A or C bases with 0. Simply, the
cator sequences) being used for numerical representation of DNA scheme thus becomes (A = 1 and T, C or G = 0; T = 1 and A, C or G = 0;
sequences. C = 1 and A, T or G = 0; G = 1 and A, T or C = 0). Such representation
helped in conversion of base sequence to a numerical sequence.
1.1. Coding measure schemes (indicator sequences)
1.1.9. Integer number representation coding scheme
The coding measure schemes are very useful for encoding DNA The author [42] introduced genetic signal representation and
sequence to a digital signal. Several coding measure schemes have analysis by proposing a coding measure scheme based on Integer
been reported in literature for translation of nucleotides into a Number representation obtained by mapping numerals {1, 3, 2, 0}
numeric format for further analysis. respectively to the four nucleotides as C = 1, G = 3, A = 2, T = 0.
1.1.1. Tetrahedron coding scheme 1.1.10. Autoregressive coding scheme

In this scheme [34], the transform was considered as invari- Autoregressive modeling [43] was based on structural proper-
ant to represent and label the bases (nucleotides). It was used as ties of DNA sequences. The authors employed propeller twist and
a measure to determine the periodicity in DNA segments. Each DNA-bending stiffness.
nucleotide had been placed along one of the four corners of a reg-
ular tetrahedron. The vectors drawn from the center to the corners 1.1.11. Gradient source localization coding scheme
of tetrahedron represented each of the four nucleotides. Biologically-inspired gradient source localization was consid-
ered in designing this indicator sequence [44]. A single indicator
1.1.2. 4-bit binary coding scheme sequence was proposed using nucleotides’ weights as A = 0, C = 1,
It was proposed as a neural network optimization prediction G = 3, and T = 2 in a DNA sequence.
method [35] using 4-bit binary encoding of nucleotides A, C, G, and
T mapped into binary numbers in the form 1000, 0010, 0001 and 1.1.12. Electron-Ion Interaction Potential (EIIP) coding scheme
0100 respectively. In this scheme [45], the four binary indicator sequences pre-
sented by Voss [40] have been replaced by just one indictor
1.1.3. Binary coding scheme sequence called EIIP coding scheme reducing the computationally
This method was proposed by Voss [36]. Binary indicator overhead by 75%. The coding scheme assigns A, T, G and C with
sequence assigns 1 and 0 for the existence or non-existence of a 0.1260, 0. 1335, 0. 0806 and 0.1340 respectively.
specific nucleotide in strand. For example, x[n] = [T T A G G T C C T]
translates to [001000000] for Adenine. Similarly, the binary indi- 1.1.13. Paired nucleotide atomic number coding scheme
cator sequences of other nucleotides can be formed such that sum The paired nucleotides G-A and C-T were assigned the atomic
of all binary indicator sequences is 1, numbers 62 and 42 respectively [46]. The coding measure scheme
was formed by assigning the atomic number of each nucleotide as
uA[n] + uG[n] + uC[n] + uT[n] = 1forn = 0, 1, 2, . . ..N-1. C = 58, G = 78, A = 70, T = 66 in a DNA sequence.
1.1.4. Molecular mass coding scheme 1.1.14. Complex numbers coding scheme
It was presented [37] as a coding measure scheme based on In this scheme [47], the four binary indicator sequences pre-
molecular mass representation of a DNA sequence formed by map- sented by Voss [36] have been replaced by just one indictor
ping the four nucleotides to their molecular masses as C = 110, sequence called Complex coding scheme reducing the computation-
G = 150, A = 134, T = 125, respectively. ally overhead by 75%. The coding scheme assigns A, T, G and C with
+1, +j, −1 and −j respectively.
The coding measure schemes help us to produce a time domain
1.1.5. Z curve based coding scheme
signal of DNA sequence which can be further analyzed using digi-
Z curve coding scheme was proposed [38] as a new Fourier trans-
tal signal processing techniques. We also require certain Window
form approach for protein coding measure based on the format of
functions to perform a frequency domain analysis of signal hav-
the Z curve. Z-signals can be used to translate the DNA sequence
ing its convolution with the Window function. Literature depicts
into a set of signals based on such curves.
that employment of a suitable Window function greatly helps in
noise suppression, enhancing discernment between coding and
1.1.6. Pathogenicity islands coding scheme
non-coding regions and protein coding regions identification [48].
The authors [39] employed a simple representation of base pairs
with chosen numbers such that C and G bases are assigned numer-
1.2. Window functions
ical value 1 while A and T bases are assigned numerical value 0 to
significantly detect G + C patterns in genome sequences.
Significant identification of protein coding regions is highly
associated with application of appropriate Window function which
1.1.7. Entropic segmentation coding scheme
enhances the identification and suppresses signal noise. Literature
Entropic segmentation method [40] was devised as a binary
highlights that following Window functions [47–51] have been par-
mapping of 12-letter nucleotide formation and representation to
ticularly used for protein coding regions identification in digital
identify larger segments of DNA for certain patterns.
signal processing approaches.
1.1.8. Paired nucleotide representation coding scheme

Paired nucleotide representation [41] was introduced by assign-
1.2.1. Rectangular window
1, 0 ≤ n ≤ M − 1
ing binary values to nucleotides in DNA sequences. Certain w(n) = (1)
conventions were adopted to represent bases as numbers. For 0, otherwise
Here w(n) is a Rectangular Window function which is bounded 1.3.1. Finite impulse response filter (FIR)
between 0 and M number of samples. A DNA segment of interest is The filters that carry a finite response to impulse signals are
convoluted with this fixed Window size. called FIR filters [30–33]. The FIR filter of length k can be described
as,
⎧ 2nwindow
1.2.2. Bartlett
M−1

K−1
⎪
⎪ , 0≤n≤ y[n] = ak x[n−k] (7)
⎪
⎨ M−1 2
k=0
w(n) = 2n M−1 (2)
⎪ 2− , ≤n≤M−1
⎪
⎪ M−1 2 Here y represents transformed vector and x is the input data.
⎩ The filter takes a summation over the input vector multiplied by a
0, otherwise
constant factor. The output vector has the same length as an input
It is also called Triangular Window developed to overcome the vector. k, meanwhile, is called the order of this filter.
transition hazard involved in the Rectangular Window. Y (z)
A(z) = (8)
X(z)
1.2.3. Hanning
⎧ window
2n
A (z) is a transfer function for this filter. It is obtained by dividing
⎨ 0.5 ∗ 1 − cos , 0≤n≤M−1 the output vector values by the input vector. It can also be termed
w(n) = M−1 (3)
⎩ as
0, otherwise

K−1
This Window operates in the same range of samples (opposite A(z) = ak z −k = a0 + a1 z −1 + ... + a(K−1) z −(K−1) (9)
to Bartlett Window in which the samples are sectioned into two k=0
halves) and produces zero outside the sample space.
This equation shows a polynomial equation in z-transform and
defines the same FIR filter. These filters are widely used because of
1.2.4. Hamming
⎧ window 2n their stability.
⎨ 0.54 − 0.46 ∗ cos , 0≤n≤M−1
w(n) = M−1 (4) 1.3.2. Infinite impulse response filter (IIR)
⎩ This filter carries an infinite response to signal [38,39].
0, otherwise
This Window can be differentiated from a Hanning Window by

N−1
M−1
a discontinuity factor which is observed as prominent in Hanning y [n] = − ak y [n − k] + bk x [n − k] (10)

Window over the same sample space. k=1 k=0
Here y represents a vector of length n that contains the trans-

1.2.5. ⎧
Blackman window formed values of IIR filter. The filter uses two kinds of coefficients;
⎨ 0.42 − 0.5 cos 2n
+ 0.08 cos
4n
, 0≤n≤M−1 namely feed forward and feed backward represented by ak and bk .
M−1 M−1
w(n) = (5)
⎩
M−1
0, otherwise
bk z −k
Y (z) B(z) k=0
It contains an additional harmonic term and is much comparable H(z) = = = (11)
to Hamming Window over the sample space M. X(z) A(z)
N−1
1+ ak z −k
k=1
⎧
1.2.6. Kaiser window
⎪
⎨ 2 1⁄2 H is the transform function over z-transform when an output
I0 ˇ 1 − (n − ˛)/˛ /I0 (ˇ) 0≤n≤M−1 vector is divided by an input vector. The main difference between
w(n) = (6)
⎪
⎩ the two filters is stability, band width and order of filter. IIR filter
0 otherwise with its extension is widely used in DSP techniques for DNA signal
analysis.
Kaiser Window contains two very important parameters: ␣ and Further Fourier transforms based methods have also been
␤. Normally, ␣ is chosen over three sample spaces ranging from widely employed in protein coding regions identification to reveal
0 to 21, 22–50 and 51 to the entire length of signal. The default certain periodicities in DNA signal for discernment between coding
value for ␤, meanwhile, is 0.5 and determines the leakage factor and non-coding regions.
and side lobes attenuations. Testing different combinations, it can
be observed that optimal value of ␤ may range from 10 to 60 against
Window size of 120–350 samples. 1.4. Fourier transforms based methods for coding regions
identification
1.3. Digital filters for coding regions identification Fourier transforms based methods have been widely used for
protein coding regions identification [52–67]. The transformation
A digital filter is a particular class of discrete system that helps of a complex valued function into another complex valued func-
in examining the input signal over certain chosen frequency lev- tion is defined over a real variable or simply as the transformation
els. There are numerous classes of digital filters but here only those of time domain function to a frequency domain function. At this
classes are described that have been frequently employed in pro- point, Fourier transform is normally used to visualize the frequency
tein coding regions identification. These classes are stated as finite components of a signal [52]. It helps in the better understanding of a
impulse response and infinite impulse response filters. time-domain signal since the timed information at many instances
may not glimpse the nature, behavior and function of signal which Where x (t) is a function of time domain signal, (t) is a mother
can be better approximated using a frequency domain analysis. wavelet and |a| is normalizing (smoothing) factor for this wavelet.
1 t − b
∞ a,b (t) = √ dt (15)
a a
X(f ) = x(t)e−j2ft dt (12)
The child wavelets, meanwhile, are the detailed translated ver-
−∞
sions of the mother wavelet. A relatively fast transform of signal
Where x (t) is a continuous signal sampled over discrete time can be achieved employing the above equation as compared to the
intervals (nucleotide samples in a specified gene) and X (f) is a Fourier transform.
vector representing the frequency components of DNA signal.
1.5.2. Discrete wavelet transform

N−1 The discrete wavelet transform [60] involves the discretization
Xk = xn e−j2kn/N k = 1, 2, ..., N (13) concepts of continuous transform in which the discrete coefficients
n=0 can be calculated using the equation.

The above expression is the Discrete Fourier Transform (DFT) Xa,b = Xj,k = x[n]gj,k [n] (16)
[53] of DNA signal, in which xn represents a DNA signal sampled n∈Z
over N points. The exponent e serves as the cube root of unity and
Wherea = 2j , b = k2j , j ∈ N, k ∈ Z.
also provides the sinusoidal components of signal. Xk , meanwhile,
In this context, a convolution operation with a scaled wavelet is
stores the coefficients of this transformation which later can be
repeatable so that a set of approximate and detail coefficients can
used for analysis of frequency, magnitude and power of signal seg-
be obtained with different iterations. The discrete transform can be
ments. Short Time Fourier Transform (STFT) is also used in the same
defined as,
context which involves the concept of Windowing the DFT of a
signal [54]. 1−˛
Pk = (17)
1 + ˛2 − 2˛ cos(2k/N)
1.4.1. Weaknesses of fourier transforms based methods where k is stated as frequency index and ␣ as noise index.
The Fourier transform of a digital signal presents the frequency The DNA signal convoluted with a suitable Window function
components of signal (also known as frequency domain analy- objectively suppresses the signal noise to a great extent. It seg-
sis). A signal is expressed as a series of sine and cosine harmonics ments the signal for further analysis employing power spectrum
which only report the frequency components of signal without density estimation. The spectrum estimation methods analyze the
any depiction of time-domain analysis i.e. the appearance of sig- DNA signal for a viable discrimination between coding and non-
nal segments at certain instances of time [55]. This problem can be coding regions.
stated as Heisenberg uncertainty principle [56], which describes
the impossibility to achieve frequency and time information of sig- 1.6. Spectrum estimation
nal components. Further, the concept of spectral leakage is directly
associated with frequency analysis of signal employing Discrete Spectrum estimation of a digital signal is commonly performed
Fourier transforms [57]. In spectral leakage phenomenon, some by a series of steps including discrete Fourier analysis, magnitude
energy from main lobe disappears in side lobes. This hinders to calculations of DFT components [61] and power estimation of same
clearly distinguish main lobe from side lobes. The degree of rele- signal components. A spectral estimation graph is plotted with
vancy of side lobes with main lobe is called signal attenuation. It two parameters i.e. frequency and amplitude components of signal.
causes energy to leak from the sharp energy levels to the low levels. Such analysis greatly helps to identify amplitude strength of coding
Spectral leakage can be reduced by convoluting signal components regions from non-coding regions. A spectral estimation possesses
with appropriate Window functions [58]. several different methods as follows.
1.6.1. Parametric methods

1.5. Wavelet transform based methods for coding regions The parametric methods are considered more suitable when sig-
identification nal length is relatively short. Ultimately, the methods model the
genomic data of system employing some real characteristics with
The wavelet transforms based methods have been reported very involvement of white noise. The autoregressive process driven by
useful for background noise suppression of DNA signal. The Heisen- a pole model is the most common linear system used for the para-
berg uncertainty principle is solved by the concepts of wavelets metric estimation of signal [62].
[59]. The wavelet transforms help to choose a global scaled Win- Yule-Walker method, in addition, generally calculates the
dow function for better signal analysis. The stated Window function autoregressive scales based on a biased autocorrelation function
can be used for a piecewise analysis of signal by changing the Win- [63]. This method solves the least square approximation of sys-
dow parameters instantly which mark the frequency components’ tem in the rectangular calculation of data. It works and outputs the
resolution at different scale levels [59,60]. At broad level, there are results as maximum entropy inputs.
two types of wavelet transforms. Furthermore, Burg method is based on the minimization of for-
ward and backward prediction errors [63]. It usually avoids the
calculation of an autoregressive function. It is useful for the calcu-
1.5.1. Continuous wavelet transforms
lations for signals with less noisy channels and may ensure a stable
The continuous wavelet transform [59] can be defined as
autoregression and is considered to be computationally efficient.

+∞ Covariance method, on the other hand, is concerned with the
1
t − b minimization of the forward prediction errors while modified
Xa,b = √ x(t) dt (14)
|a| a covariance method minimizes the forward as well as backward
−∞ prediction errors [64].
1.6.2. Non parametric methods challenging due to strong diffusion of coding regions with non-
Non paramedic methods [65] usually involve the spectral esti- coding regions by a 1/f background noise. A number of predictive
mation using PSD methods. The periodogram method, for instance, models based on digital signal processing techniques suffered due
calculates the power spectral density of a signal as follows. to presence of background noise in DNA sequences. This sequence
noise mingles the two regions in a way so that a viable discrim-
|Xl(f )|2
Spectral expressionS = Pxx(f ) = ination turns to a real challenge [7,70]. In this context, specific
fs L protein structure and function remains unknown if coding regions
Where of a DNA segment are not significantly translated to protein seg-
ment. Such unidentified protein structure and function hinder our

L−1
way to provide good remedies against several diseases despite the
Xl (f) = xL (n)e−2jfn⁄fs (18)
rich repositories of genomic sequences available nowadays.
n=0
The above expression is the most general form for the calcula-
2. From DNA to protein: prevailing methodologies
tion of power spectrum estimation of a signal.
Another enhanced method is called as the modified peri-
2.1. Coding measure schemes
odogram which Windows the time-domain signal prior to
performing calculations for fast Fourier transforms [66]. This helps
The identification of periodicities (of different types) in DNA
in reducing spectral leakage. Welch’s method is another improved
sequences have been an active area of research recently. These
method which operates by dividing the time series of data into
periodicities can be observed using digital signal processing tech-
equally chunked segments. Furthermore, the individual calculation
niques. The application of DSP approaches to DNA sequences for
of each segment is passed through some steps of the modified peri-
mining useful information (e.g. DNA sequence analysis and protein
odogram [67]. An average of power spectral estimate is necessary
coding regions identification) is highly dependent on translation of
for this method.
DNA segment to corresponding digital signal [36,45,47]. It has been
Lastly, multitaper method operates by filtering the signal
deeply observed [33–47,51] that coding measure scheme (syn-
through a filter bank of FIR bandpass filters. In this method, peri-
onymously called DNA coding measure scheme or DNA indicator
odogram of each filter is essential for the calculation of an entire
sequence) carries a substantial importance in identifying protein
signal power spectral density estimation [67].
coding regions efficiently from non-coding regions by reducing 1/f
Commonly, digital signals are convoluted with fixed length Win-
noise. This numerical representation depicts the biological charac-
dow filters for signal analysis but DNA signals differ in nature and
teristics of DNA in numerical domain.
characteristics from other signals due to their nucleotides contents.
The first coding measure scheme was named as tetrahedron
DNA signals formed from DNA sequences contain specific order
mapping scheme proposed by Silverman and Linsker [34]. In this
of nucleotides with certain frequencies and mostly depict unbal-
scheme, the transform was considered as invariant to represent and
anced nucleotides’ distribution. Interestingly, the nucleotides of
label the nucleotide bases. It was used as a measure to determine
DNA sequence also cause 3-base periodicity while forming protein
the periodicity in DNA segments. The four nucleotides were placed
sequence that is also an evidence for biological context of DNA sig-
along four corners of a regular tetrahedron. The vectors drawn from
nals in terms of coding regions identification. Here, the exons are
the center to the corners of tetrahedron represented each of the four
special regions that appear as sequence of nucleotides happened
nucleotides.
to code for protein while introns are regions that don’t code for
Later Demeler and Zhou [35] proposed a neural network
protein [68,69]. The identification of coding regions is strongly tied
optimization prediction method using 4-bit binary encoding of
with 1/f background noise that mingles these two regions so that a
nucleotides A, C, G, and T mapped into binary numbers in the form
clear discrimination between the regions becomes challenging.
1000, 0010, 0001 and 0100 respectively. Another coding measure
In this context, a general framework [36,45,47,50,61] for protein
scheme called as ‘Binary indicator sequence’ was proposed by Voss
coding regions identification can be demonstrated as,
[36] consisting of four indicator sequences each representing one
Fig. 3 presents a general framework for coding regions iden-
of the four nucleotides. Binary indicator sequence assigns 1 and
tification. The components of this framework are: selection and
0 for the presence or absence of specific nucleotide in strand. For
preprocessing of DNA sequence, signal formation employing a
example, x[n] = [T T A G G T C C T] translates to [001000000] for
suitable DNA encoding scheme, application of an appropriate pro-
the nucleotide Adenine. Similarly the binary indicator sequences
cedure for denoising DNA signal, signal segmentation using a
of other nucleotides can be defined such that sum of all binary
suitable Window function, energy calculation of segments, seg-
indicator sequences is 1.
ments’ synthesis and coding regions identification. The subsequent
section highlights the most important components focused mostly uA[n] + uG[n] + uC[n] + uT[n] = 1forn = 0, 1, 2, . . ..N-1.
by researchers for enhancing coding regions identification in terms
of noise suppression and spectral leakage minimization. These are, Based on this criterion, the DNA sequence can be mapped with
these four indicator sequences that help us to translate the DNA
• A coding measure scheme that encodes a DNA sequence to a time sequence to a digital signal for further frequency based analysis.
domain signal applying a numerical mapping to nucleotides. Stanley et al., [37] presented a coding measure scheme based
• A Window function convoluted with DNA signal to suppress sig- on molecular mass representation of a DNA sequence formed
nal noise and to increase the discrimination measure between by mapping the nucleotides to their four molecular masses as
coding and non-coding regions. C = 110, G = 150, A = 134, T = 125, respectively. Similarly, Yan et al.,
• Background noise suppression of DNA signal employing digital [38] proposed a new Fourier transform approach for protein cod-
signal processing approaches i.e. digital filters, Fourier transforms ing measure based on the format of Z curve. Z-signals can be
and Wavelet transforms etc. used to translate the DNA sequence into a set of signals based
on such curves. In the same way, Lio and Vannucci [39] narrated
It is interesting to note down that approximately only a 5% por- pathogenicity islands and gene transfer events in genome data. The
tion of DNA sequence (i.e. coding regions) is translatable to protein authors also employed certain conventions to represent bases as
sequence [2–4]. Even this small scale translation further becomes numbers. For instance, in 1st convention, A or T bases are assigned
Fig. 3. Framework for coding regions identification.
a weight 0 while C or G bases with 1. Similarly, C or T bases are Likely, Ranawana and Palade [76] proposed a multi-classifier sys-
assigned 0 while A or G bases with 1 in 2nd convention. In the 3rd tem based on neural network to identify gene in DNA sequence. The
convention, G or T bases are assigned 1 while A or C bases with 0. nucleotide sequence was mapped by a 2-bit binary representation
Simply, the scheme thus becomes (A = 1 and T, C or G = 0; T = 1 and using binary numbers in the form 00, 11, 10, 01 that resulted into
A, C or G = 0; C = 1 and A, T or G = 0; G = 1 and A, T or C = 0). Such rep- a 1-dimensional indicator sequence.
resentation helped in conversion of base sequence to a numerical Similarly, Nair and Mahalakshmi [75] depicted a visualization of
sequence. genomic data using inter-nucleotide distance signals. In such rep-
Bernaola-Galván et al., [41] proposed a solution for finding resentation, a base symbol was replaced by the base distance which
boundaries between coding and noncoding DNA regions by an was represented in the form of one dimensional indicator sequence.
entropic segmentation method using a binary mapping of 12-letter Rosen [44] presented a signal processing approach for biologically-
nucleotide formation and representation. Later, Zhang and Wang inspired gradient source localization and DNA sequence analysis.
[71] introduced concepts of recognition for protein coding genes in In this scheme, a single indicator sequence was proposed using
the yeast genome at some better accuracy based on Z curve. The nucleotides weights as A = 0, C = 1, G = 3, and T = 2 in a DNA sequence.
Z-curves were used to reveal the distribution of bases at relevant Nair and Sreenadhan [45] noticed certain frequency leakage
codon positions. Dodin et al., [72] proposed a Fourier and wavelet that occurred employing Voss [36] coding scheme and proposed a
transform analysis as a tool for visualizing regular patterns in DNA coding measure scheme based on Electron-ion interaction pseudo
sequences. The Correlation function was employed to relate a base potential (EIIP). This indicator sequence replaced the four binary
with its neighbors in a way that calculates score as 1 when the indicator sequences by just one sequence which was stated as
two bases are identical and 0 otherwise. Likely, Anastassiou [73] EIIP indicator sequence. The energy of delocalized electrons in
presented a frequency-domain analysis of DNA sequences by their amino acids and nucleotides had been calculated which showed the
representation in the form of nucleotide base values. The assign- peak at the right location (near N/3). At these instances, the Binary
ment of numerical values to the alphabets of DNA string helped to sequences were determined to be failed. The experiments employ-
apply digital signal processing approaches for further analysis of ing sliding Windows, done over common data sequences showed
frequency contents of signal. that in a number of cases the EIIP indicator sequence gave a bet-
Later, a paired nucleotide representation given as binary values ter discrimination between coding and non-coding regions. It was
to nucleotides in DNA sequences was proposed by Bernaola-Galván shown that coding measure scheme using EIIP indicator sequence
et al., [41]. In this criterion, the authors used certain conventions to could be utilized for gene finding procedures using genomic signal
represent bases as numbers. For instance, in 1st convention, A or T processing.
bases are assigned a weight 0 while C or G bases with 1. Similarly, C Later, Holden et al., [46] presented ATCG nucleotide fluctuation
or T bases are assigned 0 while A or G bases with 1 in 2nd conven- by using representation of Paired Nucleotide Atomic Numbers. The
tion. In the 3rd convention, G or T bases are assigned 1 while A or pairs were assigned atomic numbers G, A = 62 and C, T = 42 respec-
C bases with 0. Simply, the scheme thus becomes (A = 1 and T, C or tively. The coding measure scheme was designed with assignment
G = 0; T = 1 and A, C or G = 0; C = 1 and A, T or G = 0; G = 1 and A, T or of atomic number to nucleotides as C = 58, G = 78, A = 70, T = 66.
C = 0). Such representation helped in conversion of base sequence In the same way, a 2-simplex mapping was proposed by Grandhi
to a numerical sequence. Such representation helped in conver- and Kumar [77] to identify exons in DNA sequence. This triangle
sion of base sequence to a numerical sequence. In the same way, based mapping scheme represented four bases of nucleotides along
Cristea [42] presented genetic signal representation and analysis by vertices and center of a triangle. All bases were substituted with rel-
proposing a coding measure scheme based on Integer Number rep- evant vector values. This sequence was then passed through an IIR
resentation obtained by mapping numerals {1, 3, 2, 0} respectively filter with a passband so to conserve 3-base property. The method
to the four nucleotides as C = 1, G = 3, A = 2, T = 0. reduced the complexities to half as compared to Voss method [36].
Berger et al., [74] performed power spectrum analysis for DNA Similarly, Yin and Yau [49] introduced numerical representation
sequences. The authors introduced a binary mapping of 12-letter of segments of DNA sequence based on complex numbers. Since
nucleotide formation and representation depicting the base com- there are 64 genetic codons to approximate 20 amino acids, 20
position for each codon position. Similarly, Chakravarthy et al., amino acids were mapped with 20 distinctive complex numbers.
[26] proposed a coding measure scheme based on autoregressive The characteristics of amino acids actually represented the real
modeling and feature analysis of DNA sequences. The nucleotide and imaginary parts concerned with such complex numbers. For
mappings thus become C = 0.5, G = −0.5, A = −1.5, T = 1.5, in which example the codons from the DNA sequence ATGGTCCCG, assessed
each of the C-G and A-T pairs holds complementary property. from reading frames ATG,TGG,GGT,GTC,TCC,CCC,CCG with numer-
ical vector of amino acids are [1.18 + 162i, 2.65 + 227.0i, 0.07 + 60.0i, codon composition and relevance of nucleotides with amino acids
1.32 + 141.4i, 0.05 + 88.7i, 1.95 + 122.2i, 1.95 + 122.2i]. (degeneracy) etc.
Keeping the track, Akhtar et al., [50] exploited the key statisti- Since nucleotides appear in the codon in distinctive triplet for-
cal property applied to the rich collection of nucleotide contents mat, It is interesting to reveal that each nucleotide has a special
(C < G and A < T) and employed DNA convention with real num- place in the codon and its occurrence in codon determines the
bers. For instance, real number “1” referred to T = 0, C = 1, A = 2, type of protein it produces, for instance the codon “ATG” produces
G = 3 and real number “2” referred to A = 0, G = 1, C = 2, T = 3 while amino acid “Methionine” while codon “TAG” don’t code for amino
real number “3” referred to T = −1.5, C = 0.5, A = 1.5, G = −0.5. Later, acid. Any change in the occurrence of nucleotide in codon changes
Hota and Srivastava [47] further improved scheme presented in the amino acid in protein sequence. Further, DNA code is degener-
[45] by introducing a new mapping scheme named as “Complex ate stating that some of amino acids can be coded by more than
scheme”. This coding scheme used complex numbers namely 1, one codon. Since nucleotides in the codon preserve their struc-
−1, j and −j assigned to four bases A, G, T and C, respectively. It tures (each nucleotide owns a particular place in the codon) in a
was noticed that a 75% of computational overhead was reduced as unique fashion but on the other hand, same nucleotide is appear-
compare with Voss [36] scheme. Later, Kwan et al., [51] proposed ing more than once in the codon cluster space [83]. The occurrence
new coding measure schemes based on Fourier transform of DNA of a certain nucleotide in cluster may be thrice, twice, singleton or
digital signal. The scheme was named as Complex Twin-Pair with zero.
numerical values of nucleotides as (C, G = −1; A, T = j), where j is Such natural characteristics of nucleotides in codons can
complex number. be exploited to propose a coding measure scheme that could
Table 1 presents different coding measure schemes with incorporate the occurrence of nucleotides in codons, their den-
context, applicability and robustness. Summarizing the coding sity distribution, specified locations and composition in codons
measure schemes proposed in literature, Silverman and Linsker (w.r.t. codon structure). We can term the sharing/overlapping of
[34] proposed the first coding scheme named as tetrahedron map- nucleotides’ contribution in codons as a fuzzy behavior since a sin-
ping scheme. Later, another coding scheme named as “Binary gle nucleotide can have either no density, single/double or triple
indictor sequence” was introduced by Voss [36]. Researchers density in a codon or these densities also overlap when we draw a
[77–79] subjectively employed this scheme for sequence analysis correlation between codons in terms of nucleotides’ density distri-
in digital signal processing methodologies. Nair and Sreenadhan bution.
[45], observed some frequency leakage from spectrum because of
applying Voss coding scheme [36] and addressed these drawbacks 2.2. Window filters
by introducing a new coding scheme named as “EIIP scheme”. The
scheme replaced four binary schemes by a single scheme. This The authors of this research reviewed the literature to seek
scheme was based on delocalizing electrons in certain amino acids common Window filters used for DNA signal processing in con-
and the authors calculated nucleotides as EIIP. Several researchers text of identifying protein coding regions. It was observed that a
employed [47,80], EIIP to address certain issues appeared in exons series of published papers addressing coding regions identification
identification and DNA sequence analysis. Another important cod- report the employment of conventional Window filters of some
ing scheme was designed by Hota and Srivastava [47] named as fixed length. In contrast, we could not find satisfactory literature
“Complex mapping scheme”. This scheme was outcome of further related with Window filters based on genetic context of code [9]
improving the results depicted by the authors [45] based on EIIP and unbalanced nucleotides’ distribution that produce high bias
indicator sequence. The Complex mapping scheme is considered for nucleotides’ usage in coding regions [10,11].
well suited equally in DNA sequence analysis for coding regions Table 2 describes different Window filters with proposed
identification [49,59,81], in context of minimizing computational window length employed for tracing codons. We observed that con-
overhead (involved in taking Voss coding scheme in exons identi- ventional Rectangular Window filter with filter length 351 bp was
fication). employed by Sahu and Panda [84], Hota and Srivastava [86], Bergen
It can be observed that different coding schemes introduced and Antoniou [88], Hota and Srivastava [90], Tiwari et al., [92], Yin
in the literature are mainly based on chemical property based and Yau [49], Kotlar and Lavner [93], Akhtar et al., [50], Datta and
methods [71], fixed mapping methods [35,39,76] and statisti- Asif [98], Kakumani et al., [96], Akhtar et al., [99], and George and
cal property based methods [44,72,75,81]. The coding measure Thomas [101]. Similarly, Chavan et al., [87], Andreas [89], Oppen-
schemes proposed in literature lack the important nucleotides’ sig- heim and Schafer [91], and Nair and Sreenadhan [45] employed
nificance in context of genetic code information (i.e. presence as Kaiser Window filter with length 351 bp. In the same way, Shakya
a distinctive triplet format, maintaining distinct structure, own- et al., [48], Gunawan [94] and Datta and Asif [95] used Bartlett
ing and further sharing distribution of densities in codons, fuzzy Window filter with same length i.e. 351 bp. Gaussian Window fil-
behaviors, semantic similarities, unbalanced nucleotides’ distribu- ter with filter length of 351 bp was employed by Mena-Chalco et al.,
tion) and inter-nucleotide distance for proposing new DNA coding [100] and Abbasi et al., [102] used Hamming Window filter with
schemes. Sadikin et al., [82] has recently considered hierarchical filter length 351 bp.
clustering and inter-nucleotide distance for identification of fractal
patterns (in DNA sequences). The authors have taken advantage of 2.2.1. Issues in adoption of window filter for coding regions
numerically mapping DNA sequence to a digital signal to calculate identification
inter-nucleotide distance. It can be observed that Rectangular, Kaiser and Bartlett Win-
From the discussions above, it can be summarized that DNA dows have been mostly used with a fixed Window length 351 base
encoding schemes that base on genetic code context of its pairs. Window filters (having a suitable Window length) play a very
nucleotides would significantly help in DNA sequence analysis important role in digital processing based approaches for coding
especially for coding regions identification [81]. The authors of this regions identification. The frequency components at period (N/3)
research could not find a satisfactory literature related with DNA reveal pronounced peaks of coding regions in power spectral esti-
encoding schemes that were designed purely based on genetic code mation of DNA signal.
context of nucleotides i.e. nucleotide/amino acid information i.e. For instance, Table 3 glimpses important factors i.e. filter length,
nucleotides’ density distribution in codons, their relevant positions, leakage factor and relative side lobe attenuation for selection of
appropriate Window filter for coding regions identification. The
Table 1
Coding measure schemes with context, applicability and robustness.
Coding Scheme Context(Genetic/Non-genetic) Applicability Robustness (Scope)
Tetrahedron mapping scheme [34] Non-genetic DNA Mostly Nucleotides

Neural network optimization predictor [35] Non-genetic DNA Mostly Nucleotides
Binary indicator sequence [36] Non-genetic DNA Mostly Nucleotides
Molecular mass scheme [37] Genetic DNA/RNA Mostly Polymeric
Autoregressive modeling [38] Non-genetic DNA Mostly Nucleotides
Pathogenicity islands scheme [39] Non-genetic DNA Mostly Nucleotides
Entropic segmentation scheme [40] Non-genetic DNA Mostly Nucleotides
Z-curve nucleotide position based scheme [71] Non-genetic DNA Mostly Nucleotides
Correlation function based scheme [72] Non-genetic DNA Mostly Nucleotides
FDA based scheme [73] Non-genetic DNA Mostly Nucleotides
Paired nucleotide representation based scheme [41] Genetic DNA/RNA Mostly Polymeric
Integer Number representation based scheme [42] Non-genetic DNA Mostly Nucleotides
Binary mapping of 12-letter based scheme [74] Non-genetic DNA Mostly Nucleotides
Autoregressive modeling based scheme [26] Non-genetic DNA Mostly Nucleotides
2-bit binary representation based scheme [76] Non-genetic DNA Mostly Nucleotides
Inter-nucleotide distance based scheme [75] Genetic DNA/RNA Mostly Polymeric
Biologically-inspired gradient source localization based scheme [44] Genetic DNA/RNA Mostly Polymeric
Electron-ion interaction pseudo potential based scheme [45] Genetic DNA/RNA Mostly Polymeric
Paired Nucleotide Atomic Number representation based scheme [46] Genetic DNA/RNA Mostly Polymeric
2-simplex mapping scheme [77] Non-genetic DNA Mostly Nucleotides
Complex numbers based scheme [49] Non-genetic DNA Mostly Nucleotides
Statistical property based scheme [50] Non-genetic DNA Mostly Nucleotides
Complex indicator sequence [47] Non-genetic DNA Mostly Nucleotides
Complex Twin-Pair [51] Non-genetic DNA Mostly Nucleotides
Table 2
Window filters with proposed window length employed for tracing codons.
Author(s) Window filter used Proposed Window length In conjunction with other tools/approaches
Sahu and Panda [84] Rectangular 351 bp A time-frequency based digital filters
Shakya et al. [85] Bartlett 351 bp Notch digital filter
Hota and Srivastava [86] Rectangular 351 bp Discrete Fourier transform with EIIP coding scheme
Chavan et al.[87] Kaiser 351 bp FIR digital filter
Bergen and Antoniou [88] Rectangular 351 bp Short time Fourier transforms
Andreas [89] Kaiser 351 bp Fourier transforms
Hota and Srivastava [90] Rectangular 351 bp Short time Fourier transforms
Oppenheim and Schafer [91] Kaiser 351 bp Fourier transforms
Tiwari et al. [92] Rectangular 351bp Fourier analysis
Nair and Sreenadhan [45] Kaiser 351 bp Fourier analysis
Anastassiou [2] Rectangular 351 bp Fourier analysis
Kotlar and Lavner [93] Rectangular 351 bp Fourier analysis
Akhtar et al. [50] Rectangular 351 bp Fourier analysis
Gunawan [94] Bartlett 351 bp Fourier analysis
Shakya et al. [64] Bartlett Adaptive Digital filter
Datta and Asif [95] Rectangular 351 bp Fourier transforms
Kakumani et al. [96] Rectangular 351 bp Fourier transforms
Tuqan and Rushdi [97] Rectangular 234 bp Digital filter
Datta and Asif [98] Bartlett 351 bp Fourier analysis
Akhtar et al. [99] Rectangular 351 bp Fourier analysis
Mena-Chalco et al., [100] Gaussian 351bp Short time Fourier transform
George and Thomas [101] Rectangular 351 bp Fourier transform
Abbasi et al. [102] Hamming 351 bp FIR Digital filter
power spectral estimation graph (for gene F56F11.5 having 5 cod- 351 was concluded to be the best in all combinations for coding
ing regions) shows prominent peaks of coding regions in adopting regions identification.
suitable factors of Window filter. For the Kaiser Window of size 351 bp, the window is defined as
It can be observed that suitable Window size plays significant
identification of coding regions by reducing the signal noise. From ⎧
this discussion, it can be concluded that Kaiser Window outper- ⎪
⎨ 2 1⁄2
formed for coding regions identification. I0 ˇ 1 − (n − ˛)/˛ /I0 (ˇ) 0≤n≤M−1
w(n) =
⎪
⎩
0 otherwise
2.2.2. Importance of selecting appropriate factors of window filter (19)
It has been observed that different Window functions with
variant parameters showed different spectra. Some Window func-
tions outperformed by reducing the signal noise while others Fig. 4 describes the Kaiser Window of length 351 bp with
could not provide any significance. Since enhancement of cod- numeric value of ␤ parameter = 3.5. This Window function was cho-
ing regions identification is directly associated with discriminating sen to be the most suitable convolution factor for coding regions
them from non-coding regions in noisy DNA signals, selection of identification.
an appropriate filter is very important to achieve best results. From Yin and Yau [103] observed that a small Window size produces
experiments, Kaiser Window with beta = 3.5 and window length more statistical oscillations that results in prediction errors while
Table 3
Important factors in selection of a suitable Window filter.
Window Name Window size Parameters Spectrum
Rectangular 120 Leakage Factor = 9.18

Relative side lobe attenuation = −13.3 dB


Remarks: It can be observed that Window size of 351 for Rectangular Window reduces the 1/f noise and sharp peaks can be obtained for spectrum.
Hamming 120 Symmetric

Leakage Factor = 0.03

Table 3 (Continued)

Remarks: It can be also be seen that Window size of 351 for Hamming Window sharpens the peaks of exons and reduces the 1/f noise in signal.
Bartlett 120 Leakage Factor = 0.28



Remarks: Bartlett Window of size 120 bp produces 1/f noise, Window of size 240 further reduces the noise and Window of size of 351 greatly reduces
the 1/f noise and sharp peaks can be obtained for spectrum.
Blackman-Harris 120 Leakage Factor = 0

Table 3 (Continued)


Relative side lobe attenuation = –92 dB
Remarks: The same phenomenon can be observed that Window size of 351 for Blackman-Harris Window reduces the 1/f noise and sharp peaks can
be obtained for spectrum.
Kaiser 120 Beta = 3.5



Remarks: Kaiser Window outperforms with involvement of an extra parameter beta that controls the main lobe width and reduces the leakage. It has
been observed that Window size of 351 with beta=3.5 produce the best spectrum value for coding regions identification.
Fig. 4. Kaiser Window of length 351 with beta = 3.5.
large Window sizes may miss small size coding and non-coding for period-3 components was explained, secondly a direct relation
regions. Conventional Window filters are employed normally with was found between the components and nucleotides’ bias in a DNA
a variety of digital signals [104,105] but DNA signal contains bio- spectrum through a set of numerical sequences. A correlation of
logically inspired nucleotides’ data, in which each nucleotide holds signal processing and genomic domain was done by a multirate
a special genetic code context and its distribution is highly biased DSP model.
in coding regions. These special characteristics of nucleotides in Gupta et al., [106] proposed a digital signal processing method
codons conclude that employment of a conventional Window fil- for noise suppression employing a time series approach for the exon
ter (especially with a fixed Window size) don’t suppress the 1/f and intron prediction. The method was based on a feature extrac-
background noise to a significant extent which results in a feeble tion mechanism. The extracted information contained distribution
discrimination between coding and non-coding regions. of nucleotides and hydrogen bonding and a pattern recognition
approach was applied for identifying bounds for exon and introns.
Authors have taken support from Z-curve components reflecting
2.3. Noise suppression approaches the distribution of DNA components along its three axes (Dataset of
Homo sapiens genomes). Each component of proposed model was
DNA signal is highly contaminated by 1/f background noise that considered as a discrete time series data. In the same way, Ham-
owns a strong influence over the accuracy of coding regions’ pre- dani and Shukri [107] presented a DSP system for signal denoising
diction methods since non-coding regions become greatly diffused as an application-based approach. This system was developed for
with coding regions over a DNA chain and viable identification the identification of protein region and functional property of
of coding regions form non-coding regions over noisy channels genomics. The idea had been split into transcription, splicing and
grows into a challenging task. There have been several methods translation. Having DNA been converted to RNA, splicing process
proposed for DNA signal noise suppression based on digital filters, was applied for identification of introns and exons with the help of
frequency transforms (Fourier analysis of discrete and short time mathematical formulation. A translation then converted the mes-
Fourier transforms) and Wavelet transforms. Following is a pre- senger RNA to protein.
view of existing approaches for noise suppression to discriminate Sahu and Panda [108] proposed a signal processing approach
coding regions from non-coding regions. to denoise DNA signal for protein coding regions identification. It
We noticed that Kakumani et al., [96] denoised DNA signal using was observed that existing fixed AR method was not found feasible
digital signal processing approach to maximize signal to noise ratio with certain limitations so an adaptive AR model based approach
(SNR). The presence of the exonic region in DNA strands had been was presented to achieve efficient prediction. The process of AR
detected by the employment of some least square optimization cri- method can be regarded as a prediction error (adaptive) filter
teria. The DNA sequence was transformed to digital signal using that manages its coefficients using flatten spectrum of the signal
four binary sequences with a Rectangular Window of 351 bp size. under examination. The authors carried a simulation for validat-
Each subsequence was fed into the proposed framework and SNR ing the performance of introduced method. Likely, Shakya et al.,
gains were computed. The test was then made on genes in chro- [85] denoised DNA sequence for enhancing coding regions identifi-
mosome III of C-elegans for revelation of the exonic regions. cation as a digital signal processing approach. The authors used
Similarly, Akhtar et al., [50] proposed a digital signal process- Bartlett Window of length 351 bp since it was found to be best
ing methodology for denoising DNA signal to achieve an exonic outperformed in literature. The Window was slid by one sam-
and intronic region prediction with certain comparisons to the ple in the entire process over the whole DNA sequence. It was
existing techniques. Statistical property of sequences was incor- observed that encoded signal depicted the 3-base extent (w.r.t.
porated describing the richness of introns with nucleotides A and components) of DNA signal, while non-coding regions’ noise was
T, while the exons with C and G. Comparisons were made over not reduced effectively. To manage 1/f noise, which occurred due
Burset/Guigo1996, HMR195 and GENSCAN datasets. Likely, Tuqan to large scale correlation shown by DNA sequences, the numeric
and Rushdi [97] presented a DNA signal denoising approach for DNA sequence was initially passed through a 2nd order all pass IIR
finding the complete periodicity in DNA sequences. The approach (Infinity Impulse Response) notch filter. It was found that IIR filters
was spliced into two channels. At first, the underlying mechanism
required least memory and computation than FIR filters and could was frequency content range that depended on a working scale.
be very effective in stated scenario. This dependency was stated non-desirable for long DNA stretch
Datta and Asif [98] proposed a DFT based algorithm to denoise since coding regions exhibited period-3 property. The authors used
DNA sequence for exons identification in DNA sequences. The per- Binary encoded scheme for signal mapping in which four Binary
formance of algorithm was quantified using Bartlett Window. The sequences had been used, each one representing the positions of
algorithm was run over long stretch of chromosome III of C. Ele- individual nucleotides, i.e. adenine (A), cytosine (C), guanine, (G),
gans that comprised 13783268 bp. The detection rate was observed and thymine (T).
to be improved with larger coding regions even further. However, The authors’ proposed method constituted four steps that are
the detection rate decreased when the length of exons was made stated below,
smaller than the Window size, algorithms based on DFT based splic-
ing methods could not perform well. The authors found that 3-base 1. Signal encoding with Binary indicator sequence
property could not remain valid over DFT based algorithms. In the 2. Each Binary sequence was exposed to MGWT components
same way, Akhtar et al., [99] denoised DNA signal as frequency 3. The position axis was projected with spectra
based analysis of signal. The analysis was performed as an optimiza- 4. Assigning a fixed threshold for the projection coefficients of cod-
tion to DFT based method for exonic identification. A frequency ing regions
domain PWSR method had been given some modifications in terms
of relative accuracy and computational complexity. The authors Later, George and Thomas [101] proposed a technique for sup-
used GENSCAN learning set consisting of 188 multi-exon sequences pressing noise in DNA signals. A discrete wavelet transform had
and test set (consisting of 64 available multi-exon gene sequences been used for the better minimization of noise and maximization
(taken from human genomic sequences) to acquire frequency of of prediction accuracy. DFT was employed as a conventional fre-
nucleotides’ occurrences in coding regions. quency analysis tool. The authors calculated the DFTs of some small
Working in the same direction, Yin and Yau [103] denoised scale nucleotides and computed STFT for an enhanced time domain
signal to predict the exonic regions based on 3-base property of analysis by Window-sliding at a single entry in the sequence. It
exons. DFT has been used for the extraction of Fourier coeffi- was also observed that STFT increased resolution in time domain.
cients from four indicator sequences formed from the DNA stretch. A Window function of size 351 bp was then used over long stretch
The DFT converts the signal from time to frequency domain. The of gene C elegans. The authors also stated certain scenarios with
method creates a subsequence if the sequence of DNA has more application of digital filters to suppress noise for coding regions
than 2000 bp. The subsequence is further divided into a number of identification.
smaller sequences and algorithm is applied to some short length In same way, Abbasi et al., [102] employed discrete Wavelet
sequences to get a combination of exon-intron-exon. Later, Roy transforms for denoising DNA signal and introduced a new algo-
et al., [109] denoised the signal based on frequency analysis of rithm based on cross-correlation for identifying exonic regions.
signal. A generic algorithm was proposed based on the frequency The encoded indicator sequence was powered by Binary numerical
distribution of individual nucleotides in DNA stretch. The genetic sequence with application of FIR filter of order 8 (based on Ham-
dataset Human Beta Globin with 850 bp and 223 bp were used. ming Window) with filter’s main frequency set to 2␲/3. Since FIR
The method denoised the signal related with frequency distribu- filters depicted least distortions, it was one stated reason for pre-
tion of individual nucleotides at three different positions of N/3 ferred use of IIR filters in proposed approach. A cross-correlation
component at long stretch of DNA. was calculated between impulse train of periodicity and numerical
Moving ahead, Shuo and Yi-sheng [110] denoised and presented DNA sequence for identification of regions in the DNA sequence
a coding identification approach with support vector machine (w.r.t 3-base characteristics). The size of the impulse sequence
(SVM) method. SVM classifier identified the start codon region of was set to 270 bp. The impulse trained signal length was taken as
exonic region in which time frequency characteristics of output Window length against common Window function used in tradi-
were seen using short time Fourier transformation. Coding and tional Window based approaches. The common attribute of discrete
non-coding regions were given as either positive or negative pat- wavelet transformed was employed by breaking the signal in a
terns for the recognition of splice sites. The authors slide a Window series of low and high pass filter to suppress noise.
of size 91 bp over the long DNA sequence. Later the Window’s con- Various methods can be noticed for noise suppression based on
tents were used as inputs for SVM classification. The classification generic digital filters, Fourier analysis and Wavelet transforms. Dig-
of single input showed the nucleotide in the center of Window ital filters were popularly used for frequency contents analysis of
that presented the first nucleotide of a codon employing STFT. As DNA signal but since the filters preview only one level frequency
a matter of signal denoising, STFT was used as a time-frequency analysis of signal, later these were replaced by multilevel frequency
distribution of signal components with significant time-frequency analysis techniques i.e. Wavelet transforms. Similarly, Fourier anal-
resolution. ysis based methods were concluded with spectral leakage problems
Similarly, Guo and Zhu [111] denoised the signal for coding that hindered the way for significant noise suppression in discrim-
regions identification using an integrative method based over inating coding regions from non-coding regions.
Takagi-Sugeno fuzzy model. The model functioned by identifying It can be observed from literature review that researchers have
the starting codon and then followed by the identification of time employed digital filters (i.e. FIR, IIR and notch filters) [85,96,108],
frequency characteristics using the short time Fourier transforms. for noise suppression. These filters can be applied over 1st level
Genome data was encoded and fragmented into testing and training analysis of frequency components of DNA signal focusing cer-
datasets. Learning and recognition were made by neural network tain regions of interest. The filter design empowers to pass a
and afterward a prediction was done through short time Fourier certain range of frequencies (low pass, high pass, all pass etc.)
transforms. Likely, Mena-Chalco et al., [100] used MGWT (Modi- that help to analyze particular frequency segments of signal for
fied Gabor-Wavelet Transform) to suppress noise in DNA signal for noise suppression. Although digital filters have been used, to an
the coding regions identification. The authors modified the Gabor- extent, for DNA signal noise suppression, these filters don’t outper-
wavelet function defined for analysis of signal at multiple ranges form as compared to multilevel frequency analysis of DNA signal
of frequencies and scales to suppress noise. It was observed that for noise suppression using Wavelet transforms. On contrary, the
MGWT could be used to evaluate signals at diverse frequencies approaches based over Fourier analysis (discrete, fast and short
and scales but with a certain limitation. The only limitation found time transforms) of signal [50,97–99,103,106,107,109–111], have
been used as popular solutions for DNA signal noise suppression FP (False positive) is defined as number of introns falsely identi-
in recent decade. The Fourier analysis based methods mainly rely fied as exons.
over frequency spectrum analysis of DNA signal convoluted with a FN (False negative) is defined as number of exons falsely identi-
noise suppressed Window function. In a frequency spectrum analy- fied as introns.
sis, the concept of spectral leakage is directly connected with signal Here, two kinds of datasets (i.e. benchmarked and randomly
energies shifted from main lobe to side lobes. This phenomenon taken) of different organisms were taken for analysis.
is visible in terms of the width of the main lobe and its sharp Table 4 describes the datasets used for performance evaluation
peaks around side lobes which results in energy leakage from the of different methods for coding regions identification. Further, we
sharp energy levels to the lower levels. The tradeoffs associated added certain 1/f noise thresholds (3% to 5% of signal power) to all
with viable spectrum analysis and spectral leakage, the generic fre- datasets for observing coding regions prediction tendency of different
quency transformed based methods could not resolve the issues methods against significant noise suppression.
related with spectral leakage to a great extent. The researchers Fig. 5 shows spectrum estimation i.e. discrimination measure
move towards the Wavelet transforms based methods to improve achieved by different coding measure schemes. It is observed that
the identification of coding regions by reducing the DNA signal the three popular coding measure schemes described in literature
noise significantly [100–102]. i.e. Voss, EIIP and Complex fall in similar bracket (within 75% to 80%
Recently, Yu et al., [112] presented a comprehensive review of identification based on non-genetic code context of nucleotides) for
emerging computational methods for gene identification in DNA different datasets. Interchangeably, since these schemes achieved
sequences. The authors highlighted key emerging trends in the almost similar identification.
field. Similarly Goel et al., [113] reviewed the literature for soft- Similarly, we determined the other evaluation factors as follows,
ware computing techniques in gene prediction areas. The authors Fig. 6 describes the performance evaluation of different coding
glimpsed several predictors with their strength and weaknesses. measure schemes. For instance, Fig. 6(A) depicts the mean False
Positive rate of coding schemes, Fig. 6(B) shows mean specificity,
3. Analysis and discussion Figs. 6(C) and Fig. 6(D) present mean prediction accuracy and mean
approximate correlation. It is observed that the three popular cod-
Performance evaluation of different prevailing solutions for ing measure schemes described in literature i.e. Voss, EIIP and
identification of coding regions was accomplished at basic Complex fall in similar bracket (within 75% to 80% identification
nucleotide level. For the sake, following important evaluation mea- based on non-genetic code context of nucleotides) for different
sures used are defined as follows, datasets. Interchangeably, we observed these schemes achieving
almost similar identification. It is expected that the coding mea-
LowestamplitudeofExon sure schemes based on genetic code context of nucleotides would
Discriminationmeasure (D) = (20)
HighestamplitudeofIntron achieve better identification as compared with non-genetic context
TP based schemes.
Sensitivity (Sn) = (21) Similarly, we have analyzed the performance of different
TP + FN
Window filters that are based on non-genetic code context of
TP
Specificity (Sp) = (22) nucleotides as follows,
TP + FP Fig. 7 shows a comparison among different Window filters in
TP + TN terms of signal to noise ratio. The various Window filters have been
Predictionaccuracy (P) = (23)
TP + FP + TN + FN analyzed at different Window lengths. It can be seen that since con-
ventional Window filters lack the biological aspect of nucleotides
Approximatecorrelation (AC) = (ACP − 0.5) ∗ 2 (24)
being represented as contents of Window filter, a considerable
Where, achievement in DNA signal noise suppression can’t be achieved
1
TP TP TN TN
in this regard. That is why, it is more appropriate to state that a
ACP = ∗ + + + (25) Window filter based on genetic code context of nucleotides would
4 TP + FN TP + FP TN + FN TN + FP
ensure significant noise suppression as compared with other con-
Here Discrimination measure (D), [45,83,99,100], is defined as ventional Window filters.
a ratio between smallest scale (peak value) of exon to the largest In the same way, we analyzed different filters described in
scale (peak value) of intron in power spectral density estimation of literature for noise suppression to enhance the coding regions iden-
two regions. A prediction method depicting higher value of discrim- tification. Notch filter [85], Adaptive AR [108], Integrative STFT [111]
ination measure reflects more accurate result than other prediction and Wavelet filters ([100–102]). We calculated signal to noise ratios
methods. of noise suppression filter at different datasets and observed the
Sensitivity (Sn) (true positive rate) is defined as a proportion of noise suppression tendency of non-genetically based filters. Con-
the regions that are correctly identified as exons (coding regions) cluding the experimental findings, a genetic code context digital
while Specificity (Sp) (true negative rate) is defined as a propor- filter could perform better (owning capability to process DNA/RNA
tion of regions correctly identified as introns (non-coding regions) related signals) than common digital filter designed for traditional
[85,100,102,108]. digital filtering processes for noise suppression.
Predictionaccuracy (P) being considered a good measure is taken
as combination of Sensitivity (Sn) and Specificity (Sp) [95]. Likewise,
Approximate correlation (AC) [102] is taken as an equal suitable 4. Critical discussion on analysis of results
evaluation measure when Predictionaccuracy (P) may not prefer-
ably distinguish between coding and non-coding regions due to Commonly a digital signal is processed for its digital contents
larger Sensitivity (Sn) of regions against smaller Specificity (Sp) and in most of digital applications. The literature reviewed in previ-
vice versa. Further, ous sections explains that biological signals are also being treated
TP (True positive) is defined as number of exons correctly iden- as normal digital signals. For instance, a digital signal composed
tified as exons. from nucleotides of a DNA/RNA sequence contains very impor-
TN (True negative) is defined as number of introns correctly iden- tant biological information related with area of interest of specific
tified as introns. gene/nucleotide sequence. Treating such signals as common digital
Table 4
Description of datasets used for performance evaluation.
Organism No. of sequences Average sequence length (bp)
Homo Sapiens 15 4500

Serinus Canaria 20 6250
Nicotiana Sylvestris 17 3700
Yersinia Pestis 1 4000
Limulus Polyphemus 20 7880
Felis Catus 23 4925
Vicugna Pacos 18 2120
Sus Scrofa Mitochondrion 1 8000
Cricetulus Griseus 15 5812
Tursiops Truncates 18 3460
Ornithorhynchus Anatinus 20 3933
Mus Musculus Domesticus 1 7700
Meleagris Gallopavo 25 2732
Canis Lupus Mitochondrion 1 7800
Galeopterus Variegatus 18 2267
Nicotiana Tomentosiformis 20 4200
S. Cerevisiae chromosome III 1 8000
Human, Mouse and Rat (HMR195) 103, 82 and 10 respectively 7096
Human (Beta Globin HUMHBB) 1 73326
Fig. 5. Comparison in terms of spectral estimation.
signals might not reflect significant results. Optimal goals for pro- codons that transcribe the DNA chains to protein chains at regions
tein coding regions identification could be achieved if biological known as exons [13].
signals are processed in context of nucleotides’ contents. Digital Table 5 presents the amino acids (with their abbreviated codes)
signals composed from DNA/RNA sequences should demonstrate produced by pertinent codons [12]. Each amino acid can be pro-
biological meaningful aspects of nucleotides. duced by one or more codons e.g. Tyrosine is produced by each of
For instance, we quote one example here. In case of protein “TAT” or “TAC”. In this context, DNA code is said to be degenerate,
coding regions identification, protein coding regions are strongly which means that one amino acid may be produced by more than
diffused with non-coding regions due to 1/f DNA signal noise so that one codon and each codon is unique in its nature and composition
discernment between the two regions becomes highly challenging. [14–16].
The numerical encoding of DNA sequence for DNA signal processing Fig. 8 presents the codons’ cluster space. The nature of codons
requires the translation of sequence to a digital signal. This trans- reveals that some of the codons are totally disjoint while the others
lation is considered as a base for optimal identification of coding share some common density distribution [114].
regions from non-coding regions. It can be observed from results Here in Fig. 9, all four codons are disjoints and they don’t depict
that conventional non-genetic base encoding sequences and convo- any common attributes (e.g. density distribution).
lution factors (Window filters) depict more or less similar findings All codons other than disjoint codons share some common
for identification of coding regions. This is due to the fact that these density distribution. For instance, the codons shown in Fig. 10
solutions are based on only DSP concepts lacking the biological containing nucleotides “A, T and G” show a contributing fac-
aspects of nucleotides for construction of encoding sequences and tor in each their counterparts. In such scenarios, since the three
Window filters. nucleotides have uniform mass distribution, it can be stated that
If we carefully monitor the composition of DNA sequences and each nucleotide appears with an approximate shared value of “1/3”.
their translation to protein sequences, we will notice 64 possible
Fig. 6. (A, B, C, D). Comparative analysis in terms of performance evaluation parameters.
Fig. 7. Comparison of different Window filter in terms of SNR.
Fig. 9. Disjoint clusters.
Similarly the codons shown in Fig. 11 containing nucleotides “A,

T, C and G” depict a contributing factor of “2/3” in their nucleotides’
density distribution.
This discussion leads us to the fact that nucleotides appear in the
codon as a unique combination of triplets and each nucleotide has
a special place in the codon and its occurrence in codon determines
the type of protein it produces. The DNA signal composed of such
nucleotides would not possibly provide optimal identification of
Fig. 8. Cluster space for nucleotides in codons [114]. coding regions if treated as a common digital signal. The common
Table 5 • Identification of complete gene patterns over long noisy DNA

Amino acids produced by codons.
sequences.
Amino Acid Short Code Codons • Translation of particular DNA segments to complete protein
Isoleucine I ATT, ATC, ATA sequence.
Leucine L CTT, CTC, CTA, CTG, TTA, TTG • Motif identification to discover patterns of nucleotides or pro-
Valine V GTT, GTC, GTA, GTG tein sequences for understanding the structure and function of
Phenylalanine F TTT, TTC molecules in a better way.
Methionine M ATG
• Devising new multilevel digital filters for signal’s 1/f noise
Cysteine C TGT, TGC
Alanine A GCT, GCC, GCA, GCG suppression and better discrimination between coding and non-
Glycine G GGT, GGC, GGA, GGG coding regions.
Proline P CCT, CCC, CCA, CCG • Identification of protein folding based on its structure.
Threonine T ACT, ACC, ACA, ACG
Serine S TCT, TCC, TCA, TCG, AGT, AGC
Tyrosine Y TAT, TAC References
Tryptophan W TGG
Glutamine Q CAA, CAG [1] J. Lewis, B. Alberts, A. Johnson, P. Walter, Molecular Biology of the Cell, 5th
Asparagine N AAT, AAC ed., Garland Publishing, New York, 2007.
Histidine H CAT, CAC [2] D. Anastassiou, Genomic signal processing, IEEE Signal Process. Mag. 18 (4)
Glutamic acid E GAA, GAG (2001) 8–20.
[3] K.P. Soman, Insight into Wavelets: From Theory to Practice, PHI Learning
Aspartic acid D GAT, GAC
Pvt. Ltd., 2010.
Lysine K AAA, AAG
[4] S. Sarkar, Decoding coding: Information and DNA, Bioscience 46 (11) (1996)
Arginine R CGT, CGC, CGA, CGG, AGA, AGG
857–864.
Stop codons Stop TAA, TAG, TGA [5] T.J. Richmond, C.A. Davey, The structure of DNA in the nucleosome core,
Nature 423 (6936) (2003) 145–150.
[6] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, P. Walter, Cell Junctions
Cell Adhesion and the Extracellular Matrix, 2002.
[7] T. Strachan, A.P. Read, Human Molecular Genetics An Overview of Mutation,
Polymorphism, and DNA Repair, 2004, pp. 2.
[8] L. Galleani, R. Garello, The minimum entropy mapping spectrum of a DNA
sequence, IEEE Trans. Inf. Theory 56 (2) (2010) 771–783.
[9] S.K. Mitra, Y. Kuo, Digital Signal Processing: a Computer-based Approach,
Vol. 2, McGraw-Hill, New York, 2006.
[10] S. Rogic, A.K. Mackworth, F.B. Ouellette, Evaluation of gene-finding
programs on mammalian sequences, Genome Res. 11 (5) (2001) 817–832.
Fig. 10. Nucleotides sharing same contributing factor of ‘1/3’. [11] M. Stanke, S. Waack, Gene prediction with a hidden Markov model and a
new intron submodel, Bioinformatics 19 (suppl 2) (2003) ii215–ii225.
[12] E. Coward, Equivalence of two Fourier methods for biological sequences, J.
Math. Biol. 36 (1) (1997) 64–70.
[13] W. Wang, D.H. Johnson, Computing linear transforms of symbolic signals,
IEEE Trans. Signal Process. 50 (3) (2002) 628–634.
[14] Z. Wang, Y. Chen, Y. Li, A brief review of computational gene prediction
methods, Genom. Proteom. Bioinform. 2 (4) (2004) 216–221.
[15] J.W. Fickett, The gene identification problem: an overview for developers,
Comput. Chem. 20 (1) (1996) 103–118.
[16] Y. Cai, Z. He, L. Hu, B. Li, Y. Zhou, H. Xiao, H. Li, Gene finding by integrating
gene finders, J. Biomed. Sci. Eng. 3 (11) (2010) 1061.
Fig. 11. Codons sharing same contributing factor of “2/3”. [17] A.S. Nair, S. Sreenadhan, An improved digital filtering technique using
nucleotide frequency indicators for locating exons, J CSI 36 (1) (2006) 54–60.
[18] V. Afreixo, P.J. Ferreira, D. Santos, Spectrum and symbol distribution of
DSP approaches are suitably fine for processing conventional digital nucleotide sequences, Phys. Rev. E 70 (3) (2004) 031910.
[19] N. Rao, S.J. Shepherd, Detection of 3-periodicity for small genomic
signal but not fairly suitable to process DNA signals that contain sequences based on AR technique, Communications, Circuits and Systems,
genetically meaningful contents. It is emphasized to develop new 2004. ICCCAS 2004. 2004 International Conference on 2004 June, IEEE Vol. 2
DSP approaches that could process DNA signal taking care of genetic (2004) 1032–1036.
[20] D. Kotlar, Y. Lavner, Gene prediction by spectral rotation measure: a new
contents of signal (e.g. genetic code context of nucleotides).
method for identifying protein-coding regions, Genome Res. 13 (8) (2003)
1930–1937.
5. Conclusion [21] T.W. Fox, A. Carreira, A digital signal processing method for gene prediction
with improved noise suppression, EURASIP J. Adv. Signal Process. 2004 (1)
(2004) 1–7.
This paper addresses number of issues and challenges associ- [22] P. Lio, Wavelets in bioinformatics and computational biology: state of art
ated with significant identification of protein coding regions from and perspectives, Bioinformatics 19 (1) (2003) 2–9.
[23] L. Taher, O. Rinner, S. Garg, A. Sczyrba, M. Brudno, S. Batzoglou, B.
non-coding regions in Eukaryotic DNA sequences. It has been crit- Morgenstern, AGenDA: homology-based gene prediction, Bioinformatics 19
ically observed that protein coding regions are commonly diffused (12) (2003) 1575–1577.
with non-coding regions due to 1/f background noise over noisy [24] A.K. Brodzik, O. Peters, Symbol-balanced quaternionic periodicity transform
for latent pattern detection in DNA sequences, ICASSP, 2005 March 5 (2005)
DNA sequences in such a way that a viable discernment between 373–376.
the two regions becomes cumbersome. A strong correlation has [25] T.M. Nair, S.S. Tambe, B.D. Kulkarni, Application of artificial neural networks
been observed between an enhancement of coding regions iden- for prokaryotic transcription terminator prediction, FEBS Lett. 346 (2)
(1994) 273–277.
tification and the strong 1/f background noise. Further, protein
[26] N. Chakravarthy, A. Spanias, L.D. Iasemidis, K. Tsakalis, Autoregressive
coding regions own a fundamental 3-base periodicity which is modeling and feature analysis of DNA sequences, EURASIP J. Appl. Signal
found absent in non-coding regions, this characteristic of coding Process. 2004 (2004) 13–28.
regions has been exploited by a large number of researchers to [27] R. Zhang, C.T. Zhang, Z curves, an intutive tool for visualizing and analyzing
the DNA sequences, J. Biomol. Struct. Dyn. 11 (4) (1994) 767–782.
propose various methodologies and tools to resolve issues related [28] D. Kotlar, Y. Lavner, Gene prediction by spectral rotation measure: a new
with coding regions identification but a significant solution is still method for identifying protein-coding regions, Genome Res. 13 (8) (2003)
required for an enhanced identification. 1930–1937.
[29] A. Fuentes, J. Ginori, R. Abalo, A new predictor of coding regions in genomic
The computational solutions based on genetic code of context sequences using a combination of different approaches, Int. J. Biol. Life Sci. 3
can be significantly helpful in, (2) (2007) 106–110.
[30] A.E. Cetin, O.N. Gerek, Y. Yardimci, Equiripple FIR filter design by the FFT [66] P.D. Welch, The use of fast Fourier transform for the estimation of power
algorithm, IEEE Signal Process Mag. 14 (2) (1997) 60–64. spectra: a method based on time averaging over short, modified
[31] L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing, periodograms, IEEE Trans. Audio Electroacoust. 15 (2) (1967) 70–73.
777, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1975, pp. 1. [67] D.B. Percival, A.T. Walden, Spectral Analysis for Physical Applications:
[32] S.J. Orfanidis, Introduction to Signal Processing, Prentice-Hall, Inc, 1995. Multitaper and Conventional Univariate Techniques, Cambridge Univ. Press,
[33] John G. Proakis, G. Manolakis, 1996. Dimitris Digital Signal Processing, New York, 1993, pp. 583.
511–608. [68] Z. Ignatova, I. Martínez-Pérez, K.H. Zimmermann, DNA Computing Models,
[34] B.D. Silverman, R. Linsker, A measure of DNA periodicity, J. Theor. Biol. 118 Springer Science & Business Media, 2008.
(3) (1986) 295–300. [69] F. Brueckner, K.J. Armache, A. Cheung, G.E. Damsma, H. Kettenberger, E.
[35] B. Demeler, G. Zhou, Neural network optimization for E. coli promoter Lehmann, P. Cramer, Structure-function studies of the RNA polymerase II
prediction, Nucleic Acids Res. 19 (7) (1991) 1593–1599. elongation complex, Acta Crystallogr. Sect D: Biol. Crystallogr. 65 (2) (2009)
[36] R.F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA 112–120.
base sequences, Phys. Rev. Lett. 68 (25) (1992) 3805. [70] M. Long, E. Betrán, K. Thornton, W. Wang, The origin of new genes: glimpses
[37] H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, Z.D. Goldberger, S. Havlin, R.N. from the young and old, Nat. Rev. Genet. 4 (11) (2003) 865–875.
Mantegna, M. Simons, Statistical mechanics in biology: how ubiquitous are [71] C.T. Zhang, J. Wang, Recognition of protein coding genes in the yeast
long-range correlations? Physica A 205 (1-3) (1994) 214–253. genome at better than 95% accuracy based on the Z curve, Nucleic Acids Res.
[38] M. Yan, Z.S. Lin, C.T. Zhang, A new fourier transform approach for protein 28 (14) (2000) 2804–2814.
coding measure based on the format of the Z curve, Bioinformatics 14 (8) [72] G. Dodin, P. Vandergheynst, P. Levoir, C. Cordier, L. Marcourt, Fourier and
(1998) 685–690. wavelet transform analysis, a tool for visualizing regular patterns in DNA
[39] P. Liò, M. Vannucci, Finding pathogenicity islands and gene transfer events sequences, J. Theor. Biol. 206 (3) (2000) 323–326.
in genome data, Bioinformatics 16 (10) (2000) 932–940. [73] D. Anastassiou, Frequency-domain analysis of biomolecular sequences,
[40] P. Bernaola-Galván, I. Grosse, P. Carpena, J.L. Oliver, R. Román-Roldán, H.E. Bioinformatics 16 (12) (2000) 1073–1081.
Stanley, Finding borders between coding and noncoding DNA regions by an [74] J.A. Berger, S.K. Mitra, J. Astola, Power spectrum analysis for DNA sequences,
entropic segmentation method, Phys. Rev. Lett. 85 (6) (2000) 1342. Signal Processing and Its Applications, 2003. Proceedings. Seventh
[41] P. Bernaola-Galván, P. Carpena, R. Román-Roldán, J.L. Oliver, Study of International Symposium on 2003 July, IEEE Vol. 2 (2003) 29–32.
statistical correlations in DNA sequences, Gene 300 (1) (2002) 105–115. [75] A.S.S. Nair, T. Mahalakshmi, Visualization of genomic data using
[42] P.D. Cristea, Genetic signal representation and analysis, International inter-nucleotide distance signals, Proc. IEEE Genom. Signal Process. (2005)
Symposium on Biomedical Optics, 2002 June, International Society for 408.
Optics and Photonics (2017) 77–84. [76] R. Ranawana, V. Palade, A neural network based multi-classifier system for
[43] Y. Nancy, Hong Yan, Autoregressive modeling of DNA features for short exon gene identification in DNA sequences, Neural Comput. Appl. 14 (2) (2005)
recognition, IEEE International Conference on Bioinformatics and 122–131.
Biomedicine (BIBM) (2010) 450–455. [77] D.G. Grandhi, C.V. Kumar, 2-Simplex mapping for identifying the protein
[44] G.L. Rosen, Signal Processing for Biologically-inspired Gradient Source coding regions in DNA, TENCON 2007-2007 IEEE Region 10 Conference,
Localization and DNA Sequence Analysis, 2006. 2007 October, IEEE (2017) 1–3.
[45] A.S. Nair, S.P. Sreenadhan, A coding measure scheme employing electron-ion [78] J. Mena-Chalco, H. Carrer, Y. Zana, R.M. Cesar Jr, Identification of protein
interaction pseudopotential (EIIP), Bioinformation 1 (6) (2006) 197–202. coding regions using the modified Gabor-wavelet transform, IEEE/ACM
[46] T. Holden, R. Subramaniam, R. Sullivan, E. Cheung, C. Schneider, G. Trans. Comput. Biol. Bioinform. 5 (2) (2008) 198–207.
Tremberger Jr., T.D. Cheung, ATCG nucleotide fluctuation of Deinococcus [79] C. Yin, S.S.T. Yau, Prediction of protein coding regions by the 3-base
radiodurans radiation genes, Optical Engineering+ Applications, 2007 periodicity analysis of a DNA sequence, J. Theor. Biol. 247 (4) (2007)
September, International Society for Optics and Photonics (2017) 669417. 687–694.
[47] M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction [80] M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction
taking complex indicator sequence, TENCON 2008-2008 IEEE Region 10 taking EIIP indicator sequence, Proceedings of the Second International
Conference, 2008 November, IEEE (2008) 1–6. Conference on Information Processing, 2008 January (2008) 117–123.
[48] D.K. Shakya, R. Saxena, S.N. Sharma, An adaptive window length strategy for [81] H.K. Kwan, S.B. Arniker, Numerical representation of DNA sequences, 2009
eukaryotic CDS prediction, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) IEEE International Conference on Electro/Information Technology, 2009
10 (5) (2013) 1241–1252. June, IEEE (2009) 307–310.
[49] C. Yin, S.S.T. Yau, Numerical representation of DNA sequences based on [82] I. Wasito, I. Veritawati, Fractal dimension approach for clustering of DNA
genetic code context and its applications in periodicity analysis of genomes, sequences based on internucleotide distance, Information and
Computational Intelligence in Bioinformatics and Computational Biology, Communication Technology (ICoICT), 2013 International Conference of IEEE,
2008. CIBCB’08. IEEE Symposium on 2008 September, IEEE (2017) 223–227. 2013 March (2013) 82–87.
[50] M. Akhtar, J. Epps, E. Ambikairajah, Signal processing in sequence analysis: [83] C. Fraley, A.E. Raftery, Model-based clustering, discriminant analysis, and
advances in eukaryotic gene prediction, IEEE J. Select. Topics Signal Process. density estimation, J. Am. Stat. Assoc. 97 (458) (2002) 611–631.
2 (3) (2008) 310–321. [84] S.S. Sahu, G. Panda, Identification of protein-coding regions in DNA
[51] B.Y. Kwan, J.Y. Kwan, H.K. Kwan, Spectral classification of short numerical sequences using a time-frequency filtering approach, Genom. Proteom.
exon and intron sequences, BMC Bioinf. 12 (11) (2011) 1. Bioinform. 9 (1) (2011) 45–55.
[52] M. Rahman, Applications of Fourier Transforms to Generalized Functions, [85] D.K. Shakya, R. Saxena, S.N. Sharma, A DSP-based approach for gene
WIT Press, 2011. prediction in eukaryotic genes, Int. J. Electr. Eng. Inform 3 (4) (2011).
[53] S. Gurevich, R. Hadani, On the diagonalization of the discrete Fourier [86] M.K. Hota, V.K. Srivastava, DSP technique for gene and exon prediction
transform, Appl. Comput. Harmon. Anal. 27 (1) (2009) 87–99. taking EIIP indicator sequence, Proceedings of the Second International
[54] H. Baher, The fast fourier transform and its applications, Signal Process. Conference on Information Processing, 2008 January (2008) 117–123.
Integr. Circuits (1990) 149–191. [87] M.S. Chavan, R.A. Agarwala, M.D. Uplane, Use of Kaiser window for ECG
[55] T.W. Fox, A. Carreira, A digital signal processing method for gene prediction processing, in: Proceedings of the 5th WSEAS Int. Conf. on Signal Processing,
with improved noise suppression, EURASIP J. Adv. Signal Process. 2004 (1) Robotics and Automation, 2006 February, Madrid, Spain, 2006.
(2004) 1–7. [88] S.W. Bergen, A. Antoniou, Application of parametric window functions to
[56] C. Sagiv, N.A. Sochen, Y.Y. Zeevi, Scale-space generation via uncertainty the STDFT method for gene prediction, Proceedings on Communication,
principles, in: International Conference on Scale-Space Theories in Computers and Signal Processing, (IEEE-PACRIM05) (2005) 324–327.
Computer Vision, 2005 April, Springer, Berlin, Heidelberg, 2017, pp. [89] A. Andreas, Digital Signal Processing: Signals Systems and Filters, 2006.
351–362. [90] M.K. Hota, V.K. Srivastava, Performance analysis of different DNA to
[57] D.A. Lyon, The discrete fourier transform, part 4: spectral leakage, J. Object numerical mapping techniques for identification of protein coding regions
Technol. 8 (7) (2009). using tapered window based short-time discrete Fourier transform, Power,
[58] M. Cerna, A.F. Harvey, The Fundamentals of FFT-based Signal Analysis and Control and Embedded Systems (ICPCES), 2010 International Conference on
Measurement, National Instruments, Junho, 2000. 2010 November, IEEE (2017) 1–4.
[59] C.K. Chui (Ed.), An Introduction to Wavelets, Vol. 1, Academic press, 2014. [91] A.V. Oppenheim, R.W. Schafer, Discrete-time signal processing, Pearson
[60] A. Grossmann, J. Morlet, Decomposition of Hardy functions into square High. Educ. (2010).
integrable wavelets of constant shape, SIAM J. Math. Anal. 15 (4) (1984) [92] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya, R.
723–736. Ramaswamy, Prediction of probable genes by Fourier analysis of genomic
[61] C. Bingham, M. Godfrey, J. Tukey, Modern techniques of power spectrum sequences, Comput. Appl. Biosci.: CABIOS 13 (3) (1997) 263–270.
estimation, IEEE Trans. Audio Electroacoust. 15 (2) (1967) 56–66. [93] D. Kotlar, Y. Lavner, Gene prediction by spectral rotation measure: a new
[62] B. Porat, Digital Processing of Random Signals: Theory and Methods, method for identifying protein-coding regions, Genome Res. 13 (8) (2003)
Prentice-Hall, Inc, 1994. 1930–1937.
[63] M.B. Priestly, Spectral Analysis and Time Series, Academic Press, San Diego, [94] T.S. Gunawan, On the optimal window shape for genomic signal processing,
1981. Computer and Communication Engineering, 2008. ICCCE 2008. International
[64] M.K. Steven, Modern Spectral Estimation: Theory and Application Signal Conference on 2008 May, IEEE (2008) 252–255.
Processing Series, 1988. [95] S. Datta, A. Asif, A fast DFT based gene prediction algorithm for identification
[65] G.W. Corder, D.I. Foreman, Nonparametric Statistics for Non-statisticians: a of protein coding regions, ICASSP, 2005 March 5 (2005) 653–656.
Step-by-step Approach, 2009.
[96] R. Kakumani, V. Devabhaktuni, M.O. Ahmad, Prediction of protein-coding [105] C. Yin, S.S.T. Yau, A Fourier characteristic of coding sequences: origins and a
regions in DNA sequences using a model-based approach, 2008 IEEE non-Fourier approximation, J. Comput. Biol. 12 (9) (2005) 1153–1165.
International Symposium on Circuits and Systems on 2008 May, IEEE (2008) [106] R. Gupta, A. Mittal, K. Singh, P. Bajpai, S. Prakash, A time series approach for
1918–1921. identification of exons and introns, Information Technology, (ICIT 2007).
[97] J. Tuqan, A. Rushdi, A DSP approach for finding the codon bias in DNA 10th International Conference on 2007 December, IEEE (2007) 91–93.
sequences, IEEE J. Sel. Top. Signal Process. 2 (3) (2008) 343–356. [107] H.Y. Hamdani, S.R.M. Shukri, Gene prediction system, 2008 International
[98] S. Datta, A. Asif, DFT based DNA splicing algorithms for prediction of protein Symposium on Information Technology on 2008 August, IEEE 2 (2008) 1–7.
coding regions, Signals, Systems and Computers, 2004. Conference Record of [108] S.S. Sahu, G. Panda, A DSP approach for protein coding region identification
the Thirty-Eighth Asilomar Conference on IEEE, 2004 November Vol. 1 In DNA sequence, Int. J. Signal Image Process. 1 (2) (2010).
(2004) 45–49. [109] M. Roy, S. Biswas, S. Barman, Identification and analysis of coding and
[99] M. Akhtar, J. Epps, E. Ambikairajah, On DNA numerical representations for non-coding regions of a DNA sequence by positional frequency distribution
period-3 based exon prediction, 2007 IEEE International Workshop on of nucleotides (PFDN) algorithm, Computers and Devices for
Genomic Signal Processing and Statistics on 2007 June, IEEE (2007) 1–4. Communication, 2009. CODEC 2009. 4th International Conference on 2009
[100] J. Mena-Chalco, H. Carrer, Y. Zana, R.M. Cesar Jr, Identification of protein December, IEEE (2009) 1–4.
coding regions using the modified Gabor-wavelet transform, IEEE/ACM [110] G. Shuo, Z. Yi-sheng, Prediction of protein coding regions by support vector
Trans. Comput. Biol. Bioinform. 5 (2) (2008) 198–207. machine, Intelligent Ubiquitous Computing and Education, 2009
[101] T.P. George, T. Thomas, Discrete wavelet transform de-noising in eukaryotic International Symposium on 2009 May, IEEE (2009) 185–188.
gene splicing, BMC Bioinform. 11 (1) (2010) 1. [111] S. Guo, Y.S. Zhu, An integrative algorithm for predicting protein coding
[102] O. Abbasi, A. Rostami, G. Karimian, Identification of exonic regions in DNA regions, Circuits and Systems, 2008. APCCAS 2008. IEEE Asia Pacific
sequences using cross-correlation and noise suppression by discrete Conference on 2008 November, IEEE (2008) 438–441.
wavelet transform, BMC Bioinform. 12 (1) (2011) 1. [112] N. Yu, Z. Yu, B. Li, F. Gu, Y. Pan, A comprehensive review of emerging
[103] C. Yin, S.S.T. Yau, Prediction of protein coding regions by the 3-base computational methods for gene identification, J. Inf. Process. Syst. 12 (1)
periodicity analysis of a DNA sequence, J. Theor. Biol. 247 (4) (2007) (2016).
687–694. [113] N. Goel, S. Singh, T.C. Aseri, A review of soft computing techniques for gene
[104] J.W. Fickett, Recognition of protein coding regions in DNA sequences, prediction, ISRN Genom. 2013 (2013).
Nucleic Acids Res. 10 (17) (1982) 5303–5318. [114] M. Ahmad, L.T. Jung, M.A.A. Bhuiyan, On fuzzy semantic similarity measure
for DNA coding, Comput. Biol. Med. 69 (2016) 144–151.

From DNA To Protein Why Genetic Code Context of Nucleotides For DNA Signal Processing A Review

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

From DNA To Protein Why Genetic Code Context of Nucleotides For DNA Signal Processing A Review

Hochgeladen von

Copyright:

Verfügbare Formate

Biomedical Signal Processing and Control 34 (2017) 44–63

Contents lists available at ScienceDirect

Biomedical Signal Processing and Control

From DNA to protein: Why genetic code context of nucleotides for

1.2. Window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

1. Introduction ing (called as exons) and non-coding (called as introns) regions as

1.1.1. Tetrahedron coding scheme 1.1.10. Autoregressive coding scheme

1.1.8. Paired nucleotide representation coding scheme

This Window can be differentiated from a Hanning Window by 

a discontinuity factor which is observed as prominent in Hanning y [n] = − ak y [n − k] + bk x [n − k] (10)

Here y represents a vector of length n that contains the trans-

1.6.1. Parametric methods

Fig. 3. Framework for coding regions identiﬁcation.

Coding Scheme Context(Genetic/Non-genetic) Applicability Robustness (Scope)

Tetrahedron mapping scheme [34] Non-genetic DNA Mostly Nucleotides

Window Name Window size Parameters Spectrum

Rectangular 120 Leakage Factor = 9.18

Rectangular 240 Leakage Factor = 9.18

Rectangular 351 Leakage Factor = 9.31

Hamming 120 Symmetric

Hamming 240 Symmetric

Window Name Window size Parameters Spectrum

Hamming 351 Symmetric

Bartlett 120 Leakage Factor = 0.28

Bartlett 240 Leakage Factor = 0.28

Bartlett 351 Leakage Factor = 0.29

Blackman-Harris 120 Leakage Factor = 0

Window Name Window size Parameters Spectrum

Blackman-Harris 240 Leakage Factor = 0

Blackman-Harris 351 Leakage Factor = 0

Kaiser 120 Beta = 3.5

Kaiser 240 Beta = 3.5

Kaiser 351 Beta = 3.5

Fig. 4. Kaiser Window of length 351 with beta = 3.5.

Organism No. of sequences Average sequence length (bp)

Homo Sapiens 15 4500

Fig. 5. Comparison in terms of spectral estimation.

Fig. 6. (A, B, C, D). Comparative analysis in terms of performance evaluation parameters.

Fig. 7. Comparison of different Window ﬁlter in terms of SNR.

Fig. 9. Disjoint clusters.

Similarly the codons shown in Fig. 11 containing nucleotides “A,

Table 5 • Identiﬁcation of complete gene patterns over long noisy DNA

Das könnte Ihnen auch gefallen

This Window can be differentiated from a Hanning Window by