Multiple Sequence Alignment Based On Chaotic PSO

Multiple Sequence Alignment Based on Chaotic PSO
Xiu-juan Lei, Jing-jing Sun, and Qian-zhi Ma

School of Computer Science, Shaanxi Normal University, Xian 710062, China {xjlei168,jjsun1116,zhihui312}@163.com
Abstract. This paper introduces a new improved algorithm called chaotic PSO (CPSO) based on the thought of chaos optimization to solve multiple sequence alignment. For one thing, the chaotic variables are generated between 0 and 1 when initializing the population so that the particles are distributed uniformly in the solution space. For another thing, the chaotic sequences are generated using the Logistic mapping function in order to make chaotic search and strengthen the diversity of the population. The simulation results of several benchmark data sets of BAliBase show that the improved algorithm is effective and has good performances for the data sets with different similarity. Keywords: Chaos., particle swarm optimization, Multiple sequence alignment.
1 Introduction
As one of the most basic tasks of the biological sequence analysis, multiple sequence alignment (MSA) is widely applied in sequence assembly, sequence annotation, the prediction of gene and proteins structure and function, phylogeny and evolutionary analysis and so on. It is one of the hot topics in the current biological information sciences. Multiple sequence alignment is a NP-complete problem in the sense of the sum-of-pairs scoring (SPS). At present, the existing MSA algorithms were roughly divided into four kinds. They are exact alignment algorithm, evolutionary alignment algorithm, the algorithm based on graph theory and the iterative alignment algorithm. The exact alignment algorithm is completely based on the dynamic programming. The most classical exact algorithm is the Needlman-Wunsch algorithm [1], whose feasible calculation dimension is only 3-D. The evolutionary alignment algorithm was originally proposed by Hogeweg [2] and perfected further by Feng and Taylor. Now the software package CLUSTALW based evolutionary alignment is widely used. The main representative of the alignment algorithm based on graphical model is partial order alignment [3] (POA). In recent years, the iterative alignment algorithms have been increasingly used to solve the problem of multiple sequence alignment. This method is based on the algorithm which can produce alignment, and can improve the multiple sequence alignment through a series of iterations until the results dont become better any longer. There are a lot of algorithms based on this method, such as, simulated annealing algorithm (SA) [4], genetic algorithm (GA) [5], hidden Markov model (HMM) and so on. The most influential software package SAGA [6] (sequence alignment by genetic algorithm) is constructed based on genetic algorithm. The SAGA is designed by twenty-two
Z. Cai et al. (Eds.): ISICA 2009, CCIS 51, pp. 351360, 2009. Springer-Verlag Berlin Heidelberg 2009
352
X.-j. Lei, J.-j. Sun, and Q.-z. Ma
different kinds of genetic operators and applied the dynamic scheduling strategy to control the use of them. The genetic algorithm and its application in multiple sequence alignment has been relatively mature, but the design of genetic operators and the choice of the parameters are rather complex. While the particle swarm optimization (PSO) algorithm [7] which was originally presented by Kennedy and Eberhart in 1995 is becoming very popular due to its simplicity of implementation, few parameters to adjust and quick convergence to a reasonably good solution. But it still suffers from premature convergence, tending to trap in local minima. In view of its disadvantages, improving the PSO algorithm to solve the multiple sequence alignment problem is a hot spot at present. On the basis of the study about the standard particle swarm optimization algorithms application in multiple sequence alignment [8], a novel improved PSO applied in multiple sequence alignment is proposed in this paper. The improved PSO introduces the thought of chaos optimization, which overcomes the premature convergence problem effectively. In view of the SP optimization model of multiple sequence alignment, the Chaotic Particle Swarm Optimization [9] (CPSO) is applied in multiple sequence alignment. The approach is examined by using a set of standard instances taken from the benchmark alignment database, BAliBase. The results show that the proposed algorithm improves the alignment precision and ability.
2 Description of the Problem

2.1 Multiple Sequence Alignment (MSA) Multiple sequence alignment reflects the evolutionary relationship among the given sequences. Its abstract mathematical model is that add the different numbers of gaps to each sequence of the sequence group so that each sequence has the same length and has a good similarity ultimately. MSA can be formulated mathematically as follows: A biological sequence is a string composed of l characters. The characters are taken from a finite alphabet . For DNA sequence, includes 4 letters-A, C, G, T, which respectively expresses 4 different nucleotides. For protein sequences, includes 20 different letters, which respectively expresses 20 different amino acids. These letters are collectively referred to as residues. Given a sequence group consisting of n ( n 2 ) sequences S = ( s1 , s2 ,L sn ) , where
si = si1 si 2 L sili (1 i n), sij (1 j li ) , li is defined as the length of the i-th sequence. Then the multiple sequence alignment about S can be expressed as a matrix
' S ' = ( sij ) , in which 1 i n ,1 j l , max(li ) l li .The matrix needs to meet the
i =1 n
following characteristics.
' Each sequence si' is an extension of si and it is defined as sij U {} . The
symbol denotes a gap. The deletion of gaps from si' , leaves si ; For all i , j : length( si' )=length( s 'j ); There is not any column composed of only in S ' .
353
2.2 The Standard for Judging Multiple Sequence Alignment
For multiple sequence alignment, there are some different objective functions. Such as such-of-pairs (SP) function [10], hidden Markov model (HMM) and COFFEE function. In this paper, the SP function is used as the objective function. Suppose the length of each sequence is L . The j-th character of the i-th sequence is denoted as cij (1 j L) .Then, for all other sequences, the sum-of-pairs score of all the
j-th character is defined as SP Score( j ) . SP Score( j ) =

N 1 i =1 k = i +1
p(cij , ckj )
(1)
In the above equation, p(cij , ckj ) denotes the sum-of-pairs score of the characters
cij and ckj . A formalization representation on p(cij , ckj ) is shown below.

+2 (cij = ckj and cij , ckj ) 1 (cij ckj and cij , ckj ) p(cij , ckj ) = 2 (cij = ' ' or ckj = ' ') 0 (c = ' ' and c = ' ') kj ij Then the score of all characters in the sequence group is: SUM ( S ') align = SP Score( j )
j =1 L
(2)
(3)
If the input data is taken from the benchmark alignment database-BAliBase, there will be a standard alignment result. We can calculate a relative SP-Score which is called SPS. SPS = SUM ( S ')align / SUM ( S *) align If there is no benchmark alignment database, SPS is defined as SPS = SUM ( S ')align / ( L N ( N -1) / 2) (5) (4)
In equations (4) and (5), SUM ( S ') is the result of one algorithm which we proposed, while SUM ( S *) is the result of the benchmark alignment database. Obviously, SPS reflects the ratio of the accurate alignment. Normally, the higher the value of SPS is, the more accurate the alignment is. The algorithm is more able to reflect the biological characteristics of sequences.
3 Chaotic Particle Swarm Optimization

3.1 Particle Swarm Optimization and Its Premature to Determine
Particle Swarm Optimization [7] is a random algorithm. It is initialized with a population of individuals placed in the search space randomly and searching for optimal
354
solution by updating individual with iteration. The movement of the particles is influenced by two factors, one is the best solution pbest which is found by the particle itself, the other is the best solution gbest which is found by all particles. Particles update themselves by tracking the two extreme. Then the velocity of particle and its new position will be assigned according to the following equations. v(t + 1) = wv(t ) + c1r1 ( pbest (t ) x(t )) + c2 r2 ( gbest (t ) x(t )) x(t + 1) = x(t ) + v(t + 1) (6) (7)
w is inertia weight which controls the memory of the PSO. c1 , c 2 are acceleration
constants which determine the relative influence of the pbest and gbest. r1 , r2 are generated randomly between 0 and 1. x(t ) , v(t ) denote the location and velocity at the t-th iteration respectively. Through the analysis of equations (6) and (7), we can found that when some particles get close to gbest, the update of velocity will be determined by w * v(t ) . Obviously, w < 1 . So the velocity of these particles become less and less, even close to 0. With the iteration, other particles will be congregated around these inert particles so that the algorithm terminates and appears premature convergence. The basis of judging the premature convergence [11] is as follows. Firstly, two threshold values and are predefined. If the average distance of the particles meets Dis < and the variance of the fitness meets 2 < , the particle will be judged premature. The average distance of the particles Dis represents the discrete extent of the population.
Dis = 1 PopSize L
PopSize i =1
( p
d =1
id
pd )2
(8)
L is the maximal diagonal length of the searching space, D is the dimension of the solution space, p id denotes the i-th particles coordinate in the d-th dimension, p d denotes the average of all particles d-th dimension coordinate. Obviously, the less Dis is, the more concentrative the population is; the larger Dis is, the more scattered the population is. The variance of the populations fitness 2 is defined as
PopSize
2 =
i =1
f i f avg f
(9)
f i denotes the fitness of the i-th particle, f avg is the average of the fitness of all particles. f is a normalized calibration factor, which is defined as:
max fi f avg , f = 1 i PopSize 1,
max f i f avg > 1 otherwise
(10)
355
The variance of all particles fitness 2 reflects the assembled degree of the population. The less 2 is, the more collective the population is. With the iteration, the fitness of each particle will be closer and closer. When 2 < , the algorithm is easy to fall into a local optimum, and appears premature convergence.
3.2 Chaos and Chaotic Particle Swarm Optimization
Chaos is a universal nonlinear phenomenon, whose behavior is complex and similarly random, but it is very regular. Due to the ergodicity of chaos, searching by chaotic variables has more superiority than searching disorderly and blindly. That can avoid the shortcomings of evolutionary algorithms which are easily getting into a local optimization. The unique nature of chaos is as follows: Randomness, that is, chaos has a immethodical behave like random variables. Ergodicity, it can go through a range of all states and not repeat. Regularity, chaos is generated by a determined function. Sensitivity, that is, small changes of the initial value can cause a great change in output after a period. In view of the ergodicity of chaos and insensitivity of the initial value, the chaotic initialization can be used to overcome the puzzle that the particles of standard PSO distribute in the solution space non-uniformly because of the random election of initial value. To enrich the search behavior, chaotic dynamics is incorporated into the PSO (CPSO).The logistic map is usually employed for constructing CPSO. Logistic mapping z n +1 = z n (1 z n ) is a typical chaotic system, when = 4 , it is completely in a chaotic state. A minute difference in the initial value of the chaotic variable would result in a considerable difference in its longtime behavior. The track if chaotic variable can travel ergodically over the whole search space. Then map the chaotic sequence from the chaotic interval [ 0 ,1] to the variable interval [ a , b ] by equation (12). Pi ,n = a + (b a) p i ,n (12)
In this paper, when PSO appears premature convergence, the chaotic search is used to update the particle of the current population as follows. When the particle traps in the local optimum, firstly, the initial chaotic variables are generated between 0 and 1. Then generate a new chaotic sequence by the Logistic mapping as equation (11), and transfer the span of variables from optimization to chaos as equation (12). Calculate the fitness of each sequence and record the best fitness until the current iteration meets the maximum iteration. At last, compare the best particle of the chaotic population with the best one of the current population. If the former is better than latter, replace the latter with the former, otherwise, randomly chose a particle from the current population and replace it with the best chaotic sequence.
(11)
356
4 Multiple Sequence Alignment Based on Chaotic PSO

4.1 Relevant Definition
In order to make use of PSO to solve multiple sequence alignment, several definitions are redefined as follows. Definitions 1: A residue sequence s (a1 , a2 ,L an ) is given. sk (ik ), (1 k n) denotes the collection of the locations where insert the gaps into the sequence s. Then s ' = s + sk (ik ) is the new sequence according to the sk (ik ) operator. Here, + is given different meaning. E.g.1: s=svynpgnygpylq+sk(0,3,7)=_svy_npgn_ygpylq, where _ is the gap insereted. Definitions 2: is defined as the subtraction operator. It is used to remove the elements from the previous collection if the posterior collection includes them. E.g.2: A=sk(2,3,6,8,10) sk(0,2,5,6,7,9) = sk(3,8,10). Definition 3: is defined as the union operator. It is used to unite the collections and reduce the iterant elements. E.g.3: A=sk(2,3,6,8,10) sk(0,2,5,6,7,9) = sk(0,2,3,5,6,7,8,9,10) With the redefinition of subtraction and addition, equation (6) becomes applicable for multiple sequence alignment.
v(t + 1) = wv(t ) ( pbest (t ) x(t )) ( gbest (t ) x(t ))
(13) (14)
x(t + 1) = x(t ) + v(t + 1)
, , are generated randomly between 0 and 1. Considering the multiple sequence alignment, , need to be larger than for gaining a better result. Because the collections ( pbest (t ) x (t )) and ( gbest (t ) x(t )) contain the gaps inserted to the pbest and gbest. The information is hoped to be saved for the next generation.
4.2 Several Problems to Be Solved
Problem 1: Initialization. It is considered that there are k sequences to be aligned, and these sequences are generated with various lengths, say, from l1 to lk . Parent alignments are presented as matrix where each sequence is encoded as a row with the considered alphabet set. The length of each row in the initialized matrix is from lmax to *lmax , where lmax = max(l1 , l2 ,L ln ) .Here, is chosen as 1.2 according to the analysis of the simulation results [12]. The number of the gaps is always less than 20% of the current sequences length.
357
Problem 2: Individual encoding. In view of the characteristics of multiple sequence alignment, two-dimensional encoding method is used. It is simple and intuitive, easy to operate. However, this method always takes up a lot of memory space. E.g.4: There are three sequences s1=ydgeilyqskrf, s2=adesvynpgn and s3= ydetp ikqser. The encoded result can be shown as Figure1.
Fig. 1. Encoded result
Problem 3: In consideration of the introduction of chaotic search mechanism, the chaotic sequences are produced by Logistic mapping. The mapped results will be non-integer. However, the chaotic sequences in this multiple sequence alignment algorithm express the location of the inserted gaps, which must be integer. The solution is that: firstly, int the values of sequences after Logistic mapping. Then judge whether the location where the gap will be inserted has existed. If it hasnt appeared, the gap will be inserted into this location. Otherwise, the gap will be inserted into a random location from the rest locations.
4.3 The Specific Steps of the Algorithm
The steps of the algorithm are described in detail as follows. Step1: According to the chaos, initialize the locations of the population. The lengths are between lmax and 1.2 *lmax , where lmax = max(l1 , l2 ,L ln ) .The initial velocity is generated randomly. Its length is also between lmax and 1.2 *lmax . Generate the locations of the gaps randomly and make sure that there is no column which is composed of no other than ; Step 2: Calculate each particles fitness according to the objective function, that is equation (2); Step 3: Update pbest and gbest ; Step 4: Calculate the variance of all particles fitness. Then judge whether PSO is premature convergence according to the variance and the predefined threshold value. If it is, go Step 5, otherwise, go Step 6; Step 5: Make a chaotic search for the population in accordance with the idea of 3.2. Specifically, generate a chaotic population based on the best particle of the current population. Then replace a particle randomly with the best particle of the chaotic population; Step 6: Update the particles location and velocity according to the equation (13) and (14); Step 7: Jump out the circulation if it meets the terminated condition, and output the best particle and its fitness, otherwise, go Step 2 and begin the next iteration.
358
5 Simulation and Results

The algorithm is implemented using the Matlab 7.0. The machine used for this research is a personal computer with a 1.86GHz processor. The main memory is 1GHz. The parameters are set as follows: popsize = 30 , = 0.8 , = 0.8 , end = 0.4 ,
start = 0.9 , Maxiter = 500 , the threshold value of variance is 10. declines linearly from start to end . The test data sets used in the experiments are from benchmark
alignment database, BAliBase. In this experiment, each test data is run for 30 times. Then calculate the average SPS, the maximum SPS and minimum SPS. The results obtained by the proposed CPSO algorithm are shown in table 1.
Table 1. The results of CPSO for solving MSA problems with different identities Identity < 25% 20% ~ 40% > 35% Sequence SH3 twitchin SH2 Cytochrome c Ribonuclease immunophilin Average 0.5971 0.4721 0.6235 0.5346 0.8256 0.5146 Maximum 0.9461 0.5248 0.6807 0.5731 0.8761 0.5739 Minimum 0.4559 0.4215 0.5723 0.4822 0.7778 0.4611
Table 1 shows that for the sequence Ribonuclease, the result of chaotic PSO is closer to the benchmark result. The novel algorithm has good performance. The difference between the maximum and minimum is less. That shows the algorithm is stronger and has better robustness. For the sequence SH3, comparing with the SPS value of reference [13], the result of our algorithm 0.5971 is larger than that 0.537. It testifies the validity of the new algorithm. For other sequences, the results of chaotic PSO still have some distance with that of benchmark alignment. The reason may lie in the reasonable setting of the parameters and the sequence itself.
s1: s2: s3: s4: s5: saeytCGSTCYWSSDVSAA-KAKgyslyesgd-tiddYPHEYHDYEGFDFpvsG attCGSTNYSASQVRAAA--NAacqyyqnddtasstY-PHTYNNYEGFDFpvdG acmyiCGSVCYSSSAISAALNKgysyyedgatagsssYPHRYNNYEGF-DFpta tCGKV-FYSASAVSAASNAacn-y-vrag-staggstYPHVYNNYEGFRFkgls -CGGTYYSSTQVNR-AINNaks-gqy-sstgY--PHTY-NNYEGFDFsdycd-G -TYYEYPIMSDYDVYTGGSPGAD-RVIFNGDDELAGVITHTGASGDDFVAC PY-QEFPIKSG-GVYTGGSPGADRVVINTNBE-YAGAITHTGASGNNFVGA KPWYEFPILSSGRVYTGGSPGAD-RVIFDSHGNLDMLITHNGASGNNFVAC KPFYEFPILSSGKTYTGGSPGADRVVINGQCS-IAGIITHTGASGNAFVAC PYKEYPLKTS-SSGYTGGSPGADRVVYDSNDGTFCGAITHTGASGNNFVQC
Fig. 2. The alignment result for Ribonuclease
359
Fig. 3. The score of all characters for each sequence group with iteration
The six problems alignment results are all described respectively. Here, the matched result of Ribonuclease is shown as figure 2. It is clear that there are many columns which are matched precisely. That approves the validity of the improved algorithm effectively. Figure 3 shows the evolution of the score of all characters for each sequence group.
6 Conclusions
In this paper, we present an improved algorithm chaotic particle swarm optimization (CPSO) for multiple sequence alignment. In the proposed CPSO algorithm, the chaos is introduced to PSO. From simulation results, it is shown that the proposed algorithm is effective, can improve search performance for some sequences. The new method can provide different viewpoint for related research. Furthermore, we will improve our algorithm so that it doesnt depend on the sequences themselves and the parameters can adjust adaptively.
Acknowledgements. The authors would like to thank all scholars previous researches and the hard work from every member of our team.
References
1. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443453 (1970) 2. Hogeweg, P., Hesper, B.: The alignment of sets of sequences and the construction of phylogenetic trees: An integrated meth-od. Journal of Molecular Evolution 20(2), 175186 (1984)
360
3. Lee, C., Grasso, C., Sharlow, M.F.: Multiple sequence alignment using partial order graphs. Bioinformatics 18(3), 452464 (2002) 4. Hernndez-Gua, M., Mulet, R., Rodrguez-Prez, S.: A new simulated annealing algorithm for the multiple sequence alignment problem: The approach of polymers in a random media. Physical Review E 72(3), 17 (2005) 5. Horng, J.-T., Wu, L.-C., Lin, C.-M., Yang, B.-H.: A genetic algorithm for multiple sequence alignment. LNCS, vol. 9, pp. 407420. Springer, Heidelberg (2005) 6. Notredame, C., Higgins, D.G.: SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Research 24(8), 15151524 (1996) 7. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. of IEEE Intl. Conf. on Neural Networks, IV, pp. 19421948. IEEE Press, Piscataway (1995) 8. Lei, C., Ruan, J.: A particle swarm optimization algorithm for finding DNA sequence motifs. In: Proc. IEEE Conf., pp. 166173 (2008) 9. Liu, B., Wang, L., Jin, Y.H.: Improved particle swarm optimization combined with chaos. Chaos, Solitons Fractals 25(5), 12611271 (2005) 10. Thompson, J.D., Plewniak, F., Poch, O.: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acid Research 27(13), 26822690 (1999) 11. Jun-min, L., Yue-lin, G.: Chaos particle swarm optimization algorithm. Computer applications 28(2), 322325 (2008) 12. Lee, Z.-J., Su, S.-F., Chuang, C.-C., Liu, K.-H.: Genetic algorithm with ant colony optimization (GA-ACO) for multiple sequence alignment. Applied Soft Computing, 5578 (2008) 13. Wei-li, X., Zhen-cing, W., Bao-guo, X.: Application of the PSO algorithm with mutation operator to multiple sequence alignment. Control Engineering of China 15(4), 357368 (2008)

Multiple Sequence Alignment Based On Chaotic PSO

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Multiple Sequence Alignment Based On Chaotic PSO

Hochgeladen von

Copyright:

Verfügbare Formate

Multiple Sequence Alignment Based on Chaotic PSO

Xiu-juan Lei, Jing-jing Sun, and Qian-zhi Ma

X.-j. Lei, J.-j. Sun, and Q.-z. Ma

2 Description of the Problem