The Prediction of Disulphide Bonding in HIV and Other Lenti-Viruses by Machine Learning Techniques

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014 Available online at http://www.ijsrpub.
com/ijsrk ISSN: 2322-4541; 2014 IJSRPUB http://dx.doi.org/10.12983/ijsrk-2014-p0057-0066
Full Length Research Paper The Prediction of Disulphide Bonding in HIV and other lenti-viruses by Machine Learning Techniques
Anubha Dubey
Research Scholar, Department of Bioinformatics, MANIT BHOPAL (M.P), INDIA Email: anubhadubey@rediffmail.com
Received 01 November 2013; Accepted 28 December 2013
Abstract. The introduction of disulphide bonds into proteins is an important mechanism by which they have evolved and are evolving. Most protein disulphide bonds are motifs that stabilize the tertiary and quaternary protein structure. These bonds also thought to assist protein folding by decreasing the entropy of the unfolded form. Amino acid cysteine plays a fundamental role in formation of disulphide bonds. In the present study, proteomics of disulphide bonding in HIV is studied through a machine learning model which has been developed to classify disulphide bonds from different species of lentiviruses like bovine immunodeficiency virus (BIV), simian immunodeficiency virus (SIV), Feline immunodeficiency virus, murine infectious virus (MIV) and equine infectious anaemia virus (EIV) and Human immunodeficiency virus (HIV). Phylogenetic relationship is also studied by the prediction of disulphide bonding among these viruses. Hence by different algorithms of WEKA classifier J48 predicts better classification with an accuracy of 89.6104%. Keywords: Disulphide bond, motifs, lentiviruses, Phylogenetic, WEKA.
I. INTRODUCTION Disulfide bonds play an important role in the folding and stability of some proteins, usually proteins secreted to the extracellular medium (Savier and Kaiser, 2002).Since most cellular compartments are reducing environments; in general, disulfide bonds are unstable in the cytosol, with some exceptions as noted below, unless a sulfhydryl oxidase is present (Hatahet et al., 2010).
Fig. 1: Cysteine is composed of two cysteines linked by a disulfide bond (shown here in its neutral form)
Disulfide bonds in proteins are formed between the thiol groups of cysteine residues. The other sulphurcontaining amino acid, methionine cannot form disulfide bonds. A disulfide bond is typically denoted by hyphenating the abbreviations for cysteine, e.g.,
when referring to Ribonuclease A the "Cys26-Cys84 disulfide bond", or the "26-84 disulfide bond", or most simply as "C26-C84" (Ruoppolo et al., 2000). The structure of a disulfide bond can be described by its dihedral angle between the atoms, which is usually close to 90. The disulfide bond stabilizes the folded form of a protein in several ways: 1) It holds two portions of the protein together, biasing the protein towards the folded topology. That is, the disulfide bond destabilizes the unfolded form of the protein by lowering its entropy. 2) The disulfide bond may form the nucleus of a hydrophobic core of the folded protein, i.e., local hydrophobic residues may condense around the disulfide bond and onto each other through hydrophobic interactions. 3) Related to #1 and #2, the disulfide bond link two segments of the protein chain, the disulfide bond increases the effective local concentration of protein residues and lowers the effective local concentration of water molecules. Since water molecules attack amide-amide hydrogen bonds and break up secondary structure, a disulfide bond stabilizes secondary structure in its vicinity (Thorton, 1981; Wetzel, 1987).
57
Dubey The Prediction of Disulphide Bonding in HIV and other lenti-viruses by Machine Learning Techniques
For the protein folding prediction, a correct prediction of disulfide bridges can greatly reduce the search space (Skolnick et al., 1997; Huang, 1999). The prediction of disulfide bonding pattern helps, to a certain degree, predict the 3D structure of a protein and hence its function because disulfide bonds impose geometrical constraints on the protein backbones. Some recent research works had shown the close relation between the disulfide bonding patterns and the protein structures (Chuang, 2003; Vlijmen, 2004). By stabilizing protein structure, disulphide bond can protect proteins from damage and their half-life. Once the disulphide bond is formed they remain unchanged for the life of the protein. In the realm of the disulfide bond prediction, four problems are addressed. The first is the protein chain classification: to classify if the protein contains disulfide bridge(s) or not, the second is the residue classification: to predict the bonding state of cysteines, the third is the bridge classification and the last is the prediction of the disulfide bonding pattern. Over the past years, significant progress has been made on the prediction of the disulfide bonding states (Fariselli, 1999; Fiser and Simon, 2000; Martelli, 2002; Chen, 2004)) and the disulfide bonding pattern (Vullo, 2004; Ceroni, 2006; Song, 2007; Rubinstein,R 2008). For disulfide bonding pattern prediction, with
the exception of the methods proposed by (Ferre, Clote 2005, 2006; Chen et al., 2006) others are also used with or without bonding state known. A method for predicting disulphide bonds from genomic data which organisms are rich in disulfide bonds has been described in (Mallick, 2002; O'Connor, 2004). In the present study, a similar strategy was utilized in which proteomic sequences are used first to generate phylogenetic relation between HIV and other species of lentivirus and then disulphide bond prediction is done among the species to see the disulphide richness across the species. HIV (Human Immunodeficiency Virus) is a member of genus lentivirus, part of the family retroviridae. Lentiviruses have many common morphologies and biological properties. Many species are infected by lentiviruses, which are characteristically responsible for long-duration illnesses with a long incubation period lentiviruses are transmitted as single-stranded, positive-sense, enveloped RNA viruses. Here in this paper we have introduced a disulphide bonding relationship of HIV with Other six species of Lentivirus like bovine immunodeficiency virus (BIV), simian immunodeficiency virus (SIV), Feline immunodeficiency virus, murine infectious virus (MIV) and equine infectious anaemia virus (EIV) and two types of HIV- HIV1 & HIV2.
Table1: The comparative features of HIV with other related viruses

S.N o. 1. Featur es Occur FIV Cats BIV Cattles MLV Cancer in mouse 90 nm in diameter EIAV Horse SIV African green monkey HIV Human
2.
Genom e size
3.
Enzym es Structu ral genes Open reading frames Access ory genes Envelo pe and core
80-100 nm and pleomorphic,d iploid genome RTase,integras e, protease Gag,pol,env
Mature virus,110130 nm with 8.4 kb RTase,integ rase, protease Gag,pol,env
120 nm
4.
RTase,integ rase, protease Gag,pol,env
RTase,integrase,protease,ri bonuclease Gag,pol,env
5.
absent
6.
Vif,vpr,rev
Regions between pol & env Nif,tat,rev
absent
absent
present
Present
absent
Tat, vif
7.
8.
Conser ved RNA
Env codes for surface glycoprotein and transmembran e glycoprotein absent
Envelope present and core contains gag,gag-pol polyprotein absent
Gag,gagpol poly protein
Gag poly protein
Gag,pol, polyprotein
Vif,vpr,nef,vpu,vpx (HIV2),tat,rev,tev(fusion of tat,rev,and env) Gag-pol polyprotein
Present called coreencapsidati on signal
absent
Present called core encapsidati on signal
Present in SR proteins, RNA interface etc.
58
International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014
One of the most important contributions of biological sequences to evolutionary analysis is that the sequences of different organisms are often related. Hence role of disulphide bond is studied among lentivirus species. 2. MATERIALS and METHODS 2.1. Data Preparation The analysis has been done on the basis of protein sequence data of BIV, FIV, EIAV, MLV, SIV, and HIV which has been obtained from UNIPROT [30]. 2.1.1 Phylogenetic methods To study evolutionary relationship it is important to do multiple sequence alignment (MSA). MSA is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many
cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. Of the various softwares of MSA, CLUSTAL W2 (Chenna, 2003) is found to be suitable. In this neighbour-joining method is used. Neighbour-joining (NJ) is a bottom-up clustering method used for the construction of phylogenetic trees. Usually used for trees based on DNA or protein sequence data, the algorithm requires knowledge of the distance between each pair of taxa (e.g., species or sequences) in the tree. Phylogenetic methods play an important role in evolutionary analysis and to obtain the evolutionary relationship of HIV. The sequences taken as S1,S2,S3,S4,S5,S6,S7,S8 represents HIV1,HIV2,MLV,BIV, Lentivirus,Murine virus, Feline immunodeficiency virus, Equine infectious anaemia virus, Simian immunodeficiency virus. Following figures are obtained by CLUSTAL-W2.
Fig. 2: (a) Neighbour Joining Unrooted tree
Fig. 2: (b) Neighbour Joining Rooted tree
Fig. 3: (a) Dendogram Unrooted tree
Fig. 3: (b) Dendogram Rooted tree
Here the cladogram and NJ tree shows the HIV 1 & 2 is related with Simian Immunodeficiency Virus. The alignment of a query sequence to a homologous sequence infers a likely three-
dimensional mapping of the protein sequence in question, yielding homology-based structural predictions for many proteins. Considering all such protein sequences from a given genome as a group,
59
the tendency of each amino acid type to appear in spatial proximity to every other type was then analyzed, taking into account the overall abundances of the 20 amino acid types. Enrichment in cysteine cysteine proximity above the expected value was taken to indicate an enrichment of disulfide bonding. Since cysteinecysteine proximity can also indicate metal-binding motifs, proteins were first filtered to remove proteins with metal-binding sites that would otherwise produce false-positive results. 2.1.2. Disulphide Bond Prediction Of the various softwares available for disulphide bond prediction, Disulphide Bonding Connectivity Pattern (DBCP) server (Hsuan-Hung and Lin-Yu, 2010) is used as it predicts better disulphide bonding positions with cysteines positions and also capable to relate disulphide bonding with metal binding sites. The working of server is as follows: (1) Run Basic Local Alignment Search Tool (BLAST) to get the template sequence of the input sequence. The parameters of BLAST are set as follows: the Expectation value (E) threshold for saving hits is set to a very large value 10 000 and the database is set to PDB that contains sequences derived from the 3D structure records from the Protein Data Bank. If the E-value of the template sequence is >10 or the template sequence shares identity <25% to the input sequence, instead of going to Step 2, the method previously developed by (20) to predict the disulfide bonding pattern is used. (2) Align the input sequence and the template sequence. (3) Feed the alignment file into MODELLER and run the procedure to evaluate the model of the input sequence using the template sequence. (4) get the coordinate (X, Y, Z) of the Ca (a Carbon) of each residue. (5) Coding each cysteine pair as the NPD (normalized pair distance), this will be the input to the SVM (Hsuan-Hung and Lin-Yu, 2010). (6) Feed the coding file into the Support Vector Machine (SVM) to predict the bonding probability of each cysteine pair with the trained model. The multiple trajectory searches (Tseng and Chen, 2008) are tightly integrated with the SVM training. For more details, please refer to the Supplementary Data on the DBCP web server. (7) Coding the input file with the probabilities from the SVM output and using the modified
weighted perfect matching algorithm to get the first level disulfide bonding connectivity. (8) Justify the first level disulfide bonding connectivity with the thresholds to get the final result. (9) Display the result on the web page or send the result to the user. In Step 1, if the E-value of the template sequence is >10 or the template sequence shares identity <25% to the input sequence, a previously proposed method (Lin, H.H., and Tseng, L.Y. 2009) is used for prediction. In this method, the position specific scoring matrix, the normalized bond lengths, the predicted secondary structure of protein and the physicochemical properties index of the amino acid were used as features. The multiple trajectory searches and the SVM training were tightly integrated to train the predictor. More details can be obtained from Lin and Tseng, 2009). The DBCP server is free and open to all users. 2.1.2.1. Evaluation After taking four websites of disulphide bond connectivity pattern without prior knowledge of bonding state of cysteine (Ferre et al., 2006; Song, 2007). We have tested our prediction by 10-fold cross validation on the data set of FIV, BIV, MLV, EIAV, SIV, HIV jointly named as VIRUS. And disulphide bonds were observed with cysteine residues and some of them are also shows metal binding sites. This was again evaluated/ classified by J48 WEKA 3.7 algorithm. J48 is a decision tree classifier (Pfahringer IHW, 1999). The number of ways of forming p disulfide bonds from n cysteine residues is given by the formula
2.1.2.2. Measurement of accuracy A necessary step to the prediction of disulphide connectivity is the prediction of the disulphide bonding state of cysteines in proteins. In order to evaluate the accuracy of the prediction two indexes can be used: Qp and Qc. For a protein PQp is defined as: Qp= (Correct pattern, predicted pattern) (1) Where (x,y) is 1 if and only if the predicted pattern coincides with the correct pattern. Alternatively, Qc is defined as: (2)
QC
numberofcorrectlypredictedpairs numberofpossiblepairs
60
The two indexes are equally suited and complimentary for measuring the accuracy of the prediction: Qp is a measure of the predictive performance on each protein (either 1 or 0) and can be averaged over a number of predicted proteins to give a global measure of the accuracy of the method. Qc quantifies the accuracy of the method based on the
number of pairs correctly predicted with respect to the total number of possible pairs. In order to score the method its performance was also compared with that of a random predictor. The probability of a predictor randomly performing (Rp) on the prediction of the connectivity pattern can be computed. In general, given 2B cysteines, the number of possible connectivity pattern is:
Np= (2B-1) Np (2 B 1) (i B ) (2i 1) (3) The corresponding probability of Rp is: The prediction accuracy was calculated by following the standard conventions accuracy for prediction: 1 Qp ( Rp) (4) N Np Q2 c (6) No For the random predictor (Rp), Qc is Where Nc is the total number of correctly predicted 1 Qc ( R p ) (5) cysteines and No is the total number of cysteines. (2 B 1) Specificity of the prediction: Evaluation of predictive accuracy:
Specificity
TN X TN X FPX
(7)
Where x denotes the bonded cysteines or nonbonded cysteines FPX is the number of false negatives in the prediction and TPX is the number of true positive predictions for bonding state x. Sensitivity of the prediction was calculated as:
Where FN X is the number of false negatives for bonding state x. The Matthews correlation coefficient: MCC is calculated as:
Sensitivity
TPX TPX FN X
(8)
Mathews Correlation coefficient MCC
(TP TN ) ( FP FN ) (TP FN )(TP FP )(TN FP )(TN FN )

2.1.4. Prediction cysteines of connectivity pattern of
Where TN X the true negatives of bonding state are X. The value of MCC is an indication of how good is the prediction. The closer the MCC is to 1, the closer the prediction is to a perfect prediction. 2.1.3. Prediction of oxidation state of cysteines Knowledge of the oxidation state of cysteines infer a lot of information about the protein such as local sequence environment the possible 3D structure of protein and in some cases, the function and working mechanisms of the protein. In this paper position of oxidized cysteines was observed by DBCP software (Hsuan-Hung and Lin-Yu, 2010).
Connectivity pattern prediction is a challenging and yet very biological meaningful task. It is challenging because there are too many possibilities of disulphide bonding for a given protein and many factor influence the final connection pattern. The correct predictions of disulphide bond provide in order to have a stabilized three dimensional protein structure. Research is going on for prediction of connectivity pattern of cysteines (Hsuan-Hung and Lin-Yu, 2010) 2.1.5. Prediction of number of disulphide bridges Analysis of prediction results shows that there is a relationship between the sum S(p) of all the probabilities of cysteines and the total number of bonded cysteines (as predicted by DBCP serwer). The
61
total number of bonded cysteines using linear regression approach shows that the total number of bonded cysteines is even and does not exceed the total number of cysteines in sequence. When the number of disulphide bridges increases in chains, the performance decreases in general. The overall specificity and sensitivity using four different input schemes are around 51% to 55%. The variations of the performance for chains with many disulphide bridges (K>6) is large because there is few in the dataset. Thus for proteins with a large number of disulphide bridge (K>6), prediction must be used as caution. It is very difficult to correctly predict the entire disulphide connectivity pattern because the number of connectivity pattern increases exponentially with K. 2.2. J48 algorithm The disulphide bond prediction will be classified by machine learning algorithm. J48 proves better for small biological data. J48 is a decision tree classifier.J48 is a Machine learning algorithm, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviours based on empirical data, such as from sensor data or databases. A learner can take advantage of examples (data) to capture characteristics of interest of their unknown underlying probability distribution. Data can be seen as examples that illustrate relations between observed variables. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data; the difficulty lies in the fact that the set of all possible behaviours given all
possible inputs is too large to be covered by the set of observed examples (training data). 3. RESULTS AND DISCUSSIONS Phylogenetic methods play an important role in evolutionary analysis and to obtain the evolutionary relationship of HIV with other species of lentivirus i.e. MLV, BIV, Lentivirus, Murine virus, Feline immunodeficiency virus, Equine infectious anaemia virus, Simian immunodeficiency virus. CLUSTALW2 result shows the cladogram and NJ tree of the species which presents that HIV 1 & 2 is related with Simian Immunodeficiency Virus. Disulphide bond is predicted for these sequences to find out the similarity among lentivirus species and then disulphide bond based classification is studied by J48 a machine learning technique. 3.1. Analysis of DBCP server A web-based application system called the DBCP is provided for the prediction of the disulfide bonding connectivity pattern without the prior knowledge of the bonding state of cysteines. To the best of our knowledge, the best accuracy of disulfide connectivity pattern prediction (Qp) and that of disulfide bridge prediction (Qc) are found 81% and 82%, respectively, on the data set of HIV and other related to HIV molecular sequences with 10-fold cross validation. Env gene plays a significant role in disulphide bond prediction. Env, gag in FIV, env- pol in MLV, pol in EIA, Env in SIV & HIV correctly predicts disulphide bonds. Table 2 shows the species with position of disulphide bond and this proves that disulphide bond is conserved among species.
Table 2: Species with position of disulphide bonds

Species FIV EIA MLV SIV HIV gene Env & gag POL ENV & POL ENV ENV Position of disulphide bonds 328-348 322-342 81-95,112-129,121-134,165-184: 536-538,561-576, 99-207,106-198,180-193,230-242,300-333,382-457,389430,412-422 118-200,125-191,213-242,223-234,291-328,374-435,381408
Stabilization of the native Env complex by disulfide bond linkage is likely to impose constraints on Env function because a certain degree of flexibility is probably essential for Env to undergo the conformational changes that eventually lead to fusion of the viral and cellular membranes. The gp120 gp41 interface is considered to be structurally flexible, so constraining its motion might have adverse effects.
3.2. J48 based classification Again disulphide bond based classification of HIV and other related viruses is done by machine learning J48 algorithm which gives the accuracy of 89.6104%. After 10 fold cross validation of training data (Virus) the result obtained is shown as follows: In the field of machine learning, a confusion matrix is a specific table layout that allows visualization of the performance of an algorithm,
62
typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each column of the matrix represents the instances in a
Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances
predicted class, while each row represents the instances in an actual class.
Table 3: Statistics of J48 algorithm

69 8 0.8359 0.0528 0.1845 56.0893 % 56.0893 % 77 89.6104 % 10.3896 %
Table 4: Detailed accuracy by class

TP RATE 0 0 0.833 1 0.714 0.976 0.896 FPRATE 0 0 0.042 0.032 0 0.083 0.053 PRECISION 0 0 0.625 0.875 1 0.93 0.885 RECALL 0 0 0.833 1 0.714 0.976 0.896 F-MEASURE 0 0 0.714 0.933 0.833 0.952 0.884 ROC 0.474 0.48 0.893 0.99 0.814 0.918 0.899 CLASS EIA BIV FIV MLV SIV HIV Weighted average
Table 5: Confusion Matrix between predicted and actual class Predicted class (column)
a 0 0 0 0 0 0 b 0 0 0 0 0 0 c 1 1 5 0 0 1 d 0 0 1 14 1 0 e 0 0 0 0 10 0 f 0 0 0 0 3 40 Classified as a=EIA b=BIV c=FIV d=MLV e=SIV f=HIV
The table 4 shows precision and recall which actually are two widely used metrics for evaluating the correctness of a pattern recognition algorithm. They can be seen as extended versions of accuracy, a simple metric that computes the fraction of instances for which the correct result is returned. In a classification task, the precision for a class is the number of true positives (i.e. the number of items correctly labeled as belonging to the positive class) divided by the total number of elements labeled as
belonging to the positive class (i.e. the sum of true positives and false positives, which are items incorrectly labeled as belonging to the class). Recall in this context is defined as the number of true positives divided by the total number of elements that actually belong to the positive class (i.e. the sum of true positives and false negatives, which are items which were not labeled as belonging to the positive class but should have been).
Fig. 4: ROC for J48 Classifier
63
3.3. Reciever Operating Curve (ROC) It is a graphical technique for evaluating data mining schemes, which are used in such a way the learner is trying to select samples of test instances that have a high proportion of positives a term used to characterize the tradeoff between hit rate and false rate.ROC curves depicts the performance of a classifier without regard to class distribution or error costs. They plot the number of positives included in the samples on the vertical axis, expressed as a percentage of the total number of positives, against the total number of negatives on the horizontal axis. For each fold of a 10 fold cross validation ,weight the instances for a selection of different cost ratios train the scheme on each weighted set ,count the true positives and false positives in the test set, and plot the resulting point on the ROC axes. The correlation between MLV, SIV and HIV is predicted as their data is sufficient for analysis of correlation between these species, but the data of EIV, FIV, BIV are less for any analysis. Let MLV is X3, SIV is X2 and HIV is X1, then from correlation coefficient r12= 0.98551, r13=0.30182, r23=0.47188. This correctly shows that HIV is highly related to SIV. Hence Pearson correlation coefficient has been widely for the analysis of proteomic data. Its popularity is likely due to its simplicity and interpretability; therefore it essentially computes the strength of the linear relationship between the two quantities/ species. 4. CONCLUSION Here a framework for disulphide bond prediction and classification is presented with an accuracy of 89.6104%. Furthermore, DBCP is better for prediction of disulphide bond with cysteines positions and also this web server is able to find metal binding sites. Other methods that can predict both the disulfide bonds and the metal binding sites will be more suitable for prediction. The high metal binding site score (e.g. >0.5) indicates that there may be cysteines involved in the metal binding sites. For protein sequence analysis it was found that Env envelope glycoprotein shows disulphide bond conservation among all the species of retroviruses. The correlation between HIV and SIV was also found to be 0.98551. Hence disulphide bonds are evolutionary conserved throughout the species. The knowledge of disulfide richness in certain organisms suggests practical applications, including engineering enhanced protein stability and facilitating protein-fold recognition. Disulfide-rich organisms should allow the development of novel tools and approaches for attacking such problems of current
interest. This work depends upon the availability of sequenced proteomes, and the availability of additional other proteomes has enabled the identification of an enigmatic protein family as a potential player in the biochemistry of cytoplasmic disulfide bonds. We hope this study will promote continued interest in sequencing more proteins from diverse organisms so as to further enhance the scope and resolution of comparative proteomics techniques. As more proteomes become available, we anticipate that the ease of discovery of specific proteomic adaptations to the environment will improve and yield further insights into molecular evolution and cell biology. REFERENCES Sevier CS, Kaiser CA (2002). Formation and transfer of disulphide bonds in living cells. Nature Reviews Molecular and Cellular Biology, 3(11): 836847. Hatahet F, Nguyen VD, Salo KEH, Ruddock LW (2010). Disruption of reducing pathways is not essential for efficient disulfide bond formation in the cytoplasm of E. coli. MCF, 9(67): 67. Ruoppolo M, Vinci F, Klink TA, Raines RT, Marino G (2000). Contribution of individual disulfide bonds to the oxidative folding of rib ribonuclease A. Biochemistry, 39(39): 12033 42. Thorton JM (1981). Disulphide bridges in globular proteins. J. Mol.Biol.,151: 261-287. Wetzel R (1987). Harnessing disulphide bonds using protein engineering. Trends Biochem Sci., 12: 478-482. Skolnick J, Kolinski A, Ortiz AR (1997). MONSSTER: a method for folding globular proteins with a small number of distance restraints. J. Mol. Biol., 265: 217241. Huang ES, Samudrala R, Ponder JW (1999). Ab initio fold prediction of small helical proteins using distance geometry and knowledge-based scoring functions. J. Mol. Biol., 290: 267281. Chuang CC, Chen CY, Yang JM, Lyu PC, Hwang JK (2003). Relationship between protein structures and disulfide-bonding patterns. Proteins, 55: 1 5. Van Vlijmen HWT, Gupta A, Narasimhan LS, Singh J (2004). A novel database of disulfide patterns and its application to the discovery of distantly related homologs. J. Mol. Biol., 335: 1083 1092. Fariselli P, Riccobelli P, Casadio R (1999). Role of evolutionary information in predicting the disulfide-bonding state of cysteine in proteins. Proteins, 36: 340346.
64
Fiser A, Simon I (2000). Predicting the oxidation state of cysteines by multiple sequence alignment. Bioinformatics, 16: 251256. Martelli PL, Fariselli P, Malaguti L, Casadio R (2002). Prediction of the disulfide-bonding state of cysteines in proteins with hidden neural networks. Protein Eng., 15: 951953. Chen YC, Lin YS, Lin CJ, Hwang JK (2004). Prediction of the bonding states of cysteines using the support vector machines based on multiple feature vectors and cysteine state sequences. Proteins, 55: 10361042. Fariselli P, Casadio R (2001). Prediction of disulfide connectivity in proteins. Bioinformatics, 17: 957964. Vullo A, Frasconi P (2004). Disulfide connectivity prediction using recursive neural networks and evolutionary information. Bioinformatics, 20: 653659. Ferre`F, Clote P (2005). Disulfide connectivity prediction using secondary structure information and diresidue frequencies. Bioinformatics, 21: 23362346. Ferre F, Clote P (2006). DiANNA 1.1: An extension of the DiANNA web server for ternary cysteine classification.Nucleic Acids Res., 34: W182 W185. Chen BJ, Tsai CH, Chan CK, Kao CY (2006). Disulfide connectivity prediction with 70% accuracy using two-level models. Proteins, 64: 246252. Ceroni A, Passerini A, Vullo A, Frasconi P (2006). DISULFIND: a Disulfide Bonding State and Cysteine Connectivity Prediction Server. Nucleic Acids Res., 34: W177W181. Cheng J, Saigo H, Baldi P (2006). Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching. Proteins, 62: 617629. Song J, Yuan Z, Tan H, Huber T, Burrage K (2007). Predicting disulfide connectivity from protein
sequence using multiple sequence feature vectors and secondary structure. Bioinformatics, 23: 31473154. Rubinstein R, Fiser A (2008). Predicting disulfide bond connectivity in proteins by correlated mutations analysis. Bioinformatics, 24: 498 504. Mallick P, Boutz DR, Eisenberg D, Yeates TO (2002). Genomic evidence that the intracellular proteins of archaeal microbes contain disulfide bonds. Proc Natl Acad Sci USA, 99: 96799684. O'Connor BD, Yeates TO (2004). GDAP: A web tool for genome-wide protein disulfide bond prediction. Nucleic Acids Res, 32:W360 W364. Poumbourios P, Maerz AL, Drummer HE (2003). Functional evolution of the HIV-1 envelope glycoprotein gp120-association site of gp41. J Biol Chem., 278: 42149-42160. Hsuan-Hung L, Lin-Yu T (2010). DBCP: A web server for disulfide bonding, Connectivity pattern prediction without the prior knowledge of the bonding state of cysteines. Nucleic Acids Research, Vol. 38, Web Server issue W503 W507. Pfahringer IHW (1999). WEKA: A Machine Learning Workbench for Data, www.cs.waikato.ac.nz. Lin HH, Tseng LY (2009). Predicting of disulphide bonding pattern based on support vector Machines with parameters tuned by multiple trajectory search. WSEAS Trans. Compu., 9: 1429-1439. Tseng LY, Chen C (2008). Multiple trajectories search for large scale global optimization. Proceedings of 2008 IEEE congress on Evolutionary Computation, CEC08, Hong-Kong, 30523059. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (2003). Multiple sequence alignment with the Clustal series of programs Nucleic Acids Res., 31: 3497-3500.
65
Anubha Dubey has submitted her PhD in Bioinformatics at Maulana Azad National Institute of Technology, Bhopal. She received her first degree in Rani Durgawati University Jabalpur in 2005 awarded with Bachelors of Science in Biotechnology. She obtained degree in Master of Science in Biotechnology from Barkatullah University Bhopal in 2007 with dissertation An Approach to Investigate the Phenomenon of Genomic Instability in Cultured Human Foetal Lung Fibroblast cells by modern Technologies. Her current research is focus on extracting information from HIV molecular sequences by Machine learning techniques. To date, she has published several scientific articles related to machine learning field.
66

The Prediction of Disulphide Bonding in HIV and Other Lenti-Viruses by Machine Learning Techniques

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

The Prediction of Disulphide Bonding in HIV and Other Lenti-Viruses by Machine Learning Techniques

Hochgeladen von

Copyright:

Verfügbare Formate

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014 Available online at http://www.ijsrpub.

com/ijsrk ISSN: 2322-4541; 2014 IJSRPUB http://dx.doi.org/10.12983/ijsrk-2014-p0057-0066

Table1: The comparative features of HIV with other related viruses

80-100 nm and pleomorphic,d iploid genome RTase,integras e, protease Gag,pol,env

Mature virus,110130 nm with 8.4 kb RTase,integ rase, protease Gag,pol,env

RTase,integ rase, protease Gag,pol,env

RTase,integ rase, protease Gag,pol,env

RTase,integ rase, protease Gag,pol,env

RTase,integrase,protease,ri bonuclease Gag,pol,env

Regions between pol & env Nif,tat,rev

Conser ved RNA

Env codes for surface glycoprotein and transmembran e glycoprotein absent

Envelope present and core contains gag,gag-pol polyprotein absent

Gag,gagpol poly protein

Gag poly protein

Vif,vpr,nef,vpu,vpx (HIV2),tat,rev,tev(fusion of tat,rev,and env) Gag-pol polyprotein

Present called coreencapsidati on signal

Present called core encapsidati on signal

Present in SR proteins, RNA interface etc.

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014

Fig. 2: (a) Neighbour Joining Unrooted tree

Fig. 2: (b) Neighbour Joining Rooted tree

Fig. 3: (a) Dendogram Unrooted tree

Fig. 3: (b) Dendogram Rooted tree

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014

Mathews Correlation coefficient MCC

(TP TN ) ( FP FN ) (TP FN )(TP FP )(TN FP )(TN FN )

Table 2: Species with position of disulphide bonds

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014

Table 3: Statistics of J48 algorithm

Table 4: Detailed accuracy by class

Fig. 4: ROC for J48 Classifier

International Journal of Scientific Research in Knowledge, 2(2), pp. 57-66, 2014

Das könnte Ihnen auch gefallen