Algorithm For Analysis of Amino Acid Bond Lengths of Proteins

Algorithm for Analysis of Amino Acid Bond Lengths of Proteins
Final Project Report
CS 487 Introduction to Cluster Computing
Old Dominion University
Algorithm for Analysis of Amino Acid Bond Lengths of

Proteins
Tim Dugan, Computer Engineering Department
Gordon Bland, Computer Science Department
Tyler Wood, Computer Science Department
Adrian Ostolski, Computer Science Department
Research Advisor: Professor Jay Morris
ABSTRACT
Bioinformatics is a growing field of study which supplies us with endless computing problems
to solve. Although the definition of the term itself is somewhat arguable, the generally accepted idea
is that bioinformatics is using computers to solve biological issues, or answer questions. One such
problem is to develop an algorithm for comparing lengths of proteins in order to search for protein
keys. A protein key is a protein which sends signals to other cells by means of a chemical reaction
where the binding occurs. Our team has chosen to develop an algorithm for analysis of amino acid
bond lengths of proteins because this analysis will assist in identifying protein keys.
Table of Content
1 Problem Description....................................................................................................3
1.1 Introduction...........................................................................................................3
1.2 Background...........................................................................................................4
1.3 Need for Solution..................................................................................................7
2 Solution Design...........................................................................................................8
2.1 Module Design......................................................................................................8
2.2 Classes.................................................................................................................8
2.3 Functions..............................................................................................................9
3 Results......................................................................................................................12
3.1 Generated Protein Files......................................................................................13
3.2 Output.................................................................................................................15
4 Conclusions and Recommended Further Research.................................................18
5 Acknowledgements...................................................................................................19
6 Appendices...............................................................................................................20
A References..............................................................................................................20
B Source Code and Documentation...........................................................................20
C Runtime Instructions...............................................................................................33
1
List of Figures
Figure 1. Amino Acid Structure
Figure 2. Peptide Bond
Figure 3. Function Call in Main
Figure 4. Modify Function
Figure 5. Load Function
Figure 6. Algorithm Function
Figure 7. Generate Function in Main
Figure 8. Generate Function
Figure 9. Protein1_clean
Figure 10. Protein2_clean
Figure 11. Protein1_raw
Figure 12. Protein2_raw
Figure 13. Protein_matchs
List of Tables
Table 1. Amino Acids by Name
2
1 PROBLEM DESCRIPTION
Bioinformatics is a growing field of study which supplies us with endless
computing problems to solve. Although the definition of the term itself is somewhat
arguable, the generally accepted idea is that bioinformatics is using computers to solve
biological issues, or answer questions. One such problem is to develop an algorithm for
comparing lengths of proteins in order to search for protein keys. A protein key is a
protein which sends signals to other cells by means of a chemical reaction where the
binding occurs.
For example, Dutch scientists found a protein produced by glia cells in the central
nervous system that transmit messages between brain cells which control the release of
chemicals that affect memory, attention, and addiction. Acetylcholine affects memory,
and dopamine affects addiction, to name a couple. Scientists anticipate using this
protein key to develop drugs which will influence certain neuronal functions as opposed
to certain others. There are many more protein keys that need to be identified,
however.
1.1 INTRODUCTION
The purpose of this Beowulf class is to demonstrate how cluster computing can be
used to solve large problems which could not otherwise be solved. Our team has
chosen to develop an algorithm for analysis of amino acid bond lengths of proteins
because this analysis will assist in identifying protein keys.
3
1.2 BACKGROUND
A protein is made up of amino acids, and therefore amino acids lie at the heart of
bioinformatics. There are approximately 20 amino acids found in the human body.
Each amino acid has unique properties and can be represented by a full name or a
three-letter or one-letter code, as shown in Table 1.
Alanine Ala A Hydrophobic Neutral
Cysteine Cys C Hydrophobic Neutral
Aspartic acid Asp D Hydrophilic Negative
Glutamic acid Glu E Hydrophilic Negative
Phenylalanine Phe F Hydrophobic Neutral
Glycine Gly G Hydrophobic Neutral
Histidine His H Hydrophilic Neutral/Positive/Negative
Isoleucine Ile I Hydrophobic Neutral
Lysine Lys K Hydrophilic Positive
Leucine Leu L Hydrophobic Neutral
Methionine Met M Hydrophobic Neutral
Asparagine Asn N Hydrophilic Neutral
Proline Pro P Hydrophobic Neutral
Glutamine Gln Q Hydrophilic Neutral
Arginine Arg R Hydrophilic Positive
Serine Ser S Hydrophilic Neutral
Threonine Thr T Hydrophilic Neutral
Valine Val V Hydrophobic Neutral

4
Tryptophan Trp W Hydrophobic Neutral
Tyrosine Tyr Y Hydrophobic Neutral
Table 1. Amino Acids by Name.
All amino acids are composed of a few atoms of the same type which form its
basic structure, with a central carbon atom or C-alpha, at its center. This carbon atom
has a hydrogen atom, and amino group, and a carboxylic acid group, and a fourth group
known as the variable sidechain connected to it. Sidechains are what differentiate one
amino acid form another. Amino acids are connected by a peptide bond between the
carboxyl group of the first amino acid and the amino group of the second amino acid.
Figure 2 shows the general structure of an amino acid, and Figure 3 shows a peptide
bond.
5
Figure 1. Amino Acid Structure.
(This space intentionally left blank.)
Figure 2. Peptide Bond.
6
1.3 NEED FOR SOLUTION
This solution has the potential to drastically increase the ability of the human race to
overcome diseases and illnesses or other conditions. With each discovery of a protein
key, scientists are able to make progress toward curing and/or treating endless causes
of mankind’s suffering. Using the example mentioned in the introduction, that particular
protein key can be used to prevent Alzhiemer’s disease, schizophrenia, or help people
quit smoking or stop other drug additions similarly. This solution could save millions of
lives!
2 SOLUTION DESIGN
Our team has designed a solution at Old Dominion University’s Beowulf Laboratory
where our supercomputer is housed. This solution accepts two ways of inputting data.
Either by giving the program an actual protein file, or by the user first generating protein
files. If the later is chosen, the user is prompted to input the number of protein nodes
and the maximum number of possible connections between nodes. The user also
inputs the sigma value, that is, the maximum deviation between distance comparison.
This code will then generate random proteins based on this criteria and output the
comparison to a text file.
2.1 MODULE DESIGN
7
It is important to note initially that each module is independent of one another.
The intent is that each function can be run as a stand alone function. The purpose of
this is so that the program designer can implement full user input and feed them into the
functions.
2.2 CLASSES
There are two classes used in this solution; the point class and the edge class.
The point class contains the node id and the ‘x’ and ‘y’ coordinate values for the node
object. The edge class thus contains two points and the distance between them.
These two classes can be found in Appendix B.
2.3 FUNCTIONS
There are four major functions in our code; the generate, modify, load, and
algorithm functions. The generate function will be discussed in another section. Figure
3 shows each of these functions being called in Main.cpp and what parameters each
take.
8
Figure 3. Function Call in Main.cpp.
The modify function, shown in Figure 4, requires a file input name(s) and a file
output name(d). It then reads in data from file s and ouputs it to file d in a format which
will be acceptable for use in the load function.
Figure 4. Modify Function.

9
The load function requires a file input name(s) and a list type edge(d). It can be
seen in Figure 5 and its purpose is to load data in from file ‘s’ and put the data into the
edge list ‘d’, which is passed to the function by reference.
Figure 5. Load Function.
10
The algorithm function, shown in Figure 6, is at the heart of our solution. This
function must have a file output name(s), a list type edge(d), a list type edge(f), and a
delta(x). It compares list lengths in list ‘d’ to list ‘f’ and if those lengths’ difference is
within the delta ‘x’ value then the program will output the matching edges from ‘d’ and ‘f’
to the file ‘s’.
Figure 6. Algorithm Function.
3 RESULTS
11
Our team has successfully generated two separate protein files (protein1, and
protein2) which mimic actual protein files. We have then used these protein files to
compare protein lengths. Our current solution can perform this on any protein file
supplied to it, with no alterations.
3.1 GENERATED PROTEIN FILES
The generated protein files are created by the generate function. It requires a file
output name(s), amino acid length (x), number of bonds (y) and will generate a random
protein with the amino acid length x. Each amino acid is connected to anywhere from 1
to y nodes and the function outputs the data in a specified format to a file named s.
Figure 7. Generate Function in Main.
Figure 8. Generate Function.
12
When the protein files are generated by the generate function, the files list the
node, its coordinates, the connecting node number, and the connecting node’s
coordinates. For example, the beginnings of our two generated protein files, Protein1
and Protein2, are shown in Figures 9 and 10.
Figure 9. Protein1_clean.
13
Figure 10. Protein2_raw.
3.2 OUTPUT
The output for this project is the comparison of lengths performed on our generated
protein test data. The formatting for the file output in the raw and clean output files
shows data for each node in pairs. In the raw file, the first node is the real node data
and the second node data is the node it is connected to.
A(1) 5, 4
A(4) 10, 11
A(1) 5, 4
14
A(5) 13, 15
This means that node 1 is connected to node 4 and node 5.
Figure 11. Protein1_raw.
15
Figure 12. Protein2_clean.
The matching file lists all edges, and is shown in Figure 13.
16
Figure 13. Protein_matchs.
4 CONCLUSIONS AND RECOMMENDED FURTHER RESEARCH
Our current solution finds all matching protein lengths in a given protein file, but it
does not yet actively search for protein keys. Therefore, further development would
implement an algorithm which would enable this module to identify protein keys. This
would, however, require copious amounts of additional research in order to generate a
precise method which could positively make such identifications possible.
5 ACKNOWLEDGEMENTS
Our team would like to thank Professor Jay Morris for the concept of this project
and his invaluable assistance throughout the development of this module. We would
17
also like to thank the Computer Science Department for providing the supercomputer for
us work with during this course. Special thank you also to Tihomir Hristov for the
training and initial setup which he provided.
18
6 APPENDICES
A REFERENCES
[ 1] C ar ter, J. S. ( 20 04 , N ove mb er 02 ). A mi no acid s an d p rotei ns . Re tr ie ved f rom

h tt p: // b io lo gy .cl c. uc.e du /cou rses/ bi o1 0 4 /p ro te in .h tm
[ 2] C MB I. (2 01 0 , Fe bru ary 1 2) . Amin o a ci d . Re tr ie ved f rom
h tt p: // wi k i. cmb i. ru .n l/ in de x. ph p/ Am in o_ a cid
[ 3] Vr ien d, G., & Ge lde r, C. V. ( n. d.) . I ntro b io in fo rmatics . Re tr ie ved f rom
h tt p: // swi f t.cm bi .r u. nl /tea ch /B 1 M /
[ 4] Y ah oo Storie s, . (2 00 1 , M ay 1 6) . Pro te in key to new smok in g, al zhe imer' s
d ru gs . Re tr ie ved f rom h ttp: // cm bi .b jm u. ed u. cn /n ew s/ 0 1 0 5/ 9 7 .h tm
B SOURCE CODE AND DOCUMENTATION
Main.cpp
#include<iostream>
#include<fstream>
#include<time.h>
#include<cmath>
#include<list>
using namespace std;
#include "point.h"
#include "Edge.h"
#include "function.h"
int main()
list<Edge> protein1;
list<Edge> protein2;
19
list<Edge>::iterator pitr;
int a,b,c,d;
double sigma;
char str;
srand(time(NULL));
cout<<"Enter First Protein length and max connections. <length connection>: ";
cin>>a>>b;
cout<<"\nEnter Second Protein length and max connections <length connection>: ";
cin>>c>>d;
cout<<"\nEnter sigma value for protein comparison: ";
cin>>sigma;
cout<<endl<<endl;
GenerateProteinFile("protein1_raw.txt",a,b);
ModifyProteinFile("protein1_raw.txt","protein1_clean.txt");
LoadProteinFile("protein1_clean.txt",protein1);
GenerateProteinFile("protein2_raw.txt",c,d);
ModifyProteinFile("protein2_raw.txt","protein2_clean.txt");
LoadProteinFile("protein2_clean.txt",protein2);
20
algorithm("protein_matchs.txt",protein1,protein2,sigma);
cout<< "Display Protein file data? <y/n>: ";
cin >> str;
str = toupper(str);
if(str=='Y'){
cout << "=========================================" << endl;
cout << "Protein 1 created" << endl;
cout << "=========================================" << endl;
DisplayProtein(protein1);
cout << "=========================================" << endl;
cout << "Protein 2 created" << endl;
cout << "=========================================" << endl;
DisplayProtein(protein2);
return 0;
Function.h
struct node{
21
int label,nlabel;
int x,nx;
int y,ny;
};
//////////////////////////////
//functions
//////////////////////////////
void GenerateProteinFile(char str[256],int array_size,int node_connection){
int a=0;
int b=0;
int c=0;
int d=0;
int e=0;
int f=0;
node A[array_size];
//USER INPUT
fstream fout(str,ios::out);
array_size++;
//generate array of nodes
22
for(int z=0; z<array_size; z++)
a=a+1+((rand()+1)%10);
b=b+1+((rand()+1)%10);
A[z].label=z;
A[z].x=a;
A[z].y=b;
//print out and link nodes randomly
for(int i=0; i<array_size; i++)
c=1+(rand()+1)%node_connection;
d=i;
for(int j=0;j<c;j++)
d=d+1+((rand()+1)%10);
if(d<(array_size-1)){
//to file
//head node
fout<<"A("<<A[i].label<<"), "<<A[i].x<<", "<<A[i].y<<endl;
//linked node
fout<<"A("<<A[d].label<<"), "<<A[d].x<<", "<<A[d].y;
if(i<(array_size-2)){fout<<endl;}
23
fout.close();
////////////////////////////////////////////////////////////////////////////
void ModifyProteinFile(char infileName[256], char outfileName[256]){
ifstream fin;
ofstream fout;
char b;
char * c;
int a;
//USER INPUT
fin.open(infileName);
//USER INPUT
fout.open(outfileName);
while (fin.good())
c = &b;
fin.get(*c);
24
a=*c;
if(fin.good())
if(a!=65 && a!=40 && a!=41 && a!=44)//ascii values for A ( ) , #
if(a==32)//ascii value for space
fout<<"\n";
else
fout<<b;
//else do nothing and ignore char
fin.close();
fout.close();
/////////////////////////////////////////////////////////////////////
void LoadProteinFile(char infileName[256], list<Edge> &protein){
//variable declaration
25
ifstream fin;
ofstream fout;
char * c;
//USER INPUT
fin.open(infileName);
Edge *amino;
Point *aptr;
Point *bptr;
int i,index=0;
double x1,y1,x2,y2,distance;
//temp code
fin>>i;
while(fin.good()){
fin>>x1;
fin>>y1;
aptr = new Point(i,x1,y1);
fin>>i;
fin>>x2;
fin>>y2;
26
bptr = new Point(i,x2,y2);
//calculate distance
distance = sqrt(pow((x2-x1),2)+pow((y2-y1),2));
//
amino = new Edge(index,distance,aptr,bptr);
index++;
//amino->display();
protein.push_back(*amino);
fin>>i;
//////////////////////////////////////////////////////
void DisplayProtein(list<Edge> &protein){
list<Edge>::iterator pitr;
int i;
double x,y;
for(pitr=protein.begin(); pitr!=protein.end();pitr++)
pitr->display();
////////////////////////////////////////////////////////
27
void algorithm(char outfileName[256],list<Edge> &protein1,list<Edge> &protein2, double sigma){
double delta;
fstream fout(outfileName,ios::out);
list<Edge>::iterator protein1_itr;
list<Edge>::iterator protein2_itr;
for(protein1_itr=protein1.begin(); protein1_itr!=protein1.end();protein1_itr++) {
for(protein2_itr=protein2.begin(); protein2_itr!=protein2.end();protein2_itr++) {
delta=fabs(protein1_itr->getDistance()-protein2_itr->getDistance());
if(sigma>=delta){
fout<<"These Edges match: "<<endl;
fout<<"Protein 1 : ";
protein1_itr->fdisplay(fout);
fout<<"to"<<endl;
fout<<"Protein 2 : ";
protein2_itr->fdisplay(fout);
fout<<endl;
///////////////////////////////////////////////////////
28
Point.h
class Point
public:
Point (){name=0; x=0; y=0;}
Point(int s, double i, double j){name=s; x=i; y=j;}
double getX(){return x;}
double getY(){return y;}
int getName(){return name;}
void setName(int n){name=n;}
void setX(double i){x=i;}
void setY(double i){y=i;}
void display(){cout<<name<<"("<<x<<", "<<y<<") ";}
void fdisplay(fstream &fout){fout<<name<<"("<<x<<", "<<y<<") ";}
//Operator=
private:
int name;
double x;
double y;
};
Edge.h
class Edge {
29
public:
Edge();
Edge(int i,double dist, Point *x, Point *y){
index = i;
distance = dist;
a.setX(x->getX());
a.setY(x->getY());
a.setName(x->getName());
b.setX(y->getX());
b.setY(y->getY());
b.setName(y->getName());
int getAname(){return a.getName();}
double getAx(){return a.getX();}
double getAy(){return a.getY();}
int getBname(){return b.getName();}
double getBx(){return b.getX();}
double getBy(){return b.getY();}
double getDistance(){return distance;}
void display(){cout<<"Edge# "<<index<<" ";
a.display();
b.display();
cout<<" has a Distance of: "<<distance<<" "<<endl;}
void fdisplay(fstream &fout){fout<<"Edge# "<<index<<" ";
a.fdisplay(fout);
30
b.fdisplay(fout);
fout<<" has a Distance of: "<<distance<<" "<<endl;}
virtual ~Edge(){}
private:
int index;
double distance;
Point a;
Point b;
};
C RUNTIME INSTRUCTIONS
The user will be prompted for two values for protein1, and two more values for
protein2, as well as a sigma value representing deviation. After running, the user will be
asked if they would like the output printed to the screen. If “no” is selected, the output
may be found in the text file.
31

Algorithm For Analysis of Amino Acid Bond Lengths of Proteins

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Algorithm For Analysis of Amino Acid Bond Lengths of Proteins

Hochgeladen von

Copyright:

Verfügbare Formate

Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Final Project Report

CS 487 Introduction to Cluster Computing

Old Dominion University

Algorithm for Analysis of Amino Acid Bond Lengths of

Tim Dugan, Computer Engineering Department

Gordon Bland, Computer Science Department

Tyler Wood, Computer Science Department

Adrian Ostolski, Computer Science Department

Research Advisor: Professor Jay Morris

1.3 Need for Solution..................................................................................................7

2.1 Module Design......................................................................................................8

3.1 Generated Protein Files......................................................................................13

4 Conclusions and Recommended Further Research.................................................18

B Source Code and Documentation...........................................................................20

Figure 1. Amino Acid Structure

Figure 2. Peptide Bond

Figure 3. Function Call in Main

Figure 4. Modify Function

Figure 5. Load Function

Figure 6. Algorithm Function

Figure 7. Generate Function in Main

Figure 8. Generate Function

Figure 10. Protein2_clean

Figure 11. Protein1_raw

Figure 12. Protein2_raw

Figure 13. Protein_matchs

Table 1. Amino Acids by Name

Bioinformatics is a growing field of study which supplies us with endless

because this analysis will assist in identifying protein keys.

three-letter or one-letter code, as shown in Table 1.

Alanine Ala A Hydrophobic Neutral

Cysteine Cys C Hydrophobic Neutral

Aspartic acid Asp D Hydrophilic Negative

Glutamic acid Glu E Hydrophilic Negative

Phenylalanine Phe F Hydrophobic Neutral

Glycine Gly G Hydrophobic Neutral

Histidine His H Hydrophilic Neutral/Positive/Negative

Isoleucine Ile I Hydrophobic Neutral

Lysine Lys K Hydrophilic Positive

Leucine Leu L Hydrophobic Neutral

Methionine Met M Hydrophobic Neutral

Asparagine Asn N Hydrophilic Neutral

Proline Pro P Hydrophobic Neutral

Glutamine Gln Q Hydrophilic Neutral

Arginine Arg R Hydrophilic Positive

Serine Ser S Hydrophilic Neutral

Threonine Thr T Hydrophilic Neutral

Valine Val V Hydrophobic Neutral

Tryptophan Trp W Hydrophobic Neutral

Tyrosine Tyr Y Hydrophobic Neutral

Table 1. Amino Acids by Name.

Figure 1. Amino Acid Structure.

(This space intentionally left blank.)

Figure 2. Peptide Bond.

1.3 NEED FOR SOLUTION

comparison to a text file.

2.1 MODULE DESIGN

It is important to note initially that each module is independent of one another.

These two classes can be found in Appendix B.

Figure 3. Function Call in Main.cpp.

will be acceptable for use in the load function.

Figure 4. Modify Function.

Edge(int i,double dist, Point x, Point y){