Sie sind auf Seite 1von 32

Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Final Project Report

CS 487 Introduction to Cluster Computing

Old Dominion University

Algorithm for Analysis of Amino Acid Bond Lengths of


Proteins

Tim Dugan, Computer Engineering Department

Gordon Bland, Computer Science Department

Tyler Wood, Computer Science Department

Adrian Ostolski, Computer Science Department

Research Advisor: Professor Jay Morris

ABSTRACT

Bioinformatics is a growing field of study which supplies us with endless computing problems
to solve. Although the definition of the term itself is somewhat arguable, the generally accepted idea
is that bioinformatics is using computers to solve biological issues, or answer questions. One such
problem is to develop an algorithm for comparing lengths of proteins in order to search for protein
keys. A protein key is a protein which sends signals to other cells by means of a chemical reaction
where the binding occurs. Our team has chosen to develop an algorithm for analysis of amino acid
bond lengths of proteins because this analysis will assist in identifying protein keys.
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Table of Content

1 Problem Description....................................................................................................3

1.1 Introduction...........................................................................................................3

1.2 Background...........................................................................................................4

1.3 Need for Solution..................................................................................................7

2 Solution Design...........................................................................................................8

2.1 Module Design......................................................................................................8

2.2 Classes.................................................................................................................8

2.3 Functions..............................................................................................................9

3 Results......................................................................................................................12

3.1 Generated Protein Files......................................................................................13

3.2 Output.................................................................................................................15

4 Conclusions and Recommended Further Research.................................................18

5 Acknowledgements...................................................................................................19

6 Appendices...............................................................................................................20

A References..............................................................................................................20

B Source Code and Documentation...........................................................................20

C Runtime Instructions...............................................................................................33

1
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

List of Figures

Figure 1. Amino Acid Structure

Figure 2. Peptide Bond

Figure 3. Function Call in Main

Figure 4. Modify Function

Figure 5. Load Function

Figure 6. Algorithm Function

Figure 7. Generate Function in Main

Figure 8. Generate Function

Figure 9. Protein1_clean

Figure 10. Protein2_clean

Figure 11. Protein1_raw

Figure 12. Protein2_raw

Figure 13. Protein_matchs

List of Tables

Table 1. Amino Acids by Name

2
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

1 PROBLEM DESCRIPTION

Bioinformatics is a growing field of study which supplies us with endless

computing problems to solve. Although the definition of the term itself is somewhat

arguable, the generally accepted idea is that bioinformatics is using computers to solve

biological issues, or answer questions. One such problem is to develop an algorithm for

comparing lengths of proteins in order to search for protein keys. A protein key is a

protein which sends signals to other cells by means of a chemical reaction where the

binding occurs.

For example, Dutch scientists found a protein produced by glia cells in the central

nervous system that transmit messages between brain cells which control the release of

chemicals that affect memory, attention, and addiction. Acetylcholine affects memory,

and dopamine affects addiction, to name a couple. Scientists anticipate using this

protein key to develop drugs which will influence certain neuronal functions as opposed

to certain others. There are many more protein keys that need to be identified,

however.

1.1 INTRODUCTION

The purpose of this Beowulf class is to demonstrate how cluster computing can be

used to solve large problems which could not otherwise be solved. Our team has

chosen to develop an algorithm for analysis of amino acid bond lengths of proteins

because this analysis will assist in identifying protein keys.

3
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

1.2 BACKGROUND

A protein is made up of amino acids, and therefore amino acids lie at the heart of

bioinformatics. There are approximately 20 amino acids found in the human body.

Each amino acid has unique properties and can be represented by a full name or a

three-letter or one-letter code, as shown in Table 1.

Alanine Ala A Hydrophobic Neutral

Cysteine Cys C Hydrophobic Neutral

Aspartic acid Asp D Hydrophilic Negative

Glutamic acid Glu E Hydrophilic Negative

Phenylalanine Phe F Hydrophobic Neutral

Glycine Gly G Hydrophobic Neutral

Histidine His H Hydrophilic Neutral/Positive/Negative

Isoleucine Ile I Hydrophobic Neutral

Lysine Lys K Hydrophilic Positive

Leucine Leu L Hydrophobic Neutral

Methionine Met M Hydrophobic Neutral

Asparagine Asn N Hydrophilic Neutral

Proline Pro P Hydrophobic Neutral

Glutamine Gln Q Hydrophilic Neutral

Arginine Arg R Hydrophilic Positive

Serine Ser S Hydrophilic Neutral

Threonine Thr T Hydrophilic Neutral

Valine Val V Hydrophobic Neutral


4
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Tryptophan Trp W Hydrophobic Neutral

Tyrosine Tyr Y Hydrophobic Neutral

Table 1. Amino Acids by Name.

All amino acids are composed of a few atoms of the same type which form its

basic structure, with a central carbon atom or C-alpha, at its center. This carbon atom

has a hydrogen atom, and amino group, and a carboxylic acid group, and a fourth group

known as the variable sidechain connected to it. Sidechains are what differentiate one

amino acid form another. Amino acids are connected by a peptide bond between the

carboxyl group of the first amino acid and the amino group of the second amino acid.

Figure 2 shows the general structure of an amino acid, and Figure 3 shows a peptide

bond.

5
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Figure 1. Amino Acid Structure.

(This space intentionally left blank.)

Figure 2. Peptide Bond.

6
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

1.3 NEED FOR SOLUTION

This solution has the potential to drastically increase the ability of the human race to

overcome diseases and illnesses or other conditions. With each discovery of a protein

key, scientists are able to make progress toward curing and/or treating endless causes

of mankind’s suffering. Using the example mentioned in the introduction, that particular

protein key can be used to prevent Alzhiemer’s disease, schizophrenia, or help people

quit smoking or stop other drug additions similarly. This solution could save millions of

lives!

2 SOLUTION DESIGN

Our team has designed a solution at Old Dominion University’s Beowulf Laboratory

where our supercomputer is housed. This solution accepts two ways of inputting data.

Either by giving the program an actual protein file, or by the user first generating protein

files. If the later is chosen, the user is prompted to input the number of protein nodes

and the maximum number of possible connections between nodes. The user also

inputs the sigma value, that is, the maximum deviation between distance comparison.

This code will then generate random proteins based on this criteria and output the

comparison to a text file.

2.1 MODULE DESIGN

7
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

It is important to note initially that each module is independent of one another.

The intent is that each function can be run as a stand alone function. The purpose of

this is so that the program designer can implement full user input and feed them into the

functions.

2.2 CLASSES

There are two classes used in this solution; the point class and the edge class.

The point class contains the node id and the ‘x’ and ‘y’ coordinate values for the node

object. The edge class thus contains two points and the distance between them.

These two classes can be found in Appendix B.

2.3 FUNCTIONS

There are four major functions in our code; the generate, modify, load, and

algorithm functions. The generate function will be discussed in another section. Figure

3 shows each of these functions being called in Main.cpp and what parameters each

take.

8
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Figure 3. Function Call in Main.cpp.

The modify function, shown in Figure 4, requires a file input name(s) and a file

output name(d). It then reads in data from file s and ouputs it to file d in a format which

will be acceptable for use in the load function.

Figure 4. Modify Function.


9
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

The load function requires a file input name(s) and a list type edge(d). It can be

seen in Figure 5 and its purpose is to load data in from file ‘s’ and put the data into the

edge list ‘d’, which is passed to the function by reference.

Figure 5. Load Function.

10
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

The algorithm function, shown in Figure 6, is at the heart of our solution. This

function must have a file output name(s), a list type edge(d), a list type edge(f), and a

delta(x). It compares list lengths in list ‘d’ to list ‘f’ and if those lengths’ difference is

within the delta ‘x’ value then the program will output the matching edges from ‘d’ and ‘f’

to the file ‘s’.

Figure 6. Algorithm Function.

3 RESULTS

11
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Our team has successfully generated two separate protein files (protein1, and

protein2) which mimic actual protein files. We have then used these protein files to

compare protein lengths. Our current solution can perform this on any protein file

supplied to it, with no alterations.

3.1 GENERATED PROTEIN FILES

The generated protein files are created by the generate function. It requires a file

output name(s), amino acid length (x), number of bonds (y) and will generate a random

protein with the amino acid length x. Each amino acid is connected to anywhere from 1

to y nodes and the function outputs the data in a specified format to a file named s.

Figure 7. Generate Function in Main.

Figure 8. Generate Function.

12
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

When the protein files are generated by the generate function, the files list the

node, its coordinates, the connecting node number, and the connecting node’s

coordinates. For example, the beginnings of our two generated protein files, Protein1

and Protein2, are shown in Figures 9 and 10.

Figure 9. Protein1_clean.

(This space intentionally left blank.)

13
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Figure 10. Protein2_raw.

3.2 OUTPUT

The output for this project is the comparison of lengths performed on our generated

protein test data. The formatting for the file output in the raw and clean output files

shows data for each node in pairs. In the raw file, the first node is the real node data

and the second node data is the node it is connected to.

A(1) 5, 4
A(4) 10, 11
A(1) 5, 4
14
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

A(5) 13, 15

This means that node 1 is connected to node 4 and node 5.

Figure 11. Protein1_raw.

(This space intentionally left blank.)

15
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Figure 12. Protein2_clean.

(This space intentionally left blank.)

The matching file lists all edges, and is shown in Figure 13.
16
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Figure 13. Protein_matchs.

4 CONCLUSIONS AND RECOMMENDED FURTHER RESEARCH

Our current solution finds all matching protein lengths in a given protein file, but it

does not yet actively search for protein keys. Therefore, further development would

implement an algorithm which would enable this module to identify protein keys. This

would, however, require copious amounts of additional research in order to generate a

precise method which could positively make such identifications possible.

5 ACKNOWLEDGEMENTS

Our team would like to thank Professor Jay Morris for the concept of this project

and his invaluable assistance throughout the development of this module. We would
17
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

also like to thank the Computer Science Department for providing the supercomputer for

us work with during this course. Special thank you also to Tihomir Hristov for the

training and initial setup which he provided.

(This space intentionally left blank.)

18
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

6 APPENDICES

A REFERENCES

[ 1] C ar ter, J. S. ( 20 04 , N ove mb er 02 ).   A mi no acid s an d p rotei ns . Re tr ie ved f rom


h tt p: // b io lo gy .cl c. uc.e du /cou rses/ bi o1 0 4 /p ro te in .h tm
[ 2] C MB I. (2 01 0 , Fe bru ary 1 2) .   Amin o a ci d . Re tr ie ved f rom
h tt p: // wi k i. cmb i. ru .n l/ in de x. ph p/ Am in o_ a cid
[ 3] Vr ien d, G., & Ge lde r, C. V. ( n. d.) .   I ntro b io in fo rmatics . Re tr ie ved f rom
h tt p: // swi f t.cm bi .r u. nl /tea ch /B 1 M /
[ 4] Y ah oo Storie s, . (2 00 1 , M ay 1 6) .   Pro te in key to new smok in g, al zhe imer' s
d ru gs . Re tr ie ved f rom h ttp: // cm bi .b jm u. ed u. cn /n ew s/ 0 1 0 5/ 9 7 .h tm

B SOURCE CODE AND DOCUMENTATION

Main.cpp

#include<iostream>

#include<fstream>

#include<time.h>

#include<cmath>

#include<list>

using namespace std;

#include "point.h"

#include "Edge.h"

#include "function.h"

int main()

list<Edge> protein1;

list<Edge> protein2;

19
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

list<Edge>::iterator pitr;

int a,b,c,d;

double sigma;

char str;

srand(time(NULL));

cout<<"Enter First Protein length and max connections. <length connection>: ";

cin>>a>>b;

cout<<"\nEnter Second Protein length and max connections <length connection>: ";

cin>>c>>d;

cout<<"\nEnter sigma value for protein comparison: ";

cin>>sigma;

cout<<endl<<endl;

GenerateProteinFile("protein1_raw.txt",a,b);

ModifyProteinFile("protein1_raw.txt","protein1_clean.txt");

LoadProteinFile("protein1_clean.txt",protein1);

GenerateProteinFile("protein2_raw.txt",c,d);

ModifyProteinFile("protein2_raw.txt","protein2_clean.txt");

LoadProteinFile("protein2_clean.txt",protein2);
20
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

algorithm("protein_matchs.txt",protein1,protein2,sigma);

cout<< "Display Protein file data? <y/n>: ";

cin >> str;

str = toupper(str);

if(str=='Y'){

cout << "=========================================" << endl;

cout << "Protein 1 created" << endl;

cout << "=========================================" << endl;

DisplayProtein(protein1);

cout << "=========================================" << endl;

cout << "Protein 2 created" << endl;

cout << "=========================================" << endl;

DisplayProtein(protein2);

return 0;

Function.h

struct node{

21
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

int label,nlabel;

int x,nx;

int y,ny;

};

//////////////////////////////

//functions

//////////////////////////////

void GenerateProteinFile(char str[256],int array_size,int node_connection){

int a=0;

int b=0;

int c=0;

int d=0;

int e=0;

int f=0;

node A[array_size];

//USER INPUT

fstream fout(str,ios::out);

array_size++;

//generate array of nodes

22
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

for(int z=0; z<array_size; z++)

a=a+1+((rand()+1)%10);

b=b+1+((rand()+1)%10);

A[z].label=z;

A[z].x=a;

A[z].y=b;

//print out and link nodes randomly

for(int i=0; i<array_size; i++)

c=1+(rand()+1)%node_connection;

d=i;

for(int j=0;j<c;j++)

d=d+1+((rand()+1)%10);

if(d<(array_size-1)){

//to file

//head node

fout<<"A("<<A[i].label<<"), "<<A[i].x<<", "<<A[i].y<<endl;

//linked node

fout<<"A("<<A[d].label<<"), "<<A[d].x<<", "<<A[d].y;

if(i<(array_size-2)){fout<<endl;}

23
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

fout.close();

////////////////////////////////////////////////////////////////////////////

void ModifyProteinFile(char infileName[256], char outfileName[256]){

ifstream fin;

ofstream fout;

char b;

char * c;

int a;

//USER INPUT

fin.open(infileName);

//USER INPUT

fout.open(outfileName);

while (fin.good())

c = &b;

fin.get(*c);

24
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

a=*c;

if(fin.good())

if(a!=65 && a!=40 && a!=41 && a!=44)//ascii values for A ( ) , #

if(a==32)//ascii value for space

fout<<"\n";

else

fout<<b;

//else do nothing and ignore char

fin.close();

fout.close();

/////////////////////////////////////////////////////////////////////

void LoadProteinFile(char infileName[256], list<Edge> &protein){

//variable declaration

25
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

ifstream fin;

ofstream fout;

char * c;

//USER INPUT

fin.open(infileName);

Edge *amino;

Point *aptr;

Point *bptr;

int i,index=0;

double x1,y1,x2,y2,distance;

//temp code

fin>>i;

while(fin.good()){

fin>>x1;

fin>>y1;

aptr = new Point(i,x1,y1);

fin>>i;

fin>>x2;

fin>>y2;

26
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

bptr = new Point(i,x2,y2);

//calculate distance

distance = sqrt(pow((x2-x1),2)+pow((y2-y1),2));

//

amino = new Edge(index,distance,aptr,bptr);

index++;

//amino->display();

protein.push_back(*amino);

fin>>i;

//////////////////////////////////////////////////////

void DisplayProtein(list<Edge> &protein){

list<Edge>::iterator pitr;

int i;

double x,y;

for(pitr=protein.begin(); pitr!=protein.end();pitr++)

pitr->display();

////////////////////////////////////////////////////////

27
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

void algorithm(char outfileName[256],list<Edge> &protein1,list<Edge> &protein2, double sigma){

double delta;

fstream fout(outfileName,ios::out);

list<Edge>::iterator protein1_itr;

list<Edge>::iterator protein2_itr;

for(protein1_itr=protein1.begin(); protein1_itr!=protein1.end();protein1_itr++) {

for(protein2_itr=protein2.begin(); protein2_itr!=protein2.end();protein2_itr++) {

delta=fabs(protein1_itr->getDistance()-protein2_itr->getDistance());

if(sigma>=delta){

fout<<"These Edges match: "<<endl;

fout<<"Protein 1 : ";

protein1_itr->fdisplay(fout);

fout<<"to"<<endl;

fout<<"Protein 2 : ";

protein2_itr->fdisplay(fout);

fout<<endl;

///////////////////////////////////////////////////////

28
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

Point.h
class Point

public:

Point (){name=0; x=0; y=0;}

Point(int s, double i, double j){name=s; x=i; y=j;}

double getX(){return x;}

double getY(){return y;}

int getName(){return name;}

void setName(int n){name=n;}

void setX(double i){x=i;}

void setY(double i){y=i;}

void display(){cout<<name<<"("<<x<<", "<<y<<") ";}

void fdisplay(fstream &fout){fout<<name<<"("<<x<<", "<<y<<") ";}

//Operator=

private:

int name;

double x;

double y;

};

Edge.h
class Edge {

29
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

public:

Edge();

Edge(int i,double dist, Point *x, Point *y){

index = i;

distance = dist;

a.setX(x->getX());

a.setY(x->getY());

a.setName(x->getName());

b.setX(y->getX());

b.setY(y->getY());

b.setName(y->getName());

int getAname(){return a.getName();}

double getAx(){return a.getX();}

double getAy(){return a.getY();}

int getBname(){return b.getName();}

double getBx(){return b.getX();}

double getBy(){return b.getY();}

double getDistance(){return distance;}

void display(){cout<<"Edge# "<<index<<" ";

a.display();

b.display();

cout<<" has a Distance of: "<<distance<<" "<<endl;}

void fdisplay(fstream &fout){fout<<"Edge# "<<index<<" ";

a.fdisplay(fout);

30
Algorithm for Analysis of Amino Acid Bond Lengths of Proteins

b.fdisplay(fout);

fout<<" has a Distance of: "<<distance<<" "<<endl;}

virtual ~Edge(){}

private:

int index;

double distance;

Point a;

Point b;

};

C RUNTIME INSTRUCTIONS

The user will be prompted for two values for protein1, and two more values for

protein2, as well as a sigma value representing deviation. After running, the user will be

asked if they would like the output printed to the screen. If “no” is selected, the output

may be found in the text file.

31

Das könnte Ihnen auch gefallen