Sie sind auf Seite 1von 45

2016

Take Home Exam 3

OPTIMIZATION
NAFEES HAIDER

PROFESSOR IZIDOR GERTNER | CSC 34200


11/8/16
Table of Contents
Objective ...................................................................................................................................................... 2
Section 1: Microsoft Visual Studio Environment ..................................................................................... 2
Clearing the Array using Index ............................................................................................................. 2
Procedure for generating the .asm file in Visual Studio. ..................................................................... 4
Clear Array Using Pointers.................................................................................................................. 10
Comparison and Result Part................................................................................................................ 19
Section 2: Clearing of Array in Linux ..................................................................................................... 24
Clear using Index .................................................................................................................................. 25
Clear Using Pointers ............................................................................................................................. 27
Running Time Analysis ........................................................................................................................ 28
Section 3: Dot Product Table ................................................................................................................... 33
Comparison (Optimized and Unoptimized Version) ......................................................................... 35
Running Time Analysis ........................................................................................................................ 36
Section 4: Dot Product Intrinsic Challenging Part ................................................................................ 41
Running Time Analysis ........................................................................................................................ 42
Conclusion ................................................................................................................................................. 43
Objective
The main goal of this assignment is to optimize the assembly codes generated by the compiler.

This optimization means to create a custom build assembly codes that are having a better

performance. After that we are going to measure the time of execution of a function in Microsoft

Visual Studio (32-bits) and Linux (64-bits). In time measurement part, we are going to compare

the time differences between an optimized and unoptimized codes. Then we are going to repeat

this process over the different sizes of arrays, ranging from 10 to all the way 10000000

(10,100,1000,10000,100000,1000000,10000000). With this time measurement test we are

assuring that the optimized codes are having promising results.

Section 1: Microsoft Visual Studio Environment


We have to start with first creating a project in MS Visual Studio. In this project we can perform

this task in two ways. We can either create a main file, function file, and a header file or we can

just create two function files and identify the prototype of the function in the main file. For

curiosity purposes, we are going to use both ways in this assignment.

Clearing the Array using Index


As explained above, we are going to create a main file,

header file, and a function file. In main file, we are just

defining the parameters and calling the function. In header

file, we are defining the prototype of the function. And finally

we are creating a function in the function file. The names of

the files are Array.cpp for the main file, function.cpp for the

function file, and Header.h for header file, as we can see in


Figure 1 Clear using Index
the figure 1.
One we have the entire setup, we can compile and run it to see if it works. This project compiled

successfully. Now we look into its disassembly and see what is going on.

Figure 2 The Program starts by creating the stack frame. Reserving the space from the memory

Figure 3 Moved the size of an Array into the register eax so that it can be used for further calculations
Figure 4 Pushed the value into the memory which was an unnecessary and will be show why later when we are going to perform

the optimization in codes

After this the function is going to be called which is clearing out the values stored in the data

segment of stack frame. The next step is to generate the .asm file (assembly codes file) of the

function file. In order to generate it we have to follow the procedure this is going to be described

next.

Procedure for generating the .asm file in Visual Studio.


We click on the Project at the top then in the drop down menu, at the very bottom we click on the

<Project Name> properties. The new window will pop up. On the left side, we are going to see

Output Files, directly under C/C++ option inside the menu we are going to see Assembler output

to the left. Click on it and choose “Assembly-Only Listing (/FA)”


Figure 5 Procedure for generating the .asm file

After that, click Apply and close the window. Then start compiling the program. Make sure that

the function.cpp file is open in order for the program to make the function.asm file for that. Once

the program finishes compiling we can see the function.asm file. After that, we have to import

that file in our current project. We can do this by right click on the Source Files on the right side,

in the Solution Explorer menu. Click Add then click Existing Item. A window will pop up, add

the function.asm file in that. Click to open it up and see the assembly codes. The assembly code

of clearing the array using the index is shown in the figure 3.


Figure 6 UnOptimized Codes for Clear Using Index

The codes above are the unoptimized version of the function that clears the array using index.

The point of our interest is from line 42 to line 60. In this part of the codes we can see that the

immediate value is being stored to the memory address of base pointer which is variable _i$.

After that we are going to jump to the function which has been named as $LN4@ClearUsing:

Right here, we are going to move the value zero that was stored into the memory back to the

register which is going to be compared by the value stored inside the size variable. If they are not

equal, then keep on moving to the next statement. When it moves to the next step the line 54 is

redundant and does the same thing as line 50. After that if stores the address of array into the
register ecx and it performs some calculation of how to move to the next memory location. The

theory is simple. Once it moves to the next memory location it replaces the content inside it with

zero which in a way is clearing it out. The steps above are repeating 10 times until the values in

register eax and variable _size$[ebp] are not the same. Once they are same it goes to the

procedure called $LN1@ClearUsing: And the function finally exits.

This is the part we will actually optimize. We can edit this file and make it optimized directly

from Visual Studio and save it in a different location with a different name. The optimized codes

are given in figure 7.


Figure 7 Optimized Codes for Clear Using Index

In the unoptimized version, intuitively we can see that we can eliminated many thing, such as

redundant lines and the transfer of one variable from memory to register. Now let us take a look

at our optimized codes. We directly assigned the immediate value zero to the register eax. Then

we assign the value of the variable size into the register ebx and initialized the address of an

array in the register ecx. After that jump to the procedure named $LN4@ClearUsing: where we

compare the content of register eax and ebx. If they are not the same they move to the next line.

In this line replace the content of the first array location with zero. Repeat this process for every
array location. After that the value of register eax and ebx will be similar, at this point jump to

the process called $LN1@ClearUsing: which is basically the exit of the function. Notice the

number of steps. And how much redundancy was avoided at this point. This saves a lot of time.

Next create the new project in which we don’t have the function file in it because we are going to

use this optimized function.asm file to perform the operation. After the optimization has been

performed the next step is to import this optimized version of .asm file into the new project that

we just created. Note that this file is not going to link and compile all by itself. We have to link it

by following some additional procedures in the program. We can do it by right clicking on the

function.asm file and go to its properties. Under General choose Custom Build Tool for Item

Type. Click Apply the windows will disappear and reappear within a second. This time there is

going to be another section under Configuration Properties. Underneath it, click General we can

see that there is going to be empty Command Line and Outputs. We have to type two command

in order to link it and make it work. For Command Line we are going to type ml -c "-

Fl$(IntDir)%(FileName).lst" "-Fo$(IntDir)%(FileName).obj" "%(FullPath)" and for output we

are going to type $(IntDir)%(FileName).obj;%(Outputs), as show in the figure 8.


Figure 8 Link the .asm file for execution

Click Apply and then click ok. Now make sure that there is no function file. So that the entire

execution of the function is from the .asm file. Test the result. Once the test verifies that there is

no error then we can start implementing our next phase which is the running time analysis.

Clear Array Using Pointers


In this part, we are going to perform the same operation as above. However, this time we are

going to do it through pointers. Pointers, in general are a lot faster than the indexes for multiple

reasons. First is because we are not copying the entire array into the function and then return it to

the main again. The pointers acquire the address of an array or we can call it acquiring of

pointers. In this way when we set the pointer equals to zero. The content inside the address
pointed by the pointer is replaced with zero. Basically there is no redundancy, due to avoiding of

copying the array.

We start writing the codes for clearing the arrays using pointers. The procedure is the same that

we create the new project and use the codes shown in the figure 9.

Figure 9 Clear using Pointers

As we can see that we declared the we declare the size which is going to change later in running

time analysis. The global array in declared. Although we could’ve declared it inside the main

because everything is being processed through pointers global will not matter at this point. We

also declared the prototype before the main and the void function is declared at the bottom of the

main function. And just to reassure that the array is clearing out I initialized the for loop to
display all the values of the array after the clear function has been performed. The result should

display all zeros.

So what is happening in this file is that we declare the size in the main file and then call the clear

using pointers function. It takes the address of the globally defined array and inside that function

which is going to run 10 times and on each iteration it adds one to its offset and replace the

content with zero. And this loop is going to run till it performs 10 iterations. Hence, clearing out

the entire array.

Let us take a deeper understand on it using disassembly of this code.

Figure 10 We started by reserving the space in the memory as shown in the memory window and next step will be to declare 10
in the variable named size
Figure 11 Storing the value 10 in the memory location with little indian fashion
Figure 12 Jump into Clear using Pointer function. here we are setting eax to four which is later on going to help us jump to next
memory location

Figure 13 moving the content of variable size to register eax and as well as the address location of array to register ecx. Then
loading the memory location of first content of the array that is accessed through register edx
Figure 14 moving the address of the first content of the array to pointer and from pointer to register eax. After that move the
immediate value zero to the content of the address that is stored in register eax.

Figure 15 Repeating these steps over and over to make all the content equal to zero. Basically erasing all the content stored in
the memory location
Figure 16 Exiting of the function

Now we have to optimized the codes and make the running time as fast as possible. We have to

first generate the .asm file of it. The figure 17 below shows the .asm codes for the unoptimized

clear using pointer function.


Figure 17 Unoptimized asm codes for Clear Using Pointer function

First of all, before optimized we have to comment out two lines otherwise the program will give

us compilation errors. These two commented lines are shown in figure 18


Figure 18 Commands to be commented out

We start by initializing stack frame. Then we are going to store the immediate value 4 into the

memory. Register ecx is going to have the zero offset address of an array. Temporarily the

register ecx is having nothing inside. It will be zero. Next step is to load the address of an array

to the register ecx and jump to the procedure called $LN$@ClearUing: where we are going to

move the content of the variable size into the register eax and move the address of the array into

register ecx. Load the memory location using the theoretical formula mentioned above, compare

the addresses of pointer p and the address stored in register edx. If they are not equal, then move

on to next statement. On the next statement, move the address of the that is stored in pointer

which is the address of the array to register eax. And then replace the content inside it with zero.

Go back to the procedure named $LN2@ClearUsing: and repeat these steps over and over until

we go through all the contents of the array and replace it with zero. After that the function goes

to the procedure called $LN1@ClearUsing: which is basically the exiting of the function.

The codes below in the figure 19 shows the optimized version of the clear using pointer

functions.
Figure 19 Optimized function: Clear using Pointers

Comparison and Result Part


In this part, we are going to compare each method of clearing array using different size arrays.

The size n is going to be 10, 100, 1000, 10000, 100000, 1000000, and 10000000.

As the size is increasing we can see that the time in increasing but that is expected. However, we

can see the difference between the optimized codes running time of the same function and

unoptimized codes running time. The running time of unoptimized codes is larger than the

optimized codes and as we increase the size of the array we can see that the optimized codes

running time in nearly half of the running time of the unoptimized codes.

The measurement shown in the tables given blow are done as follow. Each method with the same

size has ran five times and then taken average of, in order to be precise. This step has been

repeated for all the sizes, mentioned above, of the arrays. The same method has also been using

for clear using pointers and its optimized version too. Now based on the intuition we can say that

clearing of an array using pointer should be faster than the clearing of an array using index.
Likewise, the optimized version of clearing the array using pointer should be faster than the

optimized version of clearing of an array using index, since its original codes are faster too.

Table 1 For Size = 10

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

125.16 79.82 111.47 90.37

124.87 92.65 136.84 81.82

135.42 89.82 115.18 75.55

124.30 82.68 120.31 78.68

134.28 92.37 119.74 83.53

Average = Average = 87.47 Average = Average = 81.99

128.80 120.71

Table 2 For Size = 100

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

127.16 119.74 128.01 91.23

183.89 138.56 134.56 92.37

191.58 94.08 141.41 132.57

143.59 118.03 141.70 90.38

141.69 99.50 145.12 116.61


Average = Average = Average = Average =

157.58 113.98 138.16 104.63

Table 3 For Size = 1000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

153.67 90.38 162.22 91.80

159.37 129.15 139.42 93.51

147.68 99.22 144.55 114.90

178.75 144.83 205.84 93.23

196.72 98.93 143.12 88.95

Average = Average = Average = Average = 96.49

167.24 112.50 159.03

Table 4 For Size = 10000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

165.07 143.12 155.09 113.47

151.10 101.21 150.37 106.06

187.11 107.76 164.21 100.07

194.83 118.03 177.61 106.63


189.30 127.72 160.79 107.20

Average = Average = Average = Average =

177.48 119.56 161.61 106.69

Table 5 For Size = 100000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

373.20 264.86 442.76 212.97

463.29 296.50 423.37 208.41

477.76 229.79 403.42 199.86

492.30 224.66 452.81 197.86

523.16 341.55 451.60 290.23

Average = Average = Average = Average =

465.94 271.47 434.79 221.87

Table 6 For Size = 1000000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

2733.56 1605.13 2735.84 1186.59

2691.37 1239.34 2781.18 1567.49

2709.33 2511.18 2816.53 2416.53


2729.86 1284.10 2803.42 1240.76

2872.69 1246.47 2889.80 1317.74

Average = Average = Average = Average =

2747.36 1577.24 2805.35 1545.82

Table 7 For Size = 10000000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

27633.61 14200.126 26352.07 12553.66

27842.87 16541.67 28963.90 15752.51

29685.21 14793.14 29621.35 12394.85

29681.10 11753.94 28553.35 13122.44

30755.77 12678.25 28202.39 12380.03

Average = Average = Average = Average =

29119.71 13993.43 28338.61 13240.70


Running Time Summary
30000

25000

20000
Time in μs

15000

10000

5000

0
0 2000000 4000000 6000000 8000000 10000000
Size of n

Clear Using Index Optimized Clear Using Index


Clear Using Pointers Optimized Clear Using Pointers

Graph 1 Microsoft Visual Studio Running Time Analysis Summary

As we can see the running time complexity is linear. However, the code shows significant

difference between running time of unoptimized and optimized codes. The result in the graph is

expected, as mentioned above. Unoptimized codes are a lot slower but even in that the clear

using pointer is much fast while the optimized codes are faster than the unoptimized ones.

Section 2: Clearing of Array in Linux


In Linux environment, we can measure the performance of array clearing using both index and

pointers along with gcc compiler. Since we are working on optimizing the codes we have to use

some special commands in the terminal to generate the .asm file which later on we are going to

modify it, make it the optimized version and then link it with main file to perform the operation.

The following commands are going to help us generate, link, and compile the assembly file.

1. gcc -O0 -S function.c

2. gcc main.c function.s


The first function will help us generating the assembly file of the function.c file. This file can be

read in gtext software and we can edit it here to make it optimized. Save the file and use the

second command from the table above to link this assembly file with the function file. Notice

that this time in main.c file we remove the function; just like we did in visual studio but it still

works because the assembly file in having the function to clear the array.

Clear using Index


The C codes for clearing the array using index or pointer is the same as the one we have used in

Microsoft Visual Studio. The generate assembly code of unoptimized function for clear using

index is given below.

Figure 20 Unoptimized Version of Clear using Index


Figure 21 Optimized Version of Clear using Index

Over here, we are going to compare the unoptimized and optimized codes. In unoptimized

version of the code we were having the variable -4(%rbp) which works as an addition and

moving on to the next memory location. There are a lot of memory calls which needs to be

eliminated.

The way we tackled this problem is that we noticed that there are many transitions between

register to memory in the unoptimized version of codes. We optimized it by reducing the amount

of transfers and made the calculation stay within the registers, as much as possible. So here, we
can see that local variables –28(%rbp) and -24(%rbp) are assigned to registers. The first one (–

28(%rbp)) in the memory shows the size of the array and the second one (–24(%rbp)) shows the

location of the first element of an array.

Clear Using Pointers


Again the clear using pointers is using the same C/C++ codes for Microsoft Visual Studio. Both

the unoptimized and optimized codes for clear using pointers are given below.

Figure 22 Unoptimized Version of Clear using Pointers

In the codes above, we can see all the unnecessary memory calls. The calls between registers and

memories increases the amount of time it’s take one program to finish executing. So we need to

keep the transition at the minimal level.


Figure 23 Optimized Version of Clear using Pointers

We reduced the amount of memory calls and kept most of the calculations between stack pointer,

base pointer, register rax and register rdx. Similar to clear using index, except this time we are

using addresses instead of an actual variable which future reduces the run time.

Running Time Analysis


Just like in Visual Studio we have to perform running time analysis in both clear using index and

pointers functions, as well as their optimized versions. The time shown in the following table is

in microseconds. For accuracy purposes, we take the running time for each one five times,

followed by their averages. The running analysis will show some expected results. Again, based

on intuition we can say that the time taken by clear using index is slower than the others. The

time taken by clear using pointers is going to be faster because it eliminates the steps of copying

the array into the memory and then setting up the offset based on the next address.
Table 2 For Size = 10

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

1 1 0 1

1 0 1 0

1 1 1 1

1 1 1 0

0 0 0 0

Average = 0.8 Average = 0.6 Average = 0.6 Average = 0.4

Table 2 For Size = 100

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

1 0 1 1

1 1 1 0

1 1 1 0

1 1 1 1

1 0 0 0

Average = 1 Average = 0.6 Average = 0.8 Average = 0.4


Table 3 For Size = 1000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

3 1 2 0

2 1 2 1

2 1 2 1

2 1 1 1

2 1 2 1

Average = 2.2 Average = 1 Average = 1.8 Average = 0.8

Table 4 For Size = 10000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

51 14 28 10

23 13 26 9

23 13 26 9

24 14 27 11

24 13 26 10

Average = 29 Average = 13.4 Average = 26.6 Average = 9.8


Table 5 For Size = 100000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

265 135 237 98

263 133 236 100

262 131 263 97

264 133 261 110

299 132 236 97

Average = 270.6 Average = 132.8 Average = 246.6 Average = 100.4

Table 6 For Size = 1000000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

3033 1423 2448 1053

2757 1381 2443 994

2702 1373 2454 1030

2710 1372 2403 1024

2779 1372 2456 1024

Average = Average = Average = Average = 1025

2796.2 1384.2 2440.8


Table 7 For Size = 10000000

Clear Using Optimized Clear Clear Using Optimized Clear

Index Time(μs) Using Index Pointer Time(μs) Using Pointer

Time(μs) Time(μs)

27035 9136 20227 5645

26952 9042 19409 9448

26945 9002 18883 5759

39665 8956 25086 5549

27058 9011 20258 5654

Average = Average = Average = Average = 6411

29531 9029.4 20772.6

30000

25000

20000
Time in μs

15000

10000

5000

0
0 2000000 4000000 6000000 8000000 10000000
Size of N

Clear Using Index Optimized Clear Using Index


Clear Using Pointers Optimized Clear Using Pointer

Graph 2 Linux Running Time Analysis Summary


The result shown above seems to be predictable theoretically. Note that all the time

measurements are written in microseconds. That’s why the numbers appear to be huge but it’s

really not.

Section 3: Dot Product Table


Now as we have seen the difference in timings by optimization we can use it in different

applications, in order to make calculations faster. There are various theories where we can apply

it. One of them that we are going to perform optimization is Dot Product. Dot Product plays a

vital role, especially calculating the magnitude of the vector. The codes shown in figures 24, 25,

and 26 are the codes for dot product respectively.

Figure 24 Main file of dot product

In the codes above we declare all the libraries, followed by importing the header file “Header.h”.

After that we are declaring two global arrays. Both of them are of size 10, at this point. Later on
the size will change for running time analysis. After that, we are declaring the main where the

custom size of an array is going to be 10 for now. On the next line where I am calling the dot

product function and displaying the output at the same time to verify my output result. It is a

good habit to verify the result in order to make sure that I have the right codes.

Figure 25 Header File of dot product

This is the header file in which we declared the prototype of dot product.

Figure 26 Function file of dot product

This is the function file which is going to perform the dot product. This function takes the two

arrays and their size as the argument. Then it declares the variable “sum” in which the result of

the sum is going to be stored. Now we are going to run the for loop in which the array1 is going

to be added with array2 and the result is going to be added with the previous result and stored

into the sum.


Comparison (Optimized and Unoptimized Version)
The codes below are show the comparison between optimized and unoptimized version of dot

product codes and later on we are going to see the different between the it takes one to finish the

process, in running time analysis part.


Running Time Analysis
The time analysis of dot product above is done in Microsoft Visual Studio. The following tables

shows the same pattern we followed for clearing the arrays above. We are going to take the

average of the time that has been taken 5 times. This step will be repeated for all sizes of n. The

data and the graph is given below.

Table 1 For Size = 10

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

189.88 120.60

176.48 95.22
131.43 95.79

139.99 97.79

129.44 102.92

Average = Average =

153.44 102.46

Table 2 For Size = 100

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

126.30 100.36

188.45 107.48

137.42 128.01

165.07 92.37

175.23 95.22

Average = Average =

158.49 104.69

Table 3 For Size = 1000

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

208.98 142.27
137.70 128.01

235.50 166.79

258.30 96.94

182.18 137.42

Average = Average =

200.53 134.29

Table 4 For Size = 10000

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

223.81 127.44

177.05 167.07

223.52 123.16

209.27 148.87

169.64 124.88

Average = Average =

209.66 138.28

Table 5 For Size = 100000

Dot Product Optimized Dot

Time(μs) Product

Time(μs)
734.71 435.92

520.60 424.80

817.11 324.73

560.51 310.48

509.19 351.53

Average = Average =

628.42 369.49

Table 6 For Size = 1000000

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

4453.88 3125.58

4463.00 2388.31

4022.80 2222.95

4023.94 2995.29

3777.04 2480.11

Average = Average =

4148.13 2642.45
Table 7 For Size = 10000000

Dot Product Optimized Dot

Time(μs) Product

Time(μs)

45006.36 24689.64

42639.44 24944.80

45186.55 24165.90

40640.30 25527.27

40752.63 24554.78

Average = Average =

42845.06 24776.48

Running Time Analysis of Dot Product


45000
40000
35000
Running Time in μs

30000
25000
20000
15000
10000
5000
0
0 2000000 4000000 6000000 8000000 10000000
Size of n

Dot Product Optimized Dot Product

Graph 3 Microsoft Visual Studio Running Time Analysis Summary


Section 4: Dot Product Intrinsic Challenging Part
Now this part is going to use the intrinsic function instead of regular C/C++ codes. The beauty of

intrinsic codes is that they perform calculations much faster. In our case we are going to see that

intrinsic functions may are faster but due to certain limitations they might overall get slower.

How it will affect our time is discussed later on.

Figure 27 Intrinsic Dot Product

In the codes shown in the above figure 27 we can see how intrinsic function is working. We start

by declaring the universal size variable so that in the future we don’t have to change the value of

size everywhere which could consume a lot of time. After that I declared two full size floating

arrays. But we are going to create two addition array that are going to store only eight floating
number because the intrinsic function _mm256_dp_ps can only calculate the dot product of 128

bits. Which means that it can calculate the dot product of only four elements of array. So we

need to first declare a for loop that is going to copy the 8 floating numbers of into another array

of size eight bits. After that we have to load these floating values into two 256 bit variable that

we can assume as an array at this point. Next is to create another 256 intrinsic variable named as

result. Inside the result we are going to store the dot product of intrinsic array one and intrinsic

array two, using the command called _mm256_dp_ps (dp for dot product and ps for floating

number). Now we create a floating pointer that on which we are going to pass the content of

result. The last step is to calculate the actual value and we can do it by creating a floating

variable, named as value. The value is going to store the sum of previous values and the sum of

the dot product of two 128 bits. And finally in order to verify the answer the we can do it by

displays what is stored in a variable name value.

Running Time Analysis


The running time analysis is going based on the exact same pattern. As we have taken average

result of five running time records.

Dot Product Dot Dot Dot Dot Dot Dot Product

Intrinsic Product Product Product Product Product Intrinsic Time(μs)

Time(μs) Intrinsic Intrinsic Intrinsic Intrinsic Intrinsic for size 10000000

for size 10 Time(μs) Time(μs) Time(μs) Time(μs) Time(μs)

for size 100 for size for size for size for size

1000 10000 100000 1000000

11.11 11.97 35.64 148.25 1267.56 12063.51 125351.85


10.55 11.97 41.62 173.63 1432.64 12204.35 127867.02

10.26 13.97 37.63 140.27 1226.51 14257.09 125586.49

10.54 16.82 32.22 142.55 1224.51 12322.10 125348.72

11.12 12.26 43.62 139.41 1467.13 12275.06 130307.21

Average = Average = Average = Average = Average = Average = Average =

10.72 13.40 38.15 148.82 1323.67 12624.42 126892.26

Dot Product Intrinsic Run Time Analysis


140000

120000

100000
Time in μs

80000

60000

40000

20000

0
0 2000000 4000000 6000000 8000000 10000000
Size

Graph 4 Microsoft Visual Studio Running Time Analysis Summary

Conclusion
This assignment was far more the most interesting assignments for all. I have finally learned a

way that how to reduce the time consumption of the same algorithm by directly accessing

through its assembly codes. The more we optimize it from assembly and reduce the memory

calls the faster the speed is going to be. The run time analysis suggests that if the size of the array
increases the running time of the optimized codes are going to be efficient up to roughly 50%.

There was another thing to learn about which is that in C/C++ there is something that exist which

I had never heard before, that is intrinsic functions. In order to perform calculations, the intrinsic

functions can be much faster than the regular function because they are meant to work like that.

Most of their functionality is based on performing mathematical calculations. And most of all I

also got the chance creating and testing intrinsic function to see its actual performance.

Das könnte Ihnen auch gefallen