Beruflich Dokumente
Kultur Dokumente
OPTIMIZATION
NAFEES HAIDER
This optimization means to create a custom build assembly codes that are having a better
performance. After that we are going to measure the time of execution of a function in Microsoft
Visual Studio (32-bits) and Linux (64-bits). In time measurement part, we are going to compare
the time differences between an optimized and unoptimized codes. Then we are going to repeat
this process over the different sizes of arrays, ranging from 10 to all the way 10000000
this task in two ways. We can either create a main file, function file, and a header file or we can
just create two function files and identify the prototype of the function in the main file. For
the files are Array.cpp for the main file, function.cpp for the
successfully. Now we look into its disassembly and see what is going on.
Figure 2 The Program starts by creating the stack frame. Reserving the space from the memory
Figure 3 Moved the size of an Array into the register eax so that it can be used for further calculations
Figure 4 Pushed the value into the memory which was an unnecessary and will be show why later when we are going to perform
After this the function is going to be called which is clearing out the values stored in the data
segment of stack frame. The next step is to generate the .asm file (assembly codes file) of the
function file. In order to generate it we have to follow the procedure this is going to be described
next.
<Project Name> properties. The new window will pop up. On the left side, we are going to see
Output Files, directly under C/C++ option inside the menu we are going to see Assembler output
After that, click Apply and close the window. Then start compiling the program. Make sure that
the function.cpp file is open in order for the program to make the function.asm file for that. Once
the program finishes compiling we can see the function.asm file. After that, we have to import
that file in our current project. We can do this by right click on the Source Files on the right side,
in the Solution Explorer menu. Click Add then click Existing Item. A window will pop up, add
the function.asm file in that. Click to open it up and see the assembly codes. The assembly code
The codes above are the unoptimized version of the function that clears the array using index.
The point of our interest is from line 42 to line 60. In this part of the codes we can see that the
immediate value is being stored to the memory address of base pointer which is variable _i$.
After that we are going to jump to the function which has been named as $LN4@ClearUsing:
Right here, we are going to move the value zero that was stored into the memory back to the
register which is going to be compared by the value stored inside the size variable. If they are not
equal, then keep on moving to the next statement. When it moves to the next step the line 54 is
redundant and does the same thing as line 50. After that if stores the address of array into the
register ecx and it performs some calculation of how to move to the next memory location. The
theory is simple. Once it moves to the next memory location it replaces the content inside it with
zero which in a way is clearing it out. The steps above are repeating 10 times until the values in
register eax and variable _size$[ebp] are not the same. Once they are same it goes to the
This is the part we will actually optimize. We can edit this file and make it optimized directly
from Visual Studio and save it in a different location with a different name. The optimized codes
In the unoptimized version, intuitively we can see that we can eliminated many thing, such as
redundant lines and the transfer of one variable from memory to register. Now let us take a look
at our optimized codes. We directly assigned the immediate value zero to the register eax. Then
we assign the value of the variable size into the register ebx and initialized the address of an
array in the register ecx. After that jump to the procedure named $LN4@ClearUsing: where we
compare the content of register eax and ebx. If they are not the same they move to the next line.
In this line replace the content of the first array location with zero. Repeat this process for every
array location. After that the value of register eax and ebx will be similar, at this point jump to
the process called $LN1@ClearUsing: which is basically the exit of the function. Notice the
number of steps. And how much redundancy was avoided at this point. This saves a lot of time.
Next create the new project in which we don’t have the function file in it because we are going to
use this optimized function.asm file to perform the operation. After the optimization has been
performed the next step is to import this optimized version of .asm file into the new project that
we just created. Note that this file is not going to link and compile all by itself. We have to link it
by following some additional procedures in the program. We can do it by right clicking on the
function.asm file and go to its properties. Under General choose Custom Build Tool for Item
Type. Click Apply the windows will disappear and reappear within a second. This time there is
going to be another section under Configuration Properties. Underneath it, click General we can
see that there is going to be empty Command Line and Outputs. We have to type two command
in order to link it and make it work. For Command Line we are going to type ml -c "-
Click Apply and then click ok. Now make sure that there is no function file. So that the entire
execution of the function is from the .asm file. Test the result. Once the test verifies that there is
no error then we can start implementing our next phase which is the running time analysis.
going to do it through pointers. Pointers, in general are a lot faster than the indexes for multiple
reasons. First is because we are not copying the entire array into the function and then return it to
the main again. The pointers acquire the address of an array or we can call it acquiring of
pointers. In this way when we set the pointer equals to zero. The content inside the address
pointed by the pointer is replaced with zero. Basically there is no redundancy, due to avoiding of
We start writing the codes for clearing the arrays using pointers. The procedure is the same that
we create the new project and use the codes shown in the figure 9.
As we can see that we declared the we declare the size which is going to change later in running
time analysis. The global array in declared. Although we could’ve declared it inside the main
because everything is being processed through pointers global will not matter at this point. We
also declared the prototype before the main and the void function is declared at the bottom of the
main function. And just to reassure that the array is clearing out I initialized the for loop to
display all the values of the array after the clear function has been performed. The result should
So what is happening in this file is that we declare the size in the main file and then call the clear
using pointers function. It takes the address of the globally defined array and inside that function
which is going to run 10 times and on each iteration it adds one to its offset and replace the
content with zero. And this loop is going to run till it performs 10 iterations. Hence, clearing out
Figure 10 We started by reserving the space in the memory as shown in the memory window and next step will be to declare 10
in the variable named size
Figure 11 Storing the value 10 in the memory location with little indian fashion
Figure 12 Jump into Clear using Pointer function. here we are setting eax to four which is later on going to help us jump to next
memory location
Figure 13 moving the content of variable size to register eax and as well as the address location of array to register ecx. Then
loading the memory location of first content of the array that is accessed through register edx
Figure 14 moving the address of the first content of the array to pointer and from pointer to register eax. After that move the
immediate value zero to the content of the address that is stored in register eax.
Figure 15 Repeating these steps over and over to make all the content equal to zero. Basically erasing all the content stored in
the memory location
Figure 16 Exiting of the function
Now we have to optimized the codes and make the running time as fast as possible. We have to
first generate the .asm file of it. The figure 17 below shows the .asm codes for the unoptimized
First of all, before optimized we have to comment out two lines otherwise the program will give
We start by initializing stack frame. Then we are going to store the immediate value 4 into the
memory. Register ecx is going to have the zero offset address of an array. Temporarily the
register ecx is having nothing inside. It will be zero. Next step is to load the address of an array
to the register ecx and jump to the procedure called $LN$@ClearUing: where we are going to
move the content of the variable size into the register eax and move the address of the array into
register ecx. Load the memory location using the theoretical formula mentioned above, compare
the addresses of pointer p and the address stored in register edx. If they are not equal, then move
on to next statement. On the next statement, move the address of the that is stored in pointer
which is the address of the array to register eax. And then replace the content inside it with zero.
Go back to the procedure named $LN2@ClearUsing: and repeat these steps over and over until
we go through all the contents of the array and replace it with zero. After that the function goes
to the procedure called $LN1@ClearUsing: which is basically the exiting of the function.
The codes below in the figure 19 shows the optimized version of the clear using pointer
functions.
Figure 19 Optimized function: Clear using Pointers
The size n is going to be 10, 100, 1000, 10000, 100000, 1000000, and 10000000.
As the size is increasing we can see that the time in increasing but that is expected. However, we
can see the difference between the optimized codes running time of the same function and
unoptimized codes running time. The running time of unoptimized codes is larger than the
optimized codes and as we increase the size of the array we can see that the optimized codes
running time in nearly half of the running time of the unoptimized codes.
The measurement shown in the tables given blow are done as follow. Each method with the same
size has ran five times and then taken average of, in order to be precise. This step has been
repeated for all the sizes, mentioned above, of the arrays. The same method has also been using
for clear using pointers and its optimized version too. Now based on the intuition we can say that
clearing of an array using pointer should be faster than the clearing of an array using index.
Likewise, the optimized version of clearing the array using pointer should be faster than the
optimized version of clearing of an array using index, since its original codes are faster too.
Time(μs) Time(μs)
128.80 120.71
Time(μs) Time(μs)
Time(μs) Time(μs)
Time(μs) Time(μs)
Time(μs) Time(μs)
Time(μs) Time(μs)
Time(μs) Time(μs)
25000
20000
Time in μs
15000
10000
5000
0
0 2000000 4000000 6000000 8000000 10000000
Size of n
As we can see the running time complexity is linear. However, the code shows significant
difference between running time of unoptimized and optimized codes. The result in the graph is
expected, as mentioned above. Unoptimized codes are a lot slower but even in that the clear
using pointer is much fast while the optimized codes are faster than the unoptimized ones.
pointers along with gcc compiler. Since we are working on optimizing the codes we have to use
some special commands in the terminal to generate the .asm file which later on we are going to
modify it, make it the optimized version and then link it with main file to perform the operation.
The following commands are going to help us generate, link, and compile the assembly file.
read in gtext software and we can edit it here to make it optimized. Save the file and use the
second command from the table above to link this assembly file with the function file. Notice
that this time in main.c file we remove the function; just like we did in visual studio but it still
works because the assembly file in having the function to clear the array.
Microsoft Visual Studio. The generate assembly code of unoptimized function for clear using
Over here, we are going to compare the unoptimized and optimized codes. In unoptimized
version of the code we were having the variable -4(%rbp) which works as an addition and
moving on to the next memory location. There are a lot of memory calls which needs to be
eliminated.
The way we tackled this problem is that we noticed that there are many transitions between
register to memory in the unoptimized version of codes. We optimized it by reducing the amount
of transfers and made the calculation stay within the registers, as much as possible. So here, we
can see that local variables –28(%rbp) and -24(%rbp) are assigned to registers. The first one (–
28(%rbp)) in the memory shows the size of the array and the second one (–24(%rbp)) shows the
the unoptimized and optimized codes for clear using pointers are given below.
In the codes above, we can see all the unnecessary memory calls. The calls between registers and
memories increases the amount of time it’s take one program to finish executing. So we need to
We reduced the amount of memory calls and kept most of the calculations between stack pointer,
base pointer, register rax and register rdx. Similar to clear using index, except this time we are
using addresses instead of an actual variable which future reduces the run time.
pointers functions, as well as their optimized versions. The time shown in the following table is
in microseconds. For accuracy purposes, we take the running time for each one five times,
followed by their averages. The running analysis will show some expected results. Again, based
on intuition we can say that the time taken by clear using index is slower than the others. The
time taken by clear using pointers is going to be faster because it eliminates the steps of copying
the array into the memory and then setting up the offset based on the next address.
Table 2 For Size = 10
Time(μs) Time(μs)
1 1 0 1
1 0 1 0
1 1 1 1
1 1 1 0
0 0 0 0
Time(μs) Time(μs)
1 0 1 1
1 1 1 0
1 1 1 0
1 1 1 1
1 0 0 0
Time(μs) Time(μs)
3 1 2 0
2 1 2 1
2 1 2 1
2 1 1 1
2 1 2 1
Time(μs) Time(μs)
51 14 28 10
23 13 26 9
23 13 26 9
24 14 27 11
24 13 26 10
Time(μs) Time(μs)
Time(μs) Time(μs)
Time(μs) Time(μs)
30000
25000
20000
Time in μs
15000
10000
5000
0
0 2000000 4000000 6000000 8000000 10000000
Size of N
measurements are written in microseconds. That’s why the numbers appear to be huge but it’s
really not.
applications, in order to make calculations faster. There are various theories where we can apply
it. One of them that we are going to perform optimization is Dot Product. Dot Product plays a
vital role, especially calculating the magnitude of the vector. The codes shown in figures 24, 25,
In the codes above we declare all the libraries, followed by importing the header file “Header.h”.
After that we are declaring two global arrays. Both of them are of size 10, at this point. Later on
the size will change for running time analysis. After that, we are declaring the main where the
custom size of an array is going to be 10 for now. On the next line where I am calling the dot
product function and displaying the output at the same time to verify my output result. It is a
good habit to verify the result in order to make sure that I have the right codes.
This is the header file in which we declared the prototype of dot product.
This is the function file which is going to perform the dot product. This function takes the two
arrays and their size as the argument. Then it declares the variable “sum” in which the result of
the sum is going to be stored. Now we are going to run the for loop in which the array1 is going
to be added with array2 and the result is going to be added with the previous result and stored
product codes and later on we are going to see the different between the it takes one to finish the
shows the same pattern we followed for clearing the arrays above. We are going to take the
average of the time that has been taken 5 times. This step will be repeated for all sizes of n. The
Time(μs) Product
Time(μs)
189.88 120.60
176.48 95.22
131.43 95.79
139.99 97.79
129.44 102.92
Average = Average =
153.44 102.46
Time(μs) Product
Time(μs)
126.30 100.36
188.45 107.48
137.42 128.01
165.07 92.37
175.23 95.22
Average = Average =
158.49 104.69
Time(μs) Product
Time(μs)
208.98 142.27
137.70 128.01
235.50 166.79
258.30 96.94
182.18 137.42
Average = Average =
200.53 134.29
Time(μs) Product
Time(μs)
223.81 127.44
177.05 167.07
223.52 123.16
209.27 148.87
169.64 124.88
Average = Average =
209.66 138.28
Time(μs) Product
Time(μs)
734.71 435.92
520.60 424.80
817.11 324.73
560.51 310.48
509.19 351.53
Average = Average =
628.42 369.49
Time(μs) Product
Time(μs)
4453.88 3125.58
4463.00 2388.31
4022.80 2222.95
4023.94 2995.29
3777.04 2480.11
Average = Average =
4148.13 2642.45
Table 7 For Size = 10000000
Time(μs) Product
Time(μs)
45006.36 24689.64
42639.44 24944.80
45186.55 24165.90
40640.30 25527.27
40752.63 24554.78
Average = Average =
42845.06 24776.48
30000
25000
20000
15000
10000
5000
0
0 2000000 4000000 6000000 8000000 10000000
Size of n
intrinsic codes is that they perform calculations much faster. In our case we are going to see that
intrinsic functions may are faster but due to certain limitations they might overall get slower.
In the codes shown in the above figure 27 we can see how intrinsic function is working. We start
by declaring the universal size variable so that in the future we don’t have to change the value of
size everywhere which could consume a lot of time. After that I declared two full size floating
arrays. But we are going to create two addition array that are going to store only eight floating
number because the intrinsic function _mm256_dp_ps can only calculate the dot product of 128
bits. Which means that it can calculate the dot product of only four elements of array. So we
need to first declare a for loop that is going to copy the 8 floating numbers of into another array
of size eight bits. After that we have to load these floating values into two 256 bit variable that
we can assume as an array at this point. Next is to create another 256 intrinsic variable named as
result. Inside the result we are going to store the dot product of intrinsic array one and intrinsic
array two, using the command called _mm256_dp_ps (dp for dot product and ps for floating
number). Now we create a floating pointer that on which we are going to pass the content of
result. The last step is to calculate the actual value and we can do it by creating a floating
variable, named as value. The value is going to store the sum of previous values and the sum of
the dot product of two 128 bits. And finally in order to verify the answer the we can do it by
for size 100 for size for size for size for size
120000
100000
Time in μs
80000
60000
40000
20000
0
0 2000000 4000000 6000000 8000000 10000000
Size
Conclusion
This assignment was far more the most interesting assignments for all. I have finally learned a
way that how to reduce the time consumption of the same algorithm by directly accessing
through its assembly codes. The more we optimize it from assembly and reduce the memory
calls the faster the speed is going to be. The run time analysis suggests that if the size of the array
increases the running time of the optimized codes are going to be efficient up to roughly 50%.
There was another thing to learn about which is that in C/C++ there is something that exist which
I had never heard before, that is intrinsic functions. In order to perform calculations, the intrinsic
functions can be much faster than the regular function because they are meant to work like that.
Most of their functionality is based on performing mathematical calculations. And most of all I
also got the chance creating and testing intrinsic function to see its actual performance.