Beruflich Dokumente
Kultur Dokumente
10660168914
M/sec
3.505056003
cycles
0.000
4002106745
instructions
0.000
IPC
3.508743022
temporal@cmult-25-67-217:~/profilers/task1$
perf
e
-e
-e
-e
10657835906 cycles
4002106780 instructions
0.000 M/sec
0.376 IPC
10652615063 cycles
4002106781 instructions
0.000 M/sec
400036820 L1-dcache-loads
0.376 IPC
#
0.000 M/sec
10644786361 cycles
4002106769 instructions
0.000 M/sec
400036808 L1-dcache-loads
0.376 IPC
#
935423583 L1-dcache-load-misses
0.000 M/sec
#
0.000 M/sec
10651824488 cycles
4002106765 instructions
0.000 M/sec
400036804 L1-dcache-loads
0.376 IPC
#
935870744 L1-dcache-load-misses
800018046 L1-dcache-stores
0.000 M/sec
#
0.000 M/sec
0.000 M/sec
10651317588 cycles
4002106782 instructions
0.000 M/sec
400036821 L1-dcache-loads
0.376 IPC
#
936246092 L1-dcache-load-misses
800018063 L1-dcache-stores
0 L1-dcache-store-misses #
0.000 M/sec
#
0.000 M/sec
0.000 M/sec
0.000 M/sec
10673775578 cycles
4003634110 instructions
398972589 L1-dcache-loads
0.375 IPC
#
466453707 L1-dcache-load-misses
799469226 L1-dcache-stores
196830114 L1-dcache-store-misses #
543983919 LLC-loads
10646911534 cycles
4003947805 instructions
400485285 L1-dcache-loads
0.376 IPC
#
467534052 L1-dcache-load-misses
799711711 L1-dcache-stores
198216756 L1-dcache-store-misses #
533935204 LLC-loads
CPU = 34.900000 ms
10666020714 cycles
4004466241 instructions
398378214 L1-dcache-loads
0.375 IPC
#
465165709 L1-dcache-load-misses
799150369 L1-dcache-stores
56811441 LLC-load-misses
197478094 L1-dcache-store-misses #
535311019 LLC-loads
10682090958 cycles
4003403900 instructions
396680731 L1-dcache-loads
0.375 IPC
#
471412419 L1-dcache-load-misses
799389996 L1-dcache-stores
68255348 LLC-load-misses
383485849 LLC-stores
202155031 L1-dcache-store-misses #
549729695 LLC-loads
-e, --event <event> event selector. use 'perf list' to list available events
-i, --inherit
-p, --pid <n>
-a, --all-cpus
-c, --scale
-v, --verbose
-n, --null
10659650396 cycles
3998752355 instructions
394546020 L1-dcache-loads
0.375 IPC
#
932447914 L1-dcache-load-misses
799272109 L1-dcache-stores
0 L1-dcache-store-misses #
546259120 LLC-loads
389598397 LLC-stores
10877361 LLC-store-misses
68158849 LLC-load-misses
temporal@cmult-25-67-217:~/profilers/task1$
5. Exchange indices in source code of task1.c, such that array is now traversed by rows
instead of by columns, hence taking advantage of both the row-major order used in C
array[i][j]
6. Compile task1.c and execute it through perf, reporting the same events as before.
7. Analyze the new results and compare them with the previous results.
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u
-e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.700000 ms
6034968746 cycles
4110499159 instructions
419891997 L1-dcache-loads
420610030 L1-dcache-load-misses
781834156 L1-dcache-stores
0 L1-dcache-store-misses #
18247292 LLC-loads
236547 LLC-load-misses
397726674 LLC-stores
9624784 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
8. Study the performance of the program without optimization (remove O2 from the
Makefile) and different types of optimization (-O1, -O2, -O3).
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.900000 ms
6067503773 cycles
4118954929 instructions
411692296 L1-dcache-loads
424837064 L1-dcache-load-misses
793457905 L1-dcache-stores
0 L1-dcache-store-misses #
18391820 LLC-loads
401793291 LLC-stores
199376 LLC-load-misses
7865504 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms
6038718173 cycles
4037654570 instructions
395531227 L1-dcache-loads
418401802 L1-dcache-load-misses
801772715 L1-dcache-stores
0 L1-dcache-store-misses #
18696232 LLC-loads
186117 LLC-load-misses
402380246 LLC-stores
7854278 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms
6053984600 cycles
3987057427 instructions
400200988 L1-dcache-loads
415827870 L1-dcache-load-misses
803989205 L1-dcache-stores
0 L1-dcache-store-misses #
18570290 LLC-loads
184409 LLC-load-misses
404119786 LLC-stores
8084878 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
1
0
CPU = 19.600000 ms
6037123792 cycles
4116785774 instructions
435708410 L1-dcache-loads
420659108 L1-dcache-load-misses
777460939 L1-dcache-stores
0 L1-dcache-store-misses #
18025137 LLC-loads
395535681 LLC-stores
251399 LLC-load-misses
8632229 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.800000 ms
6035870937 cycles
4117896047 instructions
429597588 L1-dcache-loads
419548640 L1-dcache-load-misses
777307875 L1-dcache-stores
1
1
0 L1-dcache-store-misses #
18108552 LLC-loads
214429 LLC-load-misses
396972530 LLC-stores
8410576 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
9. Write conclusions to lab report (homework).
10.
Sin optimizar
temporal@cmult-25-67-217:~/profilers/task1$ make
gcc -c task1.c
gcc -o task1 task1.o
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 52.400000 ms
16007060492 cycles
9205605320 instructions
0.575 IPC
3598395908 L1-dcache-loads
987594656 L1-dcache-load-misses
1598375817 L1-dcache-stores
0 L1-dcache-store-misses #
588372537 LLC-loads
92496263 LLC-load-misses
401075795 LLC-stores
7658290 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
Optimizando O1
emporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1CPU
= 52.400000 ms
16023716367 cycles
9200493948 instructions
0.574 IPC
3597648558 L1-dcache-loads
988629682 L1-dcache-load-misses
1599266301 L1-dcache-stores
0 L1-dcache-store-misses #
587307597 LLC-loads
403045072 LLC-stores
7594972 LLC-store-misses
62861608 LLC-load-misses
temporal@cmult-25-67-217:~/profilers/task1$
Opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l11
3
16020307101 cycles
9209170086 instructions
0.575 IPC
3599616932 L1-dcache-loads
987723501 L1-dcache-load-misses
1599231431 L1-dcache-stores
0 L1-dcache-store-misses #
588535120 LLC-loads
86883798 LLC-load-misses
8350090 LLC-store-misses
401122256 LLC-stores
temporal@cmult-25-67-217:~/profilers/task1$
Opt O3
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 52.900000 ms
16137204247 cycles
9209130186 instructions
3600613819 L1-dcache-loads
0.571 IPC
#
986946125 L1-dcache-load-misses
1597655842 L1-dcache-stores
#
#
0 L1-dcache-store-misses #
585647246 LLC-loads
97276744 LLC-load-misses
406561279 LLC-stores
8890869 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
cambiando indices O3
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms
1740572548 cycles
1443499389 instructions
0.829 IPC
97700052 L1-dcache-loads
109374222 L1-dcache-load-misses
197872525 L1-dcache-stores
0 L1-dcache-store-misses #
9433803 LLC-loads
216797 LLC-load-misses
102765566 LLC-stores
8736361 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
Cambio indices Opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms
1724945432 cycles
1405792427 instructions
0.815 IPC
96543406 L1-dcache-loads
108272949 L1-dcache-load-misses
199183731 L1-dcache-stores
0 L1-dcache-store-misses #
9583820 LLC-loads
102350319 LLC-stores
8069378 LLC-store-misses
199921 LLC-load-misses
temporal@cmult-25-67-217:~/profilers/task1$
Cambio de Indices Opt O1
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.700000 ms
1
6
1739406471 cycles
1419222785 instructions
0.816 IPC
99982320 L1-dcache-loads
109200483 L1-dcache-load-misses
197092356 L1-dcache-stores
0 L1-dcache-store-misses #
9261294 LLC-loads
102250007 LLC-stores
194582 LLC-load-misses
8148738 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
cambio los dos indices opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 5.400000 ms
1661158018 cycles
4032625761 instructions
400393144 L1-dcache-loads
2.428 IPC
#
12301881 L1-dcache-load-misses
787831854 L1-dcache-stores
0 L1-dcache-store-misses #
6146244 LLC-loads
126402 LLC-load-misses
6186194 LLC-stores
123017 LLC-store-misses
temporal@cmult-25-67-217:~/profilers/task1$
1639985549 cycles
3993195021 instructions
2.435 IPC
393478492 L1-dcache-loads
12291323 L1-dcache-load-misses
792372646 L1-dcache-stores
0 L1-dcache-store-misses #
6303632 LLC-loads
117652 LLC-store-misses
127549 LLC-load-misses
6224012 LLC-stores
#
#
temporal@cmult-25-67-217:~/profilers/task1$
1
8
1648017829 cycles
4010909417 instructions
2.434 IPC
395102231 L1-dcache-loads
12261564 L1-dcache-load-misses
790829417 L1-dcache-stores
0 L1-dcache-store-misses #
6264775 LLC-loads
119045 LLC-store-misses
129520 LLC-load-misses
6212471 LLC-stores
#
#
temporal@cmult-25-67-217:~/profilers/task1$
antes de paralelizar hay que optimizar vaya lo mas rapido posible aun cuando sea mas
lento
Como optimizar el programa para paralelizar
Cuando se compila el codigo de maquina como se compila
tres niveles de optimizacion
el compilador ve el programa para que vaya mas rapido
Cambios que hace el compilador
para ver si va mas rapido
Como afecta la optimizacion con el resultado
PERF (II)
1
9
9611683298 cycles
12225779862 instructions
5087796167 L1-dcache-loads
1015809083 L1-dcache-load-misses
1056402996 L1-dcache-stores
1018152124 LLC-loads
14834768 LLC-load-misses
59957 LLC-store-misses
0 L1-dcache-store-misses #
62788 LLC-stores
temporal@cmult-25-67-217:~/profilers/task2$
5.
Implement a new version of the matrix multiplication function (Mult2) that takes
advantage of both the row-major order used in C and the cache hierarchy.
Standard matrix multiplication (Mult1)
+
2
0
6. Run task2 and verify that both functions (Mult1, Mult2) yield the same result.
2
1
7. Comment call to Mult1 in task2.c and run task2 with the performance analysis tool
(perf).
8. Compare the performance of both functions and write conclusions to lab report,
including the implementation of Mult2 (homework).
GPROF
1. Go to directory profilers/task3 (image processing algorithm).
2. Edit and understand structure of Makefile. Option -pg at compile time forces
compiler to generate profile data suitable for gprof.
3. Compile program:
make
4. Run program (it generates binary file gmon.out with profile data). Input and output
bitmap images can be viewed with any image visualization program (gimp, xv, )
./algi channel1.bmp channel2.bmp
5. Run gprof generating profile file:
gprof algi > profile
6. Edit and understand the self-explained profile file.
7. Identify functions that consume a significant percentage of running time (bottlenecks)
best candidates to be optimized / parallelized. Any improvement on them will have
a significant impact on the overall running time:
a. Large functions (big self ms/call value) with a big percentage of running time
(big % time value).
b. Small functions (low self ms/call value) that run very frequently (big %
time value).
8. Write conclusions to lab report, including description of bottleneck functions
(homework).
KPROF (graphical front-end to gprof)
1. Execute kprof.
2. Open profile file generated by gprof:
File Open
3. Examine tabs Flat Profile, Hierarchical Profile and Graph View.
2
2
650
@ 3.20GHz
3. Identify the number of CPUs in cpuinfo.txt. The number of CPUs is equal to the number
of different physical identifiers of the available logical processors. Example for a single
CPU Intel Core i5:
processor : 0
physical
processor : 1
physical
processor : 2
physical
processor : 3
physical
4. Identify number of cores per CPU in cpuinfo.txt. Example for Intel Core i5 with 2 cores:
cpu cores
: 2
5. Identify number of hardware supported threads (i.e.: logical processors) per CPU in
cpuinfo.txt. If the number of supported threads is N times the number of cores, the CPU
supports hyper-threading and each core will be able to concurrently execute N of those
threads by sharing its internal resources (ALU, FPU, etc.). Example for Intel Core i5 with 2
cores and hyper-threading:
siblings
: 4
6. Identify what logical processors correspond to each CPU and core. Example for a single
Intel Core i5 in which the first two logical processors are mapped to the first core and the
last two logical processors to the second core, with a single CPU (physical processor):
processor : 0
physical id
core id
processor : 1
physical id
core id
processor : 2
physical id
core id
processor : 3
physical id
core id
: 0
: 0
: 0
: 0
: 0
: 2
: 0
: 2
7. Using a web browser, verify the number of cores and threads per core on the Internet based
on the CPUs model information. Write conclusions to lab report (homework).
8. Download associated material (openmp1.tar.gz) from Moodles course page into personal
working directory.
9. Uncompress and untar associated material:
2
3
gunzip openmp1.tar.gz
tar xvf openmp1.tar
10. Go to directory openmp1/task1.
11. Edit and understand structure of Makefile. Option -fopenmp at compile time forces
compiler to understand OpenMP directives. Option -lgomp at link time forces linker to
include the OpenMP library for Linux (GOMP).
12. Edit and understand example task1.c.
13. Execute in a new terminal the run-time CPU monitor mpstat (if not available, execute
gnome_system_monitor instead):
xterm &
mpstat P ALL 1
mpstat shows statistical information about each available logical processor, including
percentage of CPU load at the user level (%usr) and the system level (%sys).
14. In task1.c, set the number of OpenMP threads (constant NUM_THREADS) to 1. From
the initial terminal, compile the program and execute it, writing down the wall time (real
execution time). The latter is the minimum sequential time (Ts) of the algorithm. See how
mpstat shows what logical processor is executing the program.
15. Set the number of threads to 2 in task1.c, recompile the program and run it, checking with
mpstat what logical processors are executing both threads. Execute the program several
times. The operating system automatically maps every thread to a different core. The logical
processor within the core may vary from an execution to the next. Write down the average
wall time of all executions, which corresponds to the parallel time for two cores (Tp).
16. Compute speedup and efficiency for two cores.
17. Force the mapping of threads to logical processors, such that both OpenMP threads are
mapped to the first two logical processors. If the CPU supports hyper-threading, the first
two logical processors are executed by the same core. Example:
export GOMP_CPU_AFFINITY=0 1
18. Run the program several times and compute speedup and efficiency for two threads
executed by the same core. This measures the performance of hyper-threading for this
particular CPU-intensive application.
19. Force the mapping of threads to specific logical processors of different cores. Example:
export GOMP_CPU_AFFINITY=1 3
20. Run the program several times and compute speedup and efficiency for two threads
executed in specific logical processors belonging to different cores. Compare that
performance with the one obtained through the automatic mapping of threads to logical
processors provided by the operating system.
21. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
same logical processor. In case of several threads assigned to the same logical processor, the
latter executes them with time-sharing. Example:
export GOMP_CPU_AFFINITY=3
22. Run the program several times and compute speedup and efficiency for four threads
executed by the same logical processor.
23. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
logical processors belonging to the same core. Example:
export GOMP_CPU_AFFINITY=2 3
24. Run the program several times and compute speedup and efficiency for four threads
executed by the logical processors within the same core.
25. Release the explicit mapping of threads to specific logical processors, such that this
mapping be left to the operating system again:
export GOMP_CPU_AFFINITY=
26. Write down the results and conclusions to lab report (homework).
27. Go to directory openmp1/task2.
28. Edit and understand example task2.c.
29. Compile and run task2 several times. Realize that the PID is always different at the
beginning of the parallel body, and the same at its end. Analyze and interpret this behavior.
30. Declare variable pid as private. Compile the program and run it again several times,
realizing that the PID now is always different. Analyze and interpret this behavior.
31. Write conclusions to lab report (homework).
32. Run task2 again and realize that private variable limit, which is initialized to -1 in its
program declaration, is reset to zero at the begging of the parallel body, whereas it is set
back to -1 when the master thread resumes its execution right after the parallel body.
Analyze this behavior by considering that every thread within a parallel region has a local
copy of all its private variables.
33. Change the private clause for firstprivate, which initializes the local copies of private
variables to their original value. Compile and run the program again realizing the difference.
34. Write conclusions to lab report (homework).
35. Go to directory openmp1/task3.
36. Edit and understand example task3.c.
37. Compile and run task3 several times. Analyze why all threads run alternately.
38. Comment both the omp_set_lock and the omp_unset_lock function calls. Compile and
run again several times. Analyze why the threads do not run alternately.
39. Include the critical region into a critical directive. Compile and run again several times.
The result is the same as when locks are utilized. Example:
#pragma omp critical
{
// Critical region: One thread at a time
}
...
40. Remove the critical directive. Insert a barrier synchronization right above the workload.
Compile and run again several times. Analyze why the threads run alternately again. Since
there is no critical region, the workload of both threads is running with global
synchronization but without mutual exclusion. Example:
// Critical region: One thread at a time
#pragma omp barrier
// Workload
41. Go back to the original task3.c with the wait and signal semaphore calls. Comment the
omp_unset_lock (wait but no signal). Compile and run it again. Analyze why the first
thread runs only once and the program halts. Press Ctrl-C to stop the program.
42. Uncomment the omp_unset_lock and insert a barrier synchronization right above the
workload. Compile and run again. Analyze why the program halts right from the beginning.
Press Ctrl-C to stop the program.
43. Write conclusions to lab report (homework).