P1COMP1

UNIVERSIDAD AUTNOMA DE MADRID
MASTERS PROGRAM IN RESEARCH AND INNOVATION IN INFORMATION

2
AND COMMUNICATION TECHNOLOGIES (I -ICT)
Numerical and Data-Intensive Computing (Course 2012/2013)
Laboratory 1: PROFILERS (perf and gprof)

Proceed carefully through the following steps, completing the lab report as requested. The
report must be delivered as a PDF document through the Moodle portal by the next lab class.
The material within this report can be used for the final term report, which is to be delivered by
the end of the ordinary evaluation period.
1. Download associated material (profilers.tar.gz) from Moodles course page into personal
working directory.
2. Uncompress and untar associated material:
gunzip profilers.tar.gz
tar xvf profilers.tar
PERF (I)
1. Go to directory task1:
cd profilers/task1
2. Edit and understand example task1.c:
gedit task1.c &
3. Compile task1.c:
make
4. Run task1 with the performance analysis tool (perf), reporting:
a. CPU cycles at user level: -e cycles:u
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e
cycles:u ./task1
CPU = 34.900000 ms
Performance counter stats for './task1':
10660168914
M/sec
3.505056003
cycles
seconds time elapsed
0.000
Estadisticas de cuantas veces se hizo lago

b. Machine code instructions at user level: e instructions:u
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e
instructions:u ./task1
CPU = 34.900000 ms
4002106745
instructions
0.000
IPC
3.508743022
seconds time elapsed
temporal@cmult-25-67-217:~/profilers/task1$
c. First-level data cache loads at user level: e l1-dcache-loads:u

d. First-level data cache load misses at user level: e l1-dcache-loadmisses:u
e. First-level data cache stores at user level: e l1-dcache-stores:u
f. First-level data cache store misses at user level: e l1-dcache-storemisses:u
g. Last-level cache loads at user level: e llc-loads:u
h. Last-level cache load misses at user level: e llc-load-misses:u
i. Last-level cache stores at user level: e llc-stores:u
j. Last-level cache store misses at user level: e llc-store-misses:u
perf
e
-e
-e
-e
stat e cycles:u e instructions:u \

l1-dcache-loads:u e l1-dcache-load-misses:u \
l1-dcache-stores:u e l1-dcache-store-misses:u \
llc-loads:u e llc-load-misses:u \
llc-stores:u e llc-store-misses:u ./task1
numero de fallos a partir de nivel1

cuantas veces almaceno en la cache 1
fallo de cahe de nivel1
Last level Cache3 LLC
fallos y lecturas en L1 y en LLC
L! memoria cache 1
l2 memoria cache 2
l3 memoria cache 3 last nivel cache
luego va a la memoria ram
preguntar a la CPU cuantos fallos, almacenamientos
perf provee informacion estadistica de CPU
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u ./task1
CPU = 34.800000 ms
10657835906 cycles
4002106780 instructions
0.000 M/sec
0.376 IPC
3.502425458 seconds time elapsed
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u

./task1
CPU = 34.900000 ms
10652615063 cycles
0.000 M/sec
400036820 L1-dcache-loads
0.376 IPC
#
0.000 M/sec
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

l1-dcache-load-misses:u ./task1
CPU = 34.800000 ms
10644786361 cycles
0.000 M/sec
0.376 IPC
#
935423583 L1-dcache-load-misses
0.000 M/sec
#
0.000 M/sec

l1-dcache-load-misses:u -e l1-dcache-stores:u ./task1
CPU = 34.800000 ms
10651824488 cycles
0.000 M/sec
0.376 IPC
#
800018046 L1-dcache-stores
0.000 M/sec
#
0.000 M/sec
0.000 M/sec

l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u ./task1
CPU = 34.800000 ms
10651317588 cycles
0.000 M/sec
0.376 IPC
#
0 L1-dcache-store-misses #
0.000 M/sec
#
0.000 M/sec
0.000 M/sec
0.000 M/sec

l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u ./task1
CPU = 34.900000 ms
10673775578 cycles
0.000 M/sec (scaled from 71.43%)
0.375 IPC
#

#

543983919 LLC-loads
(scaled from 85.73%)

l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u ./task1
CPU = 34.800000 ms
10646911534 cycles
0.376 IPC
#

#

533935204 LLC-loads

l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-loadmisses:u ./task1
CPU = 34.900000 ms
10666020714 cycles
0.375 IPC
#

#
56811441 LLC-load-misses

535311019 LLC-loads

#

l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u -e llc-load-misses:u
-e llc-stores:u ./task1
CPU = 34.900000 ms
10682090958 cycles
0.375 IPC
#

#
383485849 LLC-stores

#

549729695 LLC-loads


-e llc-stores:u -e llc-store-misses:u./task1
usage: perf stat [<options>] <command>
-e, --event <event> event selector. use 'perf list' to list available events
-i, --inherit
-p, --pid <n>
-a, --all-cpus
-c, --scale
child tasks inherit counters

stat events on existing pid
system-wide collection from all CPUs
scale/normalize counters
-v, --verbose
be more verbose (show counter open errors, etc)
-r, --repeat <n>
repeat command and print average + stddev (max: 100)
-n, --null
null run - dont start any counters

-e llc-stores:u -e llc-store-misses:u ./task1
CPU = 34.900000 ms
10659650396 cycles
0.375 IPC
#
546259120 LLC-loads


#
10877361 LLC-store-misses

5. Exchange indices in source code of task1.c, such that array is now traversed by rows
instead of by columns, hence taking advantage of both the row-major order used in C
and the cache hierarchy:

array[j][i]
array[i][j]
6. Compile task1.c and execute it through perf, reporting the same events as before.
7. Analyze the new results and compare them with the previous results.
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u
-e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.700000 ms
6034968746 cycles

#
0.681 IPC (scaled from 59.67%)
18247292 LLC-loads

#


#

8. Study the performance of the program without optimization (remove O2 from the
Makefile) and different types of optimization (-O1, -O2, -O3).
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcacheloads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u -e llc-loads:u
-e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 19.900000 ms
6067503773 cycles

#

#
18391820 LLC-loads

#


#
CPU = 19.800000 ms
6038718173 cycles

#

#

#

18696232 LLC-loads

#

#
CPU = 19.800000 ms
6053984600 cycles

#
18570290 LLC-loads

#


#

1
0
CPU = 19.600000 ms
6037123792 cycles

#

#
18025137 LLC-loads

#


#
CPU = 19.800000 ms
6035870937 cycles

#

#

#

1
1
18108552 LLC-loads

#

#
9. Write conclusions to lab report (homework).
10.
Sin optimizar
temporal@cmult-25-67-217:~/profilers/task1$ make
gcc -c task1.c
gcc -o task1 task1.o
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1
CPU = 52.400000 ms
16007060492 cycles
0.575 IPC
588372537 LLC-loads

#


#

1
2

#
Optimizando O1
emporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1CPU
= 52.400000 ms
16023716367 cycles
0.574 IPC
587307597 LLC-loads

#

#

Opt O2
temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l11
3
dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-storemisses:u -e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task1

CPU = 52.400000 ms
16020307101 cycles
0.575 IPC
588535120 LLC-loads

#

#
Opt O3
CPU = 52.900000 ms
16137204247 cycles
0.571 IPC
#

1
4
#
#
585647246 LLC-loads


cambiando indices O3
CPU = 5.700000 ms
1740572548 cycles

#
0.829 IPC
9433803 LLC-loads

#


#


1
5
Cambio indices Opt O2
CPU = 5.700000 ms
1724945432 cycles

#
0.815 IPC
9583820 LLC-loads


#

Cambio de Indices Opt O1
CPU = 5.700000 ms
1
6
1739406471 cycles

#
0.816 IPC
9261294 LLC-loads

#

#
cambio los dos indices opt O2
CPU = 5.400000 ms
1661158018 cycles

#
2.428 IPC
#

#


1
7
6146244 LLC-loads
6186194 LLC-stores

#
cambio los dos indices opt O1

CPU = 5.400000 ms
1639985549 cycles

#
2.435 IPC
6303632 LLC-loads


6224012 LLC-stores

#
#
#

1
8
Cambio los dos indices opt O3

CPU = 5.400000 ms
1648017829 cycles

#
2.434 IPC
6264775 LLC-loads


6212471 LLC-stores

#
#
#

antes de paralelizar hay que optimizar vaya lo mas rapido posible aun cuando sea mas
lento
Como optimizar el programa para paralelizar
Cuando se compila el codigo de maquina como se compila
tres niveles de optimizacion
el compilador ve el programa para que vaya mas rapido
Cambios que hace el compilador
para ver si va mas rapido
Como afecta la optimizacion con el resultado
PERF (II)
1
9
1. Go to directory profilers/task2 (matrix multiplication).

2. Edit and understand example task2.c.
3. Compile task2.c.
4. Run task2 with the performance analysis tool (perf).
temporal@cmult-25-67-217:~/profilers/task2$ perf stat -e cycles:u -e instructions:u -e l1dcache-loads:u -e l1-dcache-load-misses:u -e l1-dcache-stores:u -e l1-dcache-store-misses:u
-e llc-loads:u -e llc-load-misses:u -e llc-stores:u -e llc-store-misses:u ./task2
Equal 0
CPU = 3110.000000 ms
9611683298 cycles

#
1018152124 LLC-loads

#
62788 LLC-stores

#
5.
Implement a new version of the matrix multiplication function (Mult2) that takes
advantage of both the row-major order used in C and the cache hierarchy.
Standard matrix multiplication (Mult1)
+
2
0
Optimized matrix multiplication (Mult2)
6. Run task2 and verify that both functions (Mult1, Mult2) yield the same result.
2
1
7. Comment call to Mult1 in task2.c and run task2 with the performance analysis tool
(perf).
8. Compare the performance of both functions and write conclusions to lab report,
including the implementation of Mult2 (homework).
GPROF
1. Go to directory profilers/task3 (image processing algorithm).
2. Edit and understand structure of Makefile. Option -pg at compile time forces
compiler to generate profile data suitable for gprof.
3. Compile program:
make
4. Run program (it generates binary file gmon.out with profile data). Input and output
bitmap images can be viewed with any image visualization program (gimp, xv, )
./algi channel1.bmp channel2.bmp
5. Run gprof generating profile file:
gprof algi > profile
6. Edit and understand the self-explained profile file.
7. Identify functions that consume a significant percentage of running time (bottlenecks)
best candidates to be optimized / parallelized. Any improvement on them will have
a significant impact on the overall running time:
a. Large functions (big self ms/call value) with a big percentage of running time
(big % time value).
b. Small functions (low self ms/call value) that run very frequently (big %
time value).
8. Write conclusions to lab report, including description of bottleneck functions
(homework).
KPROF (graphical front-end to gprof)
1. Execute kprof.
2. Open profile file generated by gprof:
File Open
3. Examine tabs Flat Profile, Hierarchical Profile and Graph View.
2
2
Laboratory 1: OpenMP (Compilation and basic directives)

Proceed through the following steps, completing the lab report as requested.
1. Obtain CPU information from the Linux kernel:
cat /proc/cpuinfo > cpuinfo.txt
gedit cpuinfo.txt
2. Identify the CPUs model in cpuinfo.txt. Example:
model name : Intel(R) Core(TM) i5 CPU
650
@ 3.20GHz
3. Identify the number of CPUs in cpuinfo.txt. The number of CPUs is equal to the number
of different physical identifiers of the available logical processors. Example for a single
CPU Intel Core i5:
processor : 0
physical
processor : 1
physical
processor : 2
physical
processor : 3
physical
(first logical processor)

id: 0
id: 0
id: 0
id: 0
4. Identify number of cores per CPU in cpuinfo.txt. Example for Intel Core i5 with 2 cores:
cpu cores
: 2
5. Identify number of hardware supported threads (i.e.: logical processors) per CPU in
cpuinfo.txt. If the number of supported threads is N times the number of cores, the CPU
supports hyper-threading and each core will be able to concurrently execute N of those
threads by sharing its internal resources (ALU, FPU, etc.). Example for Intel Core i5 with 2
cores and hyper-threading:
siblings
: 4
6. Identify what logical processors correspond to each CPU and core. Example for a single
Intel Core i5 in which the first two logical processors are mapped to the first core and the
last two logical processors to the second core, with a single CPU (physical processor):
processor : 0
physical id
core id
processor : 1
physical id
core id
processor : 2
physical id
core id
processor : 3
physical id
core id
: 0
: 0
: 0
: 0
: 0
: 2
: 0
: 2
7. Using a web browser, verify the number of cores and threads per core on the Internet based
on the CPUs model information. Write conclusions to lab report (homework).
8. Download associated material (openmp1.tar.gz) from Moodles course page into personal
working directory.
9. Uncompress and untar associated material:
2
3
gunzip openmp1.tar.gz
tar xvf openmp1.tar
10. Go to directory openmp1/task1.
11. Edit and understand structure of Makefile. Option -fopenmp at compile time forces
compiler to understand OpenMP directives. Option -lgomp at link time forces linker to
include the OpenMP library for Linux (GOMP).
13. Execute in a new terminal the run-time CPU monitor mpstat (if not available, execute
gnome_system_monitor instead):
xterm &
mpstat P ALL 1
(create new terminal)

(run from the new terminal)
mpstat shows statistical information about each available logical processor, including
percentage of CPU load at the user level (%usr) and the system level (%sys).
14. In task1.c, set the number of OpenMP threads (constant NUM_THREADS) to 1. From
the initial terminal, compile the program and execute it, writing down the wall time (real
execution time). The latter is the minimum sequential time (Ts) of the algorithm. See how
mpstat shows what logical processor is executing the program.
15. Set the number of threads to 2 in task1.c, recompile the program and run it, checking with
mpstat what logical processors are executing both threads. Execute the program several
times. The operating system automatically maps every thread to a different core. The logical
processor within the core may vary from an execution to the next. Write down the average
wall time of all executions, which corresponds to the parallel time for two cores (Tp).
16. Compute speedup and efficiency for two cores.
17. Force the mapping of threads to logical processors, such that both OpenMP threads are
mapped to the first two logical processors. If the CPU supports hyper-threading, the first
two logical processors are executed by the same core. Example:
export GOMP_CPU_AFFINITY=0 1
18. Run the program several times and compute speedup and efficiency for two threads
executed by the same core. This measures the performance of hyper-threading for this
particular CPU-intensive application.
19. Force the mapping of threads to specific logical processors of different cores. Example:
20. Run the program several times and compute speedup and efficiency for two threads
executed in specific logical processors belonging to different cores. Compare that
performance with the one obtained through the automatic mapping of threads to logical
processors provided by the operating system.
21. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
same logical processor. In case of several threads assigned to the same logical processor, the
latter executes them with time-sharing. Example:
export GOMP_CPU_AFFINITY=3
22. Run the program several times and compute speedup and efficiency for four threads
executed by the same logical processor.
23. In task1.c, set the number of OpenMP threads to 4 and force that all threads are run by the
logical processors belonging to the same core. Example:
24. Run the program several times and compute speedup and efficiency for four threads
executed by the logical processors within the same core.
25. Release the explicit mapping of threads to specific logical processors, such that this
mapping be left to the operating system again:
export GOMP_CPU_AFFINITY=
26. Write down the results and conclusions to lab report (homework).
29. Compile and run task2 several times. Realize that the PID is always different at the
beginning of the parallel body, and the same at its end. Analyze and interpret this behavior.
30. Declare variable pid as private. Compile the program and run it again several times,
realizing that the PID now is always different. Analyze and interpret this behavior.
32. Run task2 again and realize that private variable limit, which is initialized to -1 in its
program declaration, is reset to zero at the begging of the parallel body, whereas it is set
back to -1 when the master thread resumes its execution right after the parallel body.
Analyze this behavior by considering that every thread within a parallel region has a local
copy of all its private variables.
33. Change the private clause for firstprivate, which initializes the local copies of private
variables to their original value. Compile and run the program again realizing the difference.
37. Compile and run task3 several times. Analyze why all threads run alternately.
38. Comment both the omp_set_lock and the omp_unset_lock function calls. Compile and
run again several times. Analyze why the threads do not run alternately.
39. Include the critical region into a critical directive. Compile and run again several times.
The result is the same as when locks are utilized. Example:
#pragma omp critical
{
// Critical region: One thread at a time
}
...
40. Remove the critical directive. Insert a barrier synchronization right above the workload.
Compile and run again several times. Analyze why the threads run alternately again. Since
there is no critical region, the workload of both threads is running with global
synchronization but without mutual exclusion. Example:
// Critical region: One thread at a time
#pragma omp barrier
// Workload
41. Go back to the original task3.c with the wait and signal semaphore calls. Comment the
omp_unset_lock (wait but no signal). Compile and run it again. Analyze why the first
thread runs only once and the program halts. Press Ctrl-C to stop the program.
42. Uncomment the omp_unset_lock and insert a barrier synchronization right above the
workload. Compile and run again. Analyze why the program halts right from the beginning.
Press Ctrl-C to stop the program.

P1COMP1

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

P1COMP1

Hochgeladen von

Copyright:

Verfügbare Formate

UNIVERSIDAD AUTNOMA DE MADRID

MASTERS PROGRAM IN RESEARCH AND INNOVATION IN INFORMATION

Laboratory 1: PROFILERS (perf and gprof)

Performance counter stats for './task1':

seconds time elapsed

Estadisticas de cuantas veces se hizo lago

Performance counter stats for './task1':

seconds time elapsed

c. First-level data cache loads at user level: e l1-dcache-loads:u

stat e cycles:u e instructions:u \

numero de fallos a partir de nivel1

Performance counter stats for './task1':

3.502425458 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u

Performance counter stats for './task1':

3.506406766 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

3.497828986 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

3.502611901 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

3.506921655 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

0.000 M/sec (scaled from 71.43%)

0.000 M/sec (scaled from 85.73%)

0.000 M/sec (scaled from 85.74%)

(scaled from 85.73%)

0.000 M/sec (scaled from 85.73%)

0.000 M/sec (scaled from 57.06%)

3.505318415 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

0.000 M/sec (scaled from 71.39%)

0.000 M/sec (scaled from 85.70%)

0.000 M/sec (scaled from 85.70%)

(scaled from 85.69%)

0.000 M/sec (scaled from 85.81%)

0.000 M/sec (scaled from 57.10%)

3.495701801 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

0.000 M/sec (scaled from 62.35%)

0.000 M/sec (scaled from 74.90%)

0.000 M/sec (scaled from 75.00%)

(scaled from 74.90%)

0.000 M/sec (scaled from 75.13%)

0.000 M/sec (scaled from 49.98%)

0.000 M/sec (scaled from 49.86%)

3.506729036 seconds time elapsed

temporal@cmult-25-67-217:~/profilers/task1$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e

Performance counter stats for './task1':

0.000 M/sec (scaled from 55.38%)

0.000 M/sec (scaled from 66.66%)

0.000 M/sec (scaled from 66.84%)

0.000 M/sec (scaled from 44.40%)

0.000 M/sec (scaled from 66.77%)

(scaled from 66.55%)

0.000 M/sec (scaled from 44.27%)

3.509918495 seconds time elapsed