Beruflich Dokumente
Kultur Dokumente
Performance
10,000
1,000
Processor
Processor-Memory
Performance Gap
Growing
100
10
Memory
1
1980
1985
1990
1995
Year
2000
2005
2010
4.
5.
6.
2.50
1-way
2.00
2-way
4-way
8-way
1.50
1.00
0.50
16 KB
32 KB
64 KB
128 KB
256 KB
512 KB
1 MB
Cache size
Miss Penalty
Problems:
-
Data
Merging Arrays: improve spatial locality by single array of compound
elements vs. 2 arrays
Loop Interchange: change nesting of loops to access data in order
stored in memory
Loop Fusion: Combine 2 independent loops that have same looping
and some variables overlap
Blocking: Improve temporal locality by accessing blocks of data
repeatedly vs. going down whole columns or rows
Adapted from Patterson and Hennessey
(Morgan Kauffman Pubs)
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
Best: 4x2
Mflop/s
Reference
Adapted from Patterson and Hennessey
(Morgan Kauffman Pubs)
Sun Ultra 2,
Sun Ultra 3,
AMD Opteron
Intel
Pentium M
IBM Power 4,
Intel/HP Itanium
Intel/HP
Itanium 2
IBM
Power 3
4
2
1
1
2
4
column block size (c)
All possible column block sizes selected for 8 computers; How could
compiler know?
Adapted from Patterson and Hennessey
(Morgan Kauffman Pubs)
Technique
Hit
Time
Bandwidth
Mi
ss
pe
nal
ty
Miss
rate
HW cost/
complexity
Comment
Way-predicting caches
Used in Pentium 4
Trace caches
Used in Pentium 4
Widely used
Widely used
Widely used
Nonblocking caches
Banked caches
Compiler-controlled
(Morgan Kauffman Pubs)
prefetching
2 instr., 3
data