Sie sind auf Seite 1von 12

Effectiveness of Remote Cache in a NUMA System

Don DeSota, Sequent Computers


Ruth Forester, Sequent Computers
In NUMA systems remote accesses will degrade performance
A level 3 cache for remote accesses can be used to decrease the number of actual remote accesses
Effectiveness of the remote cache will be evaluated three ways
Decrease in the utilization of system interconnect BW
Increase in work done
Ability to compensate for higher remote latencies
Factors that contribute to the effectiveness of the remote cache are:
Remote reference rates
L2 and Remote Cache Hit Rates
Remote Latency
System Overview

Quad Quad
Proc Proc Proc Proc Proc Proc Proc Proc

L2 L2 L2 L2 L2 L2 L2 L2

I/O Mem Local Remote Remote Local Mem I/O


Bridge Link cache cache Link Bridge
Ctlr Memory Memory Ctlr

Quad Quad
I/O Mem Local Remote Remote Local Mem I/O
Bridge Link cache cache Link Bridge
Ctlr Memory Memory Ctlr

L2 L2 L2 L2 L2 L2 L2 L2

Proc Proc Proc Proc Proc Proc Proc Proc

The platform is built of 4 processor SMP blocks, called quads, each with memory, I/O and a remote cache
The remote cache services 4 processors
The remote cache is 4 way set associative
In the PentiumPro systems the remote cache is 32M, in the Xeon systems it is 128M
The PentiumPro L2 is 4 way 1M, the Xeon L2 is 4 way 2M
The L2/SMP bus line size is 32 bytes while the remote cache line size is 64 bytes
The interconnect is an SCI ring
Test Configurations

Workload OS Quads Processor RC Database


cust1 NUMA 8 PPro 32M DB-A
cust2 NUMA 8 PPro 32M DB-A
cust3-1 NONNUMA 4 PPro 32M DB-A
cust3-2 NONNUMA 4 PPro 32M DB-A
cust3-3 NONNUMA 2 PPro 32M DB-A
tpcc1 NUMA 8 Ppro 32M DB-A
tpcc2 NUMA 8 Xeon 128M DB-A
tpcc3 NUMA 16 Xeon 128M DB-A
tpcd Q5 NUMA 8 PPro 32M DB-A
tpcd Q5 NUMA 8 Xeon 128M DB-B
tpcd Q9 NUMA 16 Xeon 128M DB-A
tpcd Q9 NUMA 8 Xeon 128M DB-B

Benchmarks run include TPC-C, TPC-D, and customer benchmarks


Systems include PentiumPro and Xeon systems
The remote cache is 32M 4-way on the PentiumPro systems and 128M 4-way on the Xeon systems
Measurements include both a NUMA optimized OS and a nonoptimized OS
Measurements include two database systems
Processor Data Miss Rate
1.20%

1.00%
Processor Data Miss
% of all data references

0.80%

0.60%

0.40%

0.20%

0.00%

on

on
A

B
on
ro

ro

ro

ro
UM

UM

B-

B-
Pp

Pp

Pp
Pp

Xe

Xe
Xe
NU

D
nN

nN
8

8
8

16

16
8

on

on
on
1

5
1
No

No

Xe

Xe
st

st

Q
cc
N

9
3
cc

Q
cu

cu

cc

cd
ro

ro

ro

tp

tp

8
tp

cd
Pp

Pp

Pp

tp

9
Q

Q
tp
4

cd

cd
1

3
3-

3-

3-

tp

tp
st

st

st
cu

cu

cu

Processor Data Miss Rates are 1% or less


The 4 quad NonNUMA optimized miss rates are very low due to spinning
Remote References
100%

90% 170MB/s System RC BW


149MB/s 89% RC Hit Rate
80% 76% RC Miss
70%
RC Hit
60%
127MB/s 216MB/s
L2 Misses

58MB/s 218MB/s
50% 62% 70% 59%
79%
114MB/s 132MB/s
40%
57% 57%
107MB/s
30% 372 MB/s 53%
123MB/s85%
20% 85% 129MB/s
83%
10%

0%

on
on
A

B
on
ro

ro

ro
ro

B-

B-
UM

UM
Pp

Pp

Pp
Pp

Xe
Xe
Xe
NU

D
nN

nN
8

16
16

on

on
8
on
1

5
No

No

Xe

Xe
st

st

Q
cc
N

9
3
cc

Q
cu

cu

cc
ro

ro

ro

tp

cd
tp

8
tp

cd
Pp

Pp

Pp

tp

9
Q

Q
tp
4

cd

cd
1

3
3-

3-

3-

tp

tp
st

st

st
cu

cu

cu

The remote reference percentage of L2 misses (RC Hit + RC Miss) ranges from about 10% to 80%
The remote cache hit rate (RC Hit/(RC Hit + RC Miss)) ranges from 53% to 89%
The remote cache satisfies 8% to 70% of the L2 misses
The SCI BW is reduced by 58MB/s to 372MB/s
Remote Reference Components
100%
Inv Miss
90% RdInv Miss Inv
RdInv Miss Data
80% Rd Miss
Wb Hit
70%
Inv Hit
60% RdInv Hit
Rd Hit
50%

40%

30%

20%

10%

0%
A

-B

-B
on
n
n

o
o

ro

ro
M

UM

eo
eo
pr

pr

B
Pp

Pp

Xe
U

P
P

D
X
X
nN

nN

nN
8

8
8

on

on
16
16
8
No

No

No

5
1

Xe

Xe
Q
st

st

cc

9
3
cc

Q
cu

cu

cc
ro

ro

ro

cd
tp

tp

8
cd
tp
Pp

Pp

Pp

tp

9
Q

Q
tp
4

cd

cd
1

3
3-

3-

3-

tp

tp
st

st

st
cu

cu

cu

The remote reference rate is broken down into reference types with hits or misses
Reads are the most frequent reference type in most cases
Reads have a good hit rate
Invalidates have a poor hit rate
Simulated Relative Performance

Relative Performance with a 10:1 Latency Ratio

1.80

1.60 TPCC1 8 Ppro


TPCC2 8 Xeon
TPCD Q5 8 Ppro
1.40 TPCD Q9 16 Xeon
TPCD Q5 8 Xeon DB-B
Cust3-3 2 Ppro NonNUMA
1.20
Relative Performance

1.00

0.80

0.60

0.40

0.20

0.00

No Rc RC

The remote cache contributes a 18% to 54% performance increase with a 10:1 remote to local latency ratio
Remote Cache even helps TPCC 8 Xeon which has a 10% remote reference rate but a high L2 miss rate (1.3%)
TPCD Q5 8 Xeon DB-B gains the least from a remote cache due to its high communications component
Simulated Relative Performance

Relative Performance with a 2:1 Latency Ratio

1.12

1.10
TPCC1 8 Ppro

TPCC2 8 Xeon
1.08
TPCD Q5 8 Ppro

TPCD Q9 16 Xeon
1.06
TPCD Q5 8 Xeon DB-B
Relative Performance

Cust3-3 2 Ppro NonNUMA


1.04

1.02

1.00

0.98

0.96

0.94
No Rc RC

The remote cache gives a 3% to 10% performance increase with a 2:1 remote to local latency ratio
Simulated Relative Performance

Remote to Local Ratio for Equal Performance

12.00

10.00
TPCC1 8 Ppro
TPCC2 8 Xeon
TPCD Q5 8 Ppro

8.00 TPCD Q9 16 Xeon


TPCD Q5 8 Xeon DB-B
Remote to Local Ratio

TPCD Q5 8 Xeon DB-B


Cust3-3 2 Ppro NonNUMA
6.00

4.00

2.00

0.00
RC No RC

A system with no remote cache would need a 2.5 to 5.4 ratio to achieve equal performance to a system with a
remote cache and a 10:1 ratio
Simulated Relative Performance

Remote to Local Ratio for Equal Performance

7.00

6.00
TPCC1 8 Ppro
TPCC2 8 Xeon
TPCD Q5 8 Ppro
5.00
TPCD Q9 16 Xeon
TPCD Q5 8 Xeon DB-B
Remote to Local Ratio

Cust3-3 2 Ppro NonNUMA


4.00

3.00

2.00

1.00

0.00
No Rc RC

A system with a remote cache can have a ratio of 2.7 to 6.2 and achieve equal performance to a system with no
remote cache and a 2:1 ratio
Conclusions

Remote cache is effective on many commercial workloads


Remote cache can reduce the bandwidth requirements for the interconnect in a NUMA system
Systems with remote cache can tolerate a higher interconnect latency
Even workloads with a low remote reference rate can benefit from a remote cache
Remote cache can improve the performance of systems with a low remote-to-local latency ratio
Workloads with a high communications rate get less gain from remote cache