Beruflich Dokumente
Kultur Dokumente
Replay Scheme
ISCA-2010
cyj@ict.ac.cn
w5 r5
PO PO
r6 E w6
PO PO
03/07/20 r7 ICT, CAS w7 5/25
Introduction
Scaling Down of TeraFLOPS Multiprocessor
System
Sc
ali
n gd
ow
n
1997
“room”
2007
“refrigerator”
2008
“washing machine”
2009
“microwave oven”
03/07/20 ICT, CAS 6/25
Introduction
Global Clock
T
tp(u) ≤ te(u) < ts(v) ≤ tp(v): u v
ts(u) ts(v) te(u) te(v)
Deterministic Replay
Guarantee replay-run to behave as
production-run
Many applications (Debugging parallel
program)
Memory race recording and replaying
Software-based and hardware-assisted
schemes
Challenge
Industrial DFD guidelines
Affecting performance as little as possible
Decoupling DFD functionalities
Low area consumption
Low log size
For industrial acceptance of replay, we should
follow them
Too stringent for current hardware-assisted
deterministic replay schemes
Should be relaxed for replay
Lossy Compression
1536
mem_inst_cnt PC 700
1280
mem_cnt_rnd PC_rnd 512
PC 510
Increment
PC_rnd 256 1*256
1024 PC 345
PC 278 Increment 0*256
value
PC 165PC_rnd PC_rnd
256
256
768
PC_rnd 0 Increment
Increment 1*256
0*256
Increment
512 0*256
256
0
0 256 512 768 1024 1280 1536 1792 2048 2304 2560
time(cycle)
03/07/20 ICT, CAS 16/25
Deterministic Replay
Overview of LReplay
LPU Chip
LPU
TDI TDO
Record Logic
Generating log PPL MCL NEL
JTAG port Value Value Value Value
Low design cost
Transporting log Ram0
Addr
Ram1
Addr
Ram2
Addr
Ram3
Addr
Star topology debugging
Low verification cost
CAM0 CAM1 CAM2 CAM3
10.00%
1.00%
Proportion
0.10%
0.01%
0.8
0.6
0.4
0.2
0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 20/25
Deterministic Replay
Experimental Results: (0.85B/K-Inst)
for Relaxed MC
Log size of LReplay (8core,Godson-3 Consistency,
MCL
256-cycle sample period) NEL
PPL
2.4
2.2
Log size (B/K-Inst)
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 21/25
Outline
Introduction
Theoretical Basis
Application: Deterministic Replay
Summary