Chen - LReplay

LReplay: A Pending Period Based Deterministic
Replay Scheme
ISCA-2010
Yunji Chen1,2 Weiwu Hu1,2 Tianshi Chen3,1 Ruiyang Wu1,2
cyj@ict.ac.cn
1 Institute of Computing Technology, Chinese Academy of Sciences

2 Loongson Technologies Corporation Limited
3 University of Science and Technology of China
Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary
03/07/20 ICT, CAS 2/25

Outline
 Introduction
 Summary
03/07/20 ICT, CAS 3/25

Introduction
Logical Clock in Multiprocessor

System
 Traditional parallel system
 No global clock in most parallel/distributed investigations
 Physical distance
 Logical clock and logical time order
 Lamport1978
 Logical clock meets many problems
 Observing O(pn)
 Recording O(n2)
 Inferring O(2n)
03/07/20 ICT, CAS 4/25

Introduction
Memory Consistency Verification

r1 w2
PO PO
w1
“Cut
r2
off” by
PO E
global clock
PO
r3 w3
PO E PO
r4 w4
E
PO PO
w5 r5
PO PO
r6 E w6
PO PO
03/07/20 r7 ICT, CAS w7 5/25
Introduction
Scaling Down of TeraFLOPS Multiprocessor
System
Sc
ali
n gd
ow
n
1997
“room”
2007
“refrigerator”
2008
“washing machine”
2009
“microwave oven”
03/07/20 ICT, CAS 6/25
Introduction
Global Clock
 Global clock can avoid many problems

 It has been common in CMP now
 Intel, AMD, Godson……
 Inaccuracy: A scalability issue
 Less than 50ns (even for future many-core
chips)
 Still need some technique
03/07/20 ICT, CAS 7/25

Outline
 Introduction
 Summary
03/07/20 ICT, CAS 8/25

Theoretical Basis
Global Clock Relaxation: Pending Period
 Intuitive idea: Measure the precise perform

time tp with global clock
 Clock inaccuracy
 Involving many components
 Pending period of u: [ts(u), te(u)]
 ts(u)≤ tp(u)≤ te(u)
ts(u) start time tp(u) perform time te(u) end time
03/07/20 ICT, CAS 9/25

Theoretical Basis
Physical Time Order

 There is physical time order between two operations
with disjoint pending periods
ts(u) te(u) ts(v) te(v)
T
tp(u) ≤ te(u) < ts(v) ≤ tp(v): u v
ts(u) ts(v) te(u) te(v)
neither u Tv nor v Tu
03/07/20 ICT, CAS 10/25

Outline
 Introduction
 Summary
03/07/20 ICT, CAS 11/25

Application: Deterministic Replay
Deterministic Replay
 Guarantee replay-run to behave as
production-run
 Many applications (Debugging parallel
program)
 Memory race recording and replaying
 Software-based and hardware-assisted
schemes
03/07/20 ICT, CAS 12/25

Application: Deterministic Replay
Challenge
 Industrial DFD guidelines
 Affecting performance as little as possible
 Decoupling DFD functionalities
 Low area consumption
 Low log size
 For industrial acceptance of replay, we should
follow them
 Too stringent for current hardware-assisted
deterministic replay schemes
 Should be relaxed for replay
03/07/20 ICT, CAS 13/25

Global Clock Based Deterministic
Replay Scheme
 Record physical time order (pending perio

d) instead of execution order itself
 >99% execution order inferable from physical t
ime order
 Low cost of recording global clock information
 E.g., 1 bit per 256 cycle (lossy compression)
03/07/20 ICT, CAS 14/25

Basic Idea of LReplay
P1 P2
…… ……
u120 v207
u121 v208
u122 v209
te<200 u123 v210 The only one
u124 v211 non-inferable
Physical u125 v212 execution
time u126 v213 order
order u127 v214
u128 v215
ts>220 u129 v216
u130 v217
u131 v218
u130 v219
u131 v220
…… ……
03/07/20 ICT, CAS 15/25
Lossy Compression
1536
mem_inst_cnt PC 700
1280
mem_cnt_rnd PC_rnd 512
PC 510
Increment
PC_rnd 256 1*256
1024 PC 345
PC 278 Increment 0*256
value
PC 165PC_rnd PC_rnd
256
256
768
PC_rnd 0 Increment
Increment 1*256
0*256
Increment
512 0*256
256
0
0 256 512 768 1024 1280 1536 1792 2048 2304 2560
time(cycle)
03/07/20 ICT, CAS 16/25
NEL: Log for Non-inferable

Execution Order
 When to identify
 When a memory instruction u misses in L1cache
 How to identify
 Use a CAM to compare the addresses of recent stores
 How to record
 Record the inst number of operations
03/07/20 ICT, CAS 17/25

Overview of LReplay
 LPU Chip
LPU
TDI TDO
Record Logic
 Generating log PPL MCL NEL
 JTAG port Value Value Value Value

Low design cost
Transporting log Ram0
Addr
Ram1
Addr
Ram2
Addr
Ram3
Addr
 Star topology debugging
 Low verification cost
CAM0 CAM1 CAM2 CAM3
network Mem Inst Mem Inst Mem Inst Mem Inst

Type/Addr/Cnt Type/Addr/Cnt Type/Addr/Cnt Type/Addr/Cnt

No performance
Sending cost
memory instruction /Hit/Value /Hit/Value /Hit/Value /Hit/Value
information to LPU Existing CMP Design
 Trivial modification for existing Core 0 Core 1 Core 2 Core 3
CMP design Switch+Mesh Interconnection
L2 bank 0 L2 bank 1 L2 bank 2 L2 bank 3

 Only processor core modified
03/07/20 ICT, CAS 18/25

Proportion of Non-inferable Execution
Orders (<1%)
10.00%
1.00%
Proportion
0.10%
0.01%
03/07/20 ICT, CAS 19/25

Experimental Results: (0.55B/K-Inst)
for SC
Log size of Lreplay (8core, sequential
consistency, 256-cycle sample period)
NEL
PPL
1
Log size (B/K-Inst)
0.8
0.6
0.4
0.2
0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 20/25
Experimental Results: (0.85B/K-Inst)
for Relaxed MC
Log size of LReplay (8core,Godson-3 Consistency,
MCL
256-cycle sample period) NEL
PPL
2.4
2.2
Log size (B/K-Inst)
2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 21/25
Outline
 Introduction
 Summary
03/07/20 ICT, CAS 22/25

Summary
 Logical clock
 The traditional theoretical base of parallel investigations
 Global clock
 Adopted in most state-of-art CMP
 A crucial difference between CMP and traditional parallel system
 Inaccuracy of global clock should also be concerned for scalability
 Global clock can significantly simplify record and infer
memory race information in CMP
 Magnitude of order smaller log size is achieved with global clock
for deterministic replay without performance loss
 More applications of global clock can be expected
03/07/20 ICT, CAS 23/25

References
 A theory of global clock
 Yunji Chen, Tianshi Chen, and Weiwu Hu. “Global Clock, Physical Time Order and Pending Period Analys
is in Multiprocessor Systems”. CoRR abs/0903.4961, 2009. (http://arxiv.org/pdf/0903.4961)
 Implementation of global clock
 Menghao Su, Yunji Chen, Xiang Gao. “A General Method to Make Multi-clock System Deterministic”. In P
roceedings of Design, Automation and Test in Europe (DATE’10) , 2010.
 Several applications of global clock
 Yunji Chen, Yi Lv, Weiwu Hu, Tianshi Chen, Haihua Shen, Pengyu Wang, and Hong Pan. “Fast Complete
Memory Consistency Verification”. In Proceedings of the 15th International Symposium on High-Perform
ance Computer Architecture (HPCA’09), 2009.
 Yunji Chen, Weiwu Hu, Tianshi Chen, and Ruiyang Wu. “LReplay: A Pending Period Based Deterministic
Replay Scheme”. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’
10), 2010.
 Yunji Chen, Tianshi Chen, and Ruiyang Wu. “Complete Data Race Detection Under Pending Period Restr
ictions”. Godson Technique Report.
 Godson-3 processor
 Weiwu Hu, Jian Wang, Xiang Gao, Yunji Chen, Qi Liu, and Guojie Li. “Godson-3: A Scalable Multicore RI
SC Processor with x86 Emulation”. IEEE Micro, Vol. 29, No. 2, 2009.
 Weiwu Hu and Yunji Chen. “GS464V: A High-Performance Low-Power XPU with 512-Bit Vector Extensio
n”. In Proceedings of the 22nd IEEE Symposium on High-Performance Chips (HotChips‘10) , 2010.
03/07/20 ICT, CAS 24/25

Thank you very much
Welcome discussions after the talk
03/07/20 ICT, CAS 25/25

Chen - LReplay

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chen - LReplay

Hochgeladen von

Copyright:

Verfügbare Formate

LReplay: A Pending Period Based Deterministic

Yunji Chen1,2 Weiwu Hu1,2 Tianshi Chen3,1 Ruiyang Wu1,2

1 Institute of Computing Technology, Chinese Academy of Sciences

03/07/20 ICT, CAS 2/25

03/07/20 ICT, CAS 3/25

Logical Clock in Multiprocessor

03/07/20 ICT, CAS 4/25

Memory Consistency Verification

 Global clock can avoid many problems

03/07/20 ICT, CAS 7/25

03/07/20 ICT, CAS 8/25

 Intuitive idea: Measure the precise perform

ts(u) start time tp(u) perform time te(u) end time

03/07/20 ICT, CAS 9/25

Physical Time Order

neither u Tv nor v Tu

03/07/20 ICT, CAS 10/25

03/07/20 ICT, CAS 11/25

03/07/20 ICT, CAS 12/25

03/07/20 ICT, CAS 13/25

 Record physical time order (pending perio

03/07/20 ICT, CAS 14/25

NEL: Log for Non-inferable

03/07/20 ICT, CAS 17/25

network Mem Inst Mem Inst Mem Inst Mem Inst

information to LPU Existing CMP Design

 Trivial modification for existing Core 0 Core 1 Core 2 Core 3

CMP design Switch+Mesh Interconnection

L2 bank 0 L2 bank 1 L2 bank 2 L2 bank 3

03/07/20 ICT, CAS 18/25

03/07/20 ICT, CAS 19/25

03/07/20 ICT, CAS 22/25

03/07/20 ICT, CAS 23/25

03/07/20 ICT, CAS 24/25

03/07/20 ICT, CAS 25/25

Das könnte Ihnen auch gefallen