Sie sind auf Seite 1von 25

LReplay: A Pending Period Based Deterministic

Replay Scheme

ISCA-2010

Yunji Chen1,2 Weiwu Hu1,2 Tianshi Chen3,1 Ruiyang Wu1,2

cyj@ict.ac.cn

1 Institute of Computing Technology, Chinese Academy of Sciences


2 Loongson Technologies Corporation Limited
3 University of Science and Technology of China
Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary

03/07/20 ICT, CAS 2/25


Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary

03/07/20 ICT, CAS 3/25


Introduction

Logical Clock in Multiprocessor


System
 Traditional parallel system
 No global clock in most parallel/distributed investigations
 Physical distance
 Logical clock and logical time order
 Lamport1978
 Logical clock meets many problems
 Observing O(pn)
 Recording O(n2)
 Inferring O(2n)

03/07/20 ICT, CAS 4/25


Introduction

Memory Consistency Verification


r1 w2
PO PO
w1
“Cut
r2
off” by
PO E
global clock
PO
r3 w3
PO E PO
r4 w4
E
PO PO

w5 r5
PO PO
r6 E w6
PO PO
03/07/20 r7 ICT, CAS w7 5/25
Introduction
Scaling Down of TeraFLOPS Multiprocessor
System

Sc
ali
n gd
ow
n
1997
“room”

2007
“refrigerator”
2008
“washing machine”
2009
“microwave oven”
03/07/20 ICT, CAS 6/25
Introduction
Global Clock

 Global clock can avoid many problems


 It has been common in CMP now
 Intel, AMD, Godson……
 Inaccuracy: A scalability issue
 Less than 50ns (even for future many-core
chips)
 Still need some technique

03/07/20 ICT, CAS 7/25


Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary

03/07/20 ICT, CAS 8/25


Theoretical Basis
Global Clock Relaxation: Pending Period

 Intuitive idea: Measure the precise perform


time tp with global clock
 Clock inaccuracy
 Involving many components
 Pending period of u: [ts(u), te(u)]
 ts(u)≤ tp(u)≤ te(u)

ts(u) start time tp(u) perform time te(u) end time

03/07/20 ICT, CAS 9/25


Theoretical Basis

Physical Time Order


 There is physical time order between two operations
with disjoint pending periods
ts(u) te(u) ts(v) te(v)

T
tp(u) ≤ te(u) < ts(v) ≤ tp(v): u v
ts(u) ts(v) te(u) te(v)

neither u Tv nor v Tu

03/07/20 ICT, CAS 10/25


Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary

03/07/20 ICT, CAS 11/25


Application: Deterministic Replay

Deterministic Replay
 Guarantee replay-run to behave as
production-run
 Many applications (Debugging parallel
program)
 Memory race recording and replaying
 Software-based and hardware-assisted
schemes

03/07/20 ICT, CAS 12/25


Application: Deterministic Replay

Challenge
 Industrial DFD guidelines
 Affecting performance as little as possible
 Decoupling DFD functionalities
 Low area consumption
 Low log size
 For industrial acceptance of replay, we should
follow them
 Too stringent for current hardware-assisted
deterministic replay schemes
 Should be relaxed for replay

03/07/20 ICT, CAS 13/25


Deterministic Replay
Global Clock Based Deterministic
Replay Scheme

 Record physical time order (pending perio


d) instead of execution order itself
 >99% execution order inferable from physical t
ime order
 Low cost of recording global clock information
 E.g., 1 bit per 256 cycle (lossy compression)

03/07/20 ICT, CAS 14/25


Deterministic Replay
Basic Idea of LReplay
P1 P2
…… ……
u120 v207
u121 v208
u122 v209
te<200 u123 v210 The only one
u124 v211 non-inferable
Physical u125 v212 execution
time u126 v213 order
order u127 v214
u128 v215
ts>220 u129 v216
u130 v217
u131 v218
u130 v219
u131 v220
…… ……
03/07/20 ICT, CAS 15/25
Deterministic Replay

Lossy Compression
1536
mem_inst_cnt PC 700
1280
mem_cnt_rnd PC_rnd 512
PC 510
Increment
PC_rnd 256 1*256
1024 PC 345
PC 278 Increment 0*256
value

PC 165PC_rnd PC_rnd
256
256
768
PC_rnd 0 Increment
Increment 1*256
0*256
Increment
512 0*256

256

0
0 256 512 768 1024 1280 1536 1792 2048 2304 2560
time(cycle)
03/07/20 ICT, CAS 16/25
Deterministic Replay

NEL: Log for Non-inferable


Execution Order
 When to identify
 When a memory instruction u misses in L1cache
 How to identify
 Use a CAM to compare the addresses of recent stores
 How to record
 Record the inst number of operations

03/07/20 ICT, CAS 17/25


Deterministic Replay

Overview of LReplay
 LPU Chip
LPU
TDI TDO

Record Logic
 Generating log PPL MCL NEL
 JTAG port Value Value Value Value

Low design cost
Transporting log Ram0
Addr
Ram1
Addr
Ram2
Addr
Ram3
Addr
 Star topology debugging
 Low verification cost
CAM0 CAM1 CAM2 CAM3

network Mem Inst Mem Inst Mem Inst Mem Inst


Type/Addr/Cnt Type/Addr/Cnt Type/Addr/Cnt Type/Addr/Cnt

No performance
Sending cost
memory instruction /Hit/Value /Hit/Value /Hit/Value /Hit/Value

information to LPU Existing CMP Design

 Trivial modification for existing Core 0 Core 1 Core 2 Core 3

CMP design Switch+Mesh Interconnection

L2 bank 0 L2 bank 1 L2 bank 2 L2 bank 3


 Only processor core modified

03/07/20 ICT, CAS 18/25


Deterministic Replay
Proportion of Non-inferable Execution
Orders (<1%)

10.00%

1.00%
Proportion

0.10%

0.01%

03/07/20 ICT, CAS 19/25


Deterministic Replay
Experimental Results: (0.55B/K-Inst)
for SC
Log size of Lreplay (8core, sequential
consistency, 256-cycle sample period)
NEL
PPL
1
Log size (B/K-Inst)

0.8

0.6

0.4

0.2

0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 20/25
Deterministic Replay
Experimental Results: (0.85B/K-Inst)
for Relaxed MC
Log size of LReplay (8core,Godson-3 Consistency,
MCL
256-cycle sample period) NEL
PPL
2.4
2.2
Log size (B/K-Inst)

2
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
FFT barnes cholsky radix water ocean lu average
03/07/20 ICT, CAS 21/25
Outline
 Introduction
 Theoretical Basis
 Application: Deterministic Replay
 Summary

03/07/20 ICT, CAS 22/25


Summary
 Logical clock
 The traditional theoretical base of parallel investigations
 Global clock
 Adopted in most state-of-art CMP
 A crucial difference between CMP and traditional parallel system
 Inaccuracy of global clock should also be concerned for scalability
 Global clock can significantly simplify record and infer
memory race information in CMP
 Magnitude of order smaller log size is achieved with global clock
for deterministic replay without performance loss
 More applications of global clock can be expected

03/07/20 ICT, CAS 23/25


References
 A theory of global clock
 Yunji Chen, Tianshi Chen, and Weiwu Hu. “Global Clock, Physical Time Order and Pending Period Analys
is in Multiprocessor Systems”. CoRR abs/0903.4961, 2009. (http://arxiv.org/pdf/0903.4961)
 Implementation of global clock
 Menghao Su, Yunji Chen, Xiang Gao. “A General Method to Make Multi-clock System Deterministic”. In P
roceedings of Design, Automation and Test in Europe (DATE’10) , 2010.
 Several applications of global clock
 Yunji Chen, Yi Lv, Weiwu Hu, Tianshi Chen, Haihua Shen, Pengyu Wang, and Hong Pan. “Fast Complete
Memory Consistency Verification”. In Proceedings of the 15th International Symposium on High-Perform
ance Computer Architecture (HPCA’09), 2009.
 Yunji Chen, Weiwu Hu, Tianshi Chen, and Ruiyang Wu. “LReplay: A Pending Period Based Deterministic
Replay Scheme”. In Proceedings of the 37th International Symposium on Computer Architecture (ISCA’
10), 2010.
 Yunji Chen, Tianshi Chen, and Ruiyang Wu. “Complete Data Race Detection Under Pending Period Restr
ictions”. Godson Technique Report.
 Godson-3 processor
 Weiwu Hu, Jian Wang, Xiang Gao, Yunji Chen, Qi Liu, and Guojie Li. “Godson-3: A Scalable Multicore RI
SC Processor with x86 Emulation”. IEEE Micro, Vol. 29, No. 2, 2009.
 Weiwu Hu and Yunji Chen. “GS464V: A High-Performance Low-Power XPU with 512-Bit Vector Extensio
n”. In Proceedings of the 22nd IEEE Symposium on High-Performance Chips (HotChips‘10) , 2010.

03/07/20 ICT, CAS 24/25


Thank you very much
Welcome discussions after the talk

03/07/20 ICT, CAS 25/25

Das könnte Ihnen auch gefallen