Sie sind auf Seite 1von 1

REMO review ---- Binod Kumar

Strong points:
The paper presents an approach for redundant execution for fault-tolerance with minimum
overhead. Errors are detected by exploiting spatial and temporal redundancy. Spatial redundancy is
checked by a simple inorder checker module and temporal redundancy is performed by re-
computation in the original functional units. Although paper lacks novelty as this idea has already
been attempted in some form or other, the paper has contribution in keeping the power, area and
power overhead as minimum. Since re-execution of the instructions are initiated after resolving the
dependencies between them, the redundant execution is supposed to incur very low performance
penalty. The authors claim that proposed technique has very low detection latency and a high degree
of fault coverage comparable to a full double modular redundant architecture. The key results of
proposed technique are increase in area of only 0.4%, power overhead near to 9% and a negligible
performance penalty during fault free run (when no recovery is needed).

Weak points:
It is a common issue with many fault-tolerant architecture papers that they do not perform
experiments pertaining fault injection for the purpose of evaluating their proposal. This paper too
suffers from this. The authors claim The simulation results show almost full soft error
coverage...... Without a fault injection experiment, this claim is not very appealing although the
authors discuss three cases of occurrence of fault in Section-IV. The assumption that instruction
decoder is outside SOR (sphere of replication) can essentially lead to chances that errors in decoder
can have severe impact. How does fault recovery proceeds in such a scenario? The authors consider
fault coverage scenario in case of single bit-flips. Although single bit-flip serves as a sufficient
model for soft-errors, still the authors should comment on how the architecture behaves in
double/triple bit-flips scenario?

Disagreement:
The reuse of same unit for temporal redundancy from performance perspective may not always be
helpful for the case when programs have a large fraction of floating-point instructions. The
assumption that memory state can be recovered after 1 billion instructions is a very optimistic one.
The authors state that, A single bit flip in any of the field of ROB would result in incorrect
result of either the verifier or the main OOO processor but not both. For double-bit flips, incorrect
result may appear in both. How to deal with such cases?

Suggestions for improvement:


The authors should have experimented with different replay buffer size and impact on performance
penalty. Fault injection experiment should be performed for accurate estimation of fault coverage.
Quantitative comparison must be done with sate-of-the-art schemes. Even if the comparison is done
with DIVA (which has a complete inorder for the verifier part), the authors can bring out the
contribution of only the time-replay part for fault-tolerance.

Points which are not clear:


How do authors calculate/estimate that checkpoint overhead ranges from 30 to 50 cycles? In case of
temporal redundancy for A single bit flips in FU, the authors claim that Within this time span
even multi- cycle fault is expected to decay,....... what is the reasoning behind that? How does
performance get impacted in case the transient fault has not decayed?
Points to be discussed in class:
Are read ports to ROB and ARF sufficient for the verifier to access? This doubt arises because
during re-execution, verifier part accesses ROB and ARF for reading operands.
The paper states that checkpoints are taken at one billion instructions. It is definitely possible to
restore architectural state that far back, restoring memory that far back may not be feasible.
The recovery mechanism should be explained more elaborately.

Das könnte Ihnen auch gefallen