Beruflich Dokumente
Kultur Dokumente
Strong points:
The paper presents an approach for redundant execution for fault-tolerance with minimum
overhead. Errors are detected by exploiting spatial and temporal redundancy. Spatial redundancy is
checked by a simple inorder checker module and temporal redundancy is performed by re-
computation in the original functional units. Although paper lacks novelty as this idea has already
been attempted in some form or other, the paper has contribution in keeping the power, area and
power overhead as minimum. Since re-execution of the instructions are initiated after resolving the
dependencies between them, the redundant execution is supposed to incur very low performance
penalty. The authors claim that proposed technique has very low detection latency and a high degree
of fault coverage comparable to a full double modular redundant architecture. The key results of
proposed technique are increase in area of only 0.4%, power overhead near to 9% and a negligible
performance penalty during fault free run (when no recovery is needed).
Weak points:
It is a common issue with many fault-tolerant architecture papers that they do not perform
experiments pertaining fault injection for the purpose of evaluating their proposal. This paper too
suffers from this. The authors claim The simulation results show almost full soft error
coverage...... Without a fault injection experiment, this claim is not very appealing although the
authors discuss three cases of occurrence of fault in Section-IV. The assumption that instruction
decoder is outside SOR (sphere of replication) can essentially lead to chances that errors in decoder
can have severe impact. How does fault recovery proceeds in such a scenario? The authors consider
fault coverage scenario in case of single bit-flips. Although single bit-flip serves as a sufficient
model for soft-errors, still the authors should comment on how the architecture behaves in
double/triple bit-flips scenario?
Disagreement:
The reuse of same unit for temporal redundancy from performance perspective may not always be
helpful for the case when programs have a large fraction of floating-point instructions. The
assumption that memory state can be recovered after 1 billion instructions is a very optimistic one.
The authors state that, A single bit flip in any of the field of ROB would result in incorrect
result of either the verifier or the main OOO processor but not both. For double-bit flips, incorrect
result may appear in both. How to deal with such cases?