Sie sind auf Seite 1von 8

Chapter 1

Introduction
High Performance Computing (HPC)
HPC systems are rated in performance by the use of FLOPS, Floating Point Operations Per
Second. The operations stated above are in reference to the speed of the set up and metrics are
used to evaluate the same in scientific calculations. Various systems can then be compared as a
result.

HPC is primarily used in reference to the integrated computing environments that incorporate
parallel processes in the running of their applications. The targets here are efficiency, speed, and
reliability while also ensuring serial scientific problems are solved that would be otherwise time
consuming. Scientists are at liberty to research on various disciplines, thanks to the HPC
systems. They can conduct various researches in chemistry, nuclear physics, astrophysics,
nanotechnology, biology, medicine, and material sciences.

There is still the need to develop advanced computers, the next generation of supercomputers.
This is in regard to the upsurge of larger scaled problems and increased demand for higher
resolution models. There is an uncontrollable demand for exaflops, something necessary for the
handling of loads of data that are expected to be a trend in the future. Precision and utmost
resolutions are just but the major requirements expected in the future for handling of predictive
models and data loads. Systems are a necessity at this point and each is required to harbor at least
one million sockets with countless cores. All these are in respect to the delivery of high end
exascale performance. Memory modules, storage devices, and communication networks are key
in the exemplary performance package.

System resilience is mentioned as a major threat to the growth of systems towards exascale
performance the future. There is a problem of reduced substantiality as faults arise within the
said systems. All this is attributed to the fact that there is an increased number of components in
the system regardless of the reliability and efficiency of individual components. The reliability of
such systems are measured in terms of Mean Time to Failure (MTTF). This is timed in days or
even weeks. Large systems are administered by operators who conduct the rating procedures.
The future is likely to be marred by misappropriations especially with regard to the normal
computations of reliability that will be done in minutes and hours. The application system
therefore faces a potential failure problem in terms of computations. The risks expected here are
enormous since data is also likely to be lost alongside time and human resource. It is evident that
the reliability problem is out in the open since manufacturers are out advising developers to
incorporate fault tolerance techniques.

Mechanisms have also been put in place to mitigate the full effects of failure in systems. This has
been enabled by the use of various dominance resilience techniques. The findings here tend to
nub the redundant execution of codes as the primary effects of the system failures. It is also
important to note here that as the mean time between failures decreases as per the future systems,
the energy required continues to increase and the resilience of the systems do not meet the
demands either.
Chapter 2
Preliminaries: The Taxonomy of
Resilient Computing
Overview:
There is an art and science that involves the building of computing systems that can operate
reliably even in the occurrence of faults. This is referred to as fault-tolerant computing.
Reliability has been a specific area of interest in as far as matters pertaining to studies of
computing are concerned. The applications that draw much interest here include aviation control
systems, on-board computing for space missions, and automobile electronics. All these require
systems that can withstand faults and operate regardless of the issues that might arise while they
are functional. The widely used techniques here are based on coordination of checkpoints and
rollback recoveries. The techniques are however not much efficient in the case of modern
supercomputers that are built by NUMA and SMP nodes and utilize off the shelf parts. The
systems of large scale HPC systems are directive in terms of the failures that occur. They point
out the failures to be due to the rising occurrence of errors as the systems pursue exemplary
performance. Future reliability of operations is also at stake here. It is likely to be affected by the
occurrence of the reduced dependence on components of the system. There is also a notable need
for the development of fault resilient systems as a necessity to back up the high performance
computing platforms. The need arises due to the trends and projections regarding reliability of
semiconductor devices and large system architectures.

The chapter reviews the background and terminologies for understanding matters pertaining to
reliability challenges. This is within section 2.1 to section 2.6. Section 2.7 gives further
information on the reliability challenges and their relevance to future supercomputers.
Dependability
This is defined as the degree of trustworthiness of a system. It refers to the extent by which a
system permits reliance on whatever service it offers. The pertinent factors here include safety,
availability, reliability, and security.
Reliability is of relevance to this project. Reliability refers to the continuity of a system to give
proper results no matter the number of times a certain task is conducted.

Relationship between Fault, Error and Failure


These terms are used interchangeably most of the time. However, when it comes to fault
tolerance, the terms have some distinctions in their formal use.
Fault is referred to as an event characterized by the deviation of a component from its normal
characterization. The occurrences here may be due to physical defects. Systems failures are often
used in reference to the incapability of a particular system to perform its designated functions
mainly due to faulty parts. Faults can therefore be dormant and with no resultant effect. When
faults are activated probably during system operations then they result in errors. The triggering
of such faults may be due to internal or external factors in equal measures.
Errors are resultant effects of faults activation. They are usually known to cause illegal system
states. Failure is known to result from errors reaching the interface systems. Consequently, the
behavior of the system varies with the specifications.
There is no doubt a relationship of causality between error, fault, and failure. The failure of a
single component in a system is detrimental with respect to the occurrence of faults and errors.
This is particularly true with respect to the multiple component systems. All other components
within such systems develop faults when a single one of them shows disintegration. Errors may
therefore be generated and transferred systematically to other components within the same set
up. This, for example can be illustrated by a fault in a computation system that causes erroneous
values. The operator is likely to experience a chain of errors transferred from one stage onto the
next. The end result is quite detrimental and may include the crash of a system or program.
Fault Tolerance and Resilience
Fault tolerance involves evading failures in the presence of faults. This is undoubtedly an aspect
of reliability. Here, one has to determine all the potential causes of failure and thereafter work
towards developing mechanisms that avert failures that may result from the processes.
Resilience directly relates to fault tolerance in the sense that it entails the correctness of an
application or system; the bestowed confidence on the ability to counter potential misgivings.
Resilience is mean to ensure applications and systems run smoothly to produce correct results
within the stipulated time frame. This is regardless of the presence of system failures and
degradations. This is achievable by ensuring protection of applications and systems from data
corruption or any errors that may arise. Any software or hardware that interferes with the system
contributes to degradation and failure. Fault tolerance as a technique is primarily aimed at
managing faults and facilitating the operation of applications in the occurrence of failures.
Tolerance of faults has a seemingly long history regarding computations. The most recent of
discussions has termed fault tolerance in reference to the approaches given in handling system
reliability problems. Fault tolerance and fault resilience are therefore used interchangeably in this
work.
Classification of Faults
Faults are likely to be classified based on their effects on the eventual outcomes of procedures or
the impact on applications address specifications. Further classification may involve the use of
time concerning the persistence. These are discussed in details in the subsequent sections.
Based on the Impact on the Application
 Detected and Corrected Errors (DCE): this classification entails errors that can be
detected and eventually corrected without having to interfere with the execution of an
application. A good example here is a bit error detected and possibly corrected by the in-
built mechanisms. The mechanisms in place here include parities and error correction
codes. The techniques here are known to send notifications to the operating systems. In
the case of hardware, the correction mechanisms ensure the space between application’s
address remains unaffected.
 Detected but Unrecoverable Errors (DUE): these are the kinds of errors that can only
be detected but cannot be countered whatsoever. They are the kinds that cause
interference in the running of an application. They may go the extent of getting the
system crashed. A good example here is a double bit error that occurs on a DRAM line
protected by ECC. It may be detected but never be masked by the single bit error
correction and double bit error detection provided by ECC. As a result, a notification is
sent to the operating system through an interruption that cannot be masked or detected.
The system therefore eventually shuts down.
 Silent Errors (SE): this is a serial fault that remains undetected no matter the mechanism
employed. This is benign and might therefore not be realized. The fault may throng part
of a system’s program without its effects being visible. If the effects occur to be visible
then the situation is referred to as silent data corruption, SDC. These kinds of errors often
do not lead to the crashing of applications. In most cases the application goes ahead to
complete functions but the results here are always inconsistent with the directives given.
The classification here selects means of managing impacts regarding the applications. SDC
for instance requires the detection of errors and the capabilities to correct. DUE on the other
hand requires the means to understand parts of the application affected and hence determine
whether they can be masked or not. This is with respect to ensuring continuity of operations.
DCE has pertinent notifications that can be used to impact proactive assignments in the
future.
Based on Duration of the Fault
The classification of errors and faults distinguishes them into permanent, transient, and
intermittent.
• Transient: These are the kinds of faults that occur occasionally and do not also persist.
They are otherwise referred to as single-event upset, SEU or soft errors.
• Intermittent: This refers to a kind of fault that tends to occur repeatedly at much
spontaneous intervals under a system that is otherwise normal.
• Permanent: these are the faults that persist once they occur. They are otherwise referred to
as the hard faults and continue to occur until the component parts of a system are replaced or
repaired.
The extent by which a fault persists determines whichever strategy is to be administered to
recover a system. Permanent kinds of faults tend to demand replacement or repairing of
faulty components. Transient ones are better masked with no further attention given
whatsoever. The intermittent ones are then equivocal in terms of permanence and transiency.
They are therefore treated either way depending on the characteristic features showcased in
frequency or location.
Resilience Capabilities
There exist various specifications necessary for the designing of resilience strategies
Detection
This is all about the discovery of errors in the state of a system or an application. This could
be in the data or the application’s instructions. It solely depends on redundancy. This acts as
extra information and can be used to correct the values. The errors here are detected by
identification of states of system. Failures are consequently identified by transitions to
incorrectness. Errors and failures are equivocally indicated by detectors. The detectors are
also liable to unexpected failures.
Containment
This is specifically useful in the limitation of the impacts of errors during propagations. It is
achieved by modulation of components of a system. This is characterized by a unit which
fails solely and also happens to be the unit of replacement.
Masking
This procedure involves the mitigations enacted to ensure operations remain on course. This
is despite the occurrence of any error that poses threat to accomplishments required. The aim
here is to ensure that the results obtained are as close as possible to the correct values.

Reliability Metrics
Various metrics are used when handling HPC systems and their resiliency. Reliability here
refers to the probability of a component to deliver in whichever function is assigned to it.
This is also with regard to time. The frequency by which a system component encounters
failure defines the rate of failure. Mean Time Between Failure, MTBF is a common metric
used and it relies on the time taken between subsequent failures. It is the exact opposite of
failure rate. It is defined in years. It equals the sum of mean time to repair, MTTR and mean
time to interrupt, MTTI. MTTR refers to the time taken to recover from component failure
while MTTI refers to the mean time a component works efficiently before a breakdown by
error occurs.
The problem of resilience at exascale
Distributed systems are reportedly said to be approaching an era that is characterized by
extreme scale computations. The predictions on HPC have it that supercomputers are likely
to reach exascale by the end of this decade. These will be characterized by 1018 Flop/ sec
and will comprise of millions of nodes. Additionally, the systems will be characterized by up
to one billion threads of execution and lastly lots of petabytes memory. The computers here
will facilitate various mysteries in scientific applications. This will be in varied disciplines
such as economics, physics, biology, chemistry, earth systems, astronomy, social learning,
and neuroscience.
Exascale computing is likely to face varied range of challenges pertaining to hardware, how
they are managed, and how the computing systems will be developed. Techniques that date
back to several years back are likely to be brought back and changed to support the coming
extra scale parallel computing improvements.
However, there exists some road block towards this new development. One is the probability
of occurrence of lots of errors that may contribute to lots of system failures and consequently
reduce the milestone achieved in as far as matters of computations are concerned. Systems
and the production of erroneous results will as well be affected. The proposed systems might
be characterized by high nominal performance but will still be useless. The frequency of
occurrence of errors will be high and this will be as a result of various factors.
Failures of the physical components are likely to be high and even more frequent. Some may
even go undetected by hardware and will affect the computations. It is expected that as
hardware develop, the software will also be more complex and consequently more prone to
errors. Complexities are also likely to contribute to failures and this can be explained by the
larger scales of operations. This is attributed to the fact that there will be decentralization of
functions.
With the increased rate of errors, there will be the need to have newer improved techniques
of handling them. Their detection techniques will have to be improved in software.
There are four most traumatizing challenges in exascale computing and they are; resilience,
memory and storage, energy and power, and lastly concurrency and locality. All of them are
subject to be addressed otherwise they could impact on the progress of exascale computation.
The Energy and Power Challenge
This happens to be the most serial of them all. It is rooted in the inability to make projections
on technologies that are able to deliver all systems reliably as required.
The Memory and Storage Challenge
This is more of a current affair. It entails the deficit in technology meant to store high
capacitated data at higher rates and to be able to support applications suited for computations.
These are supposed to work within acceptable power ranges as designed. The challenge of
storing data exists in both the main memory and secondary memory.
The Concurrency and Locality Challenge
The challenge here develops from silicon clock rates and the increased thread performance.
This has resulted in visible programming and parallelism which altogether impact on system
performance. This challenge is specifically common for all the three systems. Projections in
data centers indicate that systems may be forced to be in support of varied threads just to
ensure efficient hardware use.
A Resiliency Challenge
This is a problem that entails the ability of systems to continue to operate effectively even in
the occurrence of faults and/or fluctuations. The concern here is reported to have grown out
of the unprecedented expansion of components in much larger systems. It also developed
from the need of technological advancements at relatively lesser voltage levels. Circuits and
devices are described as more sensitive to operations.
Fault Tolerance Challenges at Exascale
The problem here is presented by exascale since faults cannot be exempted. Fault tolerance is
a requirement in systems that have higher rates of occurrence of failures. Solutions are also
required to incorporate the coordination of all operations across billions of execution threads.
The solutions must as well support various applications and enable interactions. On that note,
it is uncertain whether the existing fault tolerant solutions are going to be relevant in the
future especially with the endless demands of exascale application.
The implementation of fault tolerance is in four stages that include; detection, recovery,
containment, and preparedness. The phases are not specific to HPC systems but causes
invariant challenges when they are deployed in exascale systems.
Fault detection is essential in determining the effectiveness of a system. If there is any varied
behavior then it should be detected in time. However, faults are not easily detected single
handedly but are detected along failures. Fault tolerance techniques are specifically useful in
ascertaining errors that are failure driven. Failure checking mechanisms must have particular
features as follows;
They must first be complete to detect any form of failure. They must then be independent of
the systems under scrutiny to ensure faults within the system do not affect them.
Consequently, the checking should be determination oriented and also independent of the
applications. For HPC systems, these are ensured by utmost regard for availability,
reliability, and serviceability.

Das könnte Ihnen auch gefallen