Beruflich Dokumente
Kultur Dokumente
Introduction
High Performance Computing (HPC)
HPC systems are rated in performance by the use of FLOPS, Floating Point Operations Per
Second. The operations stated above are in reference to the speed of the set up and metrics are
used to evaluate the same in scientific calculations. Various systems can then be compared as a
result.
HPC is primarily used in reference to the integrated computing environments that incorporate
parallel processes in the running of their applications. The targets here are efficiency, speed, and
reliability while also ensuring serial scientific problems are solved that would be otherwise time
consuming. Scientists are at liberty to research on various disciplines, thanks to the HPC
systems. They can conduct various researches in chemistry, nuclear physics, astrophysics,
nanotechnology, biology, medicine, and material sciences.
There is still the need to develop advanced computers, the next generation of supercomputers.
This is in regard to the upsurge of larger scaled problems and increased demand for higher
resolution models. There is an uncontrollable demand for exaflops, something necessary for the
handling of loads of data that are expected to be a trend in the future. Precision and utmost
resolutions are just but the major requirements expected in the future for handling of predictive
models and data loads. Systems are a necessity at this point and each is required to harbor at least
one million sockets with countless cores. All these are in respect to the delivery of high end
exascale performance. Memory modules, storage devices, and communication networks are key
in the exemplary performance package.
System resilience is mentioned as a major threat to the growth of systems towards exascale
performance the future. There is a problem of reduced substantiality as faults arise within the
said systems. All this is attributed to the fact that there is an increased number of components in
the system regardless of the reliability and efficiency of individual components. The reliability of
such systems are measured in terms of Mean Time to Failure (MTTF). This is timed in days or
even weeks. Large systems are administered by operators who conduct the rating procedures.
The future is likely to be marred by misappropriations especially with regard to the normal
computations of reliability that will be done in minutes and hours. The application system
therefore faces a potential failure problem in terms of computations. The risks expected here are
enormous since data is also likely to be lost alongside time and human resource. It is evident that
the reliability problem is out in the open since manufacturers are out advising developers to
incorporate fault tolerance techniques.
Mechanisms have also been put in place to mitigate the full effects of failure in systems. This has
been enabled by the use of various dominance resilience techniques. The findings here tend to
nub the redundant execution of codes as the primary effects of the system failures. It is also
important to note here that as the mean time between failures decreases as per the future systems,
the energy required continues to increase and the resilience of the systems do not meet the
demands either.
Chapter 2
Preliminaries: The Taxonomy of
Resilient Computing
Overview:
There is an art and science that involves the building of computing systems that can operate
reliably even in the occurrence of faults. This is referred to as fault-tolerant computing.
Reliability has been a specific area of interest in as far as matters pertaining to studies of
computing are concerned. The applications that draw much interest here include aviation control
systems, on-board computing for space missions, and automobile electronics. All these require
systems that can withstand faults and operate regardless of the issues that might arise while they
are functional. The widely used techniques here are based on coordination of checkpoints and
rollback recoveries. The techniques are however not much efficient in the case of modern
supercomputers that are built by NUMA and SMP nodes and utilize off the shelf parts. The
systems of large scale HPC systems are directive in terms of the failures that occur. They point
out the failures to be due to the rising occurrence of errors as the systems pursue exemplary
performance. Future reliability of operations is also at stake here. It is likely to be affected by the
occurrence of the reduced dependence on components of the system. There is also a notable need
for the development of fault resilient systems as a necessity to back up the high performance
computing platforms. The need arises due to the trends and projections regarding reliability of
semiconductor devices and large system architectures.
The chapter reviews the background and terminologies for understanding matters pertaining to
reliability challenges. This is within section 2.1 to section 2.6. Section 2.7 gives further
information on the reliability challenges and their relevance to future supercomputers.
Dependability
This is defined as the degree of trustworthiness of a system. It refers to the extent by which a
system permits reliance on whatever service it offers. The pertinent factors here include safety,
availability, reliability, and security.
Reliability is of relevance to this project. Reliability refers to the continuity of a system to give
proper results no matter the number of times a certain task is conducted.
Reliability Metrics
Various metrics are used when handling HPC systems and their resiliency. Reliability here
refers to the probability of a component to deliver in whichever function is assigned to it.
This is also with regard to time. The frequency by which a system component encounters
failure defines the rate of failure. Mean Time Between Failure, MTBF is a common metric
used and it relies on the time taken between subsequent failures. It is the exact opposite of
failure rate. It is defined in years. It equals the sum of mean time to repair, MTTR and mean
time to interrupt, MTTI. MTTR refers to the time taken to recover from component failure
while MTTI refers to the mean time a component works efficiently before a breakdown by
error occurs.
The problem of resilience at exascale
Distributed systems are reportedly said to be approaching an era that is characterized by
extreme scale computations. The predictions on HPC have it that supercomputers are likely
to reach exascale by the end of this decade. These will be characterized by 1018 Flop/ sec
and will comprise of millions of nodes. Additionally, the systems will be characterized by up
to one billion threads of execution and lastly lots of petabytes memory. The computers here
will facilitate various mysteries in scientific applications. This will be in varied disciplines
such as economics, physics, biology, chemistry, earth systems, astronomy, social learning,
and neuroscience.
Exascale computing is likely to face varied range of challenges pertaining to hardware, how
they are managed, and how the computing systems will be developed. Techniques that date
back to several years back are likely to be brought back and changed to support the coming
extra scale parallel computing improvements.
However, there exists some road block towards this new development. One is the probability
of occurrence of lots of errors that may contribute to lots of system failures and consequently
reduce the milestone achieved in as far as matters of computations are concerned. Systems
and the production of erroneous results will as well be affected. The proposed systems might
be characterized by high nominal performance but will still be useless. The frequency of
occurrence of errors will be high and this will be as a result of various factors.
Failures of the physical components are likely to be high and even more frequent. Some may
even go undetected by hardware and will affect the computations. It is expected that as
hardware develop, the software will also be more complex and consequently more prone to
errors. Complexities are also likely to contribute to failures and this can be explained by the
larger scales of operations. This is attributed to the fact that there will be decentralization of
functions.
With the increased rate of errors, there will be the need to have newer improved techniques
of handling them. Their detection techniques will have to be improved in software.
There are four most traumatizing challenges in exascale computing and they are; resilience,
memory and storage, energy and power, and lastly concurrency and locality. All of them are
subject to be addressed otherwise they could impact on the progress of exascale computation.
The Energy and Power Challenge
This happens to be the most serial of them all. It is rooted in the inability to make projections
on technologies that are able to deliver all systems reliably as required.
The Memory and Storage Challenge
This is more of a current affair. It entails the deficit in technology meant to store high
capacitated data at higher rates and to be able to support applications suited for computations.
These are supposed to work within acceptable power ranges as designed. The challenge of
storing data exists in both the main memory and secondary memory.
The Concurrency and Locality Challenge
The challenge here develops from silicon clock rates and the increased thread performance.
This has resulted in visible programming and parallelism which altogether impact on system
performance. This challenge is specifically common for all the three systems. Projections in
data centers indicate that systems may be forced to be in support of varied threads just to
ensure efficient hardware use.
A Resiliency Challenge
This is a problem that entails the ability of systems to continue to operate effectively even in
the occurrence of faults and/or fluctuations. The concern here is reported to have grown out
of the unprecedented expansion of components in much larger systems. It also developed
from the need of technological advancements at relatively lesser voltage levels. Circuits and
devices are described as more sensitive to operations.
Fault Tolerance Challenges at Exascale
The problem here is presented by exascale since faults cannot be exempted. Fault tolerance is
a requirement in systems that have higher rates of occurrence of failures. Solutions are also
required to incorporate the coordination of all operations across billions of execution threads.
The solutions must as well support various applications and enable interactions. On that note,
it is uncertain whether the existing fault tolerant solutions are going to be relevant in the
future especially with the endless demands of exascale application.
The implementation of fault tolerance is in four stages that include; detection, recovery,
containment, and preparedness. The phases are not specific to HPC systems but causes
invariant challenges when they are deployed in exascale systems.
Fault detection is essential in determining the effectiveness of a system. If there is any varied
behavior then it should be detected in time. However, faults are not easily detected single
handedly but are detected along failures. Fault tolerance techniques are specifically useful in
ascertaining errors that are failure driven. Failure checking mechanisms must have particular
features as follows;
They must first be complete to detect any form of failure. They must then be independent of
the systems under scrutiny to ensure faults within the system do not affect them.
Consequently, the checking should be determination oriented and also independent of the
applications. For HPC systems, these are ensured by utmost regard for availability,
reliability, and serviceability.