Cloud

Chapter 18
TASK SCHEDULING ALGORITHMS FOR FAULT TOLERANCE IN REAL-TIME EMBEDDED SYSTEMS

Nagarajan Kandasamy and John P. Hayes
Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science The University of Michigan 1301 Beal Ave. Ann Arbor, MI 48105, U.S.A
nkandasa@eecs.umich.edu, jhayes@eecs.umich.edu
Brian T. Murray
Advanced Development Saginaw Steering Systems Delphi Automotive Systems 3900 Holland Road Saginaw, MI 48601 , U.S.A
brian.t.murray@delphiauto.com
Abstract
We survey scheduling algorithms proposed for tolerating permanent and transient failures in real-time embedded systems. These algorithms attempt to provide low-cost solutions to fault tolerance, graceful performance degradation, and load shedding in such systems by exploiting tradeoffs between space and/or time redundancy, timing accuracy, and quality of service. We place fault-tolerant scheduling algorithms in three broad categories: dynamic scheduling, off-line or static scheduling, and scheduling of imprecise computations. Under dynamic scheduling, we survey faulttolerance extensions to the widely used rate-monotonic and earliestdeadline-rst scheduling policies. We then discuss methods that provide fault tolerance in statically scheduled systems using precomputed alternate schedules or run-time rescheduling. We also discuss imprecise scheduling, which achieves a tradeoff between solution quality and timeliness. We conclude with a brief discussion of scheduling and fault-tolerance issues related to safety-critical embedded systems. Fault tolerance, scheduling algorithms, embedded systems
Keywords:
1.
INTRODUCTION
The correctness of real-time safety-critical systems depends not only on the results of computations, but also on the time instants at which these results become available. Examples of such systems include fly- and drive-by-wire, industrial process control, nuclear reactor management, and medical electronics. Real-time tasks have to be mapped to processors such that deadlines, response times, and similar performance requirements are met, a process called task scheduling. Furthermore, many real-time systems function in a hostile, unpredictable environment and have to guarantee functional and timing correctness even in the presence of hardware and software faults. Faults can be classified according to their duration: Permanent faults remain in existence indefinitely if no corrective action is taken. These faults can be caused by catastrophic system failures such as processor failures, communication medium cutoff, and so on. Intermittent faults appear, disappear, and reappear repeatedly. They are difficult to predict, but their effects are highly correlated. Most intermittent faults are due to marginal design or manufacturing. Transient faults appear and disappear quickly, and are not correlated with each other. They are most commonly induced by random environmental disturbances such as EMI. In real-time systems, fault tolerance is typically provided by physical and/or temporal redundancy. Physical redundancy in the form of replicated hardware and software components is used to tolerate both permanent and transient system failures. Systems such as MARS [16] execute identical tasks on multiple processors. If a fault affects a processor, that processor falls silent and a backup or replica processor provides the result. Also, different versions of the software can be executed on diverse hardware platforms and the results of the versions voted upon as, for example, in the N-version programming [1] and N-self checking programming [26] approaches. These techniques mask system failures with no degradation in performance and zero recovery latency. To reduce the overhead associated with replicated hardware, some approaches treat the set of processors as a pooled resource. When a processor fails, other members in the pool provide the functionality of the failed processor [40]. Though this approach lowers the hardware overhead needed to tolerate failures, it typically causes some performance degradation and non-zero recovery latency. Low-cost embedded systems can use temporal redundancy to tolerate transient task failures via spare processor capacity. A common recovery technique is reexecuting the failed task [17]. Another is the primary/backup approach [21] [33] wherein if incorrect results are provided by the primary version of a task, the backup (alternate) is executed.
Embedded systems such as steer-by-wire (steering function implemented by computer-controlled actuators interconnected by in-vehicle networks with no direct mechanical link between driver input and the road wheels) aim at high reliability using modest hardware redundancy due to packaging and power consumption constraints. Cost-effective fault tolerance can be provided by scheduling algorithms that guarantee functional and timing correctness of tasks even in the presence of failures. This paper reviews scheduling algorithms that attempt to provide low-cost solutions to fault tolerance, graceful performance degradation, and load shedding in embedded systems by exploiting tradeoffs between space or time redundancy, timing accuracy, and quality of service. Section 2 provides a brief introduction to the dynamic and static (off-line) scheduling paradigms and discusses their strengths and weaknesses. The subsequent sections review fault-tolerant scheduling algorithms under three broad headings; dynamic, static, and imprecise. In Section 3, we discuss fault-tolerant extensions to widely used dynamic scheduling algorithms including rate-monotonic (RM) and earliestdeadline-first (EDF). Section 4 discusses methods to tolerate faults in statically scheduled real-time systems. Imprecise or approximate computations can improve scheduling exibility and dependability in certain classes of realtime systems. Scheduling algorithms for imprecise computations are surveyed in Sec. 5. We conclude with a brief discussion of scheduling and fault-tolerance issues expected in distributed embedded systems of the future.
2.
SCHEDULING PARADIGMS
A mapping of tasks to processors such that all tasks meet their time constraints is called a feasible schedule. A schedule is optimal if it minimizes a cost function defined for the task set. If no cost function is defined and the only concern is to obtain a feasible schedule, then scheduling is optimal only if it fails to meet a task deadline when no other algorithms in its class can meet it. If the scheduling problem is NP-complete, a heuristic algorithm can find a feasible solution that is not guaranteed to be the best possible. Depending on the time instances when requests for execution are made, tasks can be classified as periodic, sporadic, and aperiodic. Periodic tasks repeat at regular time intervals and their request times are known a priori. Sporadic task request times are not known a priori, but it is assumed that a minimum interval exists between two successive requests. Aperiodic tasks have no such constraint on their request times. Tasks can be independent or have precedence, synchronization, and mutual exclusion constraints between them. Tasks can be mapped to processors in preemptive or non-preemptive fashion. With preemptive mapping, the running task can be interrupted at any time to assign the processor to another ready task, whereas with non-preemptive map-
ping, a task once started executes to completion before relinquishing the processor. Finally, tasks can be executed on a single processor or in a distributed environment comprising multiple processors and a communication network. A dynamic scheduler makes its scheduling decisions at run time based on requests for system services. After the occurrence of a significant event such as a service request, the algorithm determines which of the set of ready tasks should be executed next based on some task priority which is statically or dynamically assigned. A well-known static priority-driven algorithm for scheduling independent and periodic tasks on a single processor is the RM method, rst studied by Liu and Layland [24], which assigns higher priorities to tasks with shorter periods. The authors show that the RM algorithm is optimal among static priority-based scheduling schemes, and derive a simple schedulability test based on the resource utilization by the tasks. Liu and Layland also study the EDF algorithm, an optimal preemptive algorithm in singleprocessor systems that dynamically assigns task priorities based on their deadlines; the closer a tasks deadline, the higher its priority. The laxity of a task, given by the amount of time a task can wait and still meet its deadline is used to assign priorities to tasks dynamically using a least-laxity-first (LLF) scheme. Dynamic scheduling algorithms such as RM, EDF, and LLF are exible and can be extended to handle aperiodic and sporadic task requests [20]. However, it is difficult to guarantee deadlines using dynamic scheduling techniques in the case of complex tasks with precedence, synchronization, and exclusion constraints executing in a distributed environment. In fact, task scheduling with precedence and synchronization constraints in a distributed environment is an NP-complete problem for which no optimal dynamic scheduling strategy is known [17]. A static or off-line scheduling algorithm considers the resource, precedence, and synchronization requirements of all tasks in the system and attempts to generate a feasible schedule that is guaranteed to meet the timing constraints of all tasks. The schedule is calculated off-line and is xed for the life of the system. Typically, a scheduling or dispatch table identifies the start and nish times of each task, and tasks are executed on the processor according to this table. Static table-driven scheduling is applicable to periodic tasks or to aperiodic (sporadic) tasks that can be transformed into periodic ones [17]. The problem of scheduling tasks with precedence and synchronization constraints on a set of processors is NP-complete and heuristics are typically used to obtain a feasible schedule. Most of the proposed algorithms use aspects of the branch-and-bound technique [4] in searching for a feasible schedule. The methods proposed in [32] [42] consider only task scheduling,
whereas those in [34] handle task and communication scheduling in an integrated fashion. Static scheduling is suited to periodic, control-dominated systems such as automotive control because of its predictability of behavior and high resource utilization. A scheduler based on a dispatch table is fast, and can be easily verified to ensure dependability. However, the resulting system is quite inflexible to environmental changes. For example, a static schedule cannot effectively process aperiodic task requests generated in response to a rare hazardous system condition (emergency) without poor resource utilization. Mode-change execution is one way to increase the exibility of static scheduling [11]. All possible operating and emergency modes are identified during system design and a static schedule is calculated for each mode. When a mode change is requested at run time, the appropriate schedule is activated.
3.
FAULT-TOLERANT DYNAMIC SCHEDULING
Before reviewing fault-tolerant extensions to dynamic scheduling algorithms, we discuss some important properties of the RM and EDF algorithms. Consider n independent and preemptible tasks T1(p1, c1),..., Tn(pn, cn) executing on a single processor, where pi and ci are the period and execution time of Ti, respectively. Assume that the deadline of a task is equal to its period. The utilization U of the task set is given by
U = n i=1
ci pi
If U n ( 2 1 ) , then the RM algorithm can schedule all tasks [24]; this inequality is called the RM schedulability test. As n , the minimum achievable utilization converges to 0.69. Similarly, EDF can schedule the task set if U 1 . We now review a fault-tolerant extension to the RM scheduling scheme proposed by Oh and Son [29]. The authors assume multiple versions of each task, and allocate each version on different processors while minimizing the number of processors. The allocation algorithm uses a first-fit bin-packing heuristic such that no two versions of a task T j are assigned to the same processor. (Bin-packing heuristics pack variable-size items efficiently into fixedsize bins.) Let P1,..., P m and T1,..., T n denote processors and the task set, q 1 2 respectively. Each task T j has q j versions denoted T j , T j ,..., T j j . To schedi i ule T j , we find the least k such that T j together with other tasks (versions) previously assigned to P k satisfy the RM schedulability test. We then assign i T j to P k . Procedure FT_Allocate in Fig. 1 gives the allocation algorithm proposed in [29]. The authors show that the number of processors m required for
Procedure FT_Allocate(S) /* S := task set */ /* k := processor index, m:= minimum number of required processors */ m := k := 1; for (task Tj in S) begin while (unassigned versions { T j } exist) begin if (({ T j {tasks already assigned to Pk}} is RM schedulable) and (no other version of Tj has been previously assigned to Pk)) Assign T j to Pk; else k := k +1; if (k > m) m := k; end; k := 1; end;
i i i
/* Update the number of processors */
Figure 1. The fault-tolerant task allocation procedure for an RM schedulable real-time system [29]
fault-tolerant RM scheduling has the upper bound 2.33 m o + q max , where mo is the minimum number of processors required by an optimal algorithm to schedule the same set of tasks and qmax is the maximum number of versions of a task, that is, q max = max {qi}. Gosh et al. [13] propose an RM scheme that tolerates transient faults by reexecuting failed tasks on the same processor. They introduce a slack time of at least cj between two successive requests for Tj in the schedule. This slack time is treated as a backup task and used for reexecution purposes. In order to reduce processor utilization, the backups are overloaded, that is, the slack time reserved for a single backup can be used to reexecute multiple failed primaries. Pandya and Malek [31] derive the minimum achievable utilization for RM scheduling, where the recovery action is to reexecute all uncompleted tasks after a fault. With this recovery action, they guarantee that no task will miss a single deadline in the presence of a single transient fault if the processor utilization U 0.5 . This bound is better than the trivial bound of 0.69/2 = 0.345 obtained if double execution of all tasks is assumed. Burns et al. [5] also provide exact schedulability tests for fault-tolerant task sets using RM analysis. Methods for scheduling aperiodic tasks in periodic systems can also be used to recover from transient task failures. When a periodic task fails, the on1in
P1 P2 P3 P4
p T1 p T2
b T2
P1 P2
b T3 p T3
p T1 p T2 p T4 p T5
b T2 b T3 b T1 b T4 b T5
b T1 p T3
P3 P4
(a)
(b)
Figure 2. (a) Valid and (b) invalid primary/backup schedules for aperiodic tasks [14]
line recovery mechanism generates an aperiodic task request corresponding to the failed task or a simpler backup. This aperiodic request has to be serviced within the deadline of the failed periodic task. The slack-stealing algorithm [30] computes the available slack at each priority level in a system of periodic tasks and attempts to use that slack to schedule aperiodic requests with hard deadlines. Many dynamic real-time systems deal with aperiodic task requests that have hard deadlines generated by external events. For example, an aperiodic request (or alarm) may be generated if a cars engine temperature exceeds a safe value. Fault-tolerant scheduling of nonpreemptive, independent, and aperiodic real-time tasks is studied in [14] using a primary/backup approach. When a task arrives into a system, the primary is scheduled as early as possible and the backup is scheduled after the primary but before the tasks deadline. When a primary completes successfully, its corresponding backup is deallocated. Multiple backups can overlap as long as their respective primap b ries are not scheduled on the same processor. Let T j and T j represent primary and backup versions of a task T j , respectively. Figure 2(a) shows a valid schedule with backup overlap, where the overlapped regions are shaded black. b b Since T 3 and T 4 overlap in Fig. 2(b), this schedule cannot tolerate a permanent fault affecting P 3 . Transient overloads can cause tasks to miss deadlines in dynamically scheduled real-time systems. Butazzo [6] classifies schemes that handle task overload into best effort, guarantee, and robust classes. Best-effort schemes always accept new tasks, which are then assigned execution priorities in order of importance. In guarantee schemes, the task set comprising the new task T j and previously guaranteed tasks is checked for schedulability. If the task set is schedulable, then T j is accepted, else it is rejected. Robust schemes use different policies for accepting and rejecting tasks. If the schedulability of T j is veried, then T j is accepted. If the test fails, then one or more possibly differ-
ent tasks are rejected based on task priority. Simple guarantee schemes do not consider task priority and always reject new tasks under overload conditions, whereas robust schemes may reject previously accepted low-priority tasks in favor of the new task. The robust earliest deadline (RED) algorithm deals with aperiodic tasks whose deadlines are exible, that is, tasks that produce safe results even when they miss their deadlines by a specied tolerance value [7]. A tasks (primary) deadline plus the tolerance provides the secondary deadline used as an acceptance criterion under overload conditions. Tasks are scheduled using the primary deadline, but accepted based on their secondary deadlines. If an aperiodic task T j s execution cannot be guaranteed by a processor in a distributed system, the task is transferred to a processor estimated to have sufficient resources and time to complete the task before its deadline [35]. T j s transfer can also be based on bids received from lightly loaded processors and T j sent to the processor deemed most likely to execute the task within the deadline. Other methods proposed in [15] and [19] also perform load balancing in real-time distributed systems. The authors of [2] present an algorithm that attempts to dynamically distribute the workload of a failed processor to other lightly loaded processors in the system. Tridandapani et al. [41] propose low-overhead methods using spare capacity in lightly loaded processors to detect and localize faults. When a new task arrives at a processor, the primary version of the task is started on any idle processor. Simultaneously, a backup task is started on other idle processors. Comparison of task results determines the faulty processor(s). If processors remain idle with no new task arrivals, dummy tasks are run on those processors to detect hardware faults. The Spring kernel is a research operating system that provides fault tolerance and graceful degradation under dynamic scheduling using some of the methods surveyed in this section [12]. The Spring scheduler chooses a suitable fault tolerance technique from a set of alternatives for each task arrival. When a new task arrives in the system, the scheduler uses the tasks deadline, redundancy level, etc., to guarantee its execution. The scheduler attempts to build a feasible schedule using a fault-tolerant scheme specified for that task. If a feasible schedule is not found, an alternate scheme is used. If no execution guarantees can be given, the task is rejected. Fault-tolerant options for arriving tasks include triple modular redundancy, primary/backup, and primary/exception. Under triple modular redundancy, fault tolerance is provided by scheduling three copies of the task on different processors and comparing the results of the copies. Under primary/backup, two copies of the task are scheduled on different processors. The backup is activated only if the primary provides
SB SP S1
b1 p1
b2 (a) p2 (b) p2 (c) p2 b4 , b5 (f) b2 p1 (d) b4 b5 (e) b6 b6
b1 b1
p1 p1
S2
Figure 3. Example schedules for processor Pk: (a) Backup schedule SB, (b) primary schedule SP, (c) schedule S1 when b1 is activated, (d) schedule S2 when b1 and b2 are activated; (e) backup schedule when b4 and b5 do not overlap, and (f) backup schedule when b4 and b5 overlap
incorrect results. In the primary/exception method, an exception handler is invoked if the primary fails.
4.
FAULT-TOLERANT STATIC SCHEDULING
Static systems reserve sufficient slack in the schedule and use on-line mechanisms to reconfigure the system so that task deadlines are met despite permanent and/or transient failures. Methods to tolerate failures in statically scheduled systems use three main approaches: primary/contingency scheduling, masking, and on-line rescheduling. In the primary/contingency scheduling method, the primary schedule is executed in the fault-free case. Contingency schedules are precomputed for potential failures and the appropriate contingency schedule is activated when failures actually occur. The contingency schedules ensure that tasks continue to meet their specified deadlines. Since contingency schedules are precomputed, the activation overhead is small. The methods proposed in [18] [22] fall in this category. In masking schemes [28], n versions of a task are scheduled on a set of processors and are always active, that is, all task versions execute irrespective of system failures. Since computing resources are preallocated to all versions, masking schemes may incur substantial hardware-overhead costs. On-line rescheduling is a hybrid approach where a static schedule is first obtained using an off-line scheme. To process aperiodic requests, an on-line scheduler shifts the execution of the previously scheduled periodic tasks appropriately without violating their deadlines [10]. This task shifting
assumes that there is sufficient spare (unused) capacity in the processor. A task failure on a processor generates an aperiodic request to execute the corresponding backup. The processors on-line algorithm tries to process this request by reordering certain periodic tasks to accommodate the execution of the backup within its deadline. Krishna and Shin [18] tolerate permanent processor failures by means of primary/contingency schedules. Ghosts or backup copies of tasks are embedded in the schedule and activated when the processors allocated to their corresponding primaries fail. To tolerate nfail permanent failures, task copies, both primary and backup, are scheduled on nfail processors with no two copies of the same task allocated to the same processor. Backups of two tasks can overlap in the schedule of a processor Pk if no other processor is allocated a copy of both tasks. Primaries can overlap backups in Pks schedule only if there is sufcient slack to ensure that all activated backups and primaries allocated to Pk meet their deadlines. Assume primaries p1, p2 and backups b1, b2 are allocated to Pk. Figures 3(a) and 3(b) show primary and backup schedules SP and SB, respectively. Figure 3(c) gives the resulting schedule S1 when b1 is activated. Note that p1 is right-shifted to accommodate b1. Similarly, Fig. 3(d) shows the schedule S2 when b1 and b2 are activated. The backup b2 preempts p1 if it has an earlier deadline. A simple fault-tolerant scheduling approach is to schedule the entire task set, that is, both primaries and backups, to obtain S2 in Fig. 3(d). The primaries can always be executed in the positions specified by S2. However, this needlessly delays primaries if backups are not activated. To execute primaries as soon as possible, priorities are first assigned to primaries in the order in which they nish executing in S2. The prioritized primaries are then scheduled by assigning the processor to the highest-priority primary that has been released but not yet completed to get SP in Fig. 3(b). Thus, SP is the primary schedule whereas S1 and S2 are contingency schedules activated when processor(s) fail. Backups b4 and b5 can overlap if their corresponding primaries are not allocated to the same processor (Fig. 3(f)), else they cannot (Fig. 3(e)). In the fault-tolerant scheduling approach proposed in [22], tasks are assigned levels based on their periods as follows. Let all tasks with period p be assigned level i. Then tasks in level i + 1 have period m p for some positive integer m 2. In Fig. 4(a), tasks with period 15 ms belong to level 1 and tasks with period 30 ms and 60 ms belong to levels 2 and 3, respectively. First, backups of all level-1 tasks are scheduled. Then we schedule the maximum number of level-1 primaries that t in the remaining time, thus ensuring that a backup is not scheduled earlier that its corresponding primary. The schedule for level-1 tasks is S1. Two S1 schedules are concatenated to get a provisional schedule S2, which is then modified by removing the minimum
S1
Level 1 Level 2 Level 3
p T1
0
T1
4
b 6
S1 S2
S1 S3
15 30
S1 S2
S1
(b)
p T1
b T1 4 6
p T1
10
T1
b 12 b 12
0 45 60 0 b T1
(c)
b T2 2 4 p T2 2
(a)
p T1
6 b T2 5 6 b T1 8 b T2 9 10
T1
(d)
S2
0 b T1 12
(e)
Figure 4. (a) Level assignment to tasks based on their periods; (b) - (e) illustration of the fault tolerant scheduling method proposed in [22]
number of level-1 primaries such that all level-2 backups are scheduled. If S2 has enough idle time, level-2 primaries with least execution times are also scheduled. If any unscheduled level-2 primary has a lower execution time than any scheduled level-1 primary in S2, the level-1 primary with the largest execution time is dropped and replaced in S2 with the level-2 primary. Once S2 is constructed, two S2 schedules are concatenated to get S3, and so on. This algorithm schedules either a primary and a backup or a backup for each periodic task in the system. We illustrate the above procedure using a system of two tasks T1 and T2 with periods 6 ms and 12 ms, respectively. Let the primary and backup verp b sions of T1 be T 1 (6, 4) and T 1 (6, 2) respectively. Similarly, T2 has primary p b and backup versions T 2 (12, 3) and T 2 (12, 2). T1 is placed in level 1 while T2 p b is placed in level 2. As the rst step, T 1 and T 1 of the level-1 task are sched1 uled to obtain S in Fig. 4(b). The provisional schedule in Fig. 4(c) is then b obtained by concatenating two copies of S1. In order to schedule T 2 , one copy p p of T 1 is dropped from the schedule in Fig. 4(c) to get Fig. 4(d). However, T 2 p is not schedulable due to insufcient idle time in this schedule. T 1 is dropped p p from the schedule in Fig. 4(d) and replaced by T 2 since T 1 has a longer exep cution time than T 2 . The final schedule S2 shown in Fig. 4(e) executes T1s backup version every 6 ms, while both versions of T2 are executed every 12 p b ms. We also note that if T 2 executes correctly, the space reserved for T 2 can p b 2 be deallocated. In the case of S , this space can be used to execute T 1 and T 1
during the time interval [6, 12] ms, thereby increasing the level of fault tolerance. The authors of [22] propose an on-line algorithm that reallocates the space reserved for a backup whose primary has executed successfully to increase the number of primaries executed during the remainder of the schedule. In masking schemes, tasks are duplicated and scheduled on multiple processors. The problem of scheduling non-preemptive independent tasks with duplication on m 3 processors to tolerate one failure is NP-complete [28]. The authors of [28] present a static scheduling heuristic for this problem using the primary/backup approach. Primaries of tasks are first scheduled on m processors to obtain the primary schedules and sorted in order of nonincreasing schedule lengths. Primary schedules are then duplicated to form m backup schedules. Each backup schedule is appended to a primary schedule on a processor using the following rules. A backup task must not be scheduled on the same processor as its primary, and the execution of a primary task and its backup on different processors should not overlap in time. This generates a schedule that tolerates one processor failure. The mode-change method used to adapt to changes in the mission prole of a real-time system can also be used to recongure the system after permanent processor failures. Real-time systems may have to operate in different modes during their mission life-time. For example, an aircraft control system performs different tasks during the takeoff, cruise, and landing phases. Such mode changes in statically scheduled systems are handled by precomputing schedules for each mode of operation and switching between appropriate schedules at run time [11]. Transitions between modes can have timing constraints and a decision has to be made when to effect the transition. For example, when a mode change is requested, it can be effected immediately, that is, the current system schedule is discarded, or it may be delayed until all tasks in the current schedule nish.
5.
IMPRECISE SCHEDULING
Intermediate or partial results from task computations can be used instead of more precise final results when a real-time system suffers failures or transient overloads. The concept of using partial results when exact results cannot be produced within the deadline was studied by Lin et al. [23] using an imprecise computation model. Real-time application areas for imprecise computations include signal processing, machine vision, and linear control systems. The authors of [23] define two approaches to imprecise scheduling. In the milestone approach, partial results are obtained at different execution
points in a computation and if a deadline is reached, the last recorded values form the task output. This method assumes that the precision of the results increases monotonically with time, that is, the longer a computation executes, the more precise its results become. Milestones are specified using programming primitives, thereby allowing system designers to explicitly save partial results of selected program variables. The sieve approach is based on iterative functions where each iteration computes a closer approximation of the final answer. A well-known example is the Newton-Raphson method for finding the roots of a polynomial. If a deadline is close, such functions can skip one or more iterations to produce a result within the time limit, and the nal result is not catastrophic to system correctness. The sieve approach implicitly species the imprecise results that can be obtained. A real-time task can integrate both the milestone and sieve approaches as follows. A task T j s mandatory portion m T j can use the milestone approach to produce acceptable results, while its o optional portion T j can use the sieve approach to improve the results. The o execution of T j depends on the availability of computing resources and time constraints. Liu et al. [25] propose algorithms for scheduling imprecise computations in a real-time system. We briefly discuss their method of scheduling imprecise periodic tasks on a single processor. The scheduling algorithm is preemptive and priority-driven where the optional portions of tasks are executed only after all ready mandatory portions have completed. Let J = T1(p1, c1),..., Tn(pn, cn) be a set of imprecise periodic tasks. The execution time of Tn m o m o is given by c n = c n + c n , where c n and c n are the execution times of Tns mandatory and optional portions, respectively. The task set J is split into a m m m m o mandatory set M = T 1 (p1, c n ),..., T n (pn, c n ) and an optional set O = T 1 (p1, o o o c n ),..., T n (pn, c n ) where all tasks in M have higher priorities than tasks in O. Let the utilization of the set M comprising n tasks be
UM = 1n k=1
m m ck pk
1 ) , then M is schedulable by the RM algorithm regardless If U M n ( 2 of the tasks in O. The remaining fraction of the processing power 1 UM can be used to execute tasks in O. The EDF algorithm can also be used to schedule tasks in O by assigning priorities to optional task portions dynamically, depending on their deadlines.
Tasks structured into mandatory and optional portions belong to a class called increased reward with increased service (IRIS) [19]. The reward function of an IRIS task increases with its execution time, and tasks are scheduled to maximize the reward while ensuring that mandatory parts of all tasks are completed. An example of a linear reward function is one where executing a
Tj
Assign imprecise computation level
Precision
Tj Tj
Tn
...
N
Tj
Result
CPU
Precision
...
T1
CPU
Result
Assign computation level
System load
(a)
(b)
Figure 5. Fixing computational levels of tasks (a) when task enters queue and (b) before task enters CPU [27]
unit of optional work generates a reward of one unit. It has been shown that scheduling IRIS tasks is NP-complete when request times, deadlines, and reward functions are arbitrary. However, for simple reward functions such as the linear reward function, optimal polynomial-time scheduling algorithms have been proposed [37]. Imprecise scheduling can handle transient overload in dynamic realtime systems via a queueing theoretic approach [27]. When the load is normal, the system computes precise results, that is, both the mandatory and optional portions of all tasks are executed. An unexpected external event can generate sporadic and aperiodic task requests, thereby increasing system load. In such cases, some tasks generate imprecise results by executing only their mandatory portions to ensure timing correctness of the system. For each task Tj, an on-line algorithm decides its computational level, that is, the algorithm directs Tj to produce precise or imprecise results depending on the current system load. After deciding Tjs computational level, the system is notified about the precision of the computation. The system is modeled as a queueing system (Fig. 5) consisting of a queue buffer of infinite size and a processor. Periods of normal and high task request rates simulate a normal and overloaded system, respectively. Execution times of the mandatory and optional task portions are assumed to be random. The scheduling algorithms proposed in [27] attempt to bound task response times under heavy system load and notify clients about the precision of the computation as soon as possible. Figure 5(a) shows an approach where every task Tj executes only its mandatory portion if the system load (computed from the task arrival rate) is heavy. Thus, the computational level of Tj is xed at ta(Tj), where ta is the time when Tj enters the queue. This approach sends
early precision information to the external environment but is pessimistic in the sense that it may unnecessarily cause tasks to be computed imprecisely. The approach shown in Fig. 5(b) fixes the computational level of Tj at te(Tj), where te is the time immediately before Tj enters the CPU. The number of tasks in the queue N is used as a measure of response time and H is a threshold parameter for N. If N H at te(Tj), then Tj is executed precisely, otherwise the task is executed imprecisely. A disadvantage of this approach is that the precision level of a task is decided very late. To overcome the shortcomings of the previous approaches, a third approach is proposed that assigns Tjs computational level anytime during its queued life (ta(Tj), te(Tj)). If during this interval, the number of tasks queued behind Tj is greater than or equal to H, then Tj is executed imprecisely. The milestone and sieve approaches have been implemented in a research execution environment called Concord [23] that provides programming primitives and system support for tasks to return imprecise results. The SIFT [38] computer for aircraft control treats each task as a sequence of iterations and executes each iteration on multiple processors. The partial result of each iteration is obtained by majority voting of results produced by all processors. Thus SIFT uses partial results to check the correctness of the ongoing computation. Anytime algorithms, which are similar in concept to imprecise scheduling, are used for time-constrained planning in artificial intelligence [8]. Researchers have also applied imprecise computations to databases to obtain approximate answers to queries [39]. In conclusion, we note that though researchers have studied methods to formulate computations into iterative forms [3], neither the milestone nor the sieve methods may be feasible for certain real-time applications. In such cases, the primary/backup approaches discussed in Sec. 3 can be used to trade-off result quality for meeting task deadlines.
6.
CONCLUSIONS
Many real-time embedded systems have to provide a high degree of fault tolerance using modest hardware resources due to cost and power consumption restrictions. Scheduling algorithms can provide low-cost solutions to fault tolerance, graceful performance degradation, and load shedding in such systems by exploiting tradeoffs between space or time redundancy, timing accuracy, and quality of service. We have divided fault-tolerant scheduling algorithms into three broad groups: dynamic, static, and imprecise. A dynamic scheduler is flexible and adapts quickly to environmental changes since it makes its decisions at run time. Since dynamic algorithms react to
unexpected system events, they are susceptible to transient overloads. Such algorithms are inadequate for scheduling tasks with precedence, synchronization, and mutual exclusion constraints, executing in a distributed environment. Schedules for these task types are calculated off-line and fixed for the life of the system. Static systems are very inexible and do not adapt well to environmental changes. Fault tolerance is typically provided in static systems by primary/contingency scheduling and fault masking. Primary/contingency scheduling provides limited exibility in a static system by switching between precomputed alternate schedules in case of failures. The masking technique schedules multiple versions of a task on different processors to tolerate failures. Since computing resources are preallocated to all task versions, the hardware overhead in masking schemes can be substantial. The imprecise methods use partial results when exact results cannot be produced within the deadline. This approach is feasible only if the corresponding computation can be formulated as a sequence of iterations. The price paid for the flexibility and fault tolerance of the scheduling methods discussed in this paper is unpredictability. Though these methods ensure that task deadlines are met, there can be substantial variations in task response times, precision of the results, and the like under different failure assumptions. Such variations can cause instabilities in the physical system leading to performance degradation or even safety hazards. Dynamic scheduling methods that make most of their decisions on-line are clearly the most flexible, but they are also costly to implement and difficult to analyze with respect to predictability. The amount of flexibility permitted in a dynamically scheduled safety-critical system must be tightly controlled and must lend itself to thorough off-line analysis of the systems performance under failure conditions. The resulting tradeoffs between flexibility, fault tolerance, and system predictability need further research before such dynamic scheduling methods nd their way into many real real-time embedded applications.
7.
[1] [2]
REFERENCES
A. Avizienis, The N-version approach to fault-tolerant systems, IEEE Trans. Software Eng., vol. SE-11, pp. 1491-1501, Dec. 1985. S. Balaji et al., Workload redistribution for fault-tolerance in a hard real-time distributed computing system, Proc. Fault-tolerant Comput. Symp., pp. 366383, 1989. S. K. Basu, On development of iterative programs from function specications, IEEE Trans. Software Eng., vol. SE-6, pp. 170-182, Mar. 1980. P. Brucker, Scheduling Algorithms, Springer-Verlag, New York, 1995. A. Burns et al., Feasibility analysis for fault-tolerant real-time task sets, Proc. Euromicro Workshop on Real-Time Systems, 1996.
[3] [4] [5]
[6] [7] [8] [9] [10]
[11] [12]
[13] [14]
[15]
[16] [17] [18] [19] [20]
[21] [22] [23] [24]
[25]
G. C. Buttazzo, Predictable Scheduling Algorithms and Applications, Kluwer, Boston, 1997. G. C. Buttazzo and J. Stankovic, RED: A robust earliest deadline scheduling algorithm, Proc. Third Int. Workshop on Responsive Computing Syst., 1993. T. L. Dean and M. Boddy, An analysis of time-dependent planning, Proc. Nat. Conf. Articial Intelligence, pp. 49-54, 1988. H. El-Rewini et al., Task Scheduling in Parallel and Distributed Systems, Prentice Hall, Englewood Cliffs, NJ, 1994. G. Fohler, Joint scheduling of distributed complex periodic and hard aperiodic tasks in statically scheduled systems, Proc. Real-Time Systems Symp., pp. 152-161, 1995. G. Fohler, Changing operational modes in the context of pre run-time scheduling, IEICE Trans. Inf. & Syst., vol. E76-D, pp. 1333-1340, Nov. 1993. O. Gonzalez et al., Adaptive fault tolerance and graceful degradation under dynamic hard real-time scheduling, Proc. Real-Time Systems Symp., pp. 7989, 1997. S. Gosh et al., Fault-tolerant rate-monotonic scheduling, Real-Time Systems, vol. 15, no. 2, pp. 149-181, Sep. 1998. S. Gosh et al., Fault-tolerance through scheduling of aperiodic tasks in hard real-time multiprocessor systems, IEEE Trans. Parallel and Dist. Syst., vol. 8, no. 3, pp. 272-284, Mar. 1997. C-J. Hou and K. G. Shin, Load sharing with considerations of future arrivals in heterogeneous distributed real-time systems, Proc. Real-Time Systems Symp., pp. 94-103, DEc. 1991. H. Kopetz et al., Distributed fault-tolerant real-time systems: The MARS Approach, IEEE Micro, vol. 9, no. 1, pp. 25-40, Feb. 1989. H. Kopetz, Real-Time Systems: Design Principles for Distributed Embedded Applications, Kluwer, Boston, 1997. C. M. Krishna and K. G. Shin, On scheduling tasks with a quick recovery from failure, Proc. Fault-tolerant Comput. Symp., pp. 234-239, 1985. C. M. Krishna and K. G. Shin, Real-Time Systems, McGraw-Hill, New York, 1997. J. P. Lehoczky, L. Sha, and Y. Ding, Enhancing aperiodic responsiveness in a hard real-time environment, Proc. Real-Time Systems Symp., pp. 261-270, 1987. S. Levi and A. K. Agrawala, Fault-Tolerant System Design, McGraw-Hill, New York, 1994. A.L. Liestman and R.H. Campbell, A fault-tolerant scheduling problem, IEEE Trans. Software Eng., vol. SE-12, no. 11, pp. 1089-1095, Nov. 1986. K. J. Lin et al., Imprecise results: Utilizing partial computations in real-time systems, Proc. Real-Time Systems Symp., pp. 210-217, Dec. 1987. C. L. Liu and J. Layland, Scheduling algorithms for multiprogramming in a hard real-time environment, J. Assoc. Comput. Mach., vol. 24, pp. 46-61, 1973. J. W. S. Liu et al., Algorithms for scheduling imprecise computations, IEEE Computer, vol. 24, no. 5, pp. 58-68, May 1991.
[26] [27] [28]
[29] [30] [31]
[32]
[33] [34] [35]
[36]
[37] [38] [39] [40]
[41]
[42]
M. Lyu (ed.), Software Fault Tolerance, John Wiley, New York, 1995. S. Natarajan (Ed.), Imprecise and Approximate Computation, Kluwer, Boston, 1995. Y. Oh and S. Son, Scheduling hard real-time tasks with tolerance of multiple processor failures, Microprocessing and Microprogramming, vol. 40, pp. 193206, Apr. 1994. Y. Oh and S. Son, Enhancing fault-tolerance in rate-monotonic scheduling, Real-Time Systems, pp. 315-330, Nov. 1994. S. H. Son (Ed.), Advances in Real-Time Systems, Prentice Hall, Englewood Cliffs, NJ, 1995. M. Pandya and M. Malek, Minimum achievable utilization for fault-tolerant processing of periodic tasks, IEEE Trans. on Comput., vol. 47, pp. 1102,1113, Oct. 1998. D. T. Peng and K. G. Shin, Static allocation of periodic tasks with precedence constraints in distributed real-time systems, Proc. Int. Conf. on Distributed Computing, pp. 190-198, June 1989. D. K. Pradhan, Fault-Tolerant Computer System Design, Prentice Hall, Englewood Cliffs, NJ, 1996. K. Ramamritham, Allocation and scheduling of complex periodic tasks, Int. Conf. on Distributed Computing Systems, pp. 108-115, June 1990. K. Ramamritham et al., Distributed scheduling of tasks with deadlines and resource requirements, IEEE Trans. Comput., vol. 38, pp. 1110-1123, Aug. 1989. S. Ramos-Thuel and J. K. Strosnider, The transient server approach to scheduling time-critical recovery operations, Proc. Real-Time Systems Symp., pp. 286-295, 1991. W. K. Shih et al., Algorithms for scheduling tasks to minimize total error, SIAM J. of Computing, vol. 20, pp. 537-552, 1989. D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems Design and Evaluation, A K Peters, Natick, MA, 1998. K. P. Smith and J. W. S. Liu, Monotonically improving approximate answers to relational algebra queries, Proc. of COMPSAC, pp. 234-241, 1989. J. A. Stankovic, Decentralized decision making for task reallocation in a hard real-time system, IEEE Trans. Computers, vol. 38, no. 3, pp. 341-355, Mar. 1989. S. Tridandapani et al., Low overhead multiprocessor allocation strategies exploiting system spare capacity for fault detection and location, IEEE Trans. Comput., vol. 44, no. 7, pp. 865-877, Jul. 1995. J. Xu and D. L. Parnas, Scheduling processes with release times, deadlines, precedence, and exclusion relations, IEEE Trans. Software Eng., vol. 16, pp. 360-369, Mar. 1990.

Cloud

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cloud

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 18

TASK SCHEDULING ALGORITHMS FOR FAULT TOLERANCE IN REAL-TIME EMBEDDED SYSTEMS

FAULT-TOLERANT DYNAMIC SCHEDULING

/* Update the number of processors */

b2 (a) p2 (b) p2 (c) p2 b4 , b5 (f) b2 p1 (d) b4 b5 (e) b6 b6

FAULT-TOLERANT STATIC SCHEDULING

Assign imprecise computation level

Assign computation level

[3] [4] [5]

[6] [7] [8] [9] [10]

[16] [17] [18] [19] [20]

[21] [22] [23] [24]

[26] [27] [28]

[29] [30] [31]

[33] [34] [35]

[37] [38] [39] [40]

Das könnte Ihnen auch gefallen