Concurrent Engineering-Kaur-1063293X16679001 PDF

Standard Article
Concurrent Engineering: Research

and Applications
Energy conscious scheduling 1–11
! The Author(s) 2016
with controlled threshold for Reprints and permissions:
sagepub.co.uk/journalsPermissions.nav
precedence-constrained tasks DOI: 10.1177/1063293X16679001

cer.sagepub.com
on heterogeneous clusters
Nirmal Kaur1,2, Savina Bansal3,4 and Rakesh Kumar Bansal3,4
Abstract
Efficient task scheduling of concurrent tasks is one of the primary requirements for high-performance computing plat-
forms. Recent advances in high-performance computing have resulted in widespread performance improvement though
at the cost of increased energy consumption and other system resources. In this article, an energy conscious scheduling
algorithm with controlled threshold has been developed for precedence-constrained tasks on heterogeneous cluster,
which aims at lower makespan along with reduced energy consumption. Energy conscious scheduling with controlled
threshold algorithm combines the benefits of dynamic voltage scaling with controlled threshold-based duplication strat-
egy to achieve its objectives. Effectiveness of the proposed algorithm is analyzed in comparison with available duplication-
and non-duplication-based scheduling algorithms (with and without dynamic voltage scaling approach) to ascertain its
performance and energy consumption. Exhaustive simulation results on random and real-world graphs demonstrate that
energy conscious scheduling algorithm with controlled threshold has the potential to reduce energy consumption and
makespan.
Keywords
heterogeneous computing, energy conscious scheduling, dynamic voltage scaling, duplication-based scheduling, power
consumption
Introduction about 50%, and it is predicted to rise to 70%–80% in

coming years (Rahaman and Chowdhury, 2009). Apart
Traditionally, the focus and main priority of high- from processor energy, network energy consumption is
performance computing (HPC) systems has been also quite significant and can be severe for communica-
toward enhancing system’s computational speed. tion intensive applications (Rahaman and Chowdhury,
However, in recent times, huge power consumption of 2009; Zong et al., 2011). Excessive energy consumption
multiprocessors, storage disks, interconnection net- results in increased green house gas emissions and cool-
works, and memory devices has prompted researchers ing costs to ensure system reliability, besides increasing
to design energy-aware solutions to reduce energy con- the electricity bills. Task scheduling on parallel
sumption. Current consumption—around 0.5% of
world’s total power usage—is expected to increase four
1
times by 2020 (Forrest, 2008). The power requirements I.K.Gujral Punjab Technical University, Kapurthala, India
2
Department of CSE–UIET, Panjab University, Chandigarh, India
of clusters typically ranges from 75 to 150–200 W/ft2 3
Department of ECE, Giani Zail Singh Campus College of Engineering &
and it will probably increase to 200–300 W/ft2 in near Technology, Bathinda, India
future. Furthermore, the integration of nearly billion of 4
Maharaja Ranjit Singh Punjab Technical University, Bathinda, India
transistors onto contemporary processors (e.g. Intel
Corresponding author:
Itanium) shall lead toward raising temperature level
Nirmal Kaur, Department of CSE–UIET, Panjab University, Chandigarh
which may affect system reliability as well (Koch, 160014, India.
2005). The current dynamic energy consumption is Email: nirmaljul19@gmail.com
Downloaded from cer.sagepub.com by guest on November 29, 2016

2 Concurrent Engineering: Research and Applications
computing system relates to mapping the tasks of an energy of independent tasks on uniprocessor systems
application on available processors to reduce overall (Manzak and Chakrabarti, 2001; Yao et al., 1995) and
completion time (makespan or schedule length). A non- later it became popular in homogeneous and heteroge-
judicious mapping heuristic may lead to redundant neous multiprocessor systems too. This technique has
energy consumption. Duplication-based scheduling, as also been exploited for conserving energy of dependent
proposed by many researchers, is a viable solution for tasks on homogeneous and heterogeneous multiproces-
reducing the communication overhead by replicating sors (Baskiyar and Abdel-Kader, 2010; Baskiyar and
predecessors to reduce makespan of an application Palli, 2006; Kaur et al., 2015, 2016; Lee and Zomaya,
(Bansal et al., 2003, 2005; Lai and Yang, 2008; Shin 2009; Mori et al., 2009; Wang et al., 2013). Comito
et al., 2008) and hence could be useful to reduce commu- et al. (2013) proposed an energy-aware task allocation
nication energy in interconnects. However, reduction in strategy to prolong battery life of mobiles. However,
makespan is achieved at the cost of increased computa- only little energy-aware duplication strategies have
tion energy (due to duplication of predecessors). been proposed so far and that too for homogeneous
This article attempts an energy conscious duplication clusters (Liu et al., 2014; Zong et al., 2011). These algo-
scheduling under flexible threshold to conserve energy rithms create groups of heavily communicating tasks
and improve performance in heterogeneous cluster and then allocate each group of tasks on a different
(HC) system. Threshold is dynamic and is gradually processor to conserve energy. But these techniques are
adapted to next best level till desired makespan and not always adaptable to schedule multiple task groups
energy reduction are achieved. The strength of the algo- for a bounded number of heterogeneous processors. If
rithm lies in duplicating immediate predecessors under available processors are less than as required to sched-
a controlled threshold keeping an eye on reducing ule multiple task groups, then some dummy processors
makespan and energy consumption. Energy conscious are assumed to assign leftover groups. Inclusion of
scheduling algorithm with controlled threshold dummy processors in HC raises concern about the
(ECSCT) uses energy control (EC) as a decisive para- competitiveness of the algorithm owing to different
meter; if it is within the limits of a threshold, duplica- capacities of processors.
tion is carried out, otherwise discarded. As a result, As collected from the available works, most of the
ECSCT can strike a good balance between performance energy-aware scheduling techniques have focused on
(makespan) and energy consumption. Next, dynamic independent tasks, or on uniprocessor systems, or on
voltage scaling (DVS) is exploited for the idle time slots homogeneous multiprocessors. Furthermore, energy
to reduce more processor energy consumption. reduction is handled mainly through managing proces-
The article is organized as follows: section ‘‘Related sor energy for active states, overlooking processor idle
work’’ presents the related work that underlies the energy. In addition, the communication energy of inter-
existing energy-aware scheduling techniques. Section connect has also been largely neglected in most of the
‘‘Models’’ formulates the cluster, task, and energy prior works. Therefore, in this article, an energy con-
model. Section ‘‘Proposed ECSCT algorithm’’ presents scious controlled duplication-based scheduling has been
the proposed ECSCT algorithm. The performance developed for precedence-constrained tasks on HC to
metrics and task graph generation are discussed in sec- improve performance and reduce total energy con-
tion ‘‘Performance Metrics and Task Graph sumption considering computation energy of processor
Generation.’’ Subsequent section ‘‘Simulation setup in active and idle states and communication energy of
and performance comparison’’ gives the simulation the interconnects as well.
setup and comparison of ECSCT with other related
work. Section ‘‘Conclusion’’ summarizes the conclu-
sions and future scope of work.
Models
HC model
A cluster model is a set P{pk:k = 0, 1, 2 .p2 1} of
Related work
‘‘p’’ DVS-enabled processors/nodes connected in a fully
Scheduling is an NP-complete problem, and as a conse- connected topology. In DVS cluster, each pk has dis-
quence, scheduling heuristics are applied to get sub- crete states, and each state has a voltage and frequency
optimal but practical schedules (Bala and Chana, 2015; set. Voltage–frequency of all states is sorted in non-
Bansal et al., 2003, 2005; Lai and Yang, 2008; Shin increasing order, such that (v1, f1) and (vn, fn) corre-
et al., 2008; Topcuoglu et al., 2002). Duplication-based sponds to highest and lowest voltage–frequency set
scheduling helps in reducing makespan of communica- among the discrete states. For each pk, the set
tion intensive applications but with increased computa- fp(v
k
d , fd )
, pkactive(vd , fd ) , pkidle(vd , fd ) g represents the operating
tion energy. DVS has been primarily used to conserve voltage–frequency, active and idle power consumption,

Kaur et al. 3
!
for 1 !d!n. Communication links among processors X
u
1 , f1 )
are assumed to be without any contention, and compu- ep k
idle = makespan % pidle(v
k " xik " w(ti , pk ) ð3Þ
tation and communication can take place concurrently. i=1
This assumption is justifiable as each processor in the xik is 1 if ti is scheduled on pk, else 0. Total idle
modern cluster has a separate coprocessor that can be energy consumed on all processors is given as follows
utilized to relieve processor from communication
chores. The processors are heterogeneous, so each task X
p
has different execution time on each pk. EPidle = epkidle
k =1
!
X p X
u
Task model = makespan % pidle(v
k
1 , f1 )
" xik " w(ti , pk )
A parallel application is assumed to be decomposed k =1 i=1
into multiple tasks with precedence constraints and is ð4Þ

modeled by a directed acyclic graph (DAG), G (T, E)
where vertices T{ti:i = 0, 1, 2, .u2 1} is a set of tasks Therefore, total processor energy dissipation of a
with precedence relation such that tj cannot start until parallel application is given as follows
ti finishes its execution, and E, the set of edges is a mes-
EP = EPactive + EPidle ð5Þ
sage set. Each task ti2T has computation cost, w(ti, pk),
which represents its execution time on pk. Task execu- This energy model is compatible with DVS tech-
tions are assumed to be non-preemptive. cij is the com- nique. In DVS system, it is possible to reduce processor
munication cost to transfer data between tasks ti and tj power consumption by lowering down its maximum
scheduled on different processors. Local communica- operating voltage–frequency p(v 1 , f1 )
to best fit scaled
k
tion cost is assumed to be negligible in comparison with down level pk (vd , fd )
, for 2 !d!n, without affecting make-
inter-processor communication cost. A task without span of the application. In that case, we can replace
any predecessor and successor is taken as an entry and pkactive(v1 , f1 ) and pkidle(v1 , f1 ) with pkactive(vd , fd ) and pkidle(vd , fd ) .
exit task in the DAG. The new execution time wnew (ti , pk ) of ti on pk at scaled
down voltage level vd is given as follows
Energy model
v21 " w(ti , pk )
The energy model of homogeneous cluster (Zong et al., wnew (ti , pk ) = for 2 ! d ! n ð6Þ
v2d
2011) has been extended to cope up with different pro-
cessing capabilities of processors in HC system.
Communication energy model. The network interconnect
Computation energy model. Let pkactive(v1 , f1 )
be the active is assumed to be homogeneous, so the bandwidth to
power consumption on pk for highest voltage and fre- transfer data over the links interconnecting any pair of
quency (v1, f1). Processor active energy consumed to processors is same with same consumption of commu-
execute ti on pk, denoted as epactive (ti , pk ) is given as nication power. Let PL be the power consumption of a
follows link interconnecting two processors. Communication
energy elij consumed to transfer data over the link from
epactive (ti , pk ) = pkactive(v1 , f1 ) " w(ti , pk ) ð1Þ ti to tj scheduled on different nodes is given as follows
Processor active energy consumed to execute all elij = PL " cij ð7Þ
tasks in T on P is given by equation (2)
Communication energy consumption of whole net-
jT j X
X jPj work interconnect is given by equation (8)
EPactive = epactive (ti , pk )
i=1 k =1
ð2Þ
u X
X u X
p X
p
! "
Xu X p EL = xik " xjm " PL " cij ð8Þ
1 , f1 )
= (pactive(v
k " xik " w(ti , pk )) i=1 j=1 k =1 m=1
k6¼m
j6¼i
i=1 k =1
Let pkidle(v1 , f1 ) be the idle power consumption on pk Consequently, total energy consumption (computa-
under highest voltage and frequency (v1, f1). Idle energy tion and communication) of a cluster is given as follows
consumed on pk corresponding to its idle state is given
by equation (3) ET = EP + EL ð9Þ

Table 1. Notations used in ECSCT algorithm.
T, E Number of tasks and edges in the DAG

ti , pk ith task ti2T and kth node/processor pk2P
cij Communication cost to transfer message from ti to tj scheduled on different processors
w(ti , pk ) Execution time of ith task on kth processor
(v1 , f1 ), (vn , fn ) Highest and lowest voltage–frequency among discrete states
pkactive(vd , fd ) , pkidle(vd , fd ) Active and idle power consumption on pk for voltage–frequency (vd , fd ), for 1 ! d ! n
PL Power consumption of interconnect
ST(ti , pk ), FT(ti , pk ) Start and finish time of ith task on kth processor without duplication
ST d(ti , pk ), FT d(ti , pk ) Start and finish time of ith task on kth processor with duplication of its immediate predecessors
AST(ti , pk ), AFT(ti , pk ) Actual start and finish time of ith task on kth processor
EC(ti , pk ) Energy control to be computed if duplication of predecessors of ti on pk reduces schedule length
epactive (ti , pk ), EPactive Active energy consumed to execute ith task on kth processor, total active energy consumed to execute
all tasks
epkidle , EPidle Idle energy consumed on kth processor, total idle energy consumed on all processors
elij Communication energy consumed over a link to transfer data from ith to jth task running on different
processors
EP, EL Total energy consumption of processor (active and idle) and interconnect
ECSCT: energy conscious scheduling algorithm with controlled threshold; DAG: directed acyclic graph.
Proposed ECSCT algorithm where w! i and !cij are mean computation and mean com-
munication cost, respectively. Tasks are then arranged
In this article, an energy conscious duplication schedul-
in decreasing order of their bottom level generating a
ing exploited under controlled threshold is proposed to
task sequence satisfying precedence constraints.
reduce energy and makespan in HC system. The algo-
rithm comprises of two major phases. In first phase,
priority-based task sequence is generated, while in the Energy conscious duplication scheduling
second phase, energy conscious duplication scheduling
In the second phase, energy conscious duplication sche-
is done within threshold, and this phase terminates
duling is employed under a controlled threshold till fea-
once feasible threshold is reached. ECSCT adopts the
sible threshold is reached. Initially, temp_threshold and
limited duplication strategy as done in heterogeneous
upper_threshold is set to 5,000,000 and 0. ECSCT
limited duplication (HLD) (Bansal et al., 2005). HLD
selects an unscheduled task ti from the ordered task
reduces the inter-task communication by replicating
sequence and computes its start and finish time on each
the immediate predecessors on multiple processors to
pk2P with and without duplication of its immediate
reduce makespan. However, accommodating a dupli-
predecessors (step 4). If duplication is useful (step 5),
cated task may increase processor’s active energy con-
then EC (ti, pk) is computed to analyze whether the
sumption that has not been considered in (Bansal et al.,
impact of duplication on makespan and energy con-
2005). ECSCT includes an additional EC parameter
sumption is within the limits of temp_threshold. If yes,
which takes into account the impact of duplication on
duplication is allowed, else discarded on pk. Task ti is
makespan and energy consumption and thus balances
scheduled on a processor pk that gives minimum earlier
trade-off of makespan–energy. Threshold is upgraded
finish time, and EC value (due to duplication of prede-
dynamically to next level till a feasible threshold is
cessors) on scheduled pk is maintained in a threshold
reached that corresponds to desired makespan and
array (i.e. Th-list). The min_sch_len is temp_sch_len
energy reduction. Table 1 enlists the notations used in generated under initial temp_threshold. After all tasks
ECSCT and its pseudocode is shown ahead. are scheduled, temp and upper thresholds are revised
from threshold array (step 7). If computed upper
Generation of priority-based task sequence threshold is zero, ECSCT returns upper threshold as
feasible threshold and terminates. Otherwise, an addi-
In the first phase, priorities are assigned to each and tional step r2 is used to compute feasible threshold. In
every task of a DAG based on their bottom level b(t) this step, temp_sch_len is computed (steps 2–6) under
(Bansal et al., 2005), which is calculated recursively revised temp_threshold and compared if generated
based on the mean costs, as given in equation (10) temp_sch_len is less than or equal to min_sch_len and
total energy consumption at temp_sch_len, that is,
! i + maxfb(tj ) + !cij g8tj esucc(ti )
b(ti ) = w ð10Þ E_temp_sch_len is less than or equal to total energy

Kaur et al. 5
consumption of HLD_DVS, that is, E_hld_dvs. If yes, return feasible_threshold

feasible threshold is attained. If no, temp_threshold is Get_feasible_threshold (temp_threshold, upper_thres-
updated with next_threshold and temp_sch_len is com- hold)
puted under updated temp_threshold until desired Step r1: if (upper_threshold = = 0)
objective is met. Furthermore, DVS is applied to the return feasible_threshold = upper_threshold;
scheduled tasks as well as to the task replicas created Step r2: else
on different processors to further reduce processor while (temp_threshold ! upper_threshold)
energy consumption while maintaining temp_sch_len. {y = 0; temp_Th-list [y] = {0};
repeat steps 2 to 6 to get temp_sch_len under
temp_threshold
Algorithm 1. Pseudocode of ECSCT algorithm if (temp_sch_len ! min_sch_len &&
E_temp_sch_len ! E_hld_dvs)
Input: G (T, E), processor and network power, temp_- return feasible_threshold = temp_threshold
threshold = 5000000, upper_threshold = 0, x = 0, break
Th-list [x] = {0}, y = 0, else
temp_Th-list [y] = {0}; {//compute next threshold
Output: feasible_threshold next_threshold = 5000000
Step 1: Priority based task sequence generation for (z = 0; z \ y; z + + )
Step 2: ti = first unscheduled task from ordered task if (temp_Th-list [z] \ next _threshold &&
sequence; temp_Th-list [z] . temp_threshold)
Step 3: for all pk2P next_threshold = temp_Th-list [z];
Step 4: find ST (ti, pk), FT (ti, pk) and ST_d (ti, pk), temp_threshold = next_threshold
FT_d (ti, pk)//start and finish time of ti on pk }
with and without duplication }//end while
Step 5: if (FT_d (ti, pk) \ FT (ti, pk))//duplication
reduces schedule length
{ Performance metrics and task graph
reduced_sch_len (ti, pk) = FT (ti, pk) 2 FT_d generation
(ti, pk) The performance of the algorithm is analyzed in terms
EC (ti, pk) = (epactive (tn, pk) 2elni)/reduced_s- of makespan and energy consumption comprising com-
ch_len (ti, pk) where tn = immediate predeces- putation (active and idle) energy and communication
sors of ti duplicated on pk energy of interconnect. As a large set of random and
temp_Th-list [y + + ] = EC (ti, pk) real-world task graphs are simulated, so makespan is
if (EC(ti, pk) ! temp_threshold)//duplica- normalized with respect to absolute lower bound value
tion allowed on pk termed as normalized schedule length (NSL). The
AST (ti, pk) = ST_d (ti, pk) & AFT (ti, energy reduction effectiveness of ECSCT over others is
pk) = FT_d (ti, pk) ascertained based on parameter energy saving (ES)
else//ignore duplications on pk given as follows
AST (ti, pk) = ST (ti, pk) & AFT (ti, pk)
= FT (ti, pk) EREF % EECSCT
} ES = ð11Þ
EREF
else//duplication does not reduce schedule
length where EREF represents the energy consumption of any
AST (ti, pk) = ST (ti, pk) & AFT (ti, pk) = FT reference algorithm under consideration, and EECSCT is
(ti, pk) the energy consumption of the proposed algorithm.
Step 6: schedule ti on pk with minimum actual finish The algorithms used for comparison are Heterogeneous
time and undo duplications (if any) on any Earliest Finish Time (HEFT) (Topcuoglu et al., 2002),
other pk. Record Th-list [x + + ] = EC (ti, pk) HLD (Bansal et al., 2005), Low Power Heterogeneous
while (unscheduled tasks exist in task sequence) Makespan (LPHM) (Baskiyar and Palli, 2006), and
record temp_sch_len HLD_DVS (Kaur et al., 2016). HEFT is a list schedul-
Step 7: revise temp_threshold and upper_threshold with ing algorithm, and HLD is a duplication-based schedul-
minimum and maximum threshold value from ing algorithm that exploits the available scheduling
Th-list[x] respectively holes to duplicate the immediate predecessors to gener-
Step 8: Call Get_feasible_threshold (temp_threshold, ate shorter makespan than non-duplication algorithms.
upper_threshold) HLD is an effective algorithm as it avoids redundant

Table 2. Configuration detail of AMD Athlon 64 3000þ processor.
S-state Frequency (GHz) Voltage Active power (W) Idle power (W)
1 2.0 1.50 100 87

2 1.8 1.40 87 78
3 1.0 1.10 61 59
duplications and incorporates only limited duplications Table 3. Configuration profile of networks.
as against other duplication scheduling algorithms in
literature. LPHM is a list-based energy-aware schedul- Network type Power (W) Bandwidth Message
ing algorithm that blends DVS with well-known HEFT (MB/s) delay
algorithm. HLD_DVS is energy-aware duplication
Ethernet 80 50 13
algorithm, in which the scheduled tasks (original and QsNetII 54 340 1.73
duplicated) on multiple processors exploit DVS in their Infiniband 35.6 822 0.9
slack times. All algorithms are coded in cross-platform Myrinet 64.5 243 1
integrated development environment (IDE): Code
Blocks 13.12. For an unbiased evaluation, random and
real-world task graphs are generated by modifying
benchmark task graph suite (Kwok and Ahmad, 1999) Simulation setup and performance
to be used for heterogeneous system. Task graphs are comparison
generated by varying five communication to computa- Configuration details
tion cost ratios (CCRs) {0.1, 0.5, 1.0, 2.0, 10.0}, four
high-speed interconnects {Ethernet, QsNetII, The performance comparisons on random and real-
Infiniband, Myrinet}, 10 task graph sizes (v) {50 to 500 world graphs are simulated with AMD Athlon 64
with an increment of 50}, five different average paralle- 3000 + processor (Mori et al., 2009) and four high-
lism (nOv){1 !n! 5}, and five computation ranges (b) speed interconnection networks (Zong et al., 2011)
{0.1, 0.25, 0.5, 0.75, 1.0}. CCR is the ratio of average widely used in cluster. Tables 2 and 3 give the config-
communication cost to average computation cost of a uration details of processor and network types used for
graph. High and low CCRs imply a communication simulation.
and computation intensive graph. Average parallelism
(nOv) is the maximum number of non-precedence tasks Performance comparison for random task graphs
in a graph that indicates the parallelism available in an
application. Computation range (b) is the processor Impact of varying CCRs. Based on parameters (5 CCRs, 5
heterogeneity for variation in the computation costs on nOv, 5 b, 10 v, and one network = QsNetII), 1250 ran-
processors. A high b (1.0) causes significant difference dom task graphs were generated and simulated on clus-
in computation cost of a task among processors, and a ter. Figure 1 shows performance (average NSL) and
low b (0.1) means that the execution time of a task on total energy (CPU + communication) consumption.
any processor is almost equal. The expected execution From average NSL (Figure 1(a)), it is seen that for
time w(ti, pk) of ti on pk is calculated as follows communication intensive graphs (high CCRs), HLD
gives significant improved performance over HEFT as
# $ # $
b b it exploits the available scheduling holes to duplicate
!i 3 1 %
w(ti , pk ) = w !i 3 1 +
! w(ti , pk )\ w the immediate predecessors to reduce makespan,
2 2
ð12Þ whereas for computation intensive graphs (low CCRs),
performance of HLD is more or less comparable to
where w ! i is mean computation cost of task ti taken HEFT as duplication may not be desirable under such
from the computation cost of homogeneous benchmark cases. Performance of ECSCT is comparable or better
suite (Kwok and Ahmad, 1999). Real-world parallel than HLD at high CCRs because ECSCT employs
numerical applications are simulated: mean value anal- duplication scheduling under a threshold to balance
ysis (MVA) and Gaussian elimination (Gauss). Task makespan and energy consumption. Therefore, proces-
graph size of these applications is roughly O(N2); N is sor’s available scheduling holes left unexploited due to
{19, 22, 25, 28, 31, 34, 37, 40} for MVA and {19, 24, discarded duplication can be utilized in future to
29, 34, 39, 44, 49} for Gauss. accommodate the unscheduled and duplicated tasks,

Kaur et al. 7
Figure 1. (a) Average performance and (b) energy consumption for varying CCRs.
Figure 2. (a) Average performance and (b) energy consumption for varying interconnects.
and as a result, ECSCT gives reduction in makespan large CCRs, ECSCT exhibits both reduced energy con-
than remaining algorithms. At CCR 0.1, average NSL sumption and makespan than other algorithms. At
of ECSCT is good over HEFT and HLD by 0.1% and CCR 10, average total energy saving of ECSCT is
0.2%, but at CCR 10, it is improved by 24.7% and improved by 27.2%, 6.8%, 12%, and 2.3% over
1%, respectively. For total energy (Figure 1(b)), we HEFT, HLD, LPHM, and HLD_DVS, respectively.
can find that for compute intensive graphs, HLD is
slightly less energy efficient than HEFT because some-
times duplication tends to increase processor active Impact of varying network interconnects. Here, 1000 ran-
energy without boosting up performance. But for high dom graphs were generated with four interconnects
CCRs, HLD yields significant energy saving over ({Ethernet, QsNetII, Infiniband, Myrinet}, 5 nOv, 5 b,
HEFT as duplication solution on such graphs tends to 10 v, and one CCR = 10) to reflect the effect of vary-
reduce communication delay, which reduces substantial ing interconnects. It is seen that Ethernet results in
communication and processor idle energy consump- higher total energy consumption than Infiniband for
tion. In contrary, ECSCT performs better than other all scheduling algorithms. Ethernet has more latency,
algorithms as it retains the strengths of duplication and Infiniband has less latency; therefore, communi-
under controlled threshold to satisfy desired makespan cation time in Ethernet is more resulting in increased
and energy reduction. Total energy saving in LPHM is makespan and energy consumption over Infiniband.
better than HLD and HLD_DVS for compute inten- Myrinet and QsNetII observe almost comparable per-
sive graphs, thus it is better suited for these graphs. formance, but QsNetII gives somewhat more energy
But for large CCR, HLD_DVS performs better than savings than Myrinet because power of QsNetII is
LPHM because combined effect of duplication and lower than that of Myrinet. However, the strengths
DVS causes communication and processor idle energy of ECSCT get exhibited as compared to other algo-
saving to dominate total energy. The superiority of the rithms regardless of interconnect type. Energy saving
ECSCT algorithm gets reflected from these results. At of ECSCT over HEFT, HLD, LPHM, and
low CCR, ECSCT yields reduced makespan and better HLD_DVS is 36.8%, 3.4%, 31.2%, and 0.8% when
energy savings than other duplication algorithms and is Ethernet interconnect is deployed in cluster and for
quite comparable to list-based algorithms. While at Infiniband it is 25.9%, 7.8%, 8.8%, and 3.2%,

Figure 3. (a) Average performance and (b) energy consumption for varying task sizes.
Figure 4. (a) Average performance and (b) energy consumption for varying heterogeneities.
respectively. Therefore, average performance and communication intensive graphs. From average NSL of 50
energy results (Figure 2) of scheduling algorithms size graphs, ECSCT noticeably perform well than HEFT
vary for different interconnects, and difference and HLD by 38.3% and 2.2%, whereas, for 500 tasks, per-
among their latency and power affects overall make- formance reduces to 20.3% and 0.9%. For total energy
span and energy. Scheduling algorithms can benefit consumption of 50 tasks, ECSCT is better than HEFT,
from low latency and high bandwidth interconnects HLD, LPHM, and HLD_DVS by 40.5%, 12%, 22%, and
that can reduce communication times and yield 3.8%, respectively.
improved performance and energy.
Impact of heterogeneity. Unlike homogeneous systems, in

Impact of varying number of tasks. In this section, we gen- heterogeneous computing system, the capacities of pro-
erated 250 random graphs based on (10 v, 5 nOv, 5 b, cessors are different. For the 250 random task graphs
one CCR = 10, and one network = Infiniband). as generated in section ‘‘Impact of varying number of
From results (Figure 3), it is gathered that makespan tasks,’’ performance and energy reduction (Figure 4)
deteriorates and energy consumption increases as num- improve as heterogeneity increases regardless of sche-
ber of task size increases for all algorithms because duling algorithm. At each heterogeneity parameter b,
data dependencies among tasks increases and paralle- energy saving of HLD is found to be more effective
lism needs to be taken care of in the presence of high than HEFT because duplication scenario better adapts
inter process communication. Here, it may be observed that to HC platform and thus brings more benefit of
duplication scheduling algorithms (HLD, HLD_DVS, and heterogeneity. Furthermore, it is seen that DVS tech-
ECSCT) are quite effective to preserve parallelism despite nique gives better energy savings as heterogeneity
of high communication costs and thus perform well for increases because different processors execute similar

Kaur et al. 9
Figure 5. (a) Average performance and (b) energy consumption of MVA for varying CCRs.
Figure 6. (a) Average performance and (b) energy consumption of GE for varying CCRs.
Gauss: Gaussian elimination.
tasks at different speeds resulting in more idle time both the real applications. For small CCR, average per-
to be exploited by DVS. For instance, at heterogene- formance of ECSCT is more or less comparable to
ity 1, HLD_DVS saves 24%, 5.2%, and 6.4% energy HEFT, but for large CCR, ECSCT is significantly
than HEFT, HLD, and LPHM. For total energy, at improved over HEFT and HLD. For example, at CCR
b = 0.1, ECSCT is better than HEFT, HLD, LPHM, 10, ECSCT provides better performance than HEFT
and HLD_DVS by 25.3%, 7.5%, 8.3%, and 3%, and HLD by 18.6% and 3.4% for MVA and it is 8.1%
respectively, and for b = 1, it is increased to 27%, and 4.8% for Gauss. In terms of total energy, for small
8.5%, 9.7%, and 3.5%. From results, ECSCT is found CCR, ECSCT gives good energy gain against HEFT,
to be quite energy efficient over other algorithms HLD, and HLD_DVS, and it is only marginally energy
regardless of heterogeneity. efficient than LPHM. For large CCR, performance
and energy improvement of ECSCT are significantly
improved than all other algorithms. Justification of
Performance comparison for regular task graphs for
these results is consistent with random task graphs
varying CCRs results. Compared with HEFT, HLD, LPHM, and
Two parallel applications were simulated: MVA and HLD_DVS at CCR 10, energy saving of ECSCT is
Gaussian elimination (Gauss). For MVA, 200 graphs 10.7%, 9.9%, 3.3%, and 7.5% for Gauss and it
are generated based on (5 CCRs, 5 b, 8 N, and one net- increases to 33.2%, 14.8%, 15.8%, and 6.8% when
work = Myrinet). For Gauss application, 175 graphs MVA is simulated on same cluster. Thus, from results
were created based on (5 CCRs, 5 b, 7 N, and one net- of Figures 5 and 6, it is clear that communication inten-
work = Myrinet). ECSCT yields highest performance sive graphs can take more benefit from ECSCT to
and energy saving as compared to other algorithms for improve performance and reduce energy consumption.

Conclusion References
Judicious scheduling of concurrent tasks is crucial for Bala A and Chana I (2015) Autonomic fault tolerant scheduling
exploiting full potential of parallel computing plat- approach for scientific workflows in cloud computing. Con-
current Engineering: Research and Applications 23(1): 27–39.
forms. However, for solving increasingly complex
Bansal S, Kumar P and Singh K (2003) An improved duplica-
engineering and scientific applications, the excessive tion strategy for scheduling precedence constrained graphs
power consumption of underlying computing archi- in multiprocessor systems. IEEE Transactions on Parallel
tectures is becoming a major concern. In this work, and Distributed Systems 14(6): 533–544.
an energy conscious duplication-based scheduling Bansal S, Kumar P and Singh K (2005) Dealing with hetero-
algorithm with controlled threshold (ECSCT) is pro- geneity through limited duplication for scheduling prece-
posed to address makespan and energy consumption dence constrained task graphs. Journal of Parallel and
issues concurrently for scheduling DAG on heteroge- Distributed Computing 65(4): 479–491.
neous computing platforms. Threshold is selected Baskiyar S and Abdel-Kader R (2010) Energy aware DAG
based on the computation, communication energy, scheduling on heterogeneous systems. Cluster Computing
13(4): 373–383.
and makespan and it is updated dynamically till a fea-
Baskiyar S and Palli KK (2006) Low power scheduling of
sible threshold is reached. After attaining feasible
DAGs to minimize finish times. In: 13th international con-
threshold, DVS is applied to further conserve the ference on high performance computing (HiPC), Bangalore,
computation energy, wherever feasible. The exhaus- India, 18–21 December, pp. 353–362. Berlin: Springer.
tive simulation-based analysis on random and real- Comito C, Falcon D, Talia D, et al. (2013) Efficient allocation
world graphs justifies the logic behind the strategy of data mining tasks in mobile environments. Concurrent
proposed. Although, for compute intensive applica- Engineering: Research and Applications 21(3): 197–207.
tions (with CCR ! 1), average makespan of ECSCT Forrest W (2008) How to cut data centre carbon emissions?
is slightly reduced (within 1%) with marginal energy Available at: http://www.computerweekly.com/Articles/
reduction (within 20.3% and + 2.6%) in comparison 2008/12/05/233748/how-tocut-data-centre-carbon
with other prominent list and duplication-based sche- emissions. htm
Kaur N, Bansal S and Bansal RK (2015) Towards energy effi-
duling algorithms with and without DVS. However,
cient scheduling with DVFS for precedence constrained
for communication intensive applications (with CCR
tasks on heterogeneous cluster system. In: 2nd international
. 1), the average makespan reduction generated by conference on recent advances in engineering and computa-
ECSCT is appreciable (minimum 3.4% and maximum tional sciences, Chandigarh, India, 21–22 December, pp.
10%) and that too with significant reduced energy 1–6. New York: IEEE.
consumption (minimum 5.7% and maximum 15%) in Kaur N, Bansal S and Bansal RK (2016) Energy efficient
comparison with other algorithms. Implication of duplication-based scheduling for precedence constrained
results state that ECSCT is a potential substitute to tasks on heterogeneous computing cluster. Multiagent and
contemporary algorithms to reduce energy consump- Grid Systems 12(3): 239–252.
tion while still retaining or improving makespan on Koch G (2005) Discovering Multi-Core: Extending the Benefits
HC. The present research focused on energy con- of Moore’s Law. Technology@Intel Magazine, Available
at: http://www.intel.com/technology/magazine/computing/
scious duplication scheduling under classical model to
multi-core-0705.pdf
survive makespan and reduce total energy consump- Kwok YK and Ahmad I (1999) Benchmarking and compari-
tion. The future work is being extended toward realis- son of the task graph scheduling algorithms. Journal of
tic model taking into account the contention for Parallel and Distributed Computing 59(3): 381–422.
communication resources. Energy-aware duplication Lai KC and Yang CT (2008) A dominant predecessor dupli-
scheduling under contention model has a scope to cation scheduling algorithm for heterogeneous systems.
reduce communication delay and thus communication Journal of Supercomputing 44(2): 126–145.
energy consumption. Lee YC and Zomaya AY (2009) On effective slack reclama-
tion in task scheduling for energy reduction. Information
Processing Systems 5(4): 175–186.
Declaration of conflicting interests Liu W, Du W, Chen J, et al. (2014) Adaptive energy-efficient
The author(s) declared no potential conflicts of interest with scheduling algorithm for parallel tasks on homogeneous
respect to the research, authorship, and/or publication of this clusters. Journal of Network and Computer Applications 41:
article. 101–113.
Manzak A and Chakrabarti C (2001) Variable voltage task
scheduling algorithms for minimizing energy. In: Interna-
Funding tional symposium on low power electronics and design
The author(s) received no financial support for the research, (ISLPED’01), Huntington Beach, CA, 6–7 August, pp.
authorship, and/or publication of this article. 279–282. New York: ACM.

Kaur et al. 11
Mori Y, Asakura K and Watanabe T (2009) A task selection Topcuoglu H, Hariri S and Wu MY (2002) Performance
based power-aware scheduling algorithm for applying effective and low complexity task scheduling for heteroge-
DVS. In: International conference on parallel and distribu- neous computing. IEEE Transactions on Parallel and Dis-
ted computing, applications and technologies, Higashihir- tributed Systems 13(3): 260–274.
oshima, Japan, 8–11 December, pp. 518–523. New York: Wang L, Khan SU, Chan D, et al. (2013) Energy-aware paral-
IEEE. lel task scheduling in a cluster. Future Generation Computer
Rahaman MS and Chowdhury MH (2009) Crosstalk avoid- Systems 29(7): 1661–1670.
ance and error-correction coding for coupled RLC inter- Yao F, Demers A and Shenker S (1995) A scheduling model
connects. In: International symposium on circuits and for reduced CPU energy. In 36th annual symposium on
systems, Taipei, Taiwan, 24–27 May, pp. 141–144. New foundations of computer science (FOCS’95), Milwaukee,
York: IEEE. WI, 23–25 October, pp. 374–382. New York: IEEE.
Shin KS, Cha M, Jang M, et al. (2008) Task scheduling algo- Zong Z, Manzanares A, Ruan X, et al. (2011) EAD and
rithm using minimized duplications in homogeneous sys- PEBD: two energy-aware duplication scheduling algo-
tems. Journal of Parallel and Distributed Computing 68(8): rithms for parallel tasks on homogeneous clusters. IEEE
1146–1156. Transactions on Computers 60(3): 360–374.
Author biography
Nirmal Kaur received her Masters in CSE (2009) from IK-Gujral Punjab Technical University,
Kapurthala and Bachelors in CSE (2003) from Punjabi University, Patiala. She is pursuing PhD
in CSE from IK-Gujral Punjab Technical University, Kapurthala. She is an assistant professor in
UIET, Panjab University, Chandigarh, India. Her areas of interest include energy aware multi-
processor scheduling, and digital Image watermarking techniques.
Savina Bansal earned PhD from IIT, Roorkee (2004), Masters in CSE (1994) from Thapar
University, Patiala, and Bachelors in ECE (1988) from Punjab Engg College Chandigarh, India.
She is a Professor at GZS Campus College of Engg & Tech, Bathinda (A Punjab Govt
Established Institute) and Dean (R&D), Maharaja Ranjit Singh Punjab Technical University,
Bathinda, India. Her current areas of interest include energy efficient and fault tolerant schedul-
ing on parallel computing systems, wireless sensor networks, image processing and wireless com-
munication systems. Six PhD scholars, including the first author, are working under her
supervision. She is available at savina.bansal@gmail.com.
Rakesh Kumar Bansal earned PhD (2009) from Pbi University and Masters in Computer
Engineering (1992) and Bachelors in EIC (1986) both from Thapar University, Patiala- India. He
is a professor at GZS Campus College of Engg & Tech, Bathinda (A Punjab Govt Established
Institute) and Dean (Students), Maharaja Ranjit Singh Punjab Technical University, Bathinda,
India. His areas of interest include real time and energy aware fault tolerant scheduling and wire-
less sensor networks. Five PhD scholars, including the first author, are working under his supervi-
sion. He is available at drrakeshkbansal@gmail.com.

Concurrent Engineering-Kaur-1063293X16679001 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Concurrent Engineering-Kaur-1063293X16679001 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Standard Article

Concurrent Engineering: Research

precedence-constrained tasks DOI: 10.1177/1063293X16679001

Nirmal Kaur1,2, Savina Bansal3,4 and Rakesh Kumar Bansal3,4

Introduction about 50%, and it is predicted to rise to 70%–80% in

Downloaded from cer.sagepub.com by guest on November 29, 2016

Downloaded from cer.sagepub.com by guest on November 29, 2016

into multiple tasks with precedence constraints and is ð4Þ

Downloaded from cer.sagepub.com by guest on November 29, 2016

Table 1. Notations used in ECSCT algorithm.

T, E Number of tasks and edges in the DAG

Downloaded from cer.sagepub.com by guest on November 29, 2016

consumption of HLD_DVS, that is, E_hld_dvs. If yes, return feasible_threshold

Downloaded from cer.sagepub.com by guest on November 29, 2016

Table 2. Configuration detail of AMD Athlon 64 3000þ processor.

1 2.0 1.50 100 87

Downloaded from cer.sagepub.com by guest on November 29, 2016

Downloaded from cer.sagepub.com by guest on November 29, 2016

Impact of heterogeneity. Unlike homogeneous systems, in

Downloaded from cer.sagepub.com by guest on November 29, 2016

Downloaded from cer.sagepub.com by guest on November 29, 2016

Downloaded from cer.sagepub.com by guest on November 29, 2016

Downloaded from cer.sagepub.com by guest on November 29, 2016

Das könnte Ihnen auch gefallen