Sie sind auf Seite 1von 63

Principles of Scalable

Performance
EENG-630 Chapter 3 2
Principles of Scalable
Performance

Performance measures
Speedup laws
Scalability principles
Scaling up vs. scaling down

3
Performance metrics and
measures

Parallelism profiles
Asymptotic speedup factor
System efficiency, utilization and quality
Standard performance measures
4
Degree of parallelism
Reflects the matching of software and hardware
parallelism
Discrete time function measure, for each time
period, the # of processors used
Parallelism profile is a plot of the DOP as a function
of time
Ideally have unlimited resources


Degree of Parallelism
The number of processors used at any instant to
execute a program is called the degree of
parallelism (DOP); this can vary over time.
DOP assumes an infinite number of processors are
available; this is not achievable in real machines, so
some parallel program segments must be executed
sequentially as smaller parallel segments. Other
resources may impose limiting conditions.
A plot of DOP vs. time is called a parallelism profile.

6
Factors affecting parallelism
profiles
Algorithm structure
Program optimization
Resource utilization
Run-time conditions
Realistically limited by # of available processors,
memory, and other nonprocessor resources
Average Parallelism - 1
Assume the following:
n homogeneous processors
maximum parallelism in a profile is m
Ideally, n >> m
A, the computing capacity of a processor, is something
like MIPS or Mflops w/o regard for memory latency, etc.
i is the number of processors busy in an observation
period (e.g. DOP = i )
W is the total work (instructions or computations)
performed by a program
A is the average parallelism in the program
Average Parallelism - 2
}
A =
2
1
) (
t
t
dt t DOP W

=
A =
m
i
i
t i W
1
where t
i
= total time that DOP = i, and

=
=
m
i
i
t t t
1
1 2
Total amount of work performed is proportional to the area
under the profile curve
Average Parallelism - 3
}

=
2
1
) (
1
1 2
t
t
dt t DOP
t t
A
|
.
|

\
|
|
.
|

\
|
=

= =
m
i
i
m
i
i
t t i A
1 1
/
10
Example: parallelism profile and
average parallelism
Available Parallelism
Various studies have shown that the potential
parallelism in scientific and engineering
calculations can be very high (e.g. hundreds or
thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much
smaller (e.g. 10 or 20).
Basic Blocks
A basic block is a sequence or block of instructions
with one entry and one exit.
Basic blocks are frequently used as the focus of
optimizers in compilers (since its easier to manage
the use of registers utilized in the block).
Limiting optimization to basic blocks limits the
instruction level parallelism that can be obtained
(to about 2 to 5 in typical code).
Asymptotic Speedup - 1

=
=
m
i
i
W W
1
i i
t i W A =
(work done when DOP = i)
(relates sum of W
i
terms to W)
A = k W k t
i i
/ ) (
(execution time with k processors)
A = i W t
i i
/ ) (
(for 1 s i s m)
Asymptotic Speedup - 2

= =
A
= =
m
i
m
i
i
i
W
t T
1 1
) 1 ( ) 1 ( (resp. time w/ 1 proc.)

= =
A
= =
m
i
m
i
i
i
i
W
t T
1 1
) ( ) ( (resp. time w/ proc.)
A
i W
W
T
T
S
m
i
i
m
i
i
= =

=
=

1
1
/
) (
) 1 (
(in the ideal case)
15
Performance measures
Consider n processors executing m programs in
various modes
Want to define the mean performance of these
multimode computers:
Arithmetic mean performance
Geometric mean performance
Harmonic mean performance
Mean Performance Calculation
We seek to obtain a measure that characterizes
the mean, or average, performance of a set of
benchmark programs with potentially many
different execution modes (e.g. scalar, vector,
sequential, parallel).
We may also wish to associate weights with these
programs to emphasize these different modes and
yield a more meaningful performance measure.
17
Arithmetic mean performance

=
=
m
i
i a m R R
1
/

=
=
m
i
i i a
R f R
1
*
) (
Arithmetic mean execution rate
(assumes equal weighting)
Weighted arithmetic mean
execution rate
-proportional to the sum of the inverses of
execution times
18
Geometric mean performance
[
=
=
m
i
m
i g
R R
1
/ 1
[
=
=
m
i
f
i g
i
R R
1
*
Geometric mean execution rate
Weighted geometric mean
execution rate
-does not summarize the real performance since it does
not have the inverse relation with the total time
19
Harmonic mean performance
i i
R T / 1 =

= =
= =
m
i
i
m
i
i a
R m
T
m
T
1 1
1 1 1
Mean execution time per instruction
For program i
Arithmetic mean execution time
per instruction
20
Harmonic mean performance

=
= =
m
i
i
a h
R
m
T R
1
) / 1 (
/ 1

=
=
m
i
i i
h
R f
R
1
*
) / (
1
Harmonic mean execution rate
Weighted harmonic mean execution rate
-corresponds to total # of operations divided by
the total time (closest to the real performance)
Geometric Mean
A geometric mean of n terms is the n
th
root of the
product of the n terms.
Like the arithmetic mean, the geometric mean of a
set of execution rates does not have an inverse
relationship with the total execution time of the
programs.
(Geometric mean has been advocated for use with
normalized performance numbers for comparison
with a reference machine.)
Harmonic Mean
Instead of using arithmetic or geometric mean, we
use the harmonic mean execution rate, which is
just the inverse of the arithmetic mean of the
execution time (thus guaranteeing the inverse
relation not exhibited by the other means).
( )

=
=
m
i
i
h
R
m
R
1
/ 1
Weighted Harmonic Mean
If we associate weights f
i
with the benchmarks,
then we can compute the weighted harmonic
mean:
( )

=
=
m
i
i i
h
R f
m
R
1
/
Weighted Harmonic Mean Speedup
T
1
= 1/R
1
= 1 is the sequential execution time on a single
processor with rate R
1
= 1.
T
i
= 1/R
i
= 1/i = is the execution time using i processors
with a combined execution rate of R
i
= i.
Now suppose a program has n execution modes with
associated weights f
1
f
n
. The weighted harmonic mean
speedup is defined as:
( )
*
1
1
1
/
/
n
i i
i
S T T
f R
=
= =

* *
1/
h
T R =
(weighted arithmetic
mean execution time)
25
Harmonic Mean Speedup
Performance

Amdahls Law
Assume R
i
= i, and w (the weights) are (o, 0, , 0, 1-o).
Basically this means the system is used sequentially (with
probability o) or all n processors are used (with probability
1- o).
This yields the speedup equation known as Amdahls law:
( )
1 1
n
n
S
n o
=
+
The implication is that the best speedup possible is 1/ o,
regardless of n, the number of processors.
27
Illustration of Amdahl Effect
n = 100
n = 1,000
n = 10,000
Speedup
Processors
28
Example 1
95% of a programs execution time occurs inside a
loop that can be executed in parallel. What is the
maximum speedup we should expect from a
parallel version of the program executing on 8
CPUs?
9 . 5
8 / ) 05 . 0 1 ( 05 . 0
1
~
+
s
System Efficiency 1
Assume the following definitions:
O (n) = total number of unit operations performed by an n-
processor system in completing a program P.
T (n) = execution time required to execute the program P on an n-
processor system.
O (n) can be considered similar to the total number of
instructions executed by the n processors, perhaps scaled
by a constant factor.
If we define O (1) = T (1), then it is logical to expect that
T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
30
Example 2
5% of a parallel programs execution time is spent
within inherently sequential code.
The maximum speedup achievable by this
program, regardless of how many PEs are used, is
20
05 . 0
1
/ ) 05 . 0 1 ( 05 . 0
1
lim = =
+

p
p
31
Pop Quiz
An oceanographer gives you a serial program
and asks you how much faster it might run on 8
processors. You can only find one function
amenable to a parallel solution. Benchmarking
on a single processor reveals 80% of the
execution time is spent inside this function.
What is the best speedup a parallel version is
likely to achieve on 8 processors?
Answer: 1/(0.2 + (1 - 0.2)/8) ~ 3.3
System Efficiency 2
Clearly, the speedup factor (how much faster the program
runs with n processors) can now be expressed as
S (n) = T (1) / T (n)

Recall that we expect T (n) < T (1), so S (n) > 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n T (n) )
It indicates the actual degree of speedup achieved in a
system as compared with the maximum possible speedup.
Thus 1 / n s E (n) s 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when
all processors are fully utilized.
Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations
performed is independent of the number of processors, n. This is
the ideal case.
R (n) = n when all processors performs the same number of
operations as when only a single processor is used; this implies that
n completely redundant computations are performed!
The R (n) figure indicates to what extent the software
parallelism is carried over to the hardware implementation
without having extra operations performed.
System Utilization
System utilization is defined as
U (n) = R (n) E (n) = O (n) / ( n T (n) )
It indicates the degree to which the system
resources were kept busy during execution of the
program. Since 1 s R (n) s n, and 1 / n s E (n) s
1, the best possible value for U (n) is 1, and the
worst is 1 / n.
1 / n s E (n) s U (n) s 1
1 s R (n) s 1 / E (n) s n
Quality of Parallelism
The quality of a parallel computation is defined as
Q (n) = S (n) E (n) / R (n)
= T
3
(1) / ( n T
2
(n) O (n) )
This measure is directly related to speedup (S) and
efficiency (E), and inversely related to redundancy
(R).
The quality measure is bounded by the speedup
(that is, Q (n) s S (n) ).
Standard Industry Performance Measures
MIPS and Mflops, while easily understood, are poor
measures of system performance, since their interpretation
depends on machine clock cycles and instruction sets. For
example, which of these machines is faster?
a 10 MIPS CISC computer
a 20 MIPS RISC computer
It is impossible to tell without knowing more details about
the instruction sets on the machines. Even the question,
which machine is faster, is suspect, since we really need
to say faster at doing what?
Doing What?
To answer the doing what? question, several standard
programs are frequently used.
The Dhrystone benchmark uses no floating point instructions,
system calls, or library functions. It uses exclusively integer data
items. Each execution of the entire set of high-level language
statements is a Dhrystone, and a machine is rated as having a
performance of some number of Dhrystones per second (sometimes
reported as KDhrystones/sec).
The Whestone benchmark uses a more complex program involving
floating point and integer data, arrays, subroutines with
parameters, conditional branching, and library functions. It does
not, however, contain any obviously vectorizable code.
The performance of a machine on these benchmarks
depends in large measure on the compiler used to generate
the machine language. [Some companies have, in the
past, actually tweaked their compilers to specifically deal
with the benchmark programs!]
Whats VAX Got To Do With It?
The Digital Equipment VAX-11/780 computer for
many years has been commonly agreed to be a 1-
MIPS machine (whatever that means).
Since the VAX-11/780 also has a rating of about
1.7 KDhrystrones, this gives a method whereby a
relative MIPS rating for any other machine can be
derived: just run the Dhrystone benchmark on the
other machine, divide by 1.7K, and you then obtain
the relative MIPS rating for that machine
(sometimes also called VUPs, or VAX units of
performance).
Other Measures
Transactions per second (TPS) is a measure that is
appropriate for online systems like those used to support
ATMs, reservation systems, and point of sale terminals.
The measure may include communication overhead,
database search and update, and logging operations. The
benchmark is also useful for rating relational database
performance.
KLIPS is the measure of the number of logical inferences
per second that can be performed by a system, presumably
to relate how well that system will perform at certain AI
applications. Since one inference requires about 100
instructions (in the benchmark), a rating of 400 KLIPS is
roughly equivalent to 40 MIPS.
40
Parallel Processing Applications
Drug design
High-speed civil transport
Ocean modeling
Ozone depletion research
Air pollution
Digital anatomy
41
Application Models for Parallel
Computers
Fixed-load model
Constant workload
Fixed-time model
Demands constant program execution time
Fixed-memory model
Limited by the memory bound
42
Algorithm Characteristics
Deterministic vs. nondeterministic
Computational granularity
Parallelism profile
Communication patterns and synchronization
requirements
Uniformity of operations
Memory requirement and data structures
43
Isoefficiency Concept
Relates workload to machine size n needed to
maintain a fixed efficiency



The smaller the power of n, the more scalable the
system
) , ( ) (
) (
n s h s w
s w
E
+
=
workload
overhead
44
Isoefficiency Function
To maintain a constant E, w(s) should grow in
proportion to h(s,n)



C = E/(1-E) is constant for fixed E
) , (
1
) ( n s h
E
E
s w

=
) , ( ) ( n s h C n f
E
=
45
The Isoefficiency Metric
(Terminology)
Parallel system a parallel program executing
on a parallel computer
Scalability of a parallel system - a measure of its
ability to increase performance as number of
processors increases
A scalable system maintains efficiency as
processors are added
Isoefficiency - a way to measure scalability
46
Notation Needed for the
Isoefficiency Relation
n data size
p number of processors
T(n,p) Execution time, using p processors
+(n,p) speedup
o(n) Inherently sequential computations
(n) Potentially parallel computations
k(n,p) Communication operations
c(n,p) Efficiency

Note: At least in some printings, there appears to be a misprint on
page 170 in Quinns textbook, with (n) being sometimes replaced
with |(n). To correct, simply replace each | with .
47
Isoefficiency Concepts
T
0
(n,p) is the total time spent by processes
doing work not done by sequential algorithm.
T
0
(n,p) = (p-1)o(n) + pk(n,p)
We want the algorithm to maintain a constant
level of efficiency as the data size n increases.
Hence, c(n,p) is required to be a constant.
Recall that T(n,1) represents the sequential
execution time.
48
The Isoefficiency Relation
Suppose a parallel system exhibits efficiency c(n,p). Define




In order to maintain the same level of efficiency as the
number of processors increases, n must be increased so
that the following inequality is satisfied.




) , ( ) ( ) 1 ( ) , ( T

) , ( 1
) , (
0
p n p n p p n
p n
p n
C
k o
c
c
+ =

=


) , ( ) 1 , (
0
p n CT n T >
49
Speedup Performance Laws
Amdahls law
for fixed workload or fixed problem size
Gustafsons law
for scaled problems (problem size increases with
increased machine size)
Speedup model
for scaled problems bounded by memory capacity
50
Amdahls Law
As # of processors increase, the fixed load is
distributed to more processors
Minimal turnaround time is primary goal
Speedup factor is upper-bounded by a sequential
bottleneck
Two cases:
DOP < n
DOP > n
51
Fixed Load Speedup Factor
Case 1: DOP > n






Case 2: DOP < n
(
(
(

A
=
n
i
i
W
i t
i
i
) (

=
(
(
(

A
=
m
i
i
n
i
i
W
n T
1
) (
A
= =
i
W
t n t
i
i i
) ( ) (

=
=
(
(
(

= =
m
i
i
m
i i
i
n
n
i
i
W
W
n T
T
S
1
) (
) 1 (
52
Gustafsons Law
With Amdahls Law, the workload cannot scale to
match the available computing power as n
increases
Gustafsons Law fixes the time, allowing the
problem size to increase with higher n
Not saving time, but increasing accuracy
53
Fixed-time Speedup
As the machine size increases, have increased
workload and new profile
In general, W
i
> W
i
for 2 s i s m and W
1
= W
1

Assume T(1) = T(n)

54
Gustafsons Scaled Speedup
) (
1
'
1
n Q
n
i
i
W
W
m
i
i
m
i
i
+
(
(
(

=

= =
n
n
m
i
i
m
i
i
n
W W
nW W
W
W
S
+
+
= =

=
=
1
1
1
1
'
'
'
55
Memory Bounded Speedup Model
Idea is to solve largest problem, limited by
memory space
Results in a scaled workload and higher accuracy
Each node can handle only a small subproblem for
distributed memory
Using a large # of nodes collectively increases the
memory capacity proportionally

56
Fixed-Memory Speedup
Let M be the memory requirement and W the
computational workload: W = g(M)
g
*
(nM)=G(n)g(M)=G(n)W
n



n W n G W
W n G W
n Q
n
i
i
W
W
S
n
n
m
i
i
m
i
i
n
/ ) (
) (
) (
1
1
1
*
1
*
*
*
*
+
+
=
+
(
(
(


=
=
57
Relating Speedup Models
G(n) reflects the increase in workload as memory
increases n times
G(n) = 1 : Fixed problem size (Amdahl)
G(n) = n : Workload increases n times when
memory increased n times (Gustafson)
G(n) > n : workload increases faster than memory
than the memory requirement
58
Scalability Metrics
Machine size (n) : # of processors
Clock rate (f) : determines basic m/c cycle
Problem size (s) : amount of computational
workload. Directly proportional to T(s,1).
CPU time (T(s,n)) : actual CPU time for execution
I/O demand (d) : demand in moving the program,
data, and results for a given run
59
Interpreting Scalability Function
Number of processors
M
e
m
o
r
y

n
e
e
d
e
d

p
e
r

p
r
o
c
e
s
s
o
r

Cplogp
Cp
Clogp
C
Memory Size
Can maintain
efficiency
Cannot maintain
efficiency
60
Scalability Metrics
Memory capacity (m) : max # of memory words
demanded
Communication overhead (h(s,n)) : amount of time
for interprocessor communication, synchronization,
etc.
Computer cost (c) : total cost of h/w and s/w
resources required
Programming overhead (p) : development
overhead associated with an application program
61
Speedup and Efficiency
The problem size is the independent parameter

n
n s S
n s E
) , (
) , ( =
) , ( ) , (
) 1 , (
) , (
n s h n s T
s T
n s S
+
=
62
Scalable Systems
Ideally, if E(s,n)=1 for all algorithms and any s and
n, system is scalable
Practically, consider the scalability of a m/c
) , (
) , (
) , (
) , (
) , (
n s T
n s T
n s S
n s S
n s
I
I
= = u
63
Summary (2)
Some factors preventing linear speedup?
Serial operations
Communication operations
Process start-up
Imbalanced workloads
Architectural limitations