AOK Lecture03

Arsitektur dan Organisasi Komputer
Computers Performance
Lecture 03 (22 Feb 2016)
Henry Novianus Palit

hnpalit@petra.ac.id
Designing for Performance (1)

Computer trends:
The cost of computer systems continue to drop
The performance and capacity of computer systems
continue to rise; e.g., todays laptops have the computing
power of an IBM mainframe from 10-15 years ago
The speed of a computer in executing a program is

affected by
design of its instruction set
design of its hardware
design of its software (including the OS and the compiler)
technology in which the hardware is implemented
Arsitektur & Organisasi Komputer

The speed of switching between 0 and 1 states in logic circuits
is largely determined by the size of the transistors that
implement the circuits; i.e., smaller transistors switch faster
Reducing transistor sizes has two advantages:
Instructions can be executed faster
More transistors can be placed on a chip, leading to more logic
functionality and more memory storage capacity
Gordon Moore (Intel co-founder):

The number of transistors incorporated in a chip will approximately
double every 24 months
The raw speed of a microprocessor will not achieve its

potential unless it is fed by a constant stream of work (i.e.,
computer instructions)

Techniques to exploit the raw speed of a processor:
Pipelining (a kind of instruction-level parallelism) the processor

works on multiple instructions by moving data or instructions into a
conceptual pipe with all stages of the pipe processing simultaneously;
e.g., while an instruction is being executed, the next instruction is
being fetched and decoded
Branch prediction the processor looks ahead in the instruction code
fetched from the memory, predicts which branches or groups of
instructions are likely to be processed next, and prefetches those
instructions (could be multiple branches ahead); thus, it increases the
amount of work available for the processor to execute
Data flow analysis the processor analyzes which instructions are
dependent on each others results to create an optimized schedule of
instructions; thus, it prevents unnecessary delay
Speculative execution using branch prediction and data flow
analysis, some processors speculatively execute instructions ahead of
their actual appearance in the program execution, holding the results
in temporary locations; thus, it keeps the execution engines busy
Five-Stage Pipeline
Parallelism (1)
Instruction-level parallelism
Pipelining also see the previous slides
Pipelining allows a trade-off between latency (how long it takes to
execute an instruction) and processor bandwidth (how many
instructions/sec the CPU can complete)
Superscalar architectures
A dual pipeline or a single pipeline with multiple functional units
The two instructions must neither conflict over resource usage
(e.g., registers) nor depend on the result of the other either
guaranteed by the compiler or detected & eliminated during
execution by extra hardware
Most of the functional units in stage 4 take appreciably longer
than one clock cycle to execute, certainly the ones that access
memory or do floating-point arithmetic
Superscalar Architectures
Parallelism (2)
Processor-level parallelism
Multicore processors
Fabricating multiple processing units on a single chip, e.g., dual-core,
quad-core, hex-core, etc.
Data parallel processors

SIMD (Single Instruction-stream Multiple Data-stream) procrs
consists of a large number of identical processors that perform the
same sequence of instructions on different sets of data (e.g., GPUs /
Graphics Processing Units)
Vector processors very efficient at executing a sequence of
operations on pairs of data elements, but all of the operations are
performed in a single, heavily pipelined functional unit (e.g., SSE /
Streaming SIMD Extension from Intel)
SIMD Processor
Processing steps per cycle:
The scheduler selects two threads
to execute on the processor
The next instruction from each
thread then executes on up to 16
SIMD cores
The left picture shows SIMD cores

of NVIDIA Fermi
If each thread is able to use all of
16 SIMD cores, a fully loaded GPU
with 32 SMs (stream multiprocrs)
can perform 512 ops / cycle
a similar-sized general purpose
quad-core CPU would struggle to
achieve 1/32 as much processing
10
Parallelism (3)
Processor-level parallelism (contd)
Multiprocessors (SMP / symmetric multiprocessing)
Computer systems that contain many processors, each possibly
containing multiple cores
Used for either executing a number of different application tasks
concurrently or executing subtasks of a single large task in parallel
All processors usually have access to all of the memory sharedmemory multiprocessor
Multicomputers (distributed or cluster computing)

Using an interconnected group of computers to achieve high total
computational power
Computers normally have access only to their own memory units
Sharing data is done by exchanging messages over a
communication network message-passing multicomputer
11
Multiprocessor
A single-bus multiprocessor
A multiprocessor with local memories

(NUMA / Non-Uniform Memory Access)
12
Tianhe-2
Worlds Fastest Computer
(by November 2015)
source: http://www.top500.org
Processors
Total cores
Memory
Interconnect
Linpack performance
Power
OS
MPI
:
:
:
:
:
:
:
:
Intel Xeon E5-2692 (12C) + Intel Xeon Phi 31S1P

3,120,000
1 PB
TH Express-2
33,862.7 TFlop/s (peak = 54,902.4 TFlop/s)
17,8 MW
Kylin Linux
MPICH2
13
Parallelism (4)
14
Performance Assessment
Performance is a key parameter in evaluating a
computer system, along with cost, size, security,
reliability, and power consumption
Raw speed is far less important than how a processor
performs when executing a given application
Some measures of computers performance:
Clock speed
Instruction execution rate
Benchmarks
Amdahls Law
Littles Law
15
Clock Speed
Clock speed or clock rate is measured in cycles/second or
Hertz (Hz)
Clock signals typically are generated by a quartz crystal, which
generate a constant signal wave while power is applied; the
wave is in turn converted into a digital voltage pulse stream
Since the execution of an instruction involves a number of
steps such as fetching the instruction from memory,
decoding the instruction, loading & storing data, and
performing arithmetic & logical operations most instructions
require multiple clock cycles to complete
A straight comparison of clock speeds on different processors
does not tell the whole story about performance (e.g., when
pipelining is used)
16
Instruction Execution Rate (1)

For a given processor, the number of clock cycles required
varies for different types of instructions
Average Cycles Per Instruction (CPI) for a given program is
CPI
i 1
(CPI i I i )
Ic
where CPIi = number of cycles required for instruction type i

Ii = number of executed instructions of type i
Ic = instruction count (number of instructions)
n = number of instruction types
The processor time (T) needed to execute a given program is
T I c CPI
where = the constant cycle time = 1/f

(f = the constant frequency)
17

To differentiate memory and processor cycle times, the
preceding equation can be rewritten as
T I c p m k
where p = number of processor cycles needed to decode &
execute the instruction
m = number of memory references needed (on avg)
k = ratio between memory & processor cycle times
System attributes that influence the performance factors
(Ic, p, m, k, )
18

A common measure of performance for a processor is the rate
at which instructions are executed, expressed as millions of
instructions per second (MIPS)
MIPSrate
Ic
f
T 106 CPI 106
Another common performance measure deals only with

floating-point instructions (which are common in many
scientific and game applications) is expressed as millions of
floating-point operations per second (MFLOPS)
MFLOPSrate
number of executed floating-point operations in a program

execution time 106
19
Benchmarks (1)
MIPS and MFLOPS often are inadequate to evaluate the
processors performance (e.g., CISC vs. RISC machines may
have different MIPS rates although both take about the same
amount of time)
In early 1990s, measuring the performance of systems is
shifted to using a set of benchmark programs
Desirable characteristics of a benchmark program:
Written in a high-level language, making it portable across machines
Representative of a particular kind of programming style such as
systems programming, numerical programming, or commercial
programming
Can be measured easily
Has wide distribution
20
Benchmarks (2)
SPEC (System Performance Evaluation Corporation)
benchmarks defined and maintained by an industry
consortium (e.g., SPEC CPU2006, SPECjvm98, SPECjbb2000,
SPECweb99, SPECmail2001)
Averaging results run a number of different benchmark
programs on each machine and then average the results
1 m
simple arithmetic mean RA Ri
m i 1
harmonic mean RH
m
m
1
i 1 Ri
where Ri = high-level language instruction execution rate for

benchmark i
21
Benchmarks (3)
SPEC benchmarks concern with speed metric and rate metric
Speed metric measures the ability of a machine to complete a task
Results are reported as the ratio of the reference run time to the system
(under test) run time
The overall performance measure for the system under test is calculated
by averaging the ratios values by a geometric mean
1n
n
Trefi
rG ri
ri
Tsuti
i 1
Rate metric measures the throughput or rate of a machine carrying

out a number of tasks
Multiple copies (i.e., as many as the number of processors) of the

benchmarks are run simultaneously and a ratio is reported
N Trefi
ri
Tsuti
Here, Tsuti is the elapsed time from the start of the execution of the
program on all N processors until the completion of all copies
A geometric mean is used to determine the overall performance measure
22
Amdahls Law (1)

Proposed by Gene Amdahl and deals with the potential
speedup of a program using multiple processors compared to
a single processor
Consider a program running on a single processor such that a fraction
(1f) of the execution time involves code that is inherently serial and
a fraction f involves code that is inherently parallelizable with no
scheduling overhead
Let T be the total execution time of the program using a single
processor
The speedup using a parallel processor with N processors that fully
exploits the parallel portion of the program is
Speedup
time to execute program on a single processor

time to execute program on N parallel processors
T 1 f Tf
1
Tf
f
T (1 f )
(1 f )
N
N
23
Amdahls Law (2)

Two drawn conclusions:
When f is small, the use of parallel processors has little effect
As N approaches infinity, speedup is bound by 1/(1f), so there are
diminishing returns for using more processors
The conclusions are too pessimistic, as a server can execute

multiple threads or multiple tasks in parallel and exploit data
parallelism
Speedup in general can be expressed as
Speedup
performance after enhancemen t

exec time before enhancemen t
performance before enhancemen t

exec time after enhancemen t
If a targeted enhancement is applied to fraction f, and the

fractions speedup after enhancement is SUf, the overall
1
speedup is
Speedup
1 f f
SU f
24
Littles Law
Based on a queuing theory, Littles Law can be applied to any
system that is statistically in steady state and in which there is
no leakage
General setup:
Suppose there is a steady state system where items arrive at an
average rate of items per unit time
The item stay in the system an average of W units of time
There is an average of L units in the system at any one time
Littles Law relates these three variables as L = W

Under steady state conditions, the average number of items
in a queuing system equals the average rate at which items
arrive multiplied by the average time that an item spends in
the system
25
Example: MIPS rate

Consider a program that is executed on a 400-MHz processor.
The instruction mix and the CPI for each instruction type are
given on the table below
Calculate the MIPS rate!

Solution:
Average CPI = (1 x 0.6)+(2 x 0.18)+(4 x 0.12)+(8 x 0.1) = 2.24
MIPS rate = (400 x 106) / (2.24 x 106) 178.6
26
Example: Speed Metric

The table below shows the SPEC integer speed ratios for
twelve benchmark programs on the Sun Blade 6250
Calculate the speed metric!

Solution:
Speed metric = (17.5 x 14.0 x 13.7 x 17.6 x 14.7 x 18.6 x
17.0 x 31.3 x 23.7 x 9.23 x 10.9 x 14.7)1/12
= 16.09
27
Example: Speedup (Amdahls Law)

Problem:
Suppose that a task makes extensive use of floating-point operations,
with 40% of the time is consumed by floating-point operations
With a new hardware design, the floating-point module is speeded up
by a factor of K
Calculate the maximum overall speedup!
Solution:
Speedup
1 0.4 0.4
K
1
0.6
0.4
K
Thus, independent of K, the maximum speedup is 1.67

28

AOK Lecture03

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

AOK Lecture03

Hochgeladen von

Copyright:

Verfügbare Formate

Arsitektur dan Organisasi Komputer

Henry Novianus Palit

Designing for Performance (1)

The speed of a computer in executing a program is

Designing for Performance (2)

Gordon Moore (Intel co-founder):

The raw speed of a microprocessor will not achieve its

Arsitektur & Organisasi Komputer

Designing for Performance (3)

Pipelining (a kind of instruction-level parallelism) the processor

Arsitektur & Organisasi Komputer

Arsitektur & Organisasi Komputer

Data parallel processors

Arsitektur & Organisasi Komputer

The left picture shows SIMD cores

Multicomputers (distributed or cluster computing)

A multiprocessor with local memories

Arsitektur & Organisasi Komputer

Intel Xeon E5-2692 (12C) + Intel Xeon Phi 31S1P

Arsitektur & Organisasi Komputer

Instruction Execution Rate (1)

where CPIi = number of cycles required for instruction type i

where = the constant cycle time = 1/f

Instruction Execution Rate (2)

Arsitektur & Organisasi Komputer

Instruction Execution Rate (3)

T 106 CPI 106

Another common performance measure deals only with

number of executed floating-point operations in a program

Arsitektur & Organisasi Komputer

where Ri = high-level language instruction execution rate for

Arsitektur & Organisasi Komputer

Rate metric measures the throughput or rate of a machine carrying

Multiple copies (i.e., as many as the number of processors) of the

Amdahls Law (1)

time to execute program on a single processor

Arsitektur & Organisasi Komputer

Amdahls Law (2)

The conclusions are too pessimistic, as a server can execute

performance after enhancemen t

performance before enhancemen t

If a targeted enhancement is applied to fraction f, and the

Arsitektur & Organisasi Komputer

Littles Law relates these three variables as L = W

Example: MIPS rate

Calculate the MIPS rate!

Example: Speed Metric

Calculate the speed metric!

Example: Speedup (Amdahls Law)

Thus, independent of K, the maximum speedup is 1.67

Das könnte Ihnen auch gefallen