Lecture5 en CA Principals Performance 2014

Computer Architecture
Principals developments
on the basis of technology
and software. Performance
Measurement
Influence of new technologies, Microprocessor
economics, Trends in technology, Importance of
measuring performance, Quantitative performance
measurement, Performance metrics, Amdahlss
Law
Computer Architecture
Computer Architecture (CA) = Instruction Set
Architecture (ISA) + Machine organization
ISA = programmers view of a machine
Factors that influence CA
Technology
Software
Application
IBM coined the term computer architecture in the early
1960s. Amdahl, Blaauw, and Brooks [1964] used the term to
refer to the programmer-visible portion of the IBM 360
instruction set
Influence of new technologies
New technology provides:

- greater speed
- smaller size
- higher reliability
- lower cost
- allows designers and engineers to
consider new opportunities
CA - Technology
Technology is the dominating factor in the
design of computers, respectfully the
organization of CA
The development of technology (transistors, IC,
VLSI, Flash memory, Laser disk, CDs)
influenced the development of computers
The development of computers (core memories,
magnetic tapes, disks) influenced the
development of technology
VLSI (Very-Large-Scale Integration)
is the process of creating integrated

circuits by combining thousands of transistors into a single chip. VLSI began in the 1970s
when complex semiconductor and communication technologies were being developed.
Nowadays millions, even billions of transistors on a chip
CA - Technology
The development of both Computer and Technology
influenced each other ROMs, RAMs, VLSI,
Packaging, Low power, etc)
Fast new processors required faster/new peripheral
chips, new memory controllers, new I/O, fast and
powerful computers facilitate, improve and speed up
design, simulation, manufacturing.
Transistors use power, which means that they
generate heat that must be removed. The heat
makes the design less reliable. In 20 years voltage
have gone from 5V to 1.5V, significantly reducing
power
Relative performance per unit cost of technologies

used in computers over time (source H&P)
Year
Technology used in
computers
Relative performance/unit
cost
1951
Vacuum tube
1965
Transistor
35
1975
Integrated circuit (IC) 900
1995
Very large scale IC
2 400 000
2005
Ultra large scale IC
6 200 000 000
Manufacturing chips
Silicon (natural element semiconductor) crystal ingot
a rod composed of a silicon crystal (6-12 in diameter
and 12-24 long) is sliced into wafers < 0.1 thick
Wafers are processed (patterns of chemicals placed on
each wafer, creating transistors _1_layer, conductors
_2_to_8_ levels and insulators, separating the
conductors)
Processed wafers are tested for defects and then
chopped up (diced) into dies (chips). Bad dies are
discarded (yield - % of good dies from total number of
dies)
Good dies are connected to I/O pins of a package by
bonding. Packaged parts are tested as mistakes can
occur in packing. Then chips are shipped
The chip manufacturing process

Silicon ingot
Blank wafers
20 to 40
Processing steps
Slicer
Tester dies
Bond die to
package
Patterned wafers
Tested wafer
Dicer
Wafer
fester
Tested packaged dies

Packaged dies
Part
tester
Ship to
customers
Microprocessor economics
Designing a state of the art Processor requires:
Pentium 300 engineers
PentiumPro 500 engineers
Huge investments in fabrication lines

The manufacturer needs to sell in the range of 2 to 4
million units to be profitable
The design cost of a high-end CPU is on the order of
US $100 million (http://www.wordiq.com/definition/CPU_design)
A microprocessor plant might cost 1.6 billion $
(http://www.gamespot.com/news/6025378.html)
Microprocessor economics
To stay competitive a company has to fund at
least two large design teams to release products
at the rate of 2.5 years per product generation.
Continuous improvements are needed to
improve yields and clock speed.
Price drops one tenth in 2-3 years.
Only computer mass market (production rates in
the hundreds of millions and billions of dollars in
revenue) can support such economics (personal
computers, car computers, cell phones, etc)
A task
1. Let us assume that you are in a company
marketing a certain IC chip. Costs (fixed),
including R&D, fabrication and equipments,
etc., add up to $500,000. The cost per wafer is
$6000. Each wafer can be diced into 1500
dies. The die yield is 50%. The dies are
packaged and tested (at the end), with a cost
of $10 per chip. The test yield is 90%. Only
those that pass the test will be sold to
customers. If the retail price is 40% more than
the cost, at least how many chips have to be
sold to break even?
Technology driven views of CA

Implement (old) ISA, using new technology
An iterative process:
Select new features

Design datapaths and control
Estimate cost
Measure performance with simulators
After expertise
Hardware and circuit design
Simulation, verification and testing
Back-end compilers and performance evaluation
Trends in Technology
Todays designers must be aware of rapid
changes in implementation technology some
of them are critical to modern implementations:
- IC logic technology transistor density
increases by about 35% per year
- Semiconductor DRAM capacity increases by
about 40% per year
- Magnetic disk technology before 90-ies
30% increase per year, 60% thereafter, 100% in
1996, since 2004 30% . It is still 50-100 times
cheaper than DRAM
For magnetic discs - new recording technique evolves vertical (perpendicular) recording
its areal densities ranges somewhere between 100 and
150 gigabits per square inch
The magnetization of the bits stands them on end
perpendicular to the plane of the disk giving the name
vertical recording
Future seek time reductions are expected to be minimal,
most of the performance improvements that for disk
drives will most likely come through faster rpm
Disk technology roadmaps indicate that disk drive
capacity will approach 5 terabytes by 2015
Network technology it depends mainly on the
performance of switches and of transmission systems
networking technology trends go for flexibility and remote
access to network resources
WAN optimization - reduces traffic to remote locations on
the network, consolidating data and caching large and
frequently used files this improves application
performance for branch locations and remote workers, it
also leads to reduced costs for bandwidth
Another modern trend is the cloud, which increases
collaboration capability
Scaling transistor performance and wires

IC processes are characterized by the feature size min
size of a transistor. It was 10 microns in 1971 and
nowadays it is 45 nanometers (0.045 microns) !!!
Reduced size allowed moving from 4 bit to 8, 16, 32 and
recently to 64 bit microprocessors
In general transistors improve in performance with
decreased feature size, however wires in IC do not.
The signal delay for a wire increases to the products
resistance and capacitance. Shrinking feature size
makes resistance and capacitance worse!
Wire delay scales poorly compared to transistor
performance major design limitation in recent years
Trends in power in IC
Power provides challenges as devices grow in
size and number of transistors
Chips have hundreds of pins and multiple
interconnect layers, so power and ground must
be provided for all parts of the chip.
For CMOS chips, energy consumption is in
switching transistors (i.e. dynamic power).
Power is proportional to the square of the
voltage, capacitive load and frequency of
switching (1/2 Cap.load x Voltage2 x Freq switching)
Lowering the voltage reduces dynamic power
already is just over 1 V (from 5 V)
Trends in power in IC
Power is now the major limitation to using transistors in
the past it was raw silicon area
Most microprocessors today turn off the clock of inactive
modules to save energy and dynamic power (if no FP
instr are executing, the clock of the FPU is disabled)
Static power is also becoming an issue because of
leakage current flows (even when transistor is off).
Leakage current increases with smaller transistor sized.
In 2006, the goal for leakage was 25% of the total power
consumption !!!
One way to overcome this was placing multiple
processors on chip running at lower voltages and clock
rates
Software
Software is another important factor,
influencing CA
Before the mid fifties, software played almost
no role in defining architecture
As people write programs and use
computers, our understanding of
programming and program behavior
improves.
This has profound though slower impact on
computer architecture
Modern architects cannot avoid paying
attention to software and compilation issues
Von Neumann Machines

Stored program concept key architectural
concept paved the way for modern processors
Program and data are stored in the same
memory
Program counter (PC) points to the current
instruction in memory, updated on every
instruction, addresses next instruction
Program words are fetched from sequential
memory locations
Situation in mid 50s

Expensive hardware
Small memory size (1000 words)
No resident system-software!
Memory access time - 10 to 50 times slower than the

processor cycle
Instruction execution time - totally dominated by the memory
reference time.
The ability to design complex control circuits to execute

an instruction was the central design concern as
opposed to the speed of decoding or an ALU operation
Programmers view of the machine was inseparable from
the actual hardware implementation
Compatibility Problem at IBM

By early 60s, IBM had 4 incompatible lines of
computers!
701, 650, 702, 1401
Each system had its own

Instruction set
I/O system and Secondary Storage: magnetic tapes, drums
and disks
assemblers, compilers, libraries,...
market niche business, scientific, real time, ...
This caused problems and led to the creation of IBM 360
Programmers view of the machine IBM

650
A drum machine with 44 instructions
Instruction: 60 1234 1009
Load the contents of location 1234 into the

distribution; put it also into the upper
accumulator; set lower accumulator to zero; and
then go to location 1009 for the next instruction.
Good programmers optimized the placement of
instructions on the drum to reduce latency!
What is an Instruction set?
Instruction set
An instruction set is the sum of basic
operations that a processor can accomplish. A
processors instruction set is a determining
factor in its architecture, even though the same
architecture can lead to different
implementations by different manufacturers.
The processor works efficiently thanks to a
limited number of instructions, hardwired to the
electronic circuits. Most operations can be
performed using basic functions. Some
architectures, include advanced processor
functions
Instruction set
Independent of machine organization
Machine families
Machines with very different
Organizations
Capabilities
Running the same software
IBM 360 instruction set architecture completely

hid the underlying technological differences
between various models
From model 30 until model 70 the same IS
running
Machine organization
Machine Organization
Number of functional blocks
Interconnect pattern
Transparent to software (affects performance
though!)
Microcode (a layer of hardware-level
instructions and/or data structures involved in
the implementation of higher level machine code
instructions;usually does not reside in the main
memory, but in a special high speed memory)
The Earliest Instruction Sets

Single Accumulator - A carry-over from the calculators.
LOAD, STORE, ADD, SUB, MUL, DIV,SHIFT LEFT, SHIFT RIGHT,
JUMP, JGE, LOAD, HLT
Typically less than 2 dozen instructions!
Processor-Memory Bottleneck: Early Solutions

Fast local storage in the processor
8-16 registers as opposed to one accumulator
Indexing capability
to reduce book keeping instructions (modification for the next
iteration otherwise have to remember a lot of variables/traces)
Complex instructions
to reduce instruction fetches
Compact instructions
implicit address bits for operands, to reduce instruction fetches
Processor State
The information held in the processor at the end
of an instruction to provide the processing
context for the next instruction
Programmer visible state of the processor (and
memory) plays a central role in computer
organization for both hardware and software:
Software must make efficient use of it
If the processing of an instruction can be interrupted
then the hardware must save and restore the state in
a transparent manner
Programmers machine model is a contract

between the hardware and software
Classifying Instruction Set

Architectures
Two classes of register computers
- can access memory as part of any
instruction (register-memory)
- can access memory only with load and
store instruction (load-store)
Most early computers used stack or
accumulator architectures
Since 1980, almost all use load-store
architecture
Operand locations for instruction set

architecture classes
Processor
ALU
Memory
Stack
Accumulator
RegMemory
Reg-Reg/ load-store
C=A+B
Stack:
Push A
Push B
Add
Pop C
Accumulator
Load A
Add B
Store C
Register (reg-memory)
Load R1, A
Add R3, R1, B
Store R3, C
Register (load-store)
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
Classifying Instruction Set

Architectures
Reasons for emergence of general purpose
register computers (GPR)
- registers are faster than memory
- registers are more efficient for a compiler to
use than other forms of internal storage
(Ex a*b b*c a*d can be evaluated by doing
multiplication in any order more efficient with
pipelining; on stack evaluation is done only in
one order operands hidden in the stack and
they are loaded multiple times)
Performance
What is performance?
Is it response time or execution time?
Response time is the time a system or
functional unit takes to react to a given input
Execution time is the time to execute the
program - between its start and finish
Identical - time to complete one task
Measured in sec., millisec., microsec., nanosec.,
picosec
Performance, latency and

bandwidth
Another term used is latency
Latency (response time) is typically measured in
nanoseconds for processors and RAM, microseconds for
LANs and milliseconds for hard disk access
Performance is the primary differentiator for
microprocessors and networks - they had improved
10002000 times in bandwidth and only 2040 times in
latency
Bandwidth is used for the amount of information that
can flow through a network/bus at a given period of time
- a data transmission rate; the maximum amount of
information (bits/second) that can be transmitted along a
channel
Capacity, bandwidth, latency

Capacity is generally more important than
performance for memory and disks, so capacity
has improved most - their bandwidth advances
were 120140 times, while their gains in latency
were of 48 times. Bandwidth has outpaced
latency across these technologies and will likely
continue to do so
Figure 1.9 from H&P (page 16 4 edition) gives the
performance milestones over 20 to 25 years for
microprocessors, memory, networks, and disks
(1978-2003)
A task try to give some figures for 2014!
th
Importance of measuring
performance
Real systems:
Within the free market/during procurement - to
decide which system to purchase
For system maintenance and capacity planning
to predict and plan when an upgrade is
needed either for parts of a system, or for the
entire system
For the applications (i.e. tuning) to be able to
find bottlenecks/hotspots in the application and
record them/take actions
Importance of measuring
performance
During dynamic compilation - to perform
heavy optimizations on application
hotspots
As a feedback for architects to find out
what are the performance bottlenecks in a
particular design
Paper design:
To be able to compare design alternatives
Development of Quantitative
Performance Measures
Initially designers set performance goals
ENIAC was to be 1000 times faster than the
Harvard Mark-I, and the IBM Stretch (7030) was
to be 100 times faster than the fastest machine
in existence
It wasnt clear how this performance was
measured
The original measure of performance was time
to perform an operation (an addition for example
as most instructions had the same exec time)
Execution times of instructions in a machine
became more diverse - hence the time for one
operation was no good for comparisons
An instruction mix according to the relative
frequency of instructions across many programs
(early popular example the Gibson mix
1970)
Average instruction execution time =
Instruction time * weight in the mix
Measured in clock cycles, the average
instruction execution time is the same as
average CPI (clock cycles per instruction)
CPU clock cycles for a program
CPI
Instructio n count
CPU time Instructio n count CPI Clock cycle time
Instructio n count CPI

clock rate
Logically and easy to understand MIPS

millions of instructions per second (inverse of CPI)
Instructio n count
MIPS
Execution time 10 6
Example
Comp A clock cycle time = 250 ps, CPI for
ProgramX = 2.0
Comp B - clock cycle time = 500 ps, CPI for
ProgramX = 1.2 number instrutions = I
Who is faster (same ISA)? By how much?
CPU clock cyclesA = I x 2.0; CPU timeA =cycles x
time = I x 2 x 250 ps = 500 x I ps
CPU clock cyclesB = I x 1.2; CPU timeB =cycles x
time = I x 1.2 x 500 ps = 600 x I ps
performance execution time
Example
ProgramX runs on computer A for 15 seconds

New compiler requires 0.6 as many instructions
CPI is increased by 1.1
How fast will run ProgramX on the new
compiler?
Ex timeold = Instruction count x CPI x clock cycle time

Ex timenew = 0.6 Instruction count x 1.1 CPI x clock cycle time
Ex timenew = 0.6x1.1xInstruction countxCPI x clock cycle time
Ex timenew = 0.6 x 1.1 x 15 = 9.9
CPUs became more complex, sophisticated;
relied on pipelining and memory hierarchies
As a CISC machine requires fewer instructions,
one with lower MIPS rating might be equivalent
to a RISC one with higher rating
No longer a single execution time per
instruction, hence MIPS could not be calculated
from the instruction mix
This was how benchmarking emerged - using
kernels and synthetic programs for measuring
performance
Relative MIPS for a machine M was defined
based on some reference machine as:
MIPS M
PerformanceM
MIPSreference
Performancereference
The popularity of the VAX-11/780 made it a

popular reference machine for relative MIPS it
was easy to calculate, so during the early 1980s,
the term MIPS was almost universally used to
mean relative MIPS
The 70s and 80s the growth of
supercomputer industry and use of
floating-point-intensive programs led to the
introduction of MFLOPS (millions of
floating-point operations per second) - the
inverse of execution time for a benchmark,
so marketing people started quoting peak
MFLOPS
During the late 1980s, SPEC (System
Performance and Evaluation Cooperative) was
founded - to improve the state of benchmarking,
to have better basis for comparisons
Initially focused on workstations and servers in
the UNIX marketplace
The first release of SPEC benchmarks (called
SPEC89) - a substantial improvement in the use
of more realistic benchmarks. SPEC2006 still
dominates processor benchmarks almost two
decades later
How to measure performance

Real systems:
- wall-clock time, response time, or elapsed time

- operating system timer functions
- interrupt-driven profiling (gprof)
- compiler or executable editing to insert software
counters
- external hardware (logic analyzers)
- integrated performance monitoring hardware
(event counters)
- benchmarks
How to measure performance

Paper Designs:
- analytical techniques (queuing theory,
performance models)
- hand simulation (pencil and paper)
- software simulation (write program to
model machine)
- hardware emulation (program FPGAs to
mimic machine)
Performance metrics
MIPS (millions of instructions per second) it is
only meaningful when running the same
executable code on the same inputs
MFLOPS (millions of floating-point operations
per second) problems - how many FLOPS in a
divide? Sqrt? Sine? (1 flop for add, sub, mul; 4
for div, sqrt, 8 exp, sine. For example - a
kernel with one add, one divide, and one sin
would be credited with 13 normalized floatingpoint operations)
Some inefficient algorithms have high MFLOPS,
hence - only meaningful for the same algorithm
Performance metrics
TPS (transactions per second) uses TP (transactionprocessing) benchmarks (TPS also means Transaction
Processing System)
Many other similar measures

- graphics millions of triangles per second
- neural networks millions of connections per second
All rate measures are 1/time
Execution time is primary performance metric.
Also availability, channel capacity, scalability,
performance per watt (cost of powering the computer
outweighs the cost of the computer itself), compression
ratio!!!
Benchmarks
The best way of measuring performance are real
applications benchmarks (ex - a compiler)
benchmark suites collections of benchmark
applications - popular measure of performance of
processors
The goal of a benchmark suite - to characterize the
relative performance of two computers
One of the most successful attempts for standardized
benchmark application - the SPEC (Standard
Performance Evaluation Corporation)
The evolution of the computer industry led to the need
for different benchmark suites, so nowadays there are
SPEC benchmarks to cover different application classes
(only 3 integer programs and 3 floating-point programs
survived three or more generations)
Amdahls Law
Describes the performance gain that can
be obtained by improving some portion of
a computer defines speedup
It states that the performance
improvement to be gained from using
some faster mode of execution is limited
by the fraction of the time the faster mode
can be used
Speedup
Speedup
Performance for entire task usin g enhancement when possible

Performance for entire task without usin g enhancement
Speedup
Executiontime for entiretak s no enhancement

Executiontime for entiretask withenhancement when possible
Speedup from enhancement depends on 2 factors:

-Fraction of the computation time in the original computer
that can be converted to take advantage of the
enhancement Fractionenhanced (always less or equal to 1)
-The improvement gained by the enhanced mode how

much faster the task would run Speedupenhanced (always
greater than 1)
Amdahls Law
Fractionenhanced
Executiontimenew Executiontimeold 1 Fractionenhanced
Speedup
enhanced
Executiontimeold
1
Speedupoverall
Executiontimenew
1 Fractionenhanced Fractionenhanced
Speedupenhanced
We introduce 10 times faster processor for Web serving
Processor is busy with computation 40%, waiting for I/O 60% of the time
What is the overall speedup?
Fractionenhanced = 0.4; Speedupenahnced = 10

Speedupoverall =1 / (0.6 + (0.4/10)) = 1/0.64 = 1.56
Amdahls Law - discussions

The incremental improvement in speedup
gained by an improvement of just a portion of
the computation diminishes as improvements
are added
if an enhancement is used for a fraction of a task
only, the task can not be speeded up by more
than the reciprocal of (1 fraction)
The law is a guide to how much an
enhancement will improve performance and how
to distribute resources to improve
costperformance
Amdahls Law
The goal is to spend resources
proportional to where time is spent
Useful for comparing the overall system
performance of two alternatives and for
comparing two processor design
alternatives
Task
A common transformation required in graphics
processors is square root. Implementations of FP sqr
vary significantly in performance. Suppose FP sqr
(FPSQR) is responsible for 20% of the exec time of a
critical graph benchmark. One proposal is to enhance
the FPSQR hardware and speed this up by a factor of
10. The alternative is just to make all FP instruct fun
faster; FP instr are responsible for of the exec time. By
how much do the FP instr have to be accelerated to
achieve the same performance as achieved by inserting
the specialized hardware?

Lecture5 en CA Principals Performance 2014

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture5 en CA Principals Performance 2014

Hochgeladen von

Copyright:

Verfügbare Formate

Computer Architecture

Influence of new technologies

New technology provides:

is the process of creating integrated

Relative performance per unit cost of technologies

Integrated circuit (IC) 900

Very large scale IC

Ultra large scale IC

6 200 000 000

The chip manufacturing process

Tested packaged dies

Huge investments in fabrication lines

Technology driven views of CA

Select new features

Scaling transistor performance and wires

Von Neumann Machines

Situation in mid 50s

Memory access time - 10 to 50 times slower than the

The ability to design complex control circuits to execute

Compatibility Problem at IBM

Each system had its own

Programmers view of the machine IBM

Load the contents of location 1234 into the

Running the same software

IBM 360 instruction set architecture completely

The Earliest Instruction Sets

Processor-Memory Bottleneck: Early Solutions

Programmers machine model is a contract

Classifying Instruction Set

Operand locations for instruction set

Classifying Instruction Set

Performance, latency and

Capacity, bandwidth, latency

Instructio n count CPI

Logically and easy to understand MIPS

ProgramX runs on computer A for 15 seconds

Ex timeold = Instruction count x CPI x clock cycle time

The popularity of the VAX-11/780 made it a

How to measure performance

- wall-clock time, response time, or elapsed time

How to measure performance

Many other similar measures

Performance for entire task usin g enhancement when possible

Executiontime for entiretak s no enhancement

Speedup from enhancement depends on 2 factors:

-The improvement gained by the enhanced mode how

Executiontimenew Executiontimeold 1 Fractionenhanced

Fractionenhanced = 0.4; Speedupenahnced = 10

Amdahls Law - discussions

Das könnte Ihnen auch gefallen