Beruflich Dokumente
Kultur Dokumente
Principals developments
on the basis of technology
and software. Performance
Measurement
Influence of new technologies, Microprocessor
economics, Trends in technology, Importance of
measuring performance, Quantitative performance
measurement, Performance metrics, Amdahlss
Law
Computer Architecture
Computer Architecture (CA) = Instruction Set
Architecture (ISA) + Machine organization
ISA = programmers view of a machine
Factors that influence CA
Technology
Software
Application
IBM coined the term computer architecture in the early
1960s. Amdahl, Blaauw, and Brooks [1964] used the term to
refer to the programmer-visible portion of the IBM 360
instruction set
CA - Technology
Technology is the dominating factor in the
design of computers, respectfully the
organization of CA
The development of technology (transistors, IC,
VLSI, Flash memory, Laser disk, CDs)
influenced the development of computers
The development of computers (core memories,
magnetic tapes, disks) influenced the
development of technology
VLSI (Very-Large-Scale Integration)
CA - Technology
The development of both Computer and Technology
influenced each other ROMs, RAMs, VLSI,
Packaging, Low power, etc)
Fast new processors required faster/new peripheral
chips, new memory controllers, new I/O, fast and
powerful computers facilitate, improve and speed up
design, simulation, manufacturing.
Transistors use power, which means that they
generate heat that must be removed. The heat
makes the design less reliable. In 20 years voltage
have gone from 5V to 1.5V, significantly reducing
power
Technology used in
computers
Relative performance/unit
cost
1951
Vacuum tube
1965
Transistor
35
1975
1995
2 400 000
2005
Manufacturing chips
Silicon (natural element semiconductor) crystal ingot
a rod composed of a silicon crystal (6-12 in diameter
and 12-24 long) is sliced into wafers < 0.1 thick
Wafers are processed (patterns of chemicals placed on
each wafer, creating transistors _1_layer, conductors
_2_to_8_ levels and insulators, separating the
conductors)
Processed wafers are tested for defects and then
chopped up (diced) into dies (chips). Bad dies are
discarded (yield - % of good dies from total number of
dies)
Good dies are connected to I/O pins of a package by
bonding. Packaged parts are tested as mistakes can
occur in packing. Then chips are shipped
Blank wafers
20 to 40
Processing steps
Slicer
Tester dies
Bond die to
package
Patterned wafers
Tested wafer
Dicer
Wafer
fester
Ship to
customers
Microprocessor economics
Designing a state of the art Processor requires:
Pentium 300 engineers
PentiumPro 500 engineers
Microprocessor economics
To stay competitive a company has to fund at
least two large design teams to release products
at the rate of 2.5 years per product generation.
Continuous improvements are needed to
improve yields and clock speed.
Price drops one tenth in 2-3 years.
Only computer mass market (production rates in
the hundreds of millions and billions of dollars in
revenue) can support such economics (personal
computers, car computers, cell phones, etc)
A task
1. Let us assume that you are in a company
marketing a certain IC chip. Costs (fixed),
including R&D, fabrication and equipments,
etc., add up to $500,000. The cost per wafer is
$6000. Each wafer can be diced into 1500
dies. The die yield is 50%. The dies are
packaged and tested (at the end), with a cost
of $10 per chip. The test yield is 90%. Only
those that pass the test will be sold to
customers. If the retail price is 40% more than
the cost, at least how many chips have to be
sold to break even?
After expertise
Hardware and circuit design
Simulation, verification and testing
Back-end compilers and performance evaluation
Trends in Technology
Todays designers must be aware of rapid
changes in implementation technology some
of them are critical to modern implementations:
- IC logic technology transistor density
increases by about 35% per year
- Semiconductor DRAM capacity increases by
about 40% per year
- Magnetic disk technology before 90-ies
30% increase per year, 60% thereafter, 100% in
1996, since 2004 30% . It is still 50-100 times
cheaper than DRAM
Trends in Technology
For magnetic discs - new recording technique evolves vertical (perpendicular) recording
its areal densities ranges somewhere between 100 and
150 gigabits per square inch
The magnetization of the bits stands them on end
perpendicular to the plane of the disk giving the name
vertical recording
Future seek time reductions are expected to be minimal,
most of the performance improvements that for disk
drives will most likely come through faster rpm
Disk technology roadmaps indicate that disk drive
capacity will approach 5 terabytes by 2015
Trends in Technology
Network technology it depends mainly on the
performance of switches and of transmission systems
networking technology trends go for flexibility and remote
access to network resources
WAN optimization - reduces traffic to remote locations on
the network, consolidating data and caching large and
frequently used files this improves application
performance for branch locations and remote workers, it
also leads to reduced costs for bandwidth
Another modern trend is the cloud, which increases
collaboration capability
Trends in power in IC
Power provides challenges as devices grow in
size and number of transistors
Chips have hundreds of pins and multiple
interconnect layers, so power and ground must
be provided for all parts of the chip.
For CMOS chips, energy consumption is in
switching transistors (i.e. dynamic power).
Power is proportional to the square of the
voltage, capacitive load and frequency of
switching (1/2 Cap.load x Voltage2 x Freq switching)
Lowering the voltage reduces dynamic power
already is just over 1 V (from 5 V)
Trends in power in IC
Power is now the major limitation to using transistors in
the past it was raw silicon area
Most microprocessors today turn off the clock of inactive
modules to save energy and dynamic power (if no FP
instr are executing, the clock of the FPU is disabled)
Static power is also becoming an issue because of
leakage current flows (even when transistor is off).
Leakage current increases with smaller transistor sized.
In 2006, the goal for leakage was 25% of the total power
consumption !!!
One way to overcome this was placing multiple
processors on chip running at lower voltages and clock
rates
Software
Software is another important factor,
influencing CA
Before the mid fifties, software played almost
no role in defining architecture
As people write programs and use
computers, our understanding of
programming and program behavior
improves.
This has profound though slower impact on
computer architecture
Modern architects cannot avoid paying
attention to software and compilation issues
Instruction set
An instruction set is the sum of basic
operations that a processor can accomplish. A
processors instruction set is a determining
factor in its architecture, even though the same
architecture can lead to different
implementations by different manufacturers.
The processor works efficiently thanks to a
limited number of instructions, hardwired to the
electronic circuits. Most operations can be
performed using basic functions. Some
architectures, include advanced processor
functions
Instruction set
Independent of machine organization
Machine families
Machines with very different
Organizations
Capabilities
Machine organization
Machine Organization
Number of functional blocks
Interconnect pattern
Transparent to software (affects performance
though!)
Microcode (a layer of hardware-level
instructions and/or data structures involved in
the implementation of higher level machine code
instructions;usually does not reside in the main
memory, but in a special high speed memory)
Indexing capability
to reduce book keeping instructions (modification for the next
iteration otherwise have to remember a lot of variables/traces)
Complex instructions
to reduce instruction fetches
Compact instructions
implicit address bits for operands, to reduce instruction fetches
Processor State
The information held in the processor at the end
of an instruction to provide the processing
context for the next instruction
Programmer visible state of the processor (and
memory) plays a central role in computer
organization for both hardware and software:
Software must make efficient use of it
If the processing of an instruction can be interrupted
then the hardware must save and restore the state in
a transparent manner
ALU
Memory
Stack
Accumulator
RegMemory
Reg-Reg/ load-store
C=A+B
Stack:
Push A
Push B
Add
Pop C
Accumulator
Load A
Add B
Store C
Register (reg-memory)
Load R1, A
Add R3, R1, B
Store R3, C
Register (load-store)
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
Performance
What is performance?
Is it response time or execution time?
Response time is the time a system or
functional unit takes to react to a given input
Execution time is the time to execute the
program - between its start and finish
Identical - time to complete one task
Measured in sec., millisec., microsec., nanosec.,
picosec
Importance of measuring
performance
Real systems:
Within the free market/during procurement - to
decide which system to purchase
For system maintenance and capacity planning
to predict and plan when an upgrade is
needed either for parts of a system, or for the
entire system
For the applications (i.e. tuning) to be able to
find bottlenecks/hotspots in the application and
record them/take actions
Importance of measuring
performance
During dynamic compilation - to perform
heavy optimizations on application
hotspots
As a feedback for architects to find out
what are the performance bottlenecks in a
particular design
Paper design:
To be able to compare design alternatives
Development of Quantitative
Performance Measures
Initially designers set performance goals
ENIAC was to be 1000 times faster than the
Harvard Mark-I, and the IBM Stretch (7030) was
to be 100 times faster than the fastest machine
in existence
It wasnt clear how this performance was
measured
The original measure of performance was time
to perform an operation (an addition for example
as most instructions had the same exec time)
Development of Quantitative
Performance Measures
Execution times of instructions in a machine
became more diverse - hence the time for one
operation was no good for comparisons
An instruction mix according to the relative
frequency of instructions across many programs
(early popular example the Gibson mix
1970)
Average instruction execution time =
Instruction time * weight in the mix
Development of Quantitative
Performance Measures
Measured in clock cycles, the average
instruction execution time is the same as
average CPI (clock cycles per instruction)
CPU clock cycles for a program
CPI
Instructio n count
CPU time Instructio n count CPI Clock cycle time
Example
Comp A clock cycle time = 250 ps, CPI for
ProgramX = 2.0
Comp B - clock cycle time = 500 ps, CPI for
ProgramX = 1.2 number instrutions = I
Who is faster (same ISA)? By how much?
CPU clock cyclesA = I x 2.0; CPU timeA =cycles x
time = I x 2 x 250 ps = 500 x I ps
CPU clock cyclesB = I x 1.2; CPU timeB =cycles x
time = I x 1.2 x 500 ps = 600 x I ps
performance execution time
Example
Development of Quantitative
Performance Measures
CPUs became more complex, sophisticated;
relied on pipelining and memory hierarchies
As a CISC machine requires fewer instructions,
one with lower MIPS rating might be equivalent
to a RISC one with higher rating
No longer a single execution time per
instruction, hence MIPS could not be calculated
from the instruction mix
This was how benchmarking emerged - using
kernels and synthetic programs for measuring
performance
Development of Quantitative
Performance Measures
Relative MIPS for a machine M was defined
based on some reference machine as:
MIPS M
PerformanceM
MIPSreference
Performancereference
Development of Quantitative
Performance Measures
The 70s and 80s the growth of
supercomputer industry and use of
floating-point-intensive programs led to the
introduction of MFLOPS (millions of
floating-point operations per second) - the
inverse of execution time for a benchmark,
so marketing people started quoting peak
MFLOPS
Development of Quantitative
Performance Measures
During the late 1980s, SPEC (System
Performance and Evaluation Cooperative) was
founded - to improve the state of benchmarking,
to have better basis for comparisons
Initially focused on workstations and servers in
the UNIX marketplace
The first release of SPEC benchmarks (called
SPEC89) - a substantial improvement in the use
of more realistic benchmarks. SPEC2006 still
dominates processor benchmarks almost two
decades later
Performance metrics
MIPS (millions of instructions per second) it is
only meaningful when running the same
executable code on the same inputs
MFLOPS (millions of floating-point operations
per second) problems - how many FLOPS in a
divide? Sqrt? Sine? (1 flop for add, sub, mul; 4
for div, sqrt, 8 exp, sine. For example - a
kernel with one add, one divide, and one sin
would be credited with 13 normalized floatingpoint operations)
Some inefficient algorithms have high MFLOPS,
hence - only meaningful for the same algorithm
Performance metrics
TPS (transactions per second) uses TP (transactionprocessing) benchmarks (TPS also means Transaction
Processing System)
Benchmarks
The best way of measuring performance are real
applications benchmarks (ex - a compiler)
benchmark suites collections of benchmark
applications - popular measure of performance of
processors
The goal of a benchmark suite - to characterize the
relative performance of two computers
One of the most successful attempts for standardized
benchmark application - the SPEC (Standard
Performance Evaluation Corporation)
The evolution of the computer industry led to the need
for different benchmark suites, so nowadays there are
SPEC benchmarks to cover different application classes
(only 3 integer programs and 3 floating-point programs
survived three or more generations)
Amdahls Law
Describes the performance gain that can
be obtained by improving some portion of
a computer defines speedup
It states that the performance
improvement to be gained from using
some faster mode of execution is limited
by the fraction of the time the faster mode
can be used
Speedup
Speedup
Speedup
Amdahls Law
Fractionenhanced
Speedup
enhanced
Executiontimeold
1
Speedupoverall
Executiontimenew
1 Fractionenhanced Fractionenhanced
Speedupenhanced
We introduce 10 times faster processor for Web serving
Processor is busy with computation 40%, waiting for I/O 60% of the time
What is the overall speedup?
Amdahls Law
The goal is to spend resources
proportional to where time is spent
Useful for comparing the overall system
performance of two alternatives and for
comparing two processor design
alternatives
Task
A common transformation required in graphics
processors is square root. Implementations of FP sqr
vary significantly in performance. Suppose FP sqr
(FPSQR) is responsible for 20% of the exec time of a
critical graph benchmark. One proposal is to enhance
the FPSQR hardware and speed this up by a factor of
10. The alternative is just to make all FP instruct fun
faster; FP instr are responsible for of the exec time. By
how much do the FP instr have to be accelerated to
achieve the same performance as achieved by inserting
the specialized hardware?