This Section Investigates How A Typical CPU

Overview
This section investigates how a typical

CPU is organized
– Major components (revisited)
– Register organization
– The instruction cycle (revisited)
EE 4504 – Instruction pipelining
Computer Organization – Pentium and PowerPC case studies
Reading: Text, Chapter 11 (Sections 1 --
4), Chapter 13 (Sections 1 and 2)
Section 8
The CPU Structure
EE 4504 Section 8 1 EE 4504 Section 8 2
1
CPU organization
Recall the functions performed by the

CPU:
– Fetch instructions
– Fetch data
– Process data
– Write data
Organizational requirements that are
derived from these functions:
– ALU
– Control logic
– Temporary storage
– Means to move data and instructions in and
around the CPU
Figure 11.1 External view of the CPU
2
Register Organization
Registers form the highest level of the

memory hierarchy
– Small set of high speed storage locations
– Temporary storage for data and control
information
Two types of registers
– User-visible
» May be referenced by assembly-level
instructions and are thus “visible” to the
user
– Control and status registers
» Used to control the operation of the CPU
» Most are not visible to the user
Figure 11.2 Internal structure of the CPU
3
User-visible Registers
General categories based on function Design trade off between general purpose
– General purpose and specialized registers
» Can be assigned a variety of functions – General purpose registers maximize flexibility
» Ideally, they are defined orthogonally to the in instruction design
operations within the instructions – Special purpose registers permit implicit
– Data register specification in instructions -- reduces
» These registers only hold data register field size in an instruction
– Address – No clear “best” design approach
» These registers only hold address How many registers are enough
information – More registers permit more operands to be held
» Examples: general purpose address within the CPU -- reducing memory bandwidth
registers, segment pointers, stack pointers, requirements to some extent
index registers – More registers cause an increase in the field
– Condition codes sizes needed to specify registers in an
» Visible to the user but values set by the instruction word
CPU as the result of performing operations – Locality of reference may not support too many
» Example code bits: zero, positive, overflow registers
» Bit values are used as the basis for – Most machines use 8-32 registers (does not
conditional jump instructions include RISC machines with register
windowing -- will get to that later!)
4
Control and status registers
How big (wide) These registers are used during the

– Address registers should be wide enough to fetching, decoding and execution of
hold the longest address address! instructions
– Data registers should be wide enough to hold – Many are not visible to the user/programmer
most data types – Some are visible but can not be (easily)
» Would not want to use 64-bit registers if the modified
vast majority of data operations used 16 and
32-bit operands Typical registers
» Related to width of memory data bus – Program counter
» Concatenate registers together to store » Points to the next instruction to be executed
longer formats – Instruction register
B-C registers in the 8085 » Contains the instruction being executed
AccA-AccB registers in the 68HC11 – Memory address register
– Memory data/buffer register
– Program status word(s)
» Superset of condition code register
» Interrupt masks, supervisory modes, etc.
» Status information
5
Figure 11.3 Example register organizations Figure 11.4 Extensions to 32 bits microprocessors
6
Instruction Cycle
Recall the instruction cycle from Chapter

3:
– Fetch the instruction
– Decode it
– Fetch operands
– Perform the operation
– Store results
– Recognize pending interrupts
Based on the addressing techniques from
Chapter 9, we can modify the state
diagram for the cycle to explicitly show
indirection in addressing
Flow of data and information between
registers during the instruction cycle varies
from processor to processor
Figure 11.7 More complete instruction cycle state diagram

7
Instruction pipelining
The instruction cycle state diagram clearly An ideal pipeline divides a task into k
shows the sequence of operations that take independent sequential subtasks
place in order to execute a single – Each subtask requires 1 time unit to complete
instruction – The task itself then requires k time units to
A “good” design goal of any system is to complete
have all of its components performing For n iterations of the task, the execution
useful work all of the time -- high times will be:
efficiency – With no pipelining: nk time units
Following the instruction cycle in a – With pipelining: k + (n-1) time units
sequential fashion does not permit this Speedup of a k-stage pipeline is thus
level of efficiency
Compare the instruction cycle to an S = nk / [k+(n-1)] ==> k (for large n)
automobile assembly line
– Perform all tasks concurrently, but on different
(sequential) instructions
– The result is temporal parallelism
– Result is the instruction pipeline
8
First step: instruction (pre)fetch – Alternative approaches
– Divide the instruction cycle into two (equal??) » Finer division of the instruction cycle: use
“parts” a 6-stage pipeline
» I-fetch Instruction fetch
» Everything else (execution phase) Decode opcode
– While one instruction is in “execution,” overlap Calculate operand address(es)
the prefetching of the next instruction Fetch operands
» Assumes the memory bus will be idle at Perform execution
some point during the execution phase Write (store) result
» Reduces the time to fetch an instruction to » Use multiple execution “functional units” to
zero (ideal situation) parallelize the actual execution phase of
– Problems several instructions
» The two parts are not equal in size » Use branching strategies to minimize
» Branching can negate the prefetching branch impact
As a result of the brach instruction, you
have prefetched the “wrong”
instruction
9
Figure 11.12 Pipelined execution of 9 instructions Figure 11.13 Impact of a branch after instruction 3
in 14 time units vs. 54 (to instruction 15)
10
Pipeline Limitations
Pipeline depth Data dependencies

– If the speedup is based on the number of stages, – Pipelining, as a form of parallelism, must insure
why not build lots of stages? that computed results are the same as if
– Each stage uses latches at its input (output) to computation was performed in strict sequential
buffer the next set of inputs order
» If the stage granularity is reduced too much, – With multiple stages, two instructions “in
the latches and their control become a execution” in the pipeline may have data
significant hardware overhead dependencies -- must design the pipeline to
» Also suffer a time overhead in the prevent this
propagation time through the latches » Data dependencies limit when an
Limits the rate at which data can be instruction can be input to the pipeline
clocked through the pipeline – Data dependency examples
– Logic to handle memory and register use and to
control the overall pipeline increases A=B+C
significantly with increasing pipeline depth D=E+A
C=GxH
– Data dependencies also factor into the effective A=D/H
length of pipelines
11
Branching
– For the pipeline to have the desired operational
speedup, we must “feed it” with long strings of
instructions
» However, 15-20% of instructions in an
assembly-level stream are (conditional)
branches
» Of these, 60-70% take the branch to a target
address
» Impact of the branch is that pipeline never
really operates at its full capacity -- limiting
the performance improvement that is
derived from the pipeline
– The average time to complete a pipelined
instruction becomes
Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]
– A number of techniques can be used to Loss of performance resulting from conditional branches [Lil88]
minimize the impact of the branch instruction pe = pbpt
(the branch penalty)
12
– Multiple streams – Look ahead, look behind buffer (loop buffer)
» Replicate the initial portions of the pipeline » Many conditional branches operations are
and fetch both possible next instructions used for loop control
» Increases chance of memory contention » Expand prefetch buffer so as to buffer the
» Must support multiple streams for each last few instructions executed in addition to
instruction in the pipeline the ones that are waiting to be executed
– Prefetch branch target » If buffer is big enough, entire loop can be
» When the branch instruction is decoded, held in it -- reducing branch penalty
begin to fetch the branch target instruction
and place in a second prefetch buffer
» If the branch is not taken, the sequential
instructions are already in the pipe -- no Pending
loss of performance Instructions
» If the branch is taken, the next instruction
has been prefetched and results in minimal
branch penalty (don’t have to incur a PC
memory read operation at the end of the Previous
branch to fetch the instruction) Instructions
13
– Branch prediction
» Make a good guess as to which instruction
will be executed next and start that one
down the pipeline
» If the guess turns out to be right, no loss of
performance in the pipeline
» If the guess was wrong, empty the pipeline
and restart with the correct instruction --
suffering the full branch penalty
» Static guesses: make the guess without
considering the runtime history of the
program
Branch never taken
Branch always taken
Predict based on the opcode
» Dynamic guesses: track the history of
conditional branches in the program
Taken / not taken switch
History table
Figure 11.16 Branch prediction using 2 history bits
14
Superscalar and Superpipelined
Processors
– Delayed branch Logical evolution of pipeline designs

» Minimize the branch penalty by finding resulted in 2 high-performance execution
valid instructions to execute in the pipeline techniques
while the branch address is being resolved
» Compiler is tasked with reordering the Superpipeline designs
instruction sequence to find enough – Observation: a large number of operations do
independent instructions (wrt to the not require the full clock cycle to complete
conditional branch) to feed into the pipeline – High performance can be obtained by
after the branch that the branch penalty is subdividing the clock cycle into a number of
reduced to zero sub intervals
» Consider the sequence: » Higher clock frequency!
Instruction x – Subdivide the “macro” pipeline H/W stages
Instruction x+1 into smaller (thus faster) substages and clock
Instruction x+2 data through at the higher clock rate
Conditional branch
– Time to complete individual instructions does
not change
» Implemented on many RISC architectures
» Degree of parallelism goes up
» Perceived speedup goes up
15
Superscalar
– Implement the CPU such that more than one
instruction can be performed (completed) at a
time
– Involves replication of some or all parts of the
CPU/ALU
– Examples:
» Fetch multiple instructions at the same time
» Decode multiple instructions at the same
time
» Perform add and multiply at the same time
» Perform load/stores while performing ALU
operation
– Degree of parallelism and hence the speedup of
the machine goes up as more instructions are
executed in parallel
Figure 13.1 Comparison of superscalar and superpipeline

operation to “regular” pipelines
16
Superscalar design limitations
Data dependencies: must insure computed Instruction issue policy: in what order are
results are the same as would be computed instructions issued to the execution unit
on a strictly sequential machine and in what order do they finish?
– Two instructions can not be executed in parallel – In-order issue, in-order completion
if the (data) output of one is the input of the » Simplest method, but severely limits
other or if they both write to the same output performance
location » Strict ordering of instructions: data and
– Consider: procedural dependencies or resource
conflicts delay all subsequent instructions
S1: A=B+C » “Slow” execution of some instructions
S2: D=A+1 delay all subsequent instructions
S3: B=E+F
– In-order issue, out-of-order completion
S4: A=E+3
» Any number of instructions can be executed
Resource dependencies at a time
– In the above sequence of instructions, the adder » Instruction issue is still limited by resource
unit gets a real workout! conflicts or data and procedural
– Parallelism is limited by the number of adders dependencies
in the ALU » Output dependencies resulting from out-of-
order completion must be resolved
» “Instruction” interrupts can be tricky
17
– Out-of-order issue, out-of-order completion Register renaming
» Decode and execute stages are decoupled – Output dependencies and antidependencies are
via an instruction buffer “window” eliminated by the use of a register “pool” as
» Decoded instructions are “stored” in the follows
window awaiting execution » For each instruction that writes to a register
» Functional units will take instructions from X, a “new” register X is instantiated
the window in an attempt to stay busy » Multiple “register Xs” can co-exist
This can result in out-of-order – Consider
execution
S1: R3 = R3 + R5
S1: A=B+C S2: R4 = R3 + 1
S2: D=E+1 S3: R3 = R5 + 1
S3: G=E+F S4: R7 = R3 + R4
S4: H=E*3
» “Antidependence” class of data becomes
dependencies must be dealt with
S1: R3b = R3a + R5a
S2: R4b = R3b + 1
S3: R3c = R5a + 1
S4: R7b = R3c + R4b
18
Summary
Impact on machine parallelism In this section, we have focused on the

– Adding (ALU) functional units without register operation of the CPU
renaming support may not be cost-effective – Registers and their use
» Performance is limited by data – Instruction execution
dependencies
Investigated the implementation of
– Out-of-order issue benefits from large
instruction buffer windows “modern” CPUs
» Easier for a functional unit to find a – Pipelining
pending instruction » Basic concepts
» Limitations to performance
– Superpipelining
– Superscalar
19

This Section Investigates How A Typical CPU

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

This Section Investigates How A Typical CPU

Hochgeladen von

Copyright:

Verfügbare Formate

Overview

This section investigates how a typical

EE 4504 Section 8 1 EE 4504 Section 8 2

Recall the functions performed by the

Figure 11.1 External view of the CPU

EE 4504 Section 8 3 EE 4504 Section 8 4

Registers form the highest level of the

Figure 11.2 Internal structure of the CPU

EE 4504 Section 8 5 EE 4504 Section 8 6

How big (wide) These registers are used during the

EE 4504 Section 8 9 EE 4504 Section 8 10

EE 4504 Section 8 11 EE 4504 Section 8 12

Recall the instruction cycle from Chapter

Figure 11.7 More complete instruction cycle state diagram

EE 4504 Section 8 15 EE 4504 Section 8 16

EE 4504 Section 8 17 EE 4504 Section 8 18

EE 4504 Section 8 19 EE 4504 Section 8 20

Pipeline depth Data dependencies

EE 4504 Section 8 21 EE 4504 Section 8 22

Tave =(1-pb)1 + pb[pt(1+b) + (1-pt)1]

EE 4504 Section 8 25 EE 4504 Section 8 26

EE 4504 Section 8 27 EE 4504 Section 8 28

– Delayed branch Logical evolution of pipeline designs

EE 4504 Section 8 29 EE 4504 Section 8 30

Figure 13.1 Comparison of superscalar and superpipeline

EE 4504 Section 8 35 EE 4504 Section 8 36

Impact on machine parallelism In this section, we have focused on the

EE 4504 Section 8 37 EE 4504 Section 8 38

Das könnte Ihnen auch gefallen