Sie sind auf Seite 1von 36

Advance Computer Architecture

Flynn’s classification

 In 1966, M J Flynn proposed a classification for computer architectures based on number of


instructions streams and data streams (Flynn’s Taxonomy).
 Flynn uses the stream concept for describing a machine’s structure.
 A stream simply means a sequence of items (data or instructions).
 The classification of computer architectures based on the number of instruction and data
streams is called Flynn’s classification or Flynn’s Taxonomy.

Flynn’s classification
1. Single-instruction, single-data (SISD) systems
An SISD computing system is a uniprocessor machine which is capable of executing a single
instruction, operating on a single data stream. In SISD, machine instructions are processed in a
sequential manner and computers adopting this model are popularly called sequential
computers.

 The speed of the processing element in the SISD model is limited(dependent) by the rate at
which the computer can transfer information internally.
 Example: All conventional uniprocessor systems are SISD from PC to mainframes.

2. Single-instruction, multiple-data (SIMD) systems –


An SIMD system is a multiprocessor machine capable of executing the same instruction on all
the CPUs but operating on different data streams.

 This group is dedicated to array processing machines. Sometimes, vector processors can
also be seen as a part of this group.
 Example: STARAN, Cray’s vector processing machine.

3. Multiple-instruction, single-data (MISD) systems –


An MISD computing system is a multiprocessor machine capable of executing different
instructions on different PEs but all of them operating on the same data stream.

 In practice, this kind of organization has never been used.


 Example: Cary-1

4. Multiple-instruction, multiple-data (MIMD) systems –


An MIMD system is a multiprocessor machine which is capable of executing multiple
instructions on multiple data sets. Each PE in the MIMD model has separate instruction and
data streams.
Example: All distributed systems (network of processors or network of computer) are MIMD.

Difference between Multi-processor and Multi-computer

Multi-processor

 A multiprocessor system is simply a computer that has more than one processor on its
motherboard.
 Multiprocessors have a single physical address space (shared memory) shared by each
processor and all the processors have access to it.
 Generally in multiprocessor system, processors communicate with each other through
shared-memory systems, which allow processor communication through variables stored in
a shared address space in order to cooperatively complete a task.
 A multiprocessor would run slower, because it would be in ONE computer.
 Synchronization between tasks is the system’s responsibility.
 The concept of cache coherency does apply.
 Three most common shared memory multiprocessors models are –
i. Uniform Memory Access (UMA) ii. Non-uniform Memory Access (NUMA)
ii. Cache Only Memory Architecture (COMA)
MultiComputers

 A computer made up of several computers. The term generally refers to an architecture in


which each processor has its own memory called local memory rather than multiple
processors with a shared memory. Each processor, along with its local memory and
input/output port, forms an individual processing unit (node). That is, each processor can
compute in a self-sufficient manner using the data stored in its local memory.
 In a multicomputer, a processor only has direct access to its local memory and not to the
remote memories.
 In multicomputer system, if a processor has to access or modify a piece of data that does
not exist in its local memory then a message passing mechanism is used to achieve this task.
In a message-passing mechanism, a processor is able to send (or receive) a block of
information to (or from) every other processor via communication channels. The
communication channels are physical (electrical) connections between processors and are
arranged based on an interconnection network topology.
 A multicomputer would run faster.
 Synchronization between tasks is likewise the programmer's responsibility.
 The concept of cache coherency does not apply.

Multiprocessor Models

i. Uniform Memory Access (UMA)

In this model, all the processors share the physical memory uniformly. All the processors
have equal access time to all the memory words. Each processor may have a private cache
memory. Tightly coupled systems use a common bus, crossbar, or multistage network to
connect processors, peripherals, and memories.

ii. Non-uniform Memory Access (NUMA)

 In Non-uniform Memory Access (NUMA) the shared memory is physically distributed among
all the processors, as local memories but each of these is still accessible by all processors.
 Memory access time depends on data position. Memory access is fastest from the locally-
connected processor and it is slow for other processor accesses because of the
interconnection network adding delays.
iii. Cache Only Memory Access (COMA)

 The COMA model is a special case of the NUMA model. Here, all the distributed main
memories are converted to cache memories.
 In the COMA model, processors only have cache memories; the caches, taken together,
form a global address space.
 In COMA data have no specific “permanent” location (no specific memory address) where
they stay and where they can be read (copied into local caches) or modified (first in the
cache and then updated at their “permanent” location).

iv. NoRMA No Remote Memory Access

In a NoRMA architecture, the address space globally is not unique and the memory is not
globally accessible by the processors. Accesses to remote memory modules are only indirectly
possible by messages through the interconnection network to other processors, which in turn
possibly deliver the desired data in a reply message. The entire storage configuration is
partitioned statically among the processors.

MULTIVECTOR AND SIMD COMPUTERS

VECTOR SUPERCOMPUTER-

 A vector operand contains an ordered set of n elements, where n is called the length of
the vector. Each element in a vector is a scalar quantity, which may be a floating point
number, an integer, a logical value or a character. A vector computer consists of a scalar
processor and a vector unit, which could be thought of as an independent functional
unit capable of efficient vector operations. Vector computers have hardware to
perform the vector operations efficiently.
 A vector computer is often built on top of a scalar processor. As shown in following
figure. The vector processor is attached to the scalar processor as an optional feature.
 First the host computer first loads program and data to the main memory.
 Then the scalar control unit decodes all the instructions.
 If the decoded instructions are scalar operations or program operations, the scalar
processor executes those operations using scalar functional pipelines.
 On the other hand, if the decoded instructions are vector operations then the
instructions will be sent to vector control unit. This control unit will supervise the flow of
vector data amid the main memory & vector functional pipelines. The vector data flow is
synchronized by control unit. A number of vector functional pipelines may be built into a
vector processor.

SIMD COMPUTER-

In SIMD computers, ‘N’ number of processors are connected to a control unit and all the
processors have their individual memory units. All the processors are connected by an
interconnection network.

An operational model of SIMD computers is presented in the following figure.


Instructions often depend on each other in such a way that a particular instruction cannot be
executed until a preceding instruction or even two or three preceding instructions have been
executed.

Resource Dependencies

An instruction is resource-dependent on a previously issued instruction if it requires a hardware


resource which is still being used by a previously issued instruction.
Data and control dependencies are based on the independence of the work to be done.
Resource independence is concerned with conflicts in using shared resources, such as registers,
integer and floating point ALUs, etc. ALU conflicts are called ALU dependence. Memory
(storage) conflicts are called storage dependence.

Control Dependencies

This refers to the situation where the order of the execution of statements cannot be
determined before run time. For example all condition statement, where the flow of statement
depends on the output. Different paths taken after a conditional branch may depend on the
data hence we need to eliminate this data dependence among the instructions.

This dependence also exists between operations performed in successive iterations of looping
procedure. Control dependence often prohibits parallelism from being exploited.

Control-dependent example:

for (i=1;i<n;i++) {
if (a[i-1] < 0)
a[i] = 1;
}
Control dependence also avoids parallelism to being exploited. Compilers are used to eliminate
this control dependence and exploit the parallelism.
Hardware and software parallelism

Hardware parallelism

 Hardware parallelism refers to the type of parallelism defined by the machine


architecture and hardware multiplicity i.e functional parallelism times the processor
parallelism.
 Hardware parallelism is a function of cost and performance tradeoffs. It displays the
resource utilization patterns of simultaneously executable operations. It can also
indicate the peak performance of the processors.
 It can be characterized by the number of instructions that can be issued per machine
cycle. If a processor issues k instructions per machine cycle, it is called a k-issue
processor
 In a modern processor, two or more instructions can be issued per machine cycle.
 A conventional processor takes one or more machine cycles to issue a single
instruction. These types of processors are called one-issue machines, with a single
instruction pipeline in the processor.
 A multiprocessor system which built n k-issue processors should be able to handle a
maximum of nk threads of instructions simultaneously.
 Eg: No. of processors i.e. hardware-supported threads of execution.

Software parallelism

 It is defined by the control and data dependence of programs.


 The degree of parallelism is revealed in the program profile or in the program’s flow
graph.
 Software parallelism is a function of algorithm, programming style, and compiler
optimization.
 The program flow graph displays the patterns of simultaneously executable operations.
 Parallelism in a program varies during the execution period. It limits the sustained
performance of the processor.
 Eg: Degree of Parallelism (DOP) or number of parallel tasks at selected task or grain size.

Grain size and latency

Grain size : Grain size or granularity is a measure of the amount of computation involved in a
software process. The simplest measure is to count the number of instructions in a
grain (program segment). Grain size determines the basic program segment chosen for
parallel processing. Grain sizes are commonly described as fine, medium, or coarse,
depending on the processing levels involved.
Latency
Latency is the time required for communication between different subsystems in a
computer. Memory latency, for example, is the time required by a processor to access
memory. Synchronization latency is the time required for two processes to synchronize
their execution.
Various level of parallelism
1. Instruction-level parallelism (ILP)

– ILP means how many instructions from the same instruction stream can be executed
concurrently.
– Example of Fine Grain Parallelism
– At instruction or statement level.
– Grain size at this level is 20 instructions or less.
– Compilers can usually do a reasonable job of finding this parallelism

2. Loop Level parallelism (Fine grain parallelism)

Loop-level parallelism is a form of parallelism that is concerned with extracting parallel tasks
from loops.

– This corresponds to Iterative loop operations.


– Typically, grain size at this level is 500 instructions or less per iteration.
– Loop-level parallelism is often the most optimized program construct to execute on a
parallel or vector computer

3. Procedure level parallelism (Medium grain parallelism)

– This level corresponds to medium-grain parallelism at the task, procedural, subroutine,


– and co routine levels.
– A typical grain at this level contains less than 2000 instructions.
– Detection of parallelism at this level is much more difficult than at the finer-grain Levels.
– Multitasking also belongs in this category. Significant efforts by programmers may be
needed to restructure a program at this level, and some compiler assistance is also
needed.

4. Subprogram Level Parallelism (coarse grain parallelism)


– This corresponds to the level of job steps and related subprograms.
– The grain size may typically contain tens or hundreds of thousands of instructions.
– Multiprogramming on a uniprocessor or on a multiprocessor is conducted at this level.
– parallelism at this level has been exploited by algorithm designers or programmers,
rather than by compilers.
4. Job level parallelism (coarse grain parallelism)
– This corresponds to the parallel execution of essentially independent jobs (programs) on
a parallel computer.
– The grain size can he as high as millions of instructions in a single program.
– Job-level parallelism is handled by the program loader and by the operating system in
general. Time-sharing or space-sharing multiprocessors explore this level of parallelism.

Control Flow Computers


– Conventional computers are based on control driven mechanism by which the order of
program execution is explicitly stated in the user programs.
– Control flow machines used shared memory for instructions and data. Since variables
are updated by many instructions, there may be side effects on other instructions.
These side effects frequently prevent parallel processing. Single processor systems are
inherently sequential.
– Advantages & Disadvantages: Control flow machines give complete control, but are less
efficient than other approaches.

Data Flow Computers

– Dataflow computers are based on a data driven mechanism which allows the execution
of any instruction to be driven by data (operand) availability.
– Instructions in dataflow machines are unordered and can be executed as soon as their
operands are available; data is held in the instructions themselves. Data tokens are
passed from an instruction to its dependents to trigger execution.
– Advantages & Disadvantages: Data flow (eager evaluation) machines have high
potential for parallelism and throughput and freedom from side effects, but have high
control overhead, lose time waiting for unneeded arguments, and difficulty in
manipulating data structures.

Reduction Flow Computers

– Reduction computers are based on a demand-driven mechanism which initiates an


operation based on the demand for its results by other computations.
– Advantages & Disadvantages: Reduction (lazy evaluation) machines have high
parallelism potential, easy manipulation of data structures, and only execute required
instructions. But they do not share objects with changing local state, and do require
time to propagate tokens

Static interconnection networks

Static networks use direct links which are fixed once built. This type of network is more suitable
for building computers where the communication patterns are predictable or implementable
with static connections.
Dynamic interconnection networks
Dynamic interconnection networks between processors enable changing (reconfiguring) of the
connection structure in a system. It can be done before or during parallel program execution.

Bus System

A bus system is essentially a collection of wires and connectors for data transactions among
processors, memory modules, and peripheral devices attached to the bus. The bus is used for
only one transaction at a time between source and destination.
Crossbar switches
In crossbar switch, there is a dedicated path from one processor to other processors. Thus, if
there are n inputs and m outputs, we will need n*m switches to realize a crossbar.

A crossbar switch is a circuit that enables many interconnections between elements of a


parallel system at a time. A crossbar switch has a number of input and output data pins and a
number of control pins. The crossbar switch is the most expensive one to build, due to the fact
that its hardware complexity increases as n2. However, the crossbar has the highest bandwidth
and routing capability. For a small network size, it is the desired choice.

Multistage interconnection networks

A multistage interconnect network is formed by cascading multiple single stage switches. The
switches can then use their own routing algorithm or controlled by a centralized router, to form
a completely interconnected network.

MINs have been used in both MIMD and SIMD) computers.

Multistage networks provide a compromise between the two extremes. The major advantage
of MINs lies in their scalability with modular construction. However, the latency increases with
log n, the number of stages in the network. Also, costs due to increased wiring and switching
complexity are another constraint.
Combining Network
Combining Network is a special case of multistage networks used to for automatically resolving
conflicts through the network by combining.
Combining network is developed by NYU Ultracomputer.
The advantage of using a combining network to implement the Fetch&Add operation is
achieved at a significant increase in network cost.

Multiport Memory
– Because building a crossbar network into a large system is cost prohibitive, some
mainframe multiprocessors used a multiport memory organization. The idea is to move
all crosspoint arbitration and switching functions associated with each memory module
into the memory controller.

– The multiport memory organization is a compromise solution between a low-cost, low-


performance bus system and a high-cost, high-bandwidth crossbar system. The
contention bus is time-shared by all processors and device modules attached. The
multiport memory must resolve conflicts among processors.
– Disadvantages: A multiport memory multiprocessor is not scalable because once the
ports are fixed, no more processors can be added without redesigning the memory
controller.
– There is a need for a large number of interconnection cables and connectors when the
configuration becomes large.
.
Unit-2
Difference between RISC and CISC architecture.
RISC CISC
1. RISC stands for Reduced Instruction Set 1. CISC stands for Complex Instruction Set
Computer Computer
2. In RISC the instruction set is reduced i.e. it 2.In CISC the instruction set has a variety of
has only few instructions in the instruction set. different instructions that can be used for
Many of these instructions are very primitive. complex operations.

3. RISC processors have simple instructions 3. CISC processors have complex instructions
taking about one clock cycle. The average Clock that take up multiple clock cycles for execution.
cycles Per Instruction(CPI) of a RISC processor The average Clock cycles Per Instruction of a
is 1.5 and clock rate is 50-150 MHz CISC processor is between 2 and 15 and clock
rate is 33-50 MHz
4. CPU control mechanism is hardwired 4. CPU control mechanism is microcoded using
without control memory. control memory (ROM).

5. Addressing modes limited to 3-5, 5. Addressing modes 12-24.


6. 32-192 general purpose registers used 6. 8-24 general purpose registers used

7. Cache design: Split data cache and 7. Cache design: Unified cache for instructions
instruction cache. and data.
8. Instruction formats: fixed (32-bit) format 8. Instruction formats: Varying formats (16-64
bits each instruction).
9. Memory inferences: register to register 9. Memory inferences: Memory to memory
10. The most common RISC microprocessors 10. Examples of CISC processors are the
are Alpha, ARC, ARM, AVR, MIPS, PA-RISC, PIC, System/360, VAX, PDP-11, Motorola 68000
Power Architecture, and SPARC family, AMD and Intel x86 CPUs.
VLIW Architecture
– The VLIW architecture is generalized from two well-established concepts: horizontal
microcoding and superscalar processing.
– A typical VLIW (very long instruction word) machine has instruction words hundreds of
bits in length.
– Very long instruction word (VLIW) describes a computer processing architecture in
which a language compiler or pre-processor breaks program instruction down into basic
operations that can be performed by the processor in parallel (that is, at the same time).
These operations are put into a very long instruction word which the processor can then
take apart without further analysis, handing each operation to an appropriate functional
unit.
– Multiple functional units are used concurrently in a VLIW processor. All functional units
share the use of a common large register file. The operations to be simultaneously
executed by the functional units are synchronized in a VLIW instruction.
– The main advantage of VLIW architecture is its simplicity in hardware structure and
instruction set. The VLIW processor can potentially perform well in scientific
applications where the program behavior is more predictable.
– Limitation: The challenge is to design a compiler or pre-processor that is intelligent
enough to decide how to build the very long instruction words. If dynamic pre-
processing is done as the program is run, performance may be a concern.
Difference between VLIW and SuperScalar processor
 Dynamic issue: Superscalar machines are able to dynamically issue multiple instructions
each clock cycle from a conventional linear instruction stream.
 Static issue: VLIW processors use a long instruction word that contains a usually fixed
number of instructions that are fetched, decoded, issued, and executed synchronously.
 Superscalar processor receives conventional instructions conceived for sequential
processors.
 VLIW: Receive long instruction words, each comprising a field (or opcode) for each
execution unit.

Memory Hierarchy
Storage devices such as registers, caches, main memory, disk devices and backup storage are
often organized as a hierarchy as depicted in Fig.

Memory devices at a lower level are faster to access, smaller in size, and more expensive per
byte, having a higher bandwidth and using a smaller unit of transfer as compared with those at
a higher level.
Memory Interleaving
 Interleaved memory is a design made to compensate for the relatively slow speed of
dynamic random-access memory (DRAM).
 Memory interleaving divides the system into a number of modules and arranges them
so that successive words in the address space are placed in different modules. If
memory access requests are made for consecutive addresses, then the access will be
made for different modules. Since parallel access to these modules is possible, the
average rate of fetching words from the main memory can be increased.
 The idea of interleaved memory is shown in Figure 9 below:
 As shown in Figure 9, the lower order k bits of the address are used to select the module
(Memory bank) and higher order m bits give a unique memory location in the memory
bank that is selected by the lower order k bits. Thus in this way consecutive memory
locations are stored on different memory banks.
 Whenever requests to access consecutive memory locations are being made several
memory banks are kept busy at any point in time. This results in faster access to a block
of data in the memory and also results in higher overall utilization of the memory
system as a whole. If k bits are allotted for selecting the bank as shown in the diagram,
there have to be total 2k banks. This ensures that there are no gaps of nonexistent
memory locations.

Q.6. Explain Backplane Bus System?


Ans: A backplane bus interconnects processors, data storage and peripheral devices in a
tightly coupled hardware arrangement. The system bus must be designed to permit
communication amid devices on the bus without disturbing the interval activities of all
devices attached to the bus. Timing protocols must be established to arbitrate among
multiple requests. Operational rules must be set to guarantee orderly data transfers on the
bus. Signal lines on the backplane are often functionally grouped into several buses as
shown in the figure. The four groups shown here are very similar to those proposed in the
64 bit VME bus specification.

Various functional boards are plugged into slots on the backplane. Each slot is given with
one or more connectors for pop in the boards as shown by the vertical arrows. For example
one or two 96-pin connectors are used per slot on the VME backplane.

Addressing and Timing Protocols


• Two types of printed circuit boards connected to a bus: active and passive
• The master can initiate a bus cycle
– Only one can be in control at a time
• The slaves respond to requests by a master
– Multiple slaves can respond

Bus Addressing: Design should minimize overhead time, so most bus cycles used for useful
operations.
• Identify each board with a slot number
• When slot # matches contents of high-order address lines, the board is selected as a
slave (slot addressing)
Broadcall and Broadcast: Most bus transactions have one slave/master

• Broadcall: read operation where multiple slaves place data on bus


– detects multiple interrupt sources
• Broadcast: write operation involving multiple slaves
– Implements multicache coherence on the bus

Synchronous Timing: All bus transaction steps take place at fixed clock edges

• Clock cycle time determined by slowest device on bus


• Data-ready pulse (master) initiates transfer
• Data-accept (slave) signals completion
• Simple, less circuitry, for similar device speeds

Asynchronous Timing: Based on handshaking or interlocking

• Provides freedom of variable length clock signals for different speed devices
• No fixed clock cycle
• No response time restrictions
• More complex and costly, but more flexible

Arbitration: The process of selecting the next bus master is called arbitration. The duration Of a
master's control of the bus is called bus tenure. This arbitration process is designed to restrict
tenure of the bus to one master at a time. C requests must be arbitrated on a fairness or
priority basis.

• Types
1.Central arbitration
2. Distributed arbitration

Interrupt: An interrupt is a request from I /O or Other devices to a processor for service or


attention. A priority interrupt bus is used to pass the interrupt signals. The interrupter must
provide status and identification information. A functional module can be used to serve as an
interrupt handler.
Priority interrupts are handled at many levels. For example, the VME bus uses seven interrupt-
request lines. Up to seven interrupt handlers can be used to handle multiple interrupts.
Interrupts can also be handled by message-passing using the data bus lines on a time-sharing
basis. The saving of dedicated interrupt lines is obtained at the expense of requiring some bus
cycles for handling message-based interrupts.
The use of time-shared data bus lines to implement interrupts is called virtual interrupt.
Futurebus+ was proposed without dedicated interrupt lines because virtual interrupts can be
effectively implemented with the data transfer bus.

Transaction: Transaction Modes An address-only transfer consists of an address transfer


followed by no data. A compelled data transfer consists of an address transfer followed by a
block of one or more data transfers to one or more contiguous addresses. A packet data
transfer consists Of an address transfer followed by a fixed- length block of data transfers
(packet) from a set of contiguous addresses.
Data transfers and priority interrupts handling are two classes of operations regularly
performed on a bus. A bus transaction consists Of a request followed by a response. A
connected transaction is used to carry out a master 's request and a slave's response in a single
bus transaction.
A split transaction splits the request and response into separate bus transactions. Three data
transfer modes are specified below. Split transactions allow devices with a long data latency or
access time to use the bus resources in a more efficient way. A complete split transaction may
require two or more connected bus transactions. Split transactions across multiple bus
sequences are performed to achieve cache coherence in a large multiprocessor system.
Unit-3
Pipelining: In computers, a pipeline is the continuous and somewhat overlapped movement
of instruction to the processor or in the arithmetic steps taken by the processor to perform an
instruction. Pipelining is the use of a pipeline. Without a pipeline, a computer processor gets
the first instruction from memory, performs the operation it calls for, and then goes to get the
next instruction from memory, and so forth. While fetching (getting) the instruction, the
arithmetic part of the processor is idle. It must wait until it gets the next instruction. With
pipelining, the computer architecture allows the next instructions to be fetched while the
processor is performing arithmetic operations, holding them in a buffer close to the processor
until each instruction operation can be performed. The staging of instruction fetching is
continuous. The result is an increase in the number of instructions that can be performed
during a given time period.

Linear pipeline processor


A linear Pipeline processor is a flow of processing stages which are linearly connected to
perform a fixed function over a stream of data flowing from one end to other. In modern
computers, linear pipelines are applied for instruction execution, arithmetic computation,
memory access operations.
A linear pipeline processor is built with the processing stages. External inputs are inputted into
the pipeline at the first stage S1. The processed results are passed from stage Si to stage Si+1
for all i = 1,2…….K-1. The final result appear from the pipeline at the last stage Sk.

NonLinear pipeline processor

A dynamic pipeline can be reconfigured to carry out variable functions at different times. The
traditional linear pipelines are static pipelines because they are used to carry out fixed
functions. A dynamic pipeline permit feed forward and feedback connections besides the
streamline connections. For this reason, some authors call such a structures as non-linear
pipeline.

This pipeline has three stages. Besides the streamline connections from S1 to S2 and from S2 to
S3, there is feed forward connection from S2 to S3 and two feedback associations from S3 to S2
and from S3 to S1.
These feed forward and feedback connections make the scheduling of consecutive event into
the pipeline a non trivial task. With these connections, the output of the pipeline is not
necessarily from the last stage. In fact, following different dataflow model, one can use the
same pipeline to assess different functions.

Pipeline hazards

Pipeline which is done by the superscalar architecture processor is an implementation


technique where multiple instructions are overlapped in execution. The computer pipeline is
divided in states (Fetch, Decode, Execute, Store/Write back). The stages are connected one to
the next to form a pipe (Prabhu, n.d.).

Pipeline hazards are situations that prevent the next instruction in the instruction stream from
executing during its designated clock cycles.
There are primarily three types of hazards:

i. Data Hazards: A data hazard is any condition in which either the source or the destination
operands of an instruction are not available at the time expected in the pipeline. As a result of
which some operation has to be delayed and the pipeline stalls.

Whenever there are two instructions one of which depends on the data obtained from the
other.

A=3+A

B=A*4

For the above sequence, the second instruction needs the value of ‘A’ computed in the first
instruction.

Thus the second instruction is said to depend on the first.

If the execution is done in a pipelined processor, it is highly likely that the interleaving of these
two instructions can lead to incorrect results due to data dependency between the instructions.
Thus the pipeline needs to be stalled as and when necessary to avoid errors.

To avoid to data following techniques are used:


i) forwarding: Forwarding involves providing the inputs to a stage of one instruction before the
completion of another instruction.
ii) Putting no-op instructions after each instruction that may cause a hazard
iii) Avoid data dependency by introducing stages in decode unit.

ii. Structural Hazards:

This situation arises mainly when two instructions require a given hardware resource at the
same time and hence for one of the instructions the pipeline needs to be stalled.

The most common case is when memory is accessed at the same time by two instructions. One
instruction may need to access the memory as part of the Execute or Write back phase while
other instruction is being fetched. In this case if both the instructions and data reside in the
same memory. Both the instructions can’t proceed together and one of them needs to be
stalled till the other is done with the memory access part. Thus in general sufficient hardware
resources are needed for avoiding structural hazards.

To avoid structural hazards we use following solution


 Arbitration with interlocking: hardware that performs resource conflict arbitration and
interlocks one of the competing instructions
 Resource replication

iii. Control hazards:

They arise from the pipelining of branches and other instructions that change the PC. However,
the problem arises when one of the instructions is a branching instruction to some other
memory location. Thus, all the instruction fetched in the pipeline from consecutive memory
locations are invalid now and need to remove (also called flushing of the pipeline). This induces
a stall till new instructions are again fetched from the memory address specified in the branch
instruction.

Thus, the time lost because of this is called a branch penalty. Often dedicated hardware is
incorporated in the fetch unit to identify branch instructions and compute branch addresses as
soon as possible and reducing the resulting delay as a result. (Ques10, 2015)

To avoid control hazard we use following techniques

Pipeline stall cycles. Freeze the pipeline until the branch outcome and target are known, then
proceed with fetch.

Branch prediction: Branch Predicting involves guessing whether the branch is taken or not, and
acting on that guess

-If correct, then proceed with normal pipeline execution


-If incorrect, then stall pipeline execution

Branch Delay slots: Delayed branch involves executing the next sequential instruction with the
branch taking place after that delayed branch slot.

Dynamic Instruction Scheduling

Data hazards in a program cause a processor to stall.

With static scheduling the compiler tries to reorder these instructions during compile time to
reduce pipeline stalls.
– Uses less hardware
– Can use more powerful algorithms

With dynamic scheduling the hardware tries to rearrange the instructions during run-time to
reduce pipeline stalls.
Dynamic scheduling offers several advantages:

i. Handles dependencies not known at compile time.


ii. It simplifies the compiler.
iii. It allows code that was compiled with one pipeline in mind to run efficiently on a different
pipeline.

Limitation: These advantages are gained at a cost of a significant increase in hardware


complexity.

– Scoreboarding is a technique for allowing instructions to execute out of order when there
are sufficient resources and no data dependencies.
– First implemented in 1964 by the CDC 6600.
– The goal of a scoreboard is to maintain an execution rate of one instruction per clock cycle
(when there are no structural hazards) by executing an instruction as early as possible.
Thus, when the next instruction to execute is stalled, other instructions can be issued and
executed if they do not depend on any active or stalled instruction. The scoreboard takes
full responsibility for instruction issue and execution, including all hazard detection.

Limitations
– No forwarding logic
--Stalls for WAW hazards
---Wait for WAR hazards before WB

Tomasulo Approach is another scheme to allow execution to proceed in the presence of


hazards developed by the IBM 360/91 floating-point unit. This scheme combines key elements
of the scoreboarding scheme such as out of execution of instructions with the introduction of
register renaming.

– In Tomasulo's algorithm WAW and WAR hazards are avoided by renaming registers this
functionality is provided by the reservation stations, which buffer the operands of
instructions waiting to issue, and by issue logic.
– Tomasulo algorithm is designed to handle name dependencies (WAW and WAR hazards)
efficiently.

3‐stages of Tomasulo algorithm
1. Issue—get instruction from the head of Op Queue (FIFO)
If reservation station free (no structural hazard), control issues instr & sends operands
(renames registers).
2. Execute—operate on operands (EX)
When both operands ready then execute; if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units; mark reservation station available.

Advantages
– Prevents register from being the bottleneck
– Eliminates WAR, WAW hazards
– Allows loop unrolling in HW

Limitations
-Performance limited by Common Data Bus
-Non‐precise interrupts!

Branch handling
 Branch is a flow altering instruction that we must handle in a special manner in
pipelined processors.
 A branch is an instruction in a computer program that can cause a computer to begin
executing a different instruction sequence and thus deviate from its default behavior of
executing instructions in order.
 Branching refers to the act of switching execution to a different instruction sequence as
a result of executing a branch instruction.
 If the branch is taken, control is transferred to the different instruction.
 If the branch is not taken, instructions available in the pipeline are used.
 When the branch is taken, every instruction available in the pipeline, at different stages,
is removed. Fetching of instructions begins at the target address. Due to this, the
pipeline works inefficiently for three clock cycles. This process is called branch penalty.

Branch handling Techniques

Branch Prediction: In this technique the outcome of a branch decision is predicted before the
actually executed.
Branch can be predicted either based on branch code types statically or based on branch
history during program execution.
i. Static Branch Prediction: predicts always the same direction for the same branch during the
whole program execution. It comprises hardware-fixed prediction and compiler-directed
prediction.
– Such a static branch strategy may not be always accurate.
– The static prediction direction (taken or not taken) is usually wired into the pro- cessor.
– The wired-in static prediction cannot be changed once committed to the hardware.
ii.Dynamic branch prediction: the hardware influences the prediction while execution
proceeds.
– Prediction is decided on the computation history of the program.
– In general, dynamic branch prediction gives better results than static branch prediction,
but at the cost of increased hardware complexity.

2. Delayed Branching: Delayed branch simply means that some number of instructions that
appear after the branch in the instruction stream will be executed regardless of which way the
branch ultimately goes. In many cases, a compiler can put instructions in those slots that don't
actually depend on the branch itself, but if it can't, it must fill them with NOPs, which kills the
performance anyway. This approach keeps the hardware simple, but puts a burden on the
compiler technology.

Arithmetic Pipeline: The complex arithmetic operations like multiplication, and floating point
operations consume much of the time of the ALU. These operations can also be pipelined by
segmenting the operations of the ALU and as a consequence, high speed performance may be
achieved. Thus, the pipelines used for arithmetic operations are known as arithmetic pipelines.

Classification according to pipeline configuration: According to the configuration of a pipeline,


the following types are identified under this classification:
• Static Arithmetic pipeline:
– Static arithmetic pipelines are designed to perform a fixed and dedicated function and
are thus called unifunctional.

. • Multifunction Arithmetic Pipeline:
– When a pipeline can perform more than one function, it is called multifunctional.
– A multifunctional pipeline can be either static or dynamic. Static pipelines perform one
function at a time, but different fuctions can be performed at different times. A dynamic
pipeline allows several functions to be performed simultaneously through the pipeline
as long as there are no conflicts in the shared usage of pipeline stages.
A superscalar processor executes more than one instruction during a clock cycle by
simultaneously dispatching multiple instructions to different FUs on the processor. Each FU is
not a separate CPU core but an execution resource within a single CPU such as an ALU, a bit
shifter, or a multiplier.

In contrast to a superscalar processor, a superpipelined one has split the main computational
pipeline into more stages. Each stage is simpler (does less work) and thus the clock speed can
be increased thereby possibly increasing the number of instructions running in parallel at each
cycle.

Superscalar machines can issue several instructions per cycle.

Superpipelined machines can issue only one instruction per cycle, but they have cycle times
shorter than the time required for any operation.

Superscalar processor are difficult to design than superpipelined.

superpipelined processor are easier to design than superscalar.

Superscalar performs only one pipeline stage per clock cycle in each parallel pipeline.

Super-pipeline system is capable of performing two pipeline stages per clock cycle
Unit-4

Cache coherence

In a memory hierarchy for a multiprocessor system, data inconsistency may occur between
adjacent levels or within the same level. For example, the cache and main memory may contain
inconsistent copies of the same data object. Multiple caches may possess different copies of
the same memory block because multiple processors operate asynchronously and
independently.

Caches in a multiprocessing environment introduce the cache coherence. When multiple


processors maintain locally cached copies of a unique shared-memory location, any local
modification of the location can result in a globally inconsistent view of memory. Cache
coherence schemes prevent this problem by maintaining a uniform state for each cached block
of data.

Snoopy Bus Protocols

In using private caches associated with processors tied to a common bus, two approaches have
been practiced for maintaining cache consistency:

i. Write-Invalidate

ii. Write-update

Essentially, the write-invalidate policy will invalidate all remote copies when a local cache block
is updated.

The write-update policy will broadcast the new data block to all caches containing a copy of the
block.

Snoopy protocols achieve data consistency among the caches and shared memory through a
bus watching mechanism. As illustrated in Fig. lid, two snoopy bus protocols create different
results. Consider three processors (Pl, P2, and Pn) maintaining consistence copies of block X in
their local caches (Fig. 7. l 4a} and in the shared-memory module marked X.

Using a write-invalidate protocol, the processor P1 modifies (writes) its cache from X to X’, and
all other copies are invalidated via the bus (denoted I in Fig. 7.l4b]. Invalidated blocks are
sometimes called dirty, meaning they should not be used.

The write-update protocol (Fig. 7.l4c) demands the new block content X’ be broadcast to all
cache copies via the bus. The memory copy is also updated if write-through caches are used. In
using write-back caches, the memory copy is updated later at block replacement time.
Snoopy schemes do not scale because they rely on broadcast.

Directory-Based protocol

(also Limitations of snoopy protocol)

A write-invalidate protocol may lead to heavy bus traffic caused by read-misses, resulting from
the processor updating a variable and other processors trying to read the same variable.

On the other band, the write-update protocol may update data items in remote caches which
will never be used by other processors. In fact, these problems pose additional limitations in
using buses to build large multiprocessors.

When a multistage or packet switched network is used to build a large multiprocessor with
hundreds of processors, the snoopy cache protocols must be modified to suit the network
capabilities. Since broadcasting is expensive to perform in such a network, consistency
commands will be sent only to those caches that keep a copy of the block. This leads to
Directory-Based protocol for network-connected multiprocessors.

 A directory is a data structure that maintains information on the processors that share a
memory block and on its state.
 The information maintained in the directory could be either centralized or distributed.
 A Central directory maintains information about all blocks in a central data structure.
 The same information can be handled in a distributed fashion by allowing each memory
module to maintain a separate directory.
Protocol Categorization:

Full Map Directories-

• Each directory entry contains N pointers, where N is the number of processors.


• There could be N cached copies of a particular block shared by all processors.
• For every memory block, an N bit vector is maintained, where N equals the number of
processors in the shared memory system. Each bit In the vector corresponds to one
processor.

— Limited Directories-

• Fixed number of pointers per directory entry regardless of the number of processors.
• Restricting the number of simultaneously cached copies of any block should solve the
directory size problem that might exist in full-map directories.

— Chained Directories.

• Chained directories emulate full-map by distributing the directory among the caches.
• Solving the directory size problem without restricting the number of shared block
copies.
• Chained directories keep track of shared copies of a particular block by maintaining a
chain of directory pointers.

Message routing schemes

1. Store and Forward Routing: In store and forward network Packets are the basic unit of
information flow. Each node is required to use a packet buffer. A packet is transmitted from a
source node to a destination node through a sequence of intermediate nodes. When a packet
reaches an intermediate node, it is first stored in the buffer. Then it is forwarded to the next
node if the desired output channel and a packet buffer in the receiving node are both available.

The latency in store-and-forward networks is directly proportional to the distance (the number
of hops) between the source and the destination. This routing scheme was implemented in the
first generation of multicomputers.

If error occurs when forwarding, switch can re-send

2. Wormhole routing: In wormhole routing packet are subdivided into flits (a few bytes each).

– All flits in the same packet are transmitted in order as inseparable companions in
a pipelined fashion. First flit contains header with destination address. Switch
gets header flit, decides where to forward and Other flits follow the header flit.
– Looks like packet worming through network

– If an error occurs along the way, sender must re-send

• No switch has the entire packet to re-send it

Vector Processing Definitions A vector is an ordered set Of scalar data items, all of the same
type, stored in memory. Usually, the vector elements are ordered to have a fixed addressing
increment between successive elements, called the stride.

A vector processor is an ensemble of hardware resources, including vector registers,


functional pipelines, processing elements, and register counters, for performing vector
operations. Vector processing occurs when arithmetic or logical operations are applied to
vectors. It is distinguished from scalar processing which operates on one datum or one pair
of data. The conversion from scalar code to vector code is called vectorization.

In general, vector processing is faster and more efficient than scalar processing. Both
pipelined processors and SIMD computers can perform vector operations. Vector
processing reduces software overhead incurred in the maintenance of looping control,
reduces memory-access conflicts, and above all matches nicely with the pipelining and
segmentation concepts to generate one result per clock cycle continuously.
Depending on the speed ratio between vector and scalar operations (including startup
delays and other overheads) and on the vectorization ratio in user programs, a vector
processor executing a well-vectorized code can easily achieve a speedup of 10 to 20 times,
as compared with scalar processing on conventional machines.

Of course, the enhanced performance comes with increased hardware and compiler costs,
as expected. A compiler capable of vectorization is called a vectorizing compiler or simply a
vectorizer. For successful vector processing, one needs to make improvements in vector
hardware, vectorizing compilers, and programming skills specially targeted at vector
machines.

Vector Instruction Types

Vector Access memory schemes

i. C-Access Memory Organization

This is a Vector access scheme from interleaved memory modules


• m-way low-order interleaved memory structure
• Allows m memory words to be accessed concurrently
• This is called C-access

ii. S-Access Memory Organization- The low-order interleaved memory can be rearranged to
allow simultaneous access, or S-access. All memory modules are accessed simultaneously in
synchronized manner.
• Similar to low-order interleaved memory
– High order bits select modules
– Words from modules are latched at the same time
– Low order bits select words from data latches
– This is done through the multiplexed with higher speeds (minor cycles)
• This is called S-access

iii. C/S Access


• A memory organization in which the C-access and S-access are combined is called C/S- access.
• In this scheme n access buses are used with m interleaved memory modules attached to each
bus. The m modules on each bus are m-way interleaved to allow C-access. The n buses operate
in parallel to allow S-access. In each memory cycle, at most m .n words are fetched if the n
buses are fully used with pipelined memory accesses.

Das könnte Ihnen auch gefallen