Sie sind auf Seite 1von 117

Chapter 5 Parallel Processing

Multiple Processor Organization


Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data streamMIMD

SISD
Single processor executes a single instruction stream to operate on a data stored in single memory Uni-processor

SIMD
Single machine instruction controls simultaneous execution of number of processing elements Each instruction is executed on different set of data by different processors Vector and array processors

MISD
Sequence of data transmitted to set of processors Each processor executes different instruction sequence Never been implemented

MIMD
Set of processors simultaneously execute different instruction sequences on different sets of data SMPs, clusters and NUMA systems

MIMD - Overview
General purpose processors Each can process all instructions necessary Further classified by method of processor communication

Taxonomy of Parallel Processor Architectures

Tightly Coupled - SMP


Processors share memory Communicate via the shared memory Symmetric Multiprocessor (SMP)
Share single memory or pool of memory Shared bus to access memory Memory access time to given area of memory is approximately the same for each processor

NUMA
Non uniform memory access Access times to different regions of memory may differ
7

Loosely Coupled - Clusters


Collection of independent uniprocessors or SMPs interconnected to form a cluster Communication via fixed path or network connections

Parallel Organizations
SISD

SIMD

MIMD (Shared Memory)

10

MIMD (Distributed Memory)

11

Symmetric Multiprocessors
A stand alone computer with the following characteristics
Two or more similar processors of comparable capacity Processors share same memory and are connected by a bus or other internal connection such that memory access time is approximately the same for each processor All processors share access to I/O All processors can perform the same functions (symmetric) System controlled by integrated operating system
Provide interaction between processors and their programs
12

Multiprogramming and Multiprocessing

13

SMP Advantages
Performance
If some work can be done in parallel

Availability
Since all processors can perform the same functions, failure of a single processor does not halt the system

Incremental growth
Increase performance by adding additional processors

Scaling
Vendors can offer range of products based on number of processors
14

Block Diagram of Tightly Coupled Multiprocessor

15

Time Shared Bus


Common organization used and it is simple Structure and interface similar to single processor system Following features provided
Addressing - distinguish modules on bus to determine source and destination Arbitration - any module can be temporary master Time sharing - if one module has the bus, others must wait and may have to suspend

16

Symmetric Multiprocessor Organization

17

Time Share Bus - Advantages


Simplicity
Simplest approach for multiprocessor organization

Flexibility
Easy to expand the system by attaching more processors to the bus.

Reliability
Bus is a passive medium, and the failure of any attached device should not cause failure of the whole system

18

Time Shared Bus - Disadvantage


Performance
Limited by bus cycle time because all references pass through the bus

Each processor should have local cache


Reduce number of bus accesses

Leads to problems with cache coherence


Cache is altered in one processor and it has to be informed to other processor cache also

19

Maths problems involving physical processes is difficult for computation


Aerodynamics, seismology, meteorology, atomic, nuclear Continuous field simulation

Vector Computation

High precision repeated floating point calculations on large arrays of numbers Supercomputers handle these types of problem
Hundreds of millions of floating point operations $10-15 million Optimised for calculation Limited market Research, government agencies, meteorology
20

Another system designed for vector computation -Array processor


Alternative to supercomputer Configured as peripherals to mainframe & mini computers Just run vector portion of problems

21

Vector Addition Example

22

Processor Designs
Pipelined ALU
Decomposing of floating point operations into stages Different stages can operate on different sets of data llly

Can be further enhanced if the vector elements are available in registers rather than from main memory Within operations Across operations
23

Approaches to Vector Computation

24

Chaining
Cray Supercomputers Vector operation may start as soon as first element of operand vector available and functional unit is free Result from one functional unit is fed immediately into another If vector registers used, intermediate results do not have to be stored in memory

25

Parallel ALUs

Parallel processors
break the task up into multiple processes to be executed in parallel effective only if the software and hardware for effective coordination of parallel processors
26

Operating System Support

27

OS
OS is a program that controls the execution of application programs and acts as an interface between the user and the hardware Manages the computers resources, Provides services for programmers, and Schedules the execution of other programs.

28

Objectives and Functions


Convenience
Making the computer easier to use

Efficiency
Allowing better use of computer resources

29

Layers and Views of a Computer System

30

Operating System Services


Program creation Program execution Access to I/O devices Controlled access to files System access Error detection and response Accounting

31

O/S as a Resource Manager

32

Types of Operating System


Interactive Batch Single program (Uni-programming) Multi-programming (Multi-tasking)

33

Early Systems
Late 1940s to mid 1950s No Operating System Programs interact directly with hardware Two main problems:
Scheduling Setup time

34

Simple Batch Systems


Resident Monitor program Users submit jobs to operator who batches jobs Monitor controls sequence of events to process batch When one job is finished, control returns to Monitor which reads next job Monitor handles scheduling

35

Memory Layout for Resident Monitor

36

Desirable Hardware Features


Memory protection
To protect the Monitor

Timer
To prevent a job using the system

Privileged instructions
Only executed by Monitor e.g. I/O

Interrupts
Allows regaining control from user program
37

Multi-programmed Batch Systems


I/O devices very slow When one program is waiting for I/O, another can use the CPU

38

Single Program

39

Multi-Programming with Two Programs

40

Multi-Programming with Three Programs

41

Time Sharing Systems


Allow users to interact directly with the computer
i.e. Interactive

Multi-programming allows a number of users to interact with the computer

42

Scheduling
Key to multi-programming Types
Long term Medium term Short term I/O

43

Long Term Scheduling


Determines which programs are submitted for processing i.e. controls the degree of multi-programming Once submitted, a job becomes a process for the short term scheduler

44

Medium Term Scheduling


Part of the swapping function Usually based on the need to manage multiprogramming If no virtual memory, memory management is also an issue

45

Short Term Scheduling


Also known as Dispatcher Fine grained decisions of which job to execute next Which job actually gets to use the processor in the next time slot

46

Five State Process Model

47

PCB Diagram

48

Scheduling Example

49

Key Elements involved in scheduling

50

Process Scheduling

51

Reduced Instruction Set Computers

52

The family concept

Major Advances in Computers(1)

IBM System/360 in 1964 DEC PDP-8

Microporgrammed control unit


Idea by Wilkes in 1951 Produced by IBM S/360 in 1964

Cache memory
IBM S/360 model 85 in 1969

Pipelining
Introduces parallelism into fetch execute cycle

Multiple processors
53

The Next Step - RISC


Reduced Instruction Set Computer Key features
Large number of general purpose registers Limited and simple instruction set Emphasis on optimising the instruction pipeline

54

Comparison of processors

55

Driving force for CISC


Software costs far exceed hardware costs Increasingly complex high level languages Semantic gap
difference between the operations provided in HLLs and those provided in computer architecture.

Leads to:
Large instruction sets More addressing modes CASE machine instruction on the VAX
e.g. CASE (switch) on VAX
56

Execution Characteristics
Studies have been done to determine the characteristics of execution of machine instructions generated from HLL programs Different approach: namely, to make the architecture that supports the HLL simpler Operations performed Operands used Execution sequencing

57

Operations
Assignments - predominate
Movement of data is of high importance

Conditional statements (IF, LOOP)


Sequence control Implemented in machine language

Procedure call-return is very time consuming

58

59

Operands
Mainly local scalar variables Optimisation should concentrate on accessing local variables

60

Procedure Calls
Very time consuming Depends on number of parameters passed Depends on level of nesting Most programs do not do a lot of calls followed by lots of returns

61

Implications
Attempt to make the instruction set architecture close to HLLs is not the most effective Best support is given by optimising most used and most time consuming features Large number of registers Careful design of pipelines Simplified (reduced) instruction set

62

Why CISC (1)?


Compiler simplification?
Complex machine instructions harder to exploit Optimization more difficult

Smaller programs?
Program takes up less memory but Memory is now cheap May not occupy less bits, just look shorter in symbolic form

63

Why CISC (2)?


instruction execution would be faster.
More complex control unit Microprogram control store larger

It is far from clear that a trend to increasingly complex instruction sets is appropriate

64

RISC Characteristics
One instruction per cycle Register to register operations Few, simple addressing modes Few, simple instruction formats

65

RISC v CISC
Not clear cut Many designs borrow from both philosophies e.g. PowerPC and Pentium II

66

RISC Pipelining
Most instructions are register to register Two phases of execution
I: Instruction fetch E: Execute
ALU operation with register input and output

For load and store


I: Instruction fetch E: Execute
Calculate memory address

D: Memory
Register to memory or memory to register operation
67

Effects of Pipelining

68

Delayed branch

Optimization of Pipelining

makes use of a branch that does not take effect until after execution of the following instruction Delayed Load
Register to be target is locked by processor Continue execution of instruction stream until register required Idle until load complete Re-arranging instructions can allow useful work Replicate body of loop a number of times Reduces loop overhead Increases instruction parallelism Improved register, data cache or TLB locality

Loop Unrolling

69

Delayed branch

70

Use of Delayed Branch

71

72

Controversy
Quantitative
compare program sizes and execution speeds

Qualitative
examine issues of high level language support

Problems
No pair of RISC and CISC that are directly comparable No definitive set of test programs Most comparisons done on toy machines rather than production machines Most commercial devices are a mixture

73

Control Unit Operation

74

Micro-Operations
A computer executes a program Fetch/execute cycle Each cycle has a number of steps
pipelining

Called micro-operations Each step does very little

75

Constituent Elements of Program Execution

76

Fetch - 4 Registers
Memory Address Register (MAR)
Connected to address bus Specifies address for read or write op

Memory Buffer Register (MBR)


Connected to data bus Holds data to write or last data read

Program Counter (PC)


Holds address of next instruction to be fetched

Instruction Register (IR)


Holds last instruction fetched
77

Fetch Sequence
Address of next instruction is in PC and it is moved to MAR Control unit issues READ command Result (data from memory) appears on data bus Data from data bus copied into MBR PC incremented by 1 (in parallel with data fetch from memory) Data (instruction) moved from MBR to IR

78

Fetch Sequence (symbolic)


consists of three steps and four micro operations

second and third micro-operations both take place during the second time unit

79

Rules for groupings of micro-operations


Proper sequence must be followed
MAR <- (PC) must precede MBR <- (memory)

Conflicts must be avoided


Must not read & write same register at same time MBR <- (memory) & IR <- (MBR) must not be in same cycle

Also: PC <- (PC) +1 involves addition


Use ALU May need additional micro-operations

80

Indirect Cycle
MBR contains an address

IR is now in same state as if direct addressing had been used

81

Interrupt Cycle

This is a minimum
May be additional micro-ops to get addresses Saving context is done by interrupt handler routine, not micro-ops

82

Execute Cycle (ADD)


Different for each instruction e.g. ADD R1,X - add the contents of location X to Register 1 , result in R1

83

Instruction Cycle
Each phase decomposed into sequence of elementary micro-operations E.g. fetch, indirect, and interrupt cycles Assume new 2-bit register
Instruction cycle code (ICC) designates which part of cycle processor is in
00: Fetch 01: Indirect 10: Execute 11: Interrupt

84

Flowchart for Instruction Cycle

85

Functional Requirements
Define basic elements of processor Describe micro-operations processor performs Determine the functions that the control unit must perform to cause the micro-operations to be performed

86

Basic Elements of Processor


ALU Registers Internal data pahs External data paths Control Unit

87

Types of Micro-operation
Transfer data between registers Transfer data from register to external interface Transfer data from external interface to register Perform arithmetic or logical operations

88

Functions of Control Unit


Sequencing
Causing the CPU to step through a series of microoperations

Execution
Causing the performance of each micro-op

This is done using Control Signals

89

Control Signals
Clock
This is how the control unit keeps time.

Instruction register
Op-code for current instruction Determines which micro-instructions are performed

Flags
Status of CPU Results of previous ALU operations

Control signals from control bus


Interrupts Acknowledgements
90

Model of Control Unit

91

Control Signals - output


Within CPU
Cause data movement Activate specific ALU functions

To control bus
To memory To I/O modules

92

Implementation
two categories:
Hardwired implementation Microprogrammed implementation

In hardwired , the control unit is essentially a combinational circuit. Input logic signals are transformed into a set of output logic signals, which are the control signals

93

CPU Structure and Function

94

CPU Structure
CPU must:
Fetch instructions Interpret instructions Fetch data Process data Write data

95

CPU With Systems Bus

96

CPU Internal Structure

97

Registers
CPU must have some working space (temporary storage) called registers Number and function vary between processor Top level of memory hierarchy Perform two roles:
User-visible registers Control and status registers

98

User Visible Registers


General Purpose Data Address Condition Codes(Flags)

99

User Visible Registers


May be true general purpose May be restricted May be used for data or addressing Data
Accumulator

Addressing
Segment registers Index registers Stack pointer
100

How Many GP Registers?


Between 8 - 32 Fewer = more memory references More does not reduce memory references

How big?
Large enough to hold full address Large enough to hold full word Often possible to combine two data registers

101

Condition Code Registers


Sets of individual bits
e.g. result of last operation was zero

Can be read (implicitly) by programs


e.g. Jump if zero

Can not be set by programs

102

Control & Status Registers


Program Counter Instruction Decoding Register Memory Address Register Memory Buffer Register

103

Program Status Word


A set of bits Includes Condition Codes Sign Zero Carry Equal Overflow Interrupt enable/disable Supervisor
104

Other Registers
May have registers pointing to:
Process control blocks Interrupt Vectors

105

Example Register Organizations

106

Indirect Cycle
May require memory access to fetch operands Indirect addressing requires more memory accesses Can be thought of as additional instruction subcycle

107

Data Flow (Instruction Fetch)


Depends on CPU design Fetch
PC contains address of next instruction Address moved to MAR Address placed on address bus Control unit requests memory read Result placed on data bus, copied to MBR, then to IR Meanwhile PC incremented by 1

108

Data Flow (Data Fetch)


IR is examined If indirect addressing, indirect cycle is performed
Right most N bits of MBR transferred to MAR Control unit requests memory read Result (address of operand) moved to MBR

109

Data Flow (Fetch Diagram)

110

Data Flow (Indirect Diagram)

111

Data Flow (Interrupt Diagram)

112

Pipelining
Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result

Overlap these operations

113

Two Stage Instruction Pipeline

114

Timing Diagram

115

Effect of a Conditional Branch Instruction

116

Alternative Pipeline Depiction

117

Das könnte Ihnen auch gefallen