Chapter - 5 Parallel Processing

Chapter 5 Parallel Processing
Multiple Processor Organization

Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD Multiple instruction, multiple data streamMIMD
SISD
Single processor executes a single instruction stream to operate on a data stored in single memory Uni-processor
SIMD
Single machine instruction controls simultaneous execution of number of processing elements Each instruction is executed on different set of data by different processors Vector and array processors
MISD
Sequence of data transmitted to set of processors Each processor executes different instruction sequence Never been implemented
MIMD
Set of processors simultaneously execute different instruction sequences on different sets of data SMPs, clusters and NUMA systems
MIMD - Overview
General purpose processors Each can process all instructions necessary Further classified by method of processor communication
Taxonomy of Parallel Processor Architectures
Tightly Coupled - SMP

Processors share memory Communicate via the shared memory Symmetric Multiprocessor (SMP)
Share single memory or pool of memory Shared bus to access memory Memory access time to given area of memory is approximately the same for each processor
NUMA
Non uniform memory access Access times to different regions of memory may differ
7
Loosely Coupled - Clusters

Collection of independent uniprocessors or SMPs interconnected to form a cluster Communication via fixed path or network connections
Parallel Organizations
SISD
SIMD
MIMD (Shared Memory)
10
MIMD (Distributed Memory)
11
Symmetric Multiprocessors
A stand alone computer with the following characteristics
Two or more similar processors of comparable capacity Processors share same memory and are connected by a bus or other internal connection such that memory access time is approximately the same for each processor All processors share access to I/O All processors can perform the same functions (symmetric) System controlled by integrated operating system
Provide interaction between processors and their programs
12
Multiprogramming and Multiprocessing
13
SMP Advantages
Performance
If some work can be done in parallel
Availability
Since all processors can perform the same functions, failure of a single processor does not halt the system
Incremental growth
Increase performance by adding additional processors
Scaling
Vendors can offer range of products based on number of processors
14
Block Diagram of Tightly Coupled Multiprocessor
15
Time Shared Bus

Common organization used and it is simple Structure and interface similar to single processor system Following features provided
Addressing - distinguish modules on bus to determine source and destination Arbitration - any module can be temporary master Time sharing - if one module has the bus, others must wait and may have to suspend
16
Symmetric Multiprocessor Organization
17
Time Share Bus - Advantages

Simplicity
Simplest approach for multiprocessor organization
Flexibility
Easy to expand the system by attaching more processors to the bus.
Reliability
Bus is a passive medium, and the failure of any attached device should not cause failure of the whole system
18
Time Shared Bus - Disadvantage

Performance
Limited by bus cycle time because all references pass through the bus
Each processor should have local cache

Reduce number of bus accesses
Leads to problems with cache coherence

Cache is altered in one processor and it has to be informed to other processor cache also
19
Maths problems involving physical processes is difficult for computation

Aerodynamics, seismology, meteorology, atomic, nuclear Continuous field simulation
Vector Computation
High precision repeated floating point calculations on large arrays of numbers Supercomputers handle these types of problem
Hundreds of millions of floating point operations $10-15 million Optimised for calculation Limited market Research, government agencies, meteorology
20
Another system designed for vector computation -Array processor

Alternative to supercomputer Configured as peripherals to mainframe & mini computers Just run vector portion of problems
21
Vector Addition Example
22
Processor Designs
Pipelined ALU
Decomposing of floating point operations into stages Different stages can operate on different sets of data llly
Can be further enhanced if the vector elements are available in registers rather than from main memory Within operations Across operations
23
Approaches to Vector Computation
24
Chaining
Cray Supercomputers Vector operation may start as soon as first element of operand vector available and functional unit is free Result from one functional unit is fed immediately into another If vector registers used, intermediate results do not have to be stored in memory
25
Parallel ALUs
Parallel processors
break the task up into multiple processes to be executed in parallel effective only if the software and hardware for effective coordination of parallel processors
26
Operating System Support
27
OS
OS is a program that controls the execution of application programs and acts as an interface between the user and the hardware Manages the computers resources, Provides services for programmers, and Schedules the execution of other programs.
28
Objectives and Functions

Convenience
Making the computer easier to use
Efficiency
Allowing better use of computer resources
29
Layers and Views of a Computer System
30
Operating System Services

Program creation Program execution Access to I/O devices Controlled access to files System access Error detection and response Accounting
31
O/S as a Resource Manager
32
Types of Operating System

Interactive Batch Single program (Uni-programming) Multi-programming (Multi-tasking)
33
Early Systems
Late 1940s to mid 1950s No Operating System Programs interact directly with hardware Two main problems:
Scheduling Setup time
34
Simple Batch Systems

Resident Monitor program Users submit jobs to operator who batches jobs Monitor controls sequence of events to process batch When one job is finished, control returns to Monitor which reads next job Monitor handles scheduling
35
Memory Layout for Resident Monitor
36
Desirable Hardware Features

Memory protection
To protect the Monitor
Timer
To prevent a job using the system
Privileged instructions
Only executed by Monitor e.g. I/O
Interrupts
Allows regaining control from user program
37
Multi-programmed Batch Systems

I/O devices very slow When one program is waiting for I/O, another can use the CPU
38
Single Program
39
Multi-Programming with Two Programs
40
Multi-Programming with Three Programs
41
Time Sharing Systems

Allow users to interact directly with the computer
i.e. Interactive
Multi-programming allows a number of users to interact with the computer
42
Scheduling
Key to multi-programming Types
Long term Medium term Short term I/O
43
Long Term Scheduling

Determines which programs are submitted for processing i.e. controls the degree of multi-programming Once submitted, a job becomes a process for the short term scheduler
44
Medium Term Scheduling

Part of the swapping function Usually based on the need to manage multiprogramming If no virtual memory, memory management is also an issue
45
Short Term Scheduling

Also known as Dispatcher Fine grained decisions of which job to execute next Which job actually gets to use the processor in the next time slot
46
Five State Process Model
47
PCB Diagram
48
Scheduling Example
49
Key Elements involved in scheduling
50
Process Scheduling
51
Reduced Instruction Set Computers
52
The family concept
Major Advances in Computers(1)
IBM System/360 in 1964 DEC PDP-8
Microporgrammed control unit

Idea by Wilkes in 1951 Produced by IBM S/360 in 1964
Cache memory
IBM S/360 model 85 in 1969
Pipelining
Introduces parallelism into fetch execute cycle
Multiple processors
53
The Next Step - RISC

Reduced Instruction Set Computer Key features
Large number of general purpose registers Limited and simple instruction set Emphasis on optimising the instruction pipeline
54
Comparison of processors
55
Driving force for CISC

Software costs far exceed hardware costs Increasingly complex high level languages Semantic gap
difference between the operations provided in HLLs and those provided in computer architecture.
Leads to:
Large instruction sets More addressing modes CASE machine instruction on the VAX
e.g. CASE (switch) on VAX
56
Execution Characteristics
Studies have been done to determine the characteristics of execution of machine instructions generated from HLL programs Different approach: namely, to make the architecture that supports the HLL simpler Operations performed Operands used Execution sequencing
57
Operations
Assignments - predominate
Movement of data is of high importance
Conditional statements (IF, LOOP)

Sequence control Implemented in machine language
Procedure call-return is very time consuming
58
59
Operands
Mainly local scalar variables Optimisation should concentrate on accessing local variables
60
Procedure Calls
Very time consuming Depends on number of parameters passed Depends on level of nesting Most programs do not do a lot of calls followed by lots of returns
61
Implications
Attempt to make the instruction set architecture close to HLLs is not the most effective Best support is given by optimising most used and most time consuming features Large number of registers Careful design of pipelines Simplified (reduced) instruction set
62
Why CISC (1)?

Compiler simplification?
Complex machine instructions harder to exploit Optimization more difficult
Smaller programs?
Program takes up less memory but Memory is now cheap May not occupy less bits, just look shorter in symbolic form
63
Why CISC (2)?

instruction execution would be faster.
More complex control unit Microprogram control store larger
It is far from clear that a trend to increasingly complex instruction sets is appropriate
64
RISC Characteristics
One instruction per cycle Register to register operations Few, simple addressing modes Few, simple instruction formats
65
RISC v CISC
Not clear cut Many designs borrow from both philosophies e.g. PowerPC and Pentium II
66
RISC Pipelining
Most instructions are register to register Two phases of execution
I: Instruction fetch E: Execute
ALU operation with register input and output
For load and store

I: Instruction fetch E: Execute
Calculate memory address
D: Memory
Register to memory or memory to register operation
67
Effects of Pipelining
68
Delayed branch
Optimization of Pipelining
makes use of a branch that does not take effect until after execution of the following instruction Delayed Load
Register to be target is locked by processor Continue execution of instruction stream until register required Idle until load complete Re-arranging instructions can allow useful work Replicate body of loop a number of times Reduces loop overhead Increases instruction parallelism Improved register, data cache or TLB locality
Loop Unrolling
69
Delayed branch
70
Use of Delayed Branch
71
72
Controversy
Quantitative
compare program sizes and execution speeds
Qualitative
examine issues of high level language support
Problems
No pair of RISC and CISC that are directly comparable No definitive set of test programs Most comparisons done on toy machines rather than production machines Most commercial devices are a mixture
73
Control Unit Operation
74
Micro-Operations
A computer executes a program Fetch/execute cycle Each cycle has a number of steps
pipelining
Called micro-operations Each step does very little
75
Constituent Elements of Program Execution
76
Fetch - 4 Registers
Memory Address Register (MAR)
Connected to address bus Specifies address for read or write op
Memory Buffer Register (MBR)

Connected to data bus Holds data to write or last data read
Program Counter (PC)

Holds address of next instruction to be fetched
Instruction Register (IR)

Holds last instruction fetched
77
Fetch Sequence
Address of next instruction is in PC and it is moved to MAR Control unit issues READ command Result (data from memory) appears on data bus Data from data bus copied into MBR PC incremented by 1 (in parallel with data fetch from memory) Data (instruction) moved from MBR to IR
78
Fetch Sequence (symbolic)

consists of three steps and four micro operations
second and third micro-operations both take place during the second time unit
79
Rules for groupings of micro-operations

Proper sequence must be followed
MAR <- (PC) must precede MBR <- (memory)
Conflicts must be avoided

Must not read & write same register at same time MBR <- (memory) & IR <- (MBR) must not be in same cycle
Also: PC <- (PC) +1 involves addition

Use ALU May need additional micro-operations
80
Indirect Cycle
MBR contains an address
IR is now in same state as if direct addressing had been used
81
Interrupt Cycle
This is a minimum
May be additional micro-ops to get addresses Saving context is done by interrupt handler routine, not micro-ops
82
Execute Cycle (ADD)

Different for each instruction e.g. ADD R1,X - add the contents of location X to Register 1 , result in R1
83
Instruction Cycle
Each phase decomposed into sequence of elementary micro-operations E.g. fetch, indirect, and interrupt cycles Assume new 2-bit register
Instruction cycle code (ICC) designates which part of cycle processor is in
00: Fetch 01: Indirect 10: Execute 11: Interrupt
84
Flowchart for Instruction Cycle
85
Functional Requirements
Define basic elements of processor Describe micro-operations processor performs Determine the functions that the control unit must perform to cause the micro-operations to be performed
86
Basic Elements of Processor

ALU Registers Internal data pahs External data paths Control Unit
87
Types of Micro-operation
Transfer data between registers Transfer data from register to external interface Transfer data from external interface to register Perform arithmetic or logical operations
88
Functions of Control Unit

Sequencing
Causing the CPU to step through a series of microoperations
Execution
Causing the performance of each micro-op
This is done using Control Signals
89
Control Signals
Clock
This is how the control unit keeps time.
Instruction register
Op-code for current instruction Determines which micro-instructions are performed
Flags
Status of CPU Results of previous ALU operations
Control signals from control bus

Interrupts Acknowledgements
90
Model of Control Unit
91
Control Signals - output

Within CPU
Cause data movement Activate specific ALU functions
To control bus
To memory To I/O modules
92
Implementation
two categories:
Hardwired implementation Microprogrammed implementation
In hardwired , the control unit is essentially a combinational circuit. Input logic signals are transformed into a set of output logic signals, which are the control signals
93
CPU Structure and Function
94
CPU Structure
CPU must:
Fetch instructions Interpret instructions Fetch data Process data Write data
95
CPU With Systems Bus
96
CPU Internal Structure
97
Registers
CPU must have some working space (temporary storage) called registers Number and function vary between processor Top level of memory hierarchy Perform two roles:
User-visible registers Control and status registers
98
User Visible Registers

General Purpose Data Address Condition Codes(Flags)
99
User Visible Registers

May be true general purpose May be restricted May be used for data or addressing Data
Accumulator
Addressing
Segment registers Index registers Stack pointer
100
How Many GP Registers?

Between 8 - 32 Fewer = more memory references More does not reduce memory references
How big?
Large enough to hold full address Large enough to hold full word Often possible to combine two data registers
101
Condition Code Registers

Sets of individual bits
e.g. result of last operation was zero
Can be read (implicitly) by programs

e.g. Jump if zero
Can not be set by programs
102
Control & Status Registers

Program Counter Instruction Decoding Register Memory Address Register Memory Buffer Register
103
Program Status Word

A set of bits Includes Condition Codes Sign Zero Carry Equal Overflow Interrupt enable/disable Supervisor
104
Other Registers
May have registers pointing to:
Process control blocks Interrupt Vectors
105
Example Register Organizations
106
Indirect Cycle
May require memory access to fetch operands Indirect addressing requires more memory accesses Can be thought of as additional instruction subcycle
107
Data Flow (Instruction Fetch)

Depends on CPU design Fetch
PC contains address of next instruction Address moved to MAR Address placed on address bus Control unit requests memory read Result placed on data bus, copied to MBR, then to IR Meanwhile PC incremented by 1
108
Data Flow (Data Fetch)

IR is examined If indirect addressing, indirect cycle is performed
Right most N bits of MBR transferred to MAR Control unit requests memory read Result (address of operand) moved to MBR
109
Data Flow (Fetch Diagram)
110
Data Flow (Indirect Diagram)
111
Data Flow (Interrupt Diagram)
112
Pipelining
Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result
Overlap these operations
113
Two Stage Instruction Pipeline
114
Timing Diagram
115
Effect of a Conditional Branch Instruction
116
Alternative Pipeline Depiction
117

Chapter - 5 Parallel Processing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Chapter - 5 Parallel Processing

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 5 Parallel Processing

Multiple Processor Organization

Taxonomy of Parallel Processor Architectures

Tightly Coupled - SMP

Loosely Coupled - Clusters

MIMD (Shared Memory)

MIMD (Distributed Memory)

Multiprogramming and Multiprocessing

Block Diagram of Tightly Coupled Multiprocessor

Time Shared Bus

Symmetric Multiprocessor Organization

Time Share Bus - Advantages

Time Shared Bus - Disadvantage

Each processor should have local cache

Leads to problems with cache coherence

Maths problems involving physical processes is difficult for computation

Another system designed for vector computation -Array processor

Vector Addition Example

Approaches to Vector Computation

Operating System Support

Objectives and Functions

Layers and Views of a Computer System

Operating System Services

O/S as a Resource Manager

Types of Operating System

Simple Batch Systems

Memory Layout for Resident Monitor

Desirable Hardware Features

Multi-programmed Batch Systems

Multi-Programming with Two Programs

Multi-Programming with Three Programs

Time Sharing Systems

Multi-programming allows a number of users to interact with the computer

Long Term Scheduling

Medium Term Scheduling

Short Term Scheduling

Five State Process Model

Key Elements involved in scheduling

Reduced Instruction Set Computers

The family concept

Major Advances in Computers(1)

IBM System/360 in 1964 DEC PDP-8

Microporgrammed control unit

The Next Step - RISC

Driving force for CISC

Conditional statements (IF, LOOP)

Procedure call-return is very time consuming

Why CISC (1)?

Why CISC (2)?

For load and store

Use of Delayed Branch

Control Unit Operation

Called micro-operations Each step does very little

Constituent Elements of Program Execution

Memory Buffer Register (MBR)

Program Counter (PC)

Instruction Register (IR)

Fetch Sequence (symbolic)

Rules for groupings of micro-operations

Conflicts must be avoided

Also: PC <- (PC) +1 involves addition

IR is now in same state as if direct addressing had been used