Sie sind auf Seite 1von 89

MODULE-6

1
MULTITHREADED AND
DATAFLOW ARCHITECTURE

2
INTRODUCTION
 In computer architecture, multithreading is the
ability of a central processing unit (CPU) (or a
single core in a multi-core processor) to execute
multiple processes or threads concurrently,
supported by the operating system

3
LATENCY HIDING TECHNIQUES
 In distributed shared memory machines, access to
remote memory is likely to be slow
 So architectures must rely on techniques to
reduce/hide remote-memory-access latencies.
 Latency hiding is a technique that hides the time
consumed in micro-task by communication and
remote memory
 3 approaches for latency hiding are:-

 1. Using pre-fetching techniques


 2. Using coherent caching techniques
 3. Using multiple contexts
4
4.Using relaxed memory consistency
1.PRE–FETCHING TECHNIQUES

 This technique reduces latency by bringing the


instructions or data, closer to the processor before
they are actually needed
 It uses the knowledge about expected misses in
the program
 Using this knowledge it moves the corresponding
data closer to the processor before it is actually
needed

5
TECHNIQUES OF PRE-FETCHING

 Hardware controlled pre-fetching


 Software controlled pre-fetching

6
HARDWARE CONTROLLED PRE-FETCHING
 Hardware-controlled pre-fetching is done using 2
schemes
 using long cache lines
 Using instruction look-ahead

 Long cache lines introduce the problem of "false


sharing”
 Instruction look-ahead is limited by
 branch-prediction accuracy
 Finite look-ahead buffer size

7
HARDWARE CONTROLLED PRE-FETCHING [2]

8
FALSE SHARING : EXAMPLE

9
SOFTWARE CONTROLLED PRE-FETCHING
 In this approach, explicit "pre-fetch" instructions
are issued for data that is "known" to be remote.
 Pre-fetching is done selectively

 Disadvantages
 Extra instruction overhead
 Need of sophisticated software intervention

10
ISSUE OF COHERENCE IN PRE-FETCHING
 We have to make decisions regarding what should
be done
 if a block is updated after it has been pre-fetched, but
before it has been used
 To address these issues of coherence while
performing pre-fetching, 2 approaches are used
 Binding pre-fetching
 Non binding pre-fetching

11
BINDING PRE-FETCH POLICY
 Fetch is assumed to have happened when the pre-
fetch instruction is issued

 It is the responsibility of the "fetching process" to


ensure that no other processor will update the pre-
fetched value before it is actually used

12
NON - BINDING PRE-FETCH POLICY
 The cache coherence protocol will make sure to
invalidate a pre-fetched value if it is updated
prior to its use.
 The pre-fetched data remains visible to the cache
coherence protocol
 Data is kept consistent until the processor
actually reads the value

13
BENEFITS OF PRE-FETCHING
 Improves the performance

 Pre-fetching the cache block line, directly with the


ownership reduces write latencies and network
traffic

14
2 . USING COHERENT CACHING TECHNIQUES
 Cache coherence is maintained using snoopy bus
protocols for bus based system
 Coherence is maintained using directory based
protocol for n/w based system

15
BENEFITS OF CACHING
 Improves the performance
 Reduces the no: of cycles wasted due to read
misses
 Cycles wasted due to write misses are also
reduced

16
3 . MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will wait
during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked

17
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-

 Efficiency = busy
busy + switching+ idle

Busy, switching and idle represents the amount of


time the processor remains in that corresponding
state
18
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching

19
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked

 A processor will be busy, if there is a context in the


running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies that
all contexts are blocked
20
PRINCIPLES OF MULTITHREADING

21
MULTITHREADING ISSUES AND SOLUTIONS
 Multithreading demands a processor to handle
multiple contexts simultaneously on a context
switching basis
 Architecture environment
 Consider a multithreaded system modeled using a n/w of
processor & memory nodes
 Processor P
 memory M

 Distributed memory forms a global address space

22
23
PARAMETERS USED TO ANALYZE PERFORMANCE
 Latency (L)
 It involves the communication latency on a remote
memory access
 The value of L includes
 n/w delays
 Cache miss penalty

 Delays caused by contentions in split transactions

 No: of threads (N)


 It is the no: of threads that can be interleaved in each
processor
 A thread is represented by a context
 A context consists of
 PC
24
 Register set

 Required context status words


 Context switching overhead ( C)
 This refers to the cycles lost in performing context
switching in a processor
 This depends on
 Switching mechanism
 Amount of processor states devoted to maintain active

threads
 Interval between switches (R)
 This refers to the cycles between switches triggered by
remote reference
 The inverse p=1/R is called rate of requests for
remote accesses
 This reflects the combination of
 Program behavior

 Memory s/m design

25
PROBLEMS OF ASYNCHRONY

 It is a problem found commonly in parallel


processors
 Parallel processors operate asynchronously in a
n/w environment
 This lead to 2 latency problems
 Remote loads
 Synchronizing load

26
REMOTE LOADS
 Suppose that variable A and B are located on
nodes N2 and N3
 They need to be brought to node N1 in order to
compute A-B in variable C
 This computation demands 2 remote loads and a
subtraction
 Remote load rload
 pA and pB pointers to A and B
 CTXT context of computation on N1
 Remote loads keep the node N1 idle until it
receives the variable A and B from N2 and N3
 Latency caused by remote load depends on the
27
architectural property
28
SYNCHRONIZING LOADS
 In this case idling is caused because A and B are
computed by concurrent processes
 It is not sure that when will they will be ready for
node N1 to read
 The ready signal may reach node N1
asynchronously
 This may result in Busy-waiting
 Process repeatedly checks to see whether A and B is
ready
 Latency caused by synchronization load depends on
 Scheduling
 Time taken to compute A & B
29
SOLUTIONS TO ASYNCHRONY PROBLEM
 2 solutions
 Multithreading solution
 Distributed caching

30
MULTITHREADING SOLUTION
 The solution is to multiplex among many threads
 When one thread issues a remote load request,
processor begins work on another thread and so on
 Cost of thread switching is smaller than the
latency of remote load
 Concerns of this approach
 Responses of remote load may not return in proper
order
 Ie after issuing a remote load from thread T1, we switch to
T2, which also issues a remote load
 Responses may not return in the same order because
 requests may be travelling from different distances
 Varying degree of congestion
31
 Load of destination node differ greatly
32
 Solution to this concern
 Make sure that messages carry continuations
 Each remote load and response is associated with an
identifier for the appropriate thread
 Thread identifiers are referred as continuations on
messages

33
DISTRIBUTED CACHING

 This is a solution for remote load problem


 Every memory location has an owner node
 N1 owns B
 N2 owns A
 Directories contain
 Import- Export lists
 It states whether data is shared or exclusive
 Shared many caches may hold copies(used for reads)
 Exclusive one cache holds the current value(used for writes)

34
35
MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will
wait during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked

36
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-

 Efficiency = busy
busy + switching+ idle

Busy, switching and idle represents the amount of


time the processor remains in that corresponding
state
37
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching

38
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked

 A processor will be busy, if there is a context in


the running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies
that all contexts are blocked
39
CONTEXT SWITCHING POLICIES
 Switch on cache miss
 Switch on every load

 Switch on every instruction

 Switch on block of instruction

40
SWITCH ON CACHE MISS
 In this case a context is switched when a cache
miss occurs
 R is the average interval b/w the misses

 L is the time required to satisfy the miss

 Processor switches the context only when it is


certain that current one will be delayed for
significant no: of cycles

41
SWITCH ON EVERY LOAD
 This policy allows switching on every load,
independent of whether it will cause a miss or
not
 R represents average interval between the loads

42
SWITCH ON EVERY INSTRUCTION
 This policy allows switching on every instruction,
independent of whether it is load or not
 It interleaves the instructions from different
threads on a cycle by cycle basis
 If successive instructions are independent, then
it will benefit pipelined execution

43
SWITCH ON BLOCK OF INSTRUCTION
 Blocks of instructions from different threads are
interleaved
 Benefits single context performance

44
PROCESSOR EFFICIENCIES
 A single threaded processor executes a context
until a remote reference is issued
 It remains idle until the reference completes L
cycles
 There is no context switch hence no switch
overhead
 Efficiency of a single threaded machine is:-
 E1=R  1
R+L 1+L/R
R amount of time during a cycle that the processor is
busy
L amount of time during a cycle that the processor is
idle 45
EFFICIENCY OF MULTITHREADED PROCESSOR
 Memory latency is hidden due to context
switching
 But there is switching overhead of C cycles

 Processor efficiency is analyzed under 2


situations
 Efficiency in saturation region
 Efficiency in linear region

46
SATURATION REGION
 In saturation region, processor operates with
maximum utilization
 Esat= R  1
R+ C 1+C/R
 Efficiency in saturation is independent of latency

 It does not change with a further increase in no:


of contexts

47
48
LINEAR REGION
 When the no: of contexts is below the saturation
point, there may be no ready contexts after a context
switch
 So the processor will experience idle cycles
 The time required to switch to a ready context,
execute it until a remote reference is issued and
process the reference is equal to:-
 R + C+ L
 Elin= NR
R+C+L
 N is below the saturation point
 Efficiency increases linearly with no: of contexts until
the saturation point is reached
 Beyond that saturation point it remains constant 49
FINE GRAIN MULTICOMPUTERS

50
INTRODUCTION
 Parallelism can be classified into three categories
 1. Fine-grained
 2. Medium-grained
 3. Coarse-grained parallelism

 Classification is based on the amount of work


which is performed by a parallel task
 Granularity of a task is the amount of
computations or work performed by a task
 Granularity can also be specified in terms of the
execution time of a program
 It is obtained by combining the computation time and
communication time
51
1. FINE GRAINED PARALLELISM
 In fine-grained parallelism, a program is broken down
to a set of small tasks.
 These tasks are assigned individually to many
processors.
 The amount of work associated with a parallel task is
low
 Each task processes less data
 No: of processors required to perform the complete
processing is high.
 This increases the communication and synchronization
overhead.
 Fine grained parallelism is utilized in following
parallel computers
 CM-2
 Message driven J-Machine
 Mosaic C 52
2. MEDIUM GRAINED PARALLELISM
 Medium-grained parallelism is a compromise
between fine-grained and coarse-grained
parallelism
 task size and communication time greater than
fine-grained parallelism and lower than coarse-
grained parallelism.
 Eg: Intel iPSC/1

53
3. COARSE GRAINED PARALLELISM
 In coarse-grained parallelism, a program is split
into large tasks.
 Large amount of computation takes place in
processors.
 This result in load imbalance
 Certain tasks process bulk of the data while others
might be idle.
 Coarse-grained parallelism fails to exploit the
parallelism in the program
 The advantage of this type of parallelism is low
communication and synchronization overhead.
 Eg: Cray Y-MP 54
CHARACTERISTICS OF PARALLEL MACHINES
 Latency analysis
 Communication latency (Tc)
 This measures the data or message transfer time on a

system interconnect
 In Cray Y-MP, latency implies the shared memory access

time
 In CM-2 machine, latency implies the time required to

send a 32 bit value across a hypercube n/w


 In IPSC/1 and J-Machine, latency implies the latency in

the n/w

55
 Synchronization overhead (Ts)
 It is the processing time required on a processor, PE, or
on a processing node of a multicomputer for the purpose
of synchronization
 Total time for IPC ( Instruction Per Cycle)
 It is the sum of Tc and Ts
 TIPC= Tc+Ts

56
COMPARISON OF CHARACTERISTICS

 Kkkk

 Cray-Y has short Tc but long Ts


 CM-2 has short Ts but long Tc
 iPSC/1 has longer delays
 Major improvements was done by J-Machine 57
GRAIN SIZE IN FINE GRAINED PARALLELISM
Grain size (Tg)
 It is measured by the execution time of a typical
program
 It includes both computing time and
communication time
 Fine grain leads to higher degree of parallelism
and higher communication overhead

58
DATAFLOW AND HYBRID
ARCHITECTURE

68
INTRODUCTION

 Computers follow 2 types of program execution


mechanisms
 Control flow mechanism
 Dataflow mechanism

69
CONTROL FLOW MECHANISM
 In control flow mechanism, the order of program
execution is explicitly stated in the user program
 Ie the programs are executed in the sequential order
 Computers use Program counter to sequence the
execution of instructions in the program
 This style of program execution is also called as
control driven
 Coz program flow is explicitly controlled by the
programmers
 Eg: Conventional Von Neumann computers follow
this type of mechanism
70
DATAFLOW MECHANISM
 This mechanism allows execution of any
instruction, based on the data (operand)
availability
 Does not follow program order
 Computers that follow dataflow mechanism is
called as dataflow computers
 This style of execution is also called as data
driven execution
 Does not require PC or a control sequencer

71
DATAFLOW COMPUTERS
 The execution of instruction is driven by data
availability instead of the guidance of program
counter
 Instructions will be ready for execution whenever the
operands are available
 Instructions in a data driven program is not ordered

 Data’s are held directly inside the instruction


 It is not stored in the memory
 Dataflow computers provide high degree of
parallelism

72
 Computational results (tokens) are passed directly
between the instructions
 Data generated by one instruction will be duplicated into
many copies & they are forwarded directly to all needy
instructions
 Tokens once consumed by the instruction will not be
available for other instructions

73
REQUIREMENTS FOR DATAFLOW COMPUTERS
 It require special mechanisms to:-
 Detect data availability
 Match data tokens with needy instructions
 Enable chain reaction of asynchronous instruction
execution

74
DATAFLOW ARCHITECTURE

 Global architecture consist of n processing


elements interconnected by nxn routing n/w
 Entire s/m supports pipelined dataflow operations
in all n PE’s
 Inter PE communications are done through
pipelined routing n/w

75
76
INTERIOR DESIGN OF PE
 Within each PE, there are following units
 Token match
 Program memory
 Compute tag
 Local path
 Routing n/w
 ALU
 I-structure

77
78
 Token match unit
 This unit provides a token matching mechanism
 It dispatches only those instructions to the ALU,
whose input data (tokens) are available
 Compute tag
 This unit provides a tagging mechanism
 Each data is tagged with the address of the
instruction to which it belongs
 Program memory
 Instructions are stored in the program memory
 Local path
 Tagged tokens enter into the PE through the local
path
 Routing network
 Tokens are passed to other PE’s through the routing
network 79
 I – structure
 It is a tagged memory unit for the overlapped usage
of a data structure by both the producer and
consumer processes
 Each word of I-structure uses a 2 bit tag
 This indicates whether the word is:-
 Empty

 Full

 Has pending read requests

80
DATAFLOW & HYBRID ARCHITECTURE
 Multithreaded architectures can be designed with
pure dataflow approach or with a hybrid approach
 Hybrid approach
 Combining Von Neumann and data driven mechanism
 Dataflow graph
 These are used as the machine language for dataflow
architectures
 They specify only partial order for the execution of
instructions

81
DATAFLOW GRAPH FOR CALCULATION OF COS X
 Cos x= 1- x2 + x4 – x6
2! 4! 6!

= 1 - x2 + x4 – x6
2 24 720

data flow graph contains 9 operators/nodes


edges in the graph interconnect nodes
successive powers of x is obtained by repeated
multiplications
82
83
EVOLUTION OF DATAFLOW COMPUTERS
 Pure dataflow computers
 MIT tagged token dataflow architecture
 Manchester dataflow
 ETL sigma 1

 Explicit token store machines


 They are the successors of pure dataflow machines
 They eliminate associative token matching
 The waiting token memory is directly addressed with
the use of full/empty bits
 Eg: motorola monsoon
 ETL EM-4

84
85
 Hybrid and unified architecture
 This combines the positive features from Von
Neumann and dataflow architectures
 Eg: MIT/Motorola *T
 MIT P-RISC

86
ETL/EM4 MACHINE
 Each node consisted of 2 elements]
 EMC-R processor
 Memory

 Each node played the role of I-structure memory


 Had 1.31 MB of static RAM

 Omega n/w was used to provide interconnections


among the node

87
88
NODE ARCHITECTURE
 Processor chip communicated with the n/w
through 3x3 cross bar switch unit
 Processor and the memory is interfaced using
memory control unit
 Memory holds programs as well as tokens
waiting to be fetched
 Processor contains 6 units
 Input buffers
 Fetch match unit
 Execution unit
 Instruction fetch
 Execute & emit tokens
89
 Register file
90
 Input buffer
 It is used as a token store
 It has a capacity of 32 words

 Fetch match unit


 It fetches the token from the memory
 It performed tag matching operations among the
fetched tokens
 Memory controller
 It fetches the instructions directly from the memory

91
 Execution unit
 It is the heart of the processor
 It fetches the instruction until the end of the thread
 Stop flag is raised to indicate the end of a thread
 Instructions with matching tokens are executed
 Instructions will emit the tokens
 These emitted tokens are written into register file

92
MIT/MOTOROLA *T PROTOTYPE
 *T project is a hybrid of dynamic dataflow
architecture and Von Neumann architecture
 Prototype architecture
 A brick of 16 nodes was packaged in a 9-in cube
 Local n/w consisted of 8x8 crossbar switches
 Memory was distributed among the nodes
 1 GB RAM was used per brick

93
94
*T NODE DESIGN
 Each node has 4 components
 MC88110 data processor (dP)
 Synchronization coprocessor
 Memory controller
 n/w interface unit

95
 Data processor (dP)
 It was optimized for long threads
 Concurrent integer & floating point operations were
performed in within each dP
 Synchronization coprocessor (sP)
 It is implemented as a special function unit
 It was optimized for short threads
 Both dP and sP could handle fast loads
 dP handled incomming continuations
 sP handled
 Incoming messages
 Rload/rstore responses

 Synchronization

96
97
 Memory controller
 Handled the requests for remote memory load or
store
 It manages node memory

 Network interface unit


 It transmitted or receive messages to and from the
n/w

98

Das könnte Ihnen auch gefallen