Module - 6

MODULE-6
1
MULTITHREADED AND
DATAFLOW ARCHITECTURE
2
INTRODUCTION
 In computer architecture, multithreading is the
ability of a central processing unit (CPU) (or a
single core in a multi-core processor) to execute
multiple processes or threads concurrently,
supported by the operating system
3
LATENCY HIDING TECHNIQUES
 In distributed shared memory machines, access to
remote memory is likely to be slow
 So architectures must rely on techniques to
reduce/hide remote-memory-access latencies.
 Latency hiding is a technique that hides the time
consumed in micro-task by communication and
remote memory
 3 approaches for latency hiding are:-
 1. Using pre-fetching techniques

 2. Using coherent caching techniques
 3. Using multiple contexts
4
4.Using relaxed memory consistency
1.PRE–FETCHING TECHNIQUES
 This technique reduces latency by bringing the

instructions or data, closer to the processor before
they are actually needed
 It uses the knowledge about expected misses in
the program
 Using this knowledge it moves the corresponding
data closer to the processor before it is actually
needed
5
TECHNIQUES OF PRE-FETCHING
 Hardware controlled pre-fetching

 Software controlled pre-fetching
6
HARDWARE CONTROLLED PRE-FETCHING
 Hardware-controlled pre-fetching is done using 2
schemes
 using long cache lines
 Using instruction look-ahead
 Long cache lines introduce the problem of "false

sharing”
 Instruction look-ahead is limited by
 branch-prediction accuracy
 Finite look-ahead buffer size
7
HARDWARE CONTROLLED PRE-FETCHING [2]
8
FALSE SHARING : EXAMPLE
9
SOFTWARE CONTROLLED PRE-FETCHING
 In this approach, explicit "pre-fetch" instructions
are issued for data that is "known" to be remote.
 Pre-fetching is done selectively
 Disadvantages
 Extra instruction overhead
 Need of sophisticated software intervention
10
ISSUE OF COHERENCE IN PRE-FETCHING
 We have to make decisions regarding what should
be done
 if a block is updated after it has been pre-fetched, but
before it has been used
 To address these issues of coherence while
performing pre-fetching, 2 approaches are used
 Binding pre-fetching
 Non binding pre-fetching
11
BINDING PRE-FETCH POLICY
 Fetch is assumed to have happened when the pre-
fetch instruction is issued
 It is the responsibility of the "fetching process" to

ensure that no other processor will update the pre-
fetched value before it is actually used
12
NON - BINDING PRE-FETCH POLICY
 The cache coherence protocol will make sure to
invalidate a pre-fetched value if it is updated
prior to its use.
 The pre-fetched data remains visible to the cache
coherence protocol
 Data is kept consistent until the processor
actually reads the value
13
BENEFITS OF PRE-FETCHING
 Improves the performance
 Pre-fetching the cache block line, directly with the

ownership reduces write latencies and network
traffic
14
2 . USING COHERENT CACHING TECHNIQUES
 Cache coherence is maintained using snoopy bus
protocols for bus based system
 Coherence is maintained using directory based
protocol for n/w based system
15
BENEFITS OF CACHING
 Improves the performance
 Reduces the no: of cycles wasted due to read
misses
 Cycles wasted due to write misses are also
reduced
16
3 . MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will wait
during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked
17
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-
 Efficiency = busy
busy + switching+ idle
Busy, switching and idle represents the amount of

time the processor remains in that corresponding
state
18
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching
19
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked
 A processor will be busy, if there is a context in the

running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies that
all contexts are blocked
20
PRINCIPLES OF MULTITHREADING
21
MULTITHREADING ISSUES AND SOLUTIONS
 Multithreading demands a processor to handle
multiple contexts simultaneously on a context
switching basis
 Architecture environment
 Consider a multithreaded system modeled using a n/w of
processor & memory nodes
 Processor P
 memory M
 Distributed memory forms a global address space
22
23
PARAMETERS USED TO ANALYZE PERFORMANCE
 Latency (L)
 It involves the communication latency on a remote
memory access
 The value of L includes
 n/w delays
 Cache miss penalty
 Delays caused by contentions in split transactions
 No: of threads (N)

 It is the no: of threads that can be interleaved in each
processor
 A thread is represented by a context
 A context consists of
 PC
24
 Register set
 Required context status words

 Context switching overhead ( C)
 This refers to the cycles lost in performing context
switching in a processor
 This depends on
 Switching mechanism
 Amount of processor states devoted to maintain active
threads
 Interval between switches (R)
 This refers to the cycles between switches triggered by
remote reference
 The inverse p=1/R is called rate of requests for
remote accesses
 This reflects the combination of
 Program behavior
 Memory s/m design
25
PROBLEMS OF ASYNCHRONY
 It is a problem found commonly in parallel

processors
 Parallel processors operate asynchronously in a
n/w environment
 This lead to 2 latency problems
 Remote loads
 Synchronizing load
26
REMOTE LOADS
 Suppose that variable A and B are located on
nodes N2 and N3
 They need to be brought to node N1 in order to
compute A-B in variable C
 This computation demands 2 remote loads and a
subtraction
 Remote load rload
 pA and pB pointers to A and B
 CTXT context of computation on N1
 Remote loads keep the node N1 idle until it
receives the variable A and B from N2 and N3
 Latency caused by remote load depends on the
27
architectural property
28
SYNCHRONIZING LOADS
 In this case idling is caused because A and B are
computed by concurrent processes
 It is not sure that when will they will be ready for
node N1 to read
 The ready signal may reach node N1
asynchronously
 This may result in Busy-waiting
 Process repeatedly checks to see whether A and B is
ready
 Latency caused by synchronization load depends on
 Scheduling
 Time taken to compute A & B
29
SOLUTIONS TO ASYNCHRONY PROBLEM
 2 solutions
 Multithreading solution
 Distributed caching
30
MULTITHREADING SOLUTION
 The solution is to multiplex among many threads
 When one thread issues a remote load request,
processor begins work on another thread and so on
 Cost of thread switching is smaller than the
latency of remote load
 Concerns of this approach
 Responses of remote load may not return in proper
order
 Ie after issuing a remote load from thread T1, we switch to
T2, which also issues a remote load
 Responses may not return in the same order because
 requests may be travelling from different distances
 Varying degree of congestion
31
 Load of destination node differ greatly
32
 Solution to this concern
 Make sure that messages carry continuations
 Each remote load and response is associated with an
identifier for the appropriate thread
 Thread identifiers are referred as continuations on
messages
33
DISTRIBUTED CACHING
 This is a solution for remote load problem

 Every memory location has an owner node
 N1 owns B
 N2 owns A
 Directories contain
 Import- Export lists
 It states whether data is shared or exclusive
 Shared many caches may hold copies(used for reads)
 Exclusive one cache holds the current value(used for writes)
34
35
MULTIPLE CONTEXT PROCESSORS
 A conventional single threaded processor will
wait during a remote reference
 So processor remains idle for a period of time L
 A multithreaded processor will suspend the
current context and switch to another context
 So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
 The processor will be idle only if all contexts are
blocked
36
OBJECTIVE OF CONTEXT SWITCHING
 Maximize the fraction of time that the processor
is busy
 Efficiency of the processor is defined as:-
 Efficiency = busy
busy + switching+ idle
Busy, switching and idle represents the amount of

time the processor remains in that corresponding
state
37
 The basic idea of multithreaded machine is to
 Interleave the execution of several contexts inorder to
reduce the value of idle without increasing the
magnitude of switching
38
STATES OF CONTEXT CYCLE
 A context cycles through the following states
 Ready
 Running
 Leaving
 Blocked
 A processor will be busy, if there is a context in

the running state
 If the processor is switching from one context to
another, then context is in leaving state
 If the processor is in idle state, then it implies
that all contexts are blocked
39
CONTEXT SWITCHING POLICIES
 Switch on cache miss
 Switch on every load
 Switch on every instruction
 Switch on block of instruction
40
SWITCH ON CACHE MISS
 In this case a context is switched when a cache
miss occurs
 R is the average interval b/w the misses
 L is the time required to satisfy the miss
 Processor switches the context only when it is

certain that current one will be delayed for
significant no: of cycles
41
SWITCH ON EVERY LOAD
 This policy allows switching on every load,
independent of whether it will cause a miss or
not
 R represents average interval between the loads
42
SWITCH ON EVERY INSTRUCTION
 This policy allows switching on every instruction,
independent of whether it is load or not
 It interleaves the instructions from different
threads on a cycle by cycle basis
 If successive instructions are independent, then
it will benefit pipelined execution
43
SWITCH ON BLOCK OF INSTRUCTION
 Blocks of instructions from different threads are
interleaved
 Benefits single context performance
44
PROCESSOR EFFICIENCIES
 A single threaded processor executes a context
until a remote reference is issued
 It remains idle until the reference completes L
cycles
 There is no context switch hence no switch
overhead
 Efficiency of a single threaded machine is:-
 E1=R  1
R+L 1+L/R
R amount of time during a cycle that the processor is
busy
L amount of time during a cycle that the processor is
idle 45
EFFICIENCY OF MULTITHREADED PROCESSOR
 Memory latency is hidden due to context
switching
 But there is switching overhead of C cycles
 Processor efficiency is analyzed under 2

situations
 Efficiency in saturation region
 Efficiency in linear region
46
SATURATION REGION
 In saturation region, processor operates with
maximum utilization
 Esat= R  1
R+ C 1+C/R
 Efficiency in saturation is independent of latency
 It does not change with a further increase in no:

of contexts
47
48
LINEAR REGION
 When the no: of contexts is below the saturation
point, there may be no ready contexts after a context
switch
 So the processor will experience idle cycles
 The time required to switch to a ready context,
execute it until a remote reference is issued and
process the reference is equal to:-
 R + C+ L
 Elin= NR
R+C+L
 N is below the saturation point
 Efficiency increases linearly with no: of contexts until
the saturation point is reached
 Beyond that saturation point it remains constant 49
FINE GRAIN MULTICOMPUTERS
50
INTRODUCTION
 Parallelism can be classified into three categories
 1. Fine-grained
 2. Medium-grained
 3. Coarse-grained parallelism
 Classification is based on the amount of work

which is performed by a parallel task
 Granularity of a task is the amount of
computations or work performed by a task
 Granularity can also be specified in terms of the
execution time of a program
 It is obtained by combining the computation time and
communication time
51
1. FINE GRAINED PARALLELISM
 In fine-grained parallelism, a program is broken down
to a set of small tasks.
 These tasks are assigned individually to many
processors.
 The amount of work associated with a parallel task is
low
 Each task processes less data
 No: of processors required to perform the complete
processing is high.
 This increases the communication and synchronization
overhead.
 Fine grained parallelism is utilized in following
parallel computers
 CM-2
 Message driven J-Machine
 Mosaic C 52
2. MEDIUM GRAINED PARALLELISM
 Medium-grained parallelism is a compromise
between fine-grained and coarse-grained
parallelism
 task size and communication time greater than
fine-grained parallelism and lower than coarse-
grained parallelism.
 Eg: Intel iPSC/1
53
3. COARSE GRAINED PARALLELISM
 In coarse-grained parallelism, a program is split
into large tasks.
 Large amount of computation takes place in
processors.
 This result in load imbalance
 Certain tasks process bulk of the data while others
might be idle.
 Coarse-grained parallelism fails to exploit the
parallelism in the program
 The advantage of this type of parallelism is low
communication and synchronization overhead.
 Eg: Cray Y-MP 54
CHARACTERISTICS OF PARALLEL MACHINES
 Latency analysis
 Communication latency (Tc)
 This measures the data or message transfer time on a
system interconnect
 In Cray Y-MP, latency implies the shared memory access
time
 In CM-2 machine, latency implies the time required to
send a 32 bit value across a hypercube n/w

 In IPSC/1 and J-Machine, latency implies the latency in
the n/w
55
 Synchronization overhead (Ts)
 It is the processing time required on a processor, PE, or
on a processing node of a multicomputer for the purpose
of synchronization
 Total time for IPC ( Instruction Per Cycle)
 It is the sum of Tc and Ts
 TIPC= Tc+Ts
56
COMPARISON OF CHARACTERISTICS
 Kkkk
 Cray-Y has short Tc but long Ts

 CM-2 has short Ts but long Tc
 iPSC/1 has longer delays
 Major improvements was done by J-Machine 57
GRAIN SIZE IN FINE GRAINED PARALLELISM
Grain size (Tg)
 It is measured by the execution time of a typical
program
 It includes both computing time and
communication time
 Fine grain leads to higher degree of parallelism
and higher communication overhead
58
DATAFLOW AND HYBRID
ARCHITECTURE
68
INTRODUCTION
 Computers follow 2 types of program execution

mechanisms
 Control flow mechanism
 Dataflow mechanism
69
CONTROL FLOW MECHANISM
 In control flow mechanism, the order of program
execution is explicitly stated in the user program
 Ie the programs are executed in the sequential order
 Computers use Program counter to sequence the
execution of instructions in the program
 This style of program execution is also called as
control driven
 Coz program flow is explicitly controlled by the
programmers
 Eg: Conventional Von Neumann computers follow
this type of mechanism
70
DATAFLOW MECHANISM
 This mechanism allows execution of any
instruction, based on the data (operand)
availability
 Does not follow program order
 Computers that follow dataflow mechanism is
called as dataflow computers
 This style of execution is also called as data
driven execution
 Does not require PC or a control sequencer
71
DATAFLOW COMPUTERS
 The execution of instruction is driven by data
availability instead of the guidance of program
counter
 Instructions will be ready for execution whenever the
operands are available
 Instructions in a data driven program is not ordered
 Data’s are held directly inside the instruction

 It is not stored in the memory
 Dataflow computers provide high degree of
parallelism
72
 Computational results (tokens) are passed directly
between the instructions
 Data generated by one instruction will be duplicated into
many copies & they are forwarded directly to all needy
instructions
 Tokens once consumed by the instruction will not be
available for other instructions
73
REQUIREMENTS FOR DATAFLOW COMPUTERS
 It require special mechanisms to:-
 Detect data availability
 Match data tokens with needy instructions
 Enable chain reaction of asynchronous instruction
execution
74
DATAFLOW ARCHITECTURE
 Global architecture consist of n processing

elements interconnected by nxn routing n/w
 Entire s/m supports pipelined dataflow operations
in all n PE’s
 Inter PE communications are done through
pipelined routing n/w
75
76
INTERIOR DESIGN OF PE
 Within each PE, there are following units
 Token match
 Program memory
 Compute tag
 Local path
 Routing n/w
 ALU
 I-structure
77
78
 Token match unit
 This unit provides a token matching mechanism
 It dispatches only those instructions to the ALU,
whose input data (tokens) are available
 Compute tag
 This unit provides a tagging mechanism
 Each data is tagged with the address of the
instruction to which it belongs
 Program memory
 Instructions are stored in the program memory
 Local path
 Tagged tokens enter into the PE through the local
path
 Routing network
 Tokens are passed to other PE’s through the routing
network 79
 I – structure
 It is a tagged memory unit for the overlapped usage
of a data structure by both the producer and
consumer processes
 Each word of I-structure uses a 2 bit tag
 This indicates whether the word is:-
 Empty
 Full
 Has pending read requests
80
DATAFLOW & HYBRID ARCHITECTURE
 Multithreaded architectures can be designed with
pure dataflow approach or with a hybrid approach
 Hybrid approach
 Combining Von Neumann and data driven mechanism
 Dataflow graph
 These are used as the machine language for dataflow
architectures
 They specify only partial order for the execution of
instructions
81
DATAFLOW GRAPH FOR CALCULATION OF COS X
 Cos x= 1- x2 + x4 – x6
2! 4! 6!
= 1 - x2 + x4 – x6
2 24 720
data flow graph contains 9 operators/nodes

edges in the graph interconnect nodes
successive powers of x is obtained by repeated
multiplications
82
83
EVOLUTION OF DATAFLOW COMPUTERS
 Pure dataflow computers
 MIT tagged token dataflow architecture
 Manchester dataflow
 ETL sigma 1
 Explicit token store machines

 They are the successors of pure dataflow machines
 They eliminate associative token matching
 The waiting token memory is directly addressed with
the use of full/empty bits
 Eg: motorola monsoon
 ETL EM-4
84
85
 Hybrid and unified architecture
 This combines the positive features from Von
Neumann and dataflow architectures
 Eg: MIT/Motorola *T
 MIT P-RISC
86
ETL/EM4 MACHINE
 Each node consisted of 2 elements]
 EMC-R processor
 Memory
 Each node played the role of I-structure memory

 Had 1.31 MB of static RAM
 Omega n/w was used to provide interconnections

among the node
87
88
NODE ARCHITECTURE
 Processor chip communicated with the n/w
through 3x3 cross bar switch unit
 Processor and the memory is interfaced using
memory control unit
 Memory holds programs as well as tokens
waiting to be fetched
 Processor contains 6 units
 Input buffers
 Fetch match unit
 Execution unit
 Instruction fetch
 Execute & emit tokens
89
 Register file
90
 Input buffer
 It is used as a token store
 It has a capacity of 32 words
 Fetch match unit

 It fetches the token from the memory
 It performed tag matching operations among the
fetched tokens
 Memory controller
 It fetches the instructions directly from the memory
91
 Execution unit
 It is the heart of the processor
 It fetches the instruction until the end of the thread
 Stop flag is raised to indicate the end of a thread
 Instructions with matching tokens are executed
 Instructions will emit the tokens
 These emitted tokens are written into register file
92
MIT/MOTOROLA *T PROTOTYPE
 *T project is a hybrid of dynamic dataflow
architecture and Von Neumann architecture
 Prototype architecture
 A brick of 16 nodes was packaged in a 9-in cube
 Local n/w consisted of 8x8 crossbar switches
 Memory was distributed among the nodes
 1 GB RAM was used per brick
93
94
*T NODE DESIGN
 Each node has 4 components
 MC88110 data processor (dP)
 Synchronization coprocessor
 Memory controller
 n/w interface unit
95
 Data processor (dP)
 It was optimized for long threads
 Concurrent integer & floating point operations were
performed in within each dP
 Synchronization coprocessor (sP)
 It is implemented as a special function unit
 It was optimized for short threads
 Both dP and sP could handle fast loads
 dP handled incomming continuations
 sP handled
 Incoming messages
 Rload/rstore responses
 Synchronization
96
97
 Memory controller
 Handled the requests for remote memory load or
store
 It manages node memory
 Network interface unit

 It transmitted or receive messages to and from the
n/w
98

Module - 6

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Module - 6

Hochgeladen von

Copyright:

Verfügbare Formate

MODULE-6

 1. Using pre-fetching techniques

 This technique reduces latency by bringing the

 Hardware controlled pre-fetching

 Long cache lines introduce the problem of "false

 It is the responsibility of the "fetching process" to

 Pre-fetching the cache block line, directly with the

Busy, switching and idle represents the amount of

 A processor will be busy, if there is a context in the

 Distributed memory forms a global address space

 Delays caused by contentions in split transactions

 No: of threads (N)

 Required context status words

 Memory s/m design

 It is a problem found commonly in parallel

 This is a solution for remote load problem

Busy, switching and idle represents the amount of

 A processor will be busy, if there is a context in

 Switch on every instruction

 Switch on block of instruction

 L is the time required to satisfy the miss

 Processor switches the context only when it is

 Processor efficiency is analyzed under 2

 It does not change with a further increase in no:

 Classification is based on the amount of work

send a 32 bit value across a hypercube n/w

 Cray-Y has short Tc but long Ts

 Computers follow 2 types of program execution

 Data’s are held directly inside the instruction

 Global architecture consist of n processing

 Has pending read requests

data flow graph contains 9 operators/nodes

 Explicit token store machines

 Each node played the role of I-structure memory

 Omega n/w was used to provide interconnections

 Fetch match unit

 Network interface unit

Das könnte Ihnen auch gefallen