Beruflich Dokumente
Kultur Dokumente
1
MULTITHREADED AND
DATAFLOW ARCHITECTURE
2
INTRODUCTION
In computer architecture, multithreading is the
ability of a central processing unit (CPU) (or a
single core in a multi-core processor) to execute
multiple processes or threads concurrently,
supported by the operating system
3
LATENCY HIDING TECHNIQUES
In distributed shared memory machines, access to
remote memory is likely to be slow
So architectures must rely on techniques to
reduce/hide remote-memory-access latencies.
Latency hiding is a technique that hides the time
consumed in micro-task by communication and
remote memory
3 approaches for latency hiding are:-
5
TECHNIQUES OF PRE-FETCHING
6
HARDWARE CONTROLLED PRE-FETCHING
Hardware-controlled pre-fetching is done using 2
schemes
using long cache lines
Using instruction look-ahead
7
HARDWARE CONTROLLED PRE-FETCHING [2]
8
FALSE SHARING : EXAMPLE
9
SOFTWARE CONTROLLED PRE-FETCHING
In this approach, explicit "pre-fetch" instructions
are issued for data that is "known" to be remote.
Pre-fetching is done selectively
Disadvantages
Extra instruction overhead
Need of sophisticated software intervention
10
ISSUE OF COHERENCE IN PRE-FETCHING
We have to make decisions regarding what should
be done
if a block is updated after it has been pre-fetched, but
before it has been used
To address these issues of coherence while
performing pre-fetching, 2 approaches are used
Binding pre-fetching
Non binding pre-fetching
11
BINDING PRE-FETCH POLICY
Fetch is assumed to have happened when the pre-
fetch instruction is issued
12
NON - BINDING PRE-FETCH POLICY
The cache coherence protocol will make sure to
invalidate a pre-fetched value if it is updated
prior to its use.
The pre-fetched data remains visible to the cache
coherence protocol
Data is kept consistent until the processor
actually reads the value
13
BENEFITS OF PRE-FETCHING
Improves the performance
14
2 . USING COHERENT CACHING TECHNIQUES
Cache coherence is maintained using snoopy bus
protocols for bus based system
Coherence is maintained using directory based
protocol for n/w based system
15
BENEFITS OF CACHING
Improves the performance
Reduces the no: of cycles wasted due to read
misses
Cycles wasted due to write misses are also
reduced
16
3 . MULTIPLE CONTEXT PROCESSORS
A conventional single threaded processor will wait
during a remote reference
So processor remains idle for a period of time L
A multithreaded processor will suspend the
current context and switch to another context
So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
The processor will be idle only if all contexts are
blocked
17
OBJECTIVE OF CONTEXT SWITCHING
Maximize the fraction of time that the processor
is busy
Efficiency of the processor is defined as:-
Efficiency = busy
busy + switching+ idle
19
STATES OF CONTEXT CYCLE
A context cycles through the following states
Ready
Running
Leaving
Blocked
21
MULTITHREADING ISSUES AND SOLUTIONS
Multithreading demands a processor to handle
multiple contexts simultaneously on a context
switching basis
Architecture environment
Consider a multithreaded system modeled using a n/w of
processor & memory nodes
Processor P
memory M
22
23
PARAMETERS USED TO ANALYZE PERFORMANCE
Latency (L)
It involves the communication latency on a remote
memory access
The value of L includes
n/w delays
Cache miss penalty
threads
Interval between switches (R)
This refers to the cycles between switches triggered by
remote reference
The inverse p=1/R is called rate of requests for
remote accesses
This reflects the combination of
Program behavior
25
PROBLEMS OF ASYNCHRONY
26
REMOTE LOADS
Suppose that variable A and B are located on
nodes N2 and N3
They need to be brought to node N1 in order to
compute A-B in variable C
This computation demands 2 remote loads and a
subtraction
Remote load rload
pA and pB pointers to A and B
CTXT context of computation on N1
Remote loads keep the node N1 idle until it
receives the variable A and B from N2 and N3
Latency caused by remote load depends on the
27
architectural property
28
SYNCHRONIZING LOADS
In this case idling is caused because A and B are
computed by concurrent processes
It is not sure that when will they will be ready for
node N1 to read
The ready signal may reach node N1
asynchronously
This may result in Busy-waiting
Process repeatedly checks to see whether A and B is
ready
Latency caused by synchronization load depends on
Scheduling
Time taken to compute A & B
29
SOLUTIONS TO ASYNCHRONY PROBLEM
2 solutions
Multithreading solution
Distributed caching
30
MULTITHREADING SOLUTION
The solution is to multiplex among many threads
When one thread issues a remote load request,
processor begins work on another thread and so on
Cost of thread switching is smaller than the
latency of remote load
Concerns of this approach
Responses of remote load may not return in proper
order
Ie after issuing a remote load from thread T1, we switch to
T2, which also issues a remote load
Responses may not return in the same order because
requests may be travelling from different distances
Varying degree of congestion
31
Load of destination node differ greatly
32
Solution to this concern
Make sure that messages carry continuations
Each remote load and response is associated with an
identifier for the appropriate thread
Thread identifiers are referred as continuations on
messages
33
DISTRIBUTED CACHING
34
35
MULTIPLE CONTEXT PROCESSORS
A conventional single threaded processor will
wait during a remote reference
So processor remains idle for a period of time L
A multithreaded processor will suspend the
current context and switch to another context
So for a fixed no: of cycles it will be again busy doing
useful work, even though remote reference is
outstanding
The processor will be idle only if all contexts are
blocked
36
OBJECTIVE OF CONTEXT SWITCHING
Maximize the fraction of time that the processor
is busy
Efficiency of the processor is defined as:-
Efficiency = busy
busy + switching+ idle
38
STATES OF CONTEXT CYCLE
A context cycles through the following states
Ready
Running
Leaving
Blocked
40
SWITCH ON CACHE MISS
In this case a context is switched when a cache
miss occurs
R is the average interval b/w the misses
41
SWITCH ON EVERY LOAD
This policy allows switching on every load,
independent of whether it will cause a miss or
not
R represents average interval between the loads
42
SWITCH ON EVERY INSTRUCTION
This policy allows switching on every instruction,
independent of whether it is load or not
It interleaves the instructions from different
threads on a cycle by cycle basis
If successive instructions are independent, then
it will benefit pipelined execution
43
SWITCH ON BLOCK OF INSTRUCTION
Blocks of instructions from different threads are
interleaved
Benefits single context performance
44
PROCESSOR EFFICIENCIES
A single threaded processor executes a context
until a remote reference is issued
It remains idle until the reference completes L
cycles
There is no context switch hence no switch
overhead
Efficiency of a single threaded machine is:-
E1=R 1
R+L 1+L/R
R amount of time during a cycle that the processor is
busy
L amount of time during a cycle that the processor is
idle 45
EFFICIENCY OF MULTITHREADED PROCESSOR
Memory latency is hidden due to context
switching
But there is switching overhead of C cycles
46
SATURATION REGION
In saturation region, processor operates with
maximum utilization
Esat= R 1
R+ C 1+C/R
Efficiency in saturation is independent of latency
47
48
LINEAR REGION
When the no: of contexts is below the saturation
point, there may be no ready contexts after a context
switch
So the processor will experience idle cycles
The time required to switch to a ready context,
execute it until a remote reference is issued and
process the reference is equal to:-
R + C+ L
Elin= NR
R+C+L
N is below the saturation point
Efficiency increases linearly with no: of contexts until
the saturation point is reached
Beyond that saturation point it remains constant 49
FINE GRAIN MULTICOMPUTERS
50
INTRODUCTION
Parallelism can be classified into three categories
1. Fine-grained
2. Medium-grained
3. Coarse-grained parallelism
53
3. COARSE GRAINED PARALLELISM
In coarse-grained parallelism, a program is split
into large tasks.
Large amount of computation takes place in
processors.
This result in load imbalance
Certain tasks process bulk of the data while others
might be idle.
Coarse-grained parallelism fails to exploit the
parallelism in the program
The advantage of this type of parallelism is low
communication and synchronization overhead.
Eg: Cray Y-MP 54
CHARACTERISTICS OF PARALLEL MACHINES
Latency analysis
Communication latency (Tc)
This measures the data or message transfer time on a
system interconnect
In Cray Y-MP, latency implies the shared memory access
time
In CM-2 machine, latency implies the time required to
the n/w
55
Synchronization overhead (Ts)
It is the processing time required on a processor, PE, or
on a processing node of a multicomputer for the purpose
of synchronization
Total time for IPC ( Instruction Per Cycle)
It is the sum of Tc and Ts
TIPC= Tc+Ts
56
COMPARISON OF CHARACTERISTICS
Kkkk
58
DATAFLOW AND HYBRID
ARCHITECTURE
68
INTRODUCTION
69
CONTROL FLOW MECHANISM
In control flow mechanism, the order of program
execution is explicitly stated in the user program
Ie the programs are executed in the sequential order
Computers use Program counter to sequence the
execution of instructions in the program
This style of program execution is also called as
control driven
Coz program flow is explicitly controlled by the
programmers
Eg: Conventional Von Neumann computers follow
this type of mechanism
70
DATAFLOW MECHANISM
This mechanism allows execution of any
instruction, based on the data (operand)
availability
Does not follow program order
Computers that follow dataflow mechanism is
called as dataflow computers
This style of execution is also called as data
driven execution
Does not require PC or a control sequencer
71
DATAFLOW COMPUTERS
The execution of instruction is driven by data
availability instead of the guidance of program
counter
Instructions will be ready for execution whenever the
operands are available
Instructions in a data driven program is not ordered
72
Computational results (tokens) are passed directly
between the instructions
Data generated by one instruction will be duplicated into
many copies & they are forwarded directly to all needy
instructions
Tokens once consumed by the instruction will not be
available for other instructions
73
REQUIREMENTS FOR DATAFLOW COMPUTERS
It require special mechanisms to:-
Detect data availability
Match data tokens with needy instructions
Enable chain reaction of asynchronous instruction
execution
74
DATAFLOW ARCHITECTURE
75
76
INTERIOR DESIGN OF PE
Within each PE, there are following units
Token match
Program memory
Compute tag
Local path
Routing n/w
ALU
I-structure
77
78
Token match unit
This unit provides a token matching mechanism
It dispatches only those instructions to the ALU,
whose input data (tokens) are available
Compute tag
This unit provides a tagging mechanism
Each data is tagged with the address of the
instruction to which it belongs
Program memory
Instructions are stored in the program memory
Local path
Tagged tokens enter into the PE through the local
path
Routing network
Tokens are passed to other PE’s through the routing
network 79
I – structure
It is a tagged memory unit for the overlapped usage
of a data structure by both the producer and
consumer processes
Each word of I-structure uses a 2 bit tag
This indicates whether the word is:-
Empty
Full
80
DATAFLOW & HYBRID ARCHITECTURE
Multithreaded architectures can be designed with
pure dataflow approach or with a hybrid approach
Hybrid approach
Combining Von Neumann and data driven mechanism
Dataflow graph
These are used as the machine language for dataflow
architectures
They specify only partial order for the execution of
instructions
81
DATAFLOW GRAPH FOR CALCULATION OF COS X
Cos x= 1- x2 + x4 – x6
2! 4! 6!
= 1 - x2 + x4 – x6
2 24 720
84
85
Hybrid and unified architecture
This combines the positive features from Von
Neumann and dataflow architectures
Eg: MIT/Motorola *T
MIT P-RISC
86
ETL/EM4 MACHINE
Each node consisted of 2 elements]
EMC-R processor
Memory
87
88
NODE ARCHITECTURE
Processor chip communicated with the n/w
through 3x3 cross bar switch unit
Processor and the memory is interfaced using
memory control unit
Memory holds programs as well as tokens
waiting to be fetched
Processor contains 6 units
Input buffers
Fetch match unit
Execution unit
Instruction fetch
Execute & emit tokens
89
Register file
90
Input buffer
It is used as a token store
It has a capacity of 32 words
91
Execution unit
It is the heart of the processor
It fetches the instruction until the end of the thread
Stop flag is raised to indicate the end of a thread
Instructions with matching tokens are executed
Instructions will emit the tokens
These emitted tokens are written into register file
92
MIT/MOTOROLA *T PROTOTYPE
*T project is a hybrid of dynamic dataflow
architecture and Von Neumann architecture
Prototype architecture
A brick of 16 nodes was packaged in a 9-in cube
Local n/w consisted of 8x8 crossbar switches
Memory was distributed among the nodes
1 GB RAM was used per brick
93
94
*T NODE DESIGN
Each node has 4 components
MC88110 data processor (dP)
Synchronization coprocessor
Memory controller
n/w interface unit
95
Data processor (dP)
It was optimized for long threads
Concurrent integer & floating point operations were
performed in within each dP
Synchronization coprocessor (sP)
It is implemented as a special function unit
It was optimized for short threads
Both dP and sP could handle fast loads
dP handled incomming continuations
sP handled
Incoming messages
Rload/rstore responses
Synchronization
96
97
Memory controller
Handled the requests for remote memory load or
store
It manages node memory
98