You are on page 1of 17

Introduction to MIMD Architectures

MIMD (Multiple Instruction stream, Multiple Data stream) computer system has a number of
independent processors operate upon separate data concurrently. Hence each processor has its
own program memory or has access to program memory. Similarly, each processor has its own
data memory or access to data memory. Clearly there needs to be a mechanism to load the
program and data memories and a mechanism for passing information between processors as
they work on some problem. MIMD has clearly emerges the architecture of choice for general-
purpose mutiprocessors. MIMD machines offer flexibility. With the correct hardware and
software support, MIMDs can function as single user machines focusing on high performance for
one application, as multiprogrammed machines running many tasks simultaneously, or as some
combination of these functions. There are two types of MIMD architectures: distributed memory
MIMD architecture, and shared memory MIMD architecture.


Distributed Memory MIMD Architectures

I. MIMD Architecture’s Goal

Without accessibility to shared memory architectures, it is not possible to obtain memory access
of a remote block in a multicomputer. Rather than having a processor being allocated to different
process or having a processor continue with other computations (in other words stall),
Distributed MIMD architectures are used. Thus, the main point of this architectural design is to
develop a message passing parallel computer system organized such that the processor time
spent in communication within the network is reduced to a minimum.
II. Components of the MultiComputer and Their Tasks

Within a multicomputer, there are a large number of nodes and a communication network linking
these nodes together. Inside each node, there are three important elements that do have tasks
related to message passing :

a. Computation Processor and Private Memory

b. Communication Processor

This component is responsible for organizing communication among the multicomputer

nodes, "packetizing" a message as a chunk of memory in the on the giving end, and
"depacketizing" the same message for the receiving node.

c. Router, commonly referred to as switch units

The router’s main task is to transmit the message from one node to the next and
assist the communication processor in organizing the communication of the
message in through the network of nodes.

The development of each of the above three elements took place progressively, as
hardware became more and more complex and useful. First came the Generation I
computers, in which messages were passed through direct links between nodes,
but there were not any communication processors or routers. Then Generation II
multicomputers came along with independent switch units that were separate from
the processor, and finally in the Generation III multicomputer, all three
components exist.

III. MultiComputer Classification Schemes

One method of classifying a multicomputer’s scheme of message transmission is its

interconnection network topology, which in plain English means how the nodes are
geometrically arranged as a network. There is great value in this information. The idea behind it
is related to the notion in computer science known as the Traveling Salesman problem, the
Shortest Path Algorithm from point A to point B(or in this case node A to node B) is directly
linked to the shape that the nodes generate when linked together. Interconnection network
topology has a great influence on message transmission.

A. Interconnection Network Topology

Background Information On Interconnection Topologies

With a geometrically arranged network,(eg star, hypercube) there are several

"tradeoffs" associated with the given "designs" ability to transmit messages as
quickly as possible:
1. The Network’s Size The more nodes you have, the longer it may take for node X
to pass the message to node Y, for obvious reasons
2. The Degree of the Node This term means the number of input and output links to
a given node. The lower the degree of a node, the better the design for message
3. The Network’s Diameter is the shortest path between all the pairs of nodes in the
network. When the diameter is less, the latency time for a message to transmit
reduces also.
4. The network’s Bisection Width is the minimum links that need to be removed so
that the entire network splits into two halves.
5. The Arc Connectivity is the minimum number of arcs that need to be removed in
order for the network to be two disconnected networks.
6. Finally, the Cost is the number of communication links required for the network

Interconnection Topology Types

Below is the picture for the three categories of interconnection topologies (one dimensional, two
dimensional, and three dimensional) and the designs that fall under each one. Along with this,
there is a table comparing each topology based on the tradeoffs listed above
B. Switching

A second method of comparing multicomputer schemes of message transmission is something

called switching which basically means the method of moving messages from input to output
buffers. The "buffer" is the location where the message, or a part of the message is to be stored.

Three main types of switching need to be discussed:

1. Packet Switching, sometimes referred to as Store and Forward

This mechanism of message storing and movement can be though as the way the
mail service, and is probably is just as slow… A message is split into "packets"
that can be stored in the buffer, contained within each intermediate node(the
intermediate node between the source node sending the message and the correct
receiver of the message) A "packet" usually will contain a header and some data.
The header portion works like a tag telling the switching unit which node to
forward the data to, in order to reach the destination node.

Network Latency = P/(B*D)

where P is the packet length, B is the channel bandwidth and D is the number of
nodes within the network.

Some drawbacks of packet switching include a higher network latency time in

message delivery due to the above equations dependency on the packet length and
the packet switching mechanism tends to consume a whole lot of memory for the
buffering of the "packets".

1. Circuit Switching
The circuit switching method is like a telephone system in which the message passage path
develops from source node to destination node as the message travels through its path. The three
phases of this type of switching scheme are the circuit establishment phase. This term is exactly
what it sounds like, setting up of a circuit from source node to destination node as you move
from node to node. In this scheme, the entire message can be sent at once through a special
probe. There is one rule that is followed as the message is being passed and it is used whenever
more than one message is being transmitted at a given time. In this case, a collision may result
between the two messages if there is an intermediate node that both messages need to pass
through to reach their respective destinations. However circuit switching does not allow more
than one message to pass through a channel at a time, so the message that got their last needs to

Here the Network Latency equation(NL) is

NL = (P/B)*D + M/B

P = packet length, B= channel bandwidth, D is number of nodes in the network, and M stands for
the message length

The transmission phase transmits the message to the next phase and during the termination phase
the circuitry is broken once the destination is reached

The major advantage of circuit switching is that the elimination of buffering reduces memory

3. Virtual Cut Through--- Introduction to Flits

In this scheme, the message is initially divided into small subunits, a set of flow control digits. If
the message s traveling and reaches a busy channel, the flits, or portions of the message are
buffered at the respective intermediate node where the channel was unavailable.

Here, the Network Latency is computed by the following formula

NL = HF/B * D + (M/B)

where the B= channel bandwidth, D = number of nodes in the network, M is the message length
and HF is the length of the header flit

4. Wormhole Routing

This method of switching is similar to the Virtual cut Through mechanism, where the difference
being is that the given flit size that is buffered fits into the buffer at the intermediate node
perfectly. Also, packet replication is now possible, meaning that copies of the flits can be sent
out through several output channels(multi-directional). Lastly, wormhole routing introduces us to
the concept of virtual channels, where multiple messages can now use the same channel to
transmit their respective messages.
These virtual channels have several advantages, one of them being that their is an increase in
network throughput by reducing the time nodes spend within the physical channels they pass
through, guarantee a communication bandwidth to system related functions, such as debugging,
and are used to avoid deadlock avoidance, a concept explained in the next paragraph.

In a deadlock, a subset of the messages(the flits, I believe) are all blocked at once and are
awaiting a free buffer to be released by a message that follows it.This problem can be resolved
by rerouting the message from the node where you are as occurs in the wormhole routing
solution, or move the flit all the way back to the source node moving in another path towards the
destination, or the last solution, the virtual channel.


Routing is the determination of a path, whether it is the shortest or the longest or somewhere in
between, from the source to the destination node. Two broad categories of routing exist---
deterministic and adaptive. In deterministic routing, the path is determined by the source and
destination nodes and in adaptive routing the intermediate nodes can "see" a blocked channel on
their way to their destination and can reroute to a new path. This works similar to the way people
normally alter their paths of traveling when hearing about the accident on I -495.

Deterministic routing has two more sub-categories that fall within it. One of them is named is
source routing, a method in which the source node determines the routing path to the destination.
The other method, distributed routing, considers all intermediate nodes and the source node in
determining the best path towards the destination, avoiding blocked channels.


Source Routing Means----

Street Sign Routing

Street Sign routing requires that the message header carries the entire path information.

At each turn of the node, the path towards the destination has been predetermined. However each
message header also has the capability to choose a default direction in which case an
intermediate node’s channel is in use. It does this by comparing the node’s address to see if a
miss occurred.

Distributed Routing Schemes-----

Dimension Ordered Routing

Dimension ordered routing is best explained through a diagram provided in Figure 17.14 and is
a routing method in which a message moves along a "dimension" until it reaches the a certain
coordinate and moves through another dimension. This scheme only works if the source node
and destination node lie along different dimensions.
Table Lookup Routing

In this type of routing, a table is formulated so that any given node can go where to forward its
message. Of course, this style of routing eats away at a lot of hardware, but is good in software
usage. This is because a very huge lookup table will require a very big chip area.


There are two general categories of adaptive routing. One is referred to as profitable routing in
which the only channels that are taken as the message travels from source node to destination
node are the ones that always move closer to the destination. Profitable routing ensures a
minimum path length and tends to minimize the possibility of a deadlock, discussed earlier. The
other category of adaptive routing misrouting protocols is more flexible and allows for usage of
the most cost efficient means of getting the message to its destination whenever it is deemed
appropriate in the network "situation". Also, adaptive routing types may vary on whether the
algorithm they utilize allows for backtracking the message to get out of a nasty situation. The
header of the message must carry a tag within it so that it will not continue looping the same path
if backtracking is permitted. If the header becomes very large because this, time is lost during the
search. Lastly, any given "protocol" in adaptive routing can either be partially adaptive, meaning
it can choose from a part of the channels to move through or completely adaptive, in which case
it moves to any channel it wants.

FINE GRAIN SYSTEMS described through the following example

Message Driven Processor and the J-Machine

The creation of the J-machine at MIT fulfilled two important goals: it supported many parallel
programming models , such as the data flow model and actor based object oriented model
through the MDP (Message Driven Processor) chip, and showed that combing many MDP chips
into several nodes can produce what is called a fine grain system.
The MDP has combined a processor, memory unit, and communication unit into a single chip.
The node connecting method chosen is a be a 3D mesh, using a deterministic and wormhole
routing scheme.

The above diagram is the design layout of a typical MDP chip with six bi-directional bit parallel
channels. There are six network ports corresponding to the six directions possible in a 3-D
grid(+X, -X, +Y, -Y, +Z, -Z). Here, messages are divided into flits, corresponding to the
addresses X, Y, and Z. If a message comes to a node in dimension D, the address at the node

is compared with the flit. If there is a match, all the flits behind the match "move one over" in the
direction of the destination node. (Recall that wormhole routing is in use) By the way it is the
router unit that makes this determination, and in the above diagram there are three such router
units. (for each of the three dimensions)

If a message is blocked, it remains compressed within the router unit and uses a bit flit buffer.
Once the path clears, the message becomes uncompressed and can now move forward. Higher
priority messages are transferred more quickly, by using separate virtual channels to arrive to
their destination faster.

The MDP system also contains a network input and output unit. The output unit is used to store
an eight word buffer every time flit of the message is moved from one node to the next. The
input buffer is a collection unit for the flit’s message portion.
Architecture of Message Driven Processor

The MDP’s value in the fine grain system has three aspects to it: it reduced overhead time for
receiving a message, it reduced context switch time, and provided hardware support for object
oriented programming.

There are three "levels" of execution based on "need" for the message to be sent over from least
quickly needed to most--- background priority, priority 1, and priority 2.

Type checking of the message is also possible in the MDP system through a 4 bit tag. Tags
named Cfut and Fut are used for synchronization checks, ensuring flits are passed correctly.

The network instructions here include SEND, SENDE, SEND2, SEND2E and are used to send
messages between processors.

Some typical commands include:

SEND R0,1; send destination address

SEND2 R1, R2.1 ; header and receiver

SEND2E R3[2,A3],1 ; two words sent and end of message

Instruction one sends the X,Y, and Z coordinates of the destination processor to R0. The second
instruction sets R1(register 1) and R2(register 2) for transmission. Finally instruction 3 combines
a word from memory with the end of the message to let the J-Machine know the message has
finished being sent.

Together, the message unit (MU ) and instruction unit (IU) determine which message

has the highest priority and allows that one to be passed first by a suspending the former
instruction currently in action. Messages can be stored in queue fashion, thus achieving the well
known FIFO property.

The J-Machine has several flaws within it. It has access to too few register sets R0-R3 and makes
many memory reference calls and some messages are locked out for a long time if they end up at
the back of the queue. The external bandwidth of the MDP structure is three times smaller than
the one found in many network designs. Along with this, the inability to use floating point in the
J-Machine is a big down side, so we turn to……

Medium Grain Systems

The medium grain system is based on the Transputer, a complete microcomputer that contains a
central chip within it, a floating point processor, static RAM, interface for external memory, and
several communication links into a given chip. Transputers are used to develop a synchronous
parallel programming model.
Two generations of Transputers have been developed in the passed few decades. The first
generation was mainly used for applications involving signal processing and were small systems
with quick communication links. Its main advantage was that it eliminated usage of a multi-bus
system, yet was not much quicker in terms of computations. The newer generation incorporates a
C104 router chip and a better processing node structure.


The major features of Transputer9000 include a 32 bit integer processor, a 64 bit floating point
unit, and 16KB of CPU cache, four serial communication links, virtual channel processor, and a
programmable memory interface. (PMI) An internal crossbar connects the CPU cache, PMI, and
virtual channel processor these to the four banks of the main cache.

Process management in Transputers

This particular Transputer has a small register set(6 of them) yet it still maintains fast contexting
switch. Register A is the stack top. Register four, W, points to the workspace area. The fifth
register acts as a PC(program counter) and register six is responsible for "elaborating" operands.

Processes can be in one of three states: Active, Ready, and Inactive. In the active stage, the
Program Counter is loaded into the register PC and the workspace gets pointed to W by W. In
the ready stage, the PC is stored in its workspace register, W, and the also stores two Ready
lists(one for high priority and one for low priority messages). The Ready list, like a normal
queue, has a front and a back "pointer" for its head and its tail.

The last stage in a transputers’ processes is the inactive stage, and is awaiting one of the
following situations to take place: completed execution of channel operations, the execution to
reach a specified time, or getting access to an open semaphore.

In order to switch from one process to another, the following steps are required:

1. Saving of the PC of an Active Process

2. Adding workspace of the active process to the waiting queue
3. Restoring the register PC from the workspace of the first process in the Ready
4. Loading of Register W with the value of Register in the Front of the queue and
setting the Front to point to the next item (message) on the waiting list

Channel Communication in Transputers

Communication between a receiver and a sender of a message is always an asynchronous

process. Only after the "partner process" completes its usage of a channel can a sender transmit
its message. Prior to channel usage of a message, each channel in the network is set to ‘Not
Process’. When a channel needs to be used, the data length of the message, its pointer, and the
channel location itself are all stored in the register stack. The Register named A defines the
length of the message, Register C contains the memory address where the message is actually
stored and Register B tells what channel will be in use next. Now, by looking in Register B, the
availability of the next process becomes easier to attain as it is transferred from process P( i ) to
P( j ).

Communication Between Neighboring Transputers

Each Transputer has something within it called an OS -Link, and is used to chain together many
Transputers. The data between the two Transputers is sent byte by byte The OS link has a
relatively slow speed of operation compared to other hardware available on the market, but is
still cheaper to buy if costs are a concern. However, problems may result with the OS link if the
process mapping strategy for linking the Transputers becomes complex

Due to this a new concept, known as virtual links was developed. Virtual links are built off the
concepts of bit level protocols, token level protocols, packet level protocols, and message
passing protocols.

Bit level protocols contain within them DS links rather than OS Links and have four wires, a data
strobe line going in both the input and output directions. (In bit level protocol, transmission of
messages is faster than before, but the receiving rate of messages is still somewhat slow.

Token level protocol’ s are used prevent the process sending the message from overusing the
entire input buffer for message transmission. Here, the receiver of a message lets the sender of
the process know that it is ‘ok’ to send forth more of the message to reach the destination. Some
of the commands associated with this include:
00 Flow control token(FCT)

01 End of packet(EOP)

10 End of message(EOM)

11 Escape Token(ESC)

Packet Level Protocol

Here packetizing of messages takes place rather than usage of the flit method, but is essentially
the same concept as the token level protocol

Message Passing Protocol

Here, a "virtual link" list is used that is similar to the idea of a ready list. Whenever a message is
sent out or received, the VCP unit of the T9000 "deschedules" the message currently being
worked on. The identifier of the new process is also saved and execution of it begins. Once this
is complete the process that had been waitlisted now continues at the command of the VCP
unit.In this way, the VCP lets through messages of higher priority.

Communication Between Remote Transputers

In a remote system, the packets of messages to be distributed use the header in the same way any
longer . Instead, a C104 chip is used to develop a different routing switch scheme. The C104
chip, FIGURE 17.25 is shown below.
The C104 chip uses a deterministic, distributed, minimal routing protocol called interval
labeling. Each interval for a given router is non overlapping. An interval’s destination address
determines where the message packet will be sent. The routing decision unit for each packet is
named the interval selector unit. It has a thirty-five limit and base comparator within it. More
than one interval can be assigned to it. When a packet arrives to the C104 router its header is sent
to the interval selector unit when the crossbar switch matches the message header’s data. After
this, all the tokens can pass through until the terminating token reaches its destination.

T9000 based machines

The Parasytec GC


1)3D interconnection topology

2) connects up to 16, 384 nodes

1. contains three network connections:

a. Data network
b. Control Network
c. I/O network

1. has a cluster of 17 transputers

1. only 16 of them are visible to the user

2. four links to each Transputer, each connected to C104
router chips
3. each router has two links, to neighbouring clusters
I. Coarse Grain Multicomputers

One group of Coarse Grain Multicomputers is referred to as homogeneous multicomputers and

still contain the three components necessary for parallel message-passing, a computation
processor, a communication processor and a router. The are attributed with a RISC superscalar
architecture and effectively utilize many MFLOPS per a node so that the communication
processor can support the computation processor in maintaining a lower level of communication.
In a different coarse grain multicomputer type, A router is sometimes replaced with a custom
designed communication switch Some examples of this include the Paragon, the CS-2 and the
Manna.In a third architectural design of coarse grain multicomputers, custom designed
communication processors are used to obtain a high performance level. This characteristic is also
attributed to the CS-2 which also contains vector processing units for each node The last type of
coarse grain multicomputers is known as hybrid multicomputers, and has different processor
types used in computation and communication processes. Some examples here include the
revised Parasytec GC, the PowerPlus, and the IBM SP1 and SP2.

The major topology designs that are mainly used on coarse grain systems include the following:
mesh, cube , and fat tree due to their lower implementation costs and has better than average

A. Homogeneous Architectures

Intel Paragon

The Intel paragon was based on the hypercube topology. The Paragon contains three node types :
compute nodes, service nodes, and input/output nodes. The compute nodes sole responsibility is
computations and are considered multiprocessor nodes, in contrast to the other two node types,
thought of as general purpose nodes. Each general purpose node contains an expansion port for
addition of an input and output interface . The input/output nodes themselves serve to support
input/output connectivity and the service nodes are involved in providing interactive use for
many users.

The Multi Processor (MP) node in this architecture is a small sized shared memory
multiprocessor . The nodes memory is shared in a four way system of processors and each has a
25 KB second level cache. The MP node also has a i860 chip used as a message processor.

The message passing process is started at the application processor but is performed mostly by
the message processor. The software used in message passing is executed from the internal cache
just discussed.

The architecture also consists of a Message routing unit known as the Mesh Router
Controller(MRC) formulated through a connected 2D mesh. The MRC allows message passing
at the high speed rate of 200 Mbyte/second . The MRC is made up of a 5X 5 crossbar switch
with flit buffers and many input controllers. Along with this, two block transmission engines are
included to support communication in a node. The Network Interface Controller(NIC) provides a
pipelined interface between the MRC of the node and the node’s memory.

B. Hybrid Architectures

Hybrid architectures use high-performance commodity microprocessors for computations. The

communication processor is typically a Transputer and the internal hardware supports intensive
computations. In this case first generation Transputers are combined with a RISC superscalar
architecture to process the multicomputer nodes. The Transputer is a communication network
while the superscalar aspect of the architecture plays the role of a compuational unit. In order to
distribute the operating system in this architecture, the operating system must be located on the
computation processor, or the operating system must run on the Transputer. Here the
computation unit is used as a coprocessor.
Parasytec GC/PowerPlus

The node on this system is comparable to the one found in the Transputer T9000 system. The
new CPU 601 is much more powerful than the CPU found in the T9000 though. Instead of
having virtual channels as in the T9000, four T805 Transputers are used. There are sixteen bi-
directional links on the system instead of the four found on the T9000. Also, the amount of
communication processors, the number of CPU’s , and the quantity of node memory provided
are based on customer need. THE CPU and VCP in this system send signals to each other to
reduce overhead time due to excess communication. The multi-chip concept in the architecture
along with the usage of the 3D mesh interconnection topology arrangement are also provided. A
single cube on the GC/PowerPlus machine as four sets of 5 nodes connected through 16 C004
switch units. This system also uses a wormhole routing scheme in message distribution.

Essentially, the coarse grain part of the system is derived through the multi-threaded and it
borrows the virtual channel concept from the T9000. Through the communication processor, the
application threads provide communication by placing a channel operation command and
parameters in the shared memory, thus sending an interrupt to the VCP.