Beruflich Dokumente
Kultur Dokumente
2.0 2.1 2.2 Introduction Objectives Pipelining 2.2.1 2.2.2 2.2.3 2.3 General Considerations Arithmetic Pipelining Instruction Pipelining
Vector Processing 2.3.1 2.3.2 2.3.3 2.3.4 Vector Operations Matrix Multiplication Memory Interleaving Supercomputers
2.4
Array Processors 2.4.1 2.4.2 1.4.3 Attached Array Processors SIMD Array Processors Associative Array Processing
2.5 2.6
PIPELINING
The basic idea behind pipeline design is. quite natural; it is not specific to computer technology. In fact the name pipeline stems from the analogy with petroleum pipelines in which a sequence of hydrocarbon products is pumped through a pipeline. The last product might well be entering the pipeline before the first product has been removed from the terminus. The key contribution of pipelining is that it provides a way to start a new task before an old one has been completed. The concept can be better understood by taking an example of an industrial plant. To achieve pipelining, one must subdivide the input task into a sequence of subtasks, each of which can be executed by a specialized hardware stage that operates concurrently with other stages in the hardware. Successive tasks are streamed into the pipe and get executed in an overlapped fashion at the subtask level. Hence the completion rate is not a function of the total processing time, but rather of how soon a new process can be introduced. Consider the figure 1, which shows a sequential process being done step by step over a period of time.
The total time required to complete this process is N units, assuming that each step takes one unit time. In the figure each box denotes the execution of a single step, and the label in the box denotes the number of the step being executed. To perform the same process using pipeline technique, consider figure 2, which shows a continuous stream of jobs going through the N sequential steps of the process. In this case, each horizontal row of boxes represents the time history of one job. Each vertical column of boxes represents the activity at a specific time. Note that upto N different jobs may be active at any time in this example, assuming that we have N independent stations to perform the sequence of steps in the process. The pipeline timing of Figure 2 is characteristic of assembly lines and maintenance depots as well as oil pipelines. The total time to perform one process does not change between Figure 1 and Figure 2, and it may actually be longer in Figure 2 if the pipeline structure forces some processing overhead in moving from station to station. But the completion rate of tasks in Figure 2 is one per cycle as opposed to one task every N cycles in Figure 1.
Arithmetic Pipelining
Pipeline arithmetic units are usually used in very high speed computers. These are used for implementation of floating point operations, multiplication of fixed point numbers and similar type of computations encountered in scientific problems. We briefly examine floating point multiplication and addition to construct a pipelined arithmetic unit for these operations. The simpler of the two operations to describe is multiplication. So, we shall concentrate mainly on this. 1. The input values are assumed to be two normalised floating point numbers represented by the tuples (Mantissa-1, exponent-1) and (mantissa-2, exponent-2) respectively. The first step is to add the exponent to form the exponent-out. Multiply the two mantissa to form double length mantissa. This step may be overlapped with the addition of exponent, and the step may take several time units if the design breaks up the multiplication into chunks that use common multiplier units into successive periods of time. Examine the product mantissa, and if it is not mormalized, normalize it and accordingly adjust the exponents.
2.
3.
4.
Round the product mantissa to a single-length mantissa, and if the rounding causes the mantissa to overflow, adjust the exponent.
Based on these points we can formulate a linear pipeline in the manner, shown in figure 7.
Figure 7 : A linear pipeline for floating-point multipication Now let us consider the floating point addition. This is slightly more intricate than the multiplication operation on the floating point numbers. 1. The input values are assumed to be two floating point numbers represented by the tuples (mantissa-1, exponent-1) and (mantissa - 2, exponent-2) exactly as for floating point multiplication. Subtract the exponents to find the difference exp-diff and to determine which exponent is larger. If expoiient-2 is smaller than exponent-1, swap the operands. Shift mantissa-2 to the right by the number of digit positions given by exp-diff. Add the mantissas and initialize the exponent of the suits to be exponent-1 Renormalize the sum mantissa, adjusting exponent-sum to reflect the number of digit positions of adjustment required. Round the sum to a single-length mantissa and, if the rounding causes the mantissa to overflow, renormalize and adjust the exponent.
2. 3. 4. 5.
The following example clarifies the operations: Example: For simplicity we have used decimal numbers Consider two normalized floating point numbers: X = 0.9504 E3 Y = 0.8200 E2
The two exponents are subtracted in the first segment to obtain 3 - 2 = 1. The larger exponent 3 is chosen as the exponent of the result. The next segment shifts the mantissa of the Y to right to obtain: X = 0.9504 E3 Y = 0.0820 E3, This aligns the two mantissas under the same exponent. The addition of the two mantissas in segment 3 produces the sum: Z = 1.0324 E3 The sum is adjusted by normalizing the result so that it has a fraction with a nonzero first digit. This is done by shifting the mantissa once to the right and incrementing the exponent by one to obtain the normalized sum. Z = 0.10324 E4 Figure 8 gives the pipeline for floating point addition and subtraction.
The comparator, shifter, adder-subtractor, incrementer, and decrementer in the floating point pipeline are implemented with combinational circuits.
Instruction Pipelining
Pipelining processing can occur not only in the data stream but in the instruction stream as well. An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments. One possible case that may degrade the performance of the system is when a branch instruction is executed. In this case all the instructions that were fetched have to be flushed out, and a new set of instructions are fetched. That was of course a very primitive type of pipelining. A typical instruction- execution sequence in a pipelined system would be: 1. Instruction fetch: 2. Instruction decode: 3. Address generation: 4. 5. 6. 7. Obtain a copy of the instruction from the memory examine the instruction and prepare to initialize the control signals required to execute the instruction in subsequent steps.
compute the effective address of the operands by performing indirection or indexing as specified by the instruction. Operand fetch: for READ operations, obtain the operand from central memory. Instruction execution: execute the instruction in the processor Operand store: for WRITE operations, return the resulting data to central memory Update program counter: generate the address of the next instruction.
The design of the pipeline, thus would be most efficient, if the segments are divided, such that each segment takes almost equal time to complete the task given to it. Example: Four - Segment Instruction Pipeline Assuming that the decoding of the instruction can be combined with the calculation of the effective address into one segment. Assume further that most of the instructions place the result into a processor register so that the instruction execution and storing of the result can be combined into one segment. This reduces the instruction pipeline into four segments. While an instruction is being executed in segment four, the next instruction in sequence fetches the operands in segment 3. The effective address may be calculated for the third instruction in the second segment, and yet another instruction might being be fetched in segment 1. Thus, upto four suboperations in the instruction can overlap and upto four instructions can be processed in pipeline. Figure 10 shows the operations of the instruction pipeline. The time in the horizontal direction is divided into steps of equal duration. The four segments are indexed as follows : 1. 2. 3. 4. FI DA FO EX : : : : instruction fetch segment Decode instruction and the effective address calculation segment operand fetch segment execute segment
It is assumed that the processor has the separate instruction and the data memories so that the FI and FO segment can proceed together. In other case, when the instruction and the data memories are same, the operand fetch is given higher precedence than instruction fetch segment.
Figure 10: Timing of instruction pipeline In the absence of branch instruction all the four segments operate on different instructions. In case of a branch instruction, as soon as it is decided in DA in step 4, the transfer from FI to DA of the other instruction is halted until the execution of the branch instruction is completed in step 6. If the branch is taken, a new instruction is fetched in step 7. If the branch is not taken, the instruction fetched previously in step 4 can be used. The pipeline then continues until a new branch instruction is encountered. Another delay may occur in the pipeline if the EX segment needs to store the result of the operation in the data memory while the FO segment needs to fetch an operand. In that case the FO segment must wait until the EX segment has completed its task. In general there are three major difficulties that cause the instruction pipeline to deviate from its normal operation: 1. 2. 3. Resource conflict caused by access to memory by two segments at the same time. Most of these conflicts can be resolved by using separate instructions and data memories. Data Dependency conflicts arise when an instruction depends on the result of the previous instruction, but this result is not yet available. Branch difficulties arise from the branch and other instructions that change the value of PC. These instructions are like Branch, Interrupt, Call, Return etc.
VECTOR PROCESSING
There is a class of computational problems which are beyond the capabilities of a conventional computer. These problems require a vast number of computations that may take a conventional computer days or even weeks to be completed. In scientific and engineering applications, many of such problems can be formulated in the form of vectors and matrices that lend themselves to vector processing. The vector computers have emerged as the most important high performance architecture for numerical problems. It has two key qualities of efficiency and wide applicability. Some of the applications where vector processing capabilities are of utmost demand are: 1. Long range weather forecasting 2. Explorations of Petroleum products 3. Analysis of seismic data 4. Medical diagnosis
5. Aerodynamics and flight simulations 6. Artificial intelligence and expert systems 7. Image processing applications
Supercomputers
A commercial computer with vector instructions and pipelined floating point arithmetic operations is referred to as a supercomputer. Supercomputers are very powerful, high-performance machines used mostly for scientific computations. To speed-up the operation, the components are packed tightly together to minimize the distance that the electronic signals have to travel. A supercomputer is a computer best known for its high computational speed, fast and larger memory systems, and the extensive use of parallel processing. It is equipped with multiple functional units and each unit has its own pipeline configuration. Although the supercomputer is capable of general purpose applications found in all other computers, it is specifically optimized for the type of numerical calculations involving vectors and matrices of floating point numbers.
ARRAY PROCESSORS
An array processor is a processor that performs computations on large arrays of data. The term is used to refer to two different types of processors. An attached array processor is an auxiliary processor attached to a general-purpose computer. It is intended to improve the performance of host computer in specific numerical computation tasks. An SIMD array processor is a processor that has a single-instruction multiple-data organization. It manipulates vector instructions by means of multiple functional units responding to a common instruction. Although both type of array processors manipulate vectors, their internal organization is slightly different. We shall now consider each of them in brief.
The host computer is any general purpose computer and the array processor is a back-end machine driven by a host computer. The array processor is connected by an input-output controller to the host computer and it interacts with the host computer just like any other external device to it. The general purpose computer performs all arithmetic and logical operations in the instruction, while the array processor satisfies the need for complex arithmetic operations.
SIMD organization consists of multiple, processing elements supervised by the same control unit. All processing units receive the same instruction broadcast but works on different data streams. One of the very common example is the execution of afor loop, in which the same set of instructions are executed for, may be, a different set of data. MISD organization consists of n processor units, each working on a different set of instruction but working on the same set of data. The output of one unit becomes the input to the other unit. MIMD organization implies interactions between n processors .because all memory streams are derived from the same data stream shared by all processors. If the interaction between the processors is high, it is called a tightly coupled system, or else it is called a loosely coupled system. Most mu!tiprocessors fit into this category.
CHARACTERISTICS OF MULTIPROCESSORS
A multiprocessor system is an interconnection of two or more CPUs with memory and input-output equipment. Multiprocessors can be put under MIMD category. The term multiprocessor is some times confused with the term multicomputers. Though both support concurrent operations, there is an important difference between a system with multiple computers and a system with multiple processors. In a multicomputers system, there are multiple computers, with their own operating systems, which communicate with each other, if needed, through communication links. A multiprocessor system, on the other hand, is controlled by a single operating system, which coordinate the activities of the various processors, either through shared memory or interprocessor messages. The advantages of multiprocessor systems are: 1. Increased reliability because of redundancy in processors 2. Increased throughput because of execution of - multiple jobs in parallel - portions of the same job in parallel. A single job can be divided into independent, tasks, either manually by the programmer, or by the compiler, which finds the portions of the program that are data independent, and can be executed in parallel. The multiprocessors are further classified into two groups depending on the way their memory is organized. The processors with shared memory are called tightly coupled or shared memory processors. The information in these processors is shared through the common memory. Each of the processors can also have their local memory too. The other class of multiprocessors is loosely coupled or distributed memory multi-processors. In this, each processor have their own private memory, and: they share information with each other through interconnection switching scheme or message passing.
INTERCONNECTION STRUCTURES
The principal characteristic of a multiprocessor is its ability to share a set of main memory and some 1/0 devices. This sharing is possible through some physical connections between them called the interconnection structures. Some of these schemes are: 1. Time-shared common bus 2. Multiport memory 3. Crossbar switch
CPU Central Processing Unit IOP - Input Output Processor Figure 1: Time - shared common bus organization As all of them are sharing the same bus, only one processor can communicate with the memory at a time. The processor that is at any time communicating with the memory has the control over the bus. The other processor requesting the services of the bus waits for the bus to become free. As only one transfer can take place at any one time on the bus, the overall performance of the system is limited by the bandwidth of the bus. While one processor is communicating with the memory through the bus, the processors can only be doing some local task.
2. Multiport Memory
This, configuration is suitable and used for both uniprocessor and multiprocessor systems. In this configuration, there are separate buses connecting each CPU to each memory module. Each processor bus consists of the address, data and the appropriate control lines used to communicate the CPU with the memory module. At any time there could be more than one processor requesting the services of a specific memory module. This conflict is resolved by assigning fixed priorities to each of the memory port. The systems can then be configured as necessary at each installation.
3. Crossbar Switch
Functionally, crossbar switch interconnection is very similar to multiport memory module. The memory modules and the processors can be visualized to be placed in the form of a matrix (refer to figure 4(a)), with memory modules representing the columns and the processors representing the rows of the matrix. At each intersection point the switch is placed. This switch determines the path from the processor to the memory module. The switch has a control logic which determines the path from the processor to the memory module. The processor places the address of the memory to access on the bus. This address is captured by the switch, which checks for the conflicts. In case of a conflict in the form of multiple requests, the switch resolves the conflict depending upon the priority assigned to each processor. The path is then set for the processor to communicate with the required memory module.
Figure 5: A 2 X 2 Crossbar switch Using such type of multiple 2 X 2 switches as building blocks. it is possible to build a multistage. network to control the communication between multiple sources and destinations. These can be connected to each other, either in the form of a binary tree (Figure 7) or omega network (Figure 8). To understand the working of such type of systems, consider the binary tree of figure 7. The two processors P1 and P2 are connected through switches to eight memory modules marked in binary from 000 to 111. The path from the source to the destination is determined by the binary bits of the destination. The first bit of the destination number determines the switch output in the first level. The second bit specifies the output of the switch in the second level and so on. For example, to connect P1 to memory 101, it is necessary to form a path from P1 to output 1 in the first level switch, output 0 in the second level switch, and output 1 in the third level switch. P1 and P2 can be connected to any of the eight memory modules. Certain request patterns cannot be satisfied simultaneously. For example, if P1 is connected to one of the destinations 000 through 011, then P2 can be connected to only one of the destinations from 100 to 111.
Figure 7: Binary tree with 2 X 2 switches The processors and the memory can be interconnected in a number of ways. One such topology is the omega switching network (Figure 8). In this there is exactly one path from any one processor to any memory module. However, two sources cannot be directly connected to the memory modules sharing the
same switching network. Example, two sources cannot be connected simultaneously to destinations 000 and 001 or 010 and 011 and so on.
Figure 8: 8 x 8 Omega switching network The configuration can be used for both loosely coupled and tightly coupled systems. For the tightly coupled systems, the source is the processor and the destination is the memory modules. For the loosely coupled systems, both the source and the destination are the processing elements.
5. Hypercube Interconnection
The hypercube or the binary n-cube multiprocessor structure is a loosely coupled system composed of N = (2n) processors connected in a n-dimensional binary cube. Each processor forms a node of the cube. These processors could be the I/O processors, the CPUs or even the memory modules. In an n-dimensional hypercube, each node has a direct link with 'n' other nodes. For example, consider figures 9 (a), (b) and (c). Figure 9(a) represents a 1 dimensional hypercube. As you would notice, the hypercube has 21 = 2 nodes, and each node is connected directly to 1 neighboring node. The nodes can be numbered using 1 bit. Similarly 9(b) shows a 2-dimensional hypercube, which has 22 = 4 nodes, and each node is directly connected to 2 other nodes. Each node can be represented using 2 bits (00, 01, 10, 11). The same description can be extended for three dimensional hypercube and so on. Please note that in the nomenclature of the nodes a direct neighbor of a node can be located on changing only one bit of a given node. For example, the direct neighbour of node 001 are: 011 or 101 or 050. If any node has to communicate with the other node, to which it is not directly connected, it routes through the directly connected nodes, forming a path. The following example illustrates this for a 3-dimensional hypercube (Figure 9(c)). Node 010 can communicate directly with node 01 1, but to communicate with node 101 it has to go through atleast three links. The path could be 010 --> 110 => 111 => 101 or alternately, it could even be 010 => 000 => 100 => 101. In a similar fashion a number of other possible routes can also be derived.
INTERPROCESSOR ARBITRATION
There are buses that transfer data between the CPUs and memory. These are called memory buses. An I/O bus is used to transfer data to and from input and output devices. A bus that connects major components in a multiprocessor system, such as CPUs, I/Os, and memory is called system bus. A processor, in a multiprocessor system, requests the access of a component through the system bus. In case there is no processor accessing the bus at that time, it is given the control of the bus immediately. If there is a second processor utilizing the bus, then this processor has to wait for the bus to be freed. If at any time, there is request for the services of the bus by more than one processor, then the arbitration is performed to resolve the conflict. A bus controller is placed between the local bus and the system bus to handle this.
The Least Recently Used priority scheme gives the control to the device which has not used the services of the buses for the longest time. For each device a tag is maintained, which is updated every time the device uses the bus. After a fixed time interval these tags are checked, and the device which has not used the bus for the longest interval gets the highest priority. In the First In First Out scheme a queue of processors is maintained. After a processor is serviced it gets to the tail of the queue, and the processor next in the queue gets the highest priority. The Rotating daisy-chain is a combination of static daisy-chain and first-in first-out schemes. The arrangement of the processors is same as that of the static daisy chain, with the difference that the PO line of the last arbitor is connected to the PI line of the first arbitor, thus forming a circular queue. The arbitor which is being serviced has the highest priority, and the next in the queue, has the second highest priority. After the device has been serviced, it gets the lowest priority, and the next in the queue gets the highest (figure 13).
A very common problem that can occur when two or more resources are trying to access a resource which can be modified. For example processor 1 and 2 are simultaneously trying to access memory location 100. Say the processor 1 is writing on to the location while processor 2 is reading it. The chances are that processor 2 win end up reading erroneous data. Such kind of resources which need to be protected from simultaneous access of more than one processors are called critical sections. The following assumptions are made regarding the critical sections : Mutual exclusion : At most one processor can be in a critical section at a time Termination : The critical section is executed in a finite time Fair scheduling : A process attempting to enter the critical section will eventually do so in a finite time.
A binary value called a semaphore is usually used to indicate whether a processor is currently executing the critical section. A semaphore is a software flag, with binary values 0 and 1, corresponding to functional values free and busy. Whenever, a processor wants to access the critical section, it checks semaphore. If the status of the semaphore indicates that the section is free, the processor sets the semaphore to busy, and starts executing the critical section. If the status of the semaphore is busy, then the processor waits for the critical section to be freed before using it. This ensures that at any time there is only one processor accessing the critical section. CACHE COHERENCE Cache memories are high speed buffers which are inserted between the processor and the main memory to capture those portions of the contents of main memory which are currently in use. These memories are five to ten times faster than main memories, and therefore, reduce the overall access time. In a multiprocessor system, with shared memory, each processor has its own set of private cache. Multiple copies of the cache are provided with each processor to reduce the access time. Each processor, whenever accesses the shared memory, also updates its private cache. This introduces the problem of cache coherence, which may result in data inconsistency. That is, several copies of the same data may exist in different caches at any given time. Main Memory X = 15
X= 150 CM1
X=15 CM2
P1
P2
X= 15 Data item stored in main memory CM1 & CM2 are local Cache Memories (CM) of the processors P1 & P2 Figure: Cache Coherence For example, let us assume there are two processors P1 and P2. Both have the same copy of the cache (x=15). Processor P1 update the value of X in its own private copy of the cache. As it does not have any access to the private copy of cache of processor P2, the processor P2 continues to use the variable X with old value, unless it is informed of the change. Thus, in such kind of situations if the system is to perform correctly, every updation in the cache should be informed to all the processors, so that they can make necessary changes in their private copies of the cache. A number of solutions have been suggested to overcome this cache coherence problem. One of the simplest scheme is to disallow private caches for each processor, instead have a shared copy of the cache memory associated with the main memory. This would however increase the access time of the processor. Let us
assume processors x and y want to access the memory at any given time. As the cache lies in the shared area, only one processor will be able to access it at any time, while the other processor will have to wait for the cache to be free. The other scheme is to have only nonshared and read-only data stored in the cache. Now as the data in the cache cannot be updated, therefore the copies of the cache with each processor are always exact replica. Such type of data are called cachable. The data which is sharable and writable is noncachable data. A third scheme is to have the writable data in only one cache. The status of the data is stored in the centralized global table in its compiler. Each block in the memory is categorized as read only or read-write block. All caches can have the copies of read only block, while only one cache keeps the copy of the readwrite block. Now even if any updations are made in the read-write block, other caches are not affected, because they do not contain that data. Comparison Of three multiprocessor hardware organizations or Difference between Multiprocessors with time shared bus, Multiprocessors with crossbar switch and Multiprocessors with multiport memory Multiprocessors with time shared bus 1. 2. 3. 4. 5. Least complex and most inexpensive hardware. Easily modifiable. System - capacity limited by the bus transfer rate. Lowest system efficiency. Suitable for small systems only.
Multiprocessors with crossbar switch 1. 2. 3. 4. 5. Most complex and expensive hardware. Functional units are cheapest, as they are in the form of a switch. Organization is cost effective for multiprocessors only. System expansion usually improves the overall performance. Expansion is limited by the size of the switch.
Multiprocessors with multiport memory 1. 2. 3. 4. 5. Requires the most expensive memory units, as the switching circuitry is included in the memory. The characteristics of functional units permit a relatively low cost uniprocessor to be assembled from them. Very high potential transfer rate. Size and configuration of the system is limited by the number of memory ports available. A large number of cables and connectors required.
Difference between synchronous and asynchronous transmission: Synchronous Transmission Asynchronous Transmission 1) Data transfer is done during time slice. Data transfer is done at any time. 2) Source and destination are tied up with same No such requirements clock frequency. Require handshake signal thus 3) Overheads for data transmission are small. synchronous transmission.
more
overheads
than
Difference between Serial and Parallel arbitration: Serial Arbitration 1) Bus grant/acknowledge line is connected serially. 2) Priority is defined by hardware. 3) If one of the arbitor fails then the bus can be blocked. 4) Priorities cannot be changed easily Parallel arbitration Bus grant/acknowledge is in parallel. Software define priority can be implemented. The failed arbitor need not be used at all Priorities can be changed