Advanced Pipe Lining Techniques

Ryan DiBiasio Chris Harris
Advanced Pipelining Techniques Pipelining is a very basic concept in computing which originated from the necessity to process data at faster and faster speeds than was available from the bus-style architecture. The simplest pipeline is that of the 5-stage RISC pipeline, and this concept can portray the basis behind every modern microprocessor, and that is the idea of stages in hardware, as opposed to the idea of having a single stage that the instruction has to propagate through with multiple clock cycles. The idea of completing one instruction with every clock cycle lead to pipelines becoming the only modern approach to computer architecture, and thus the variations on pipelining are extreme. Many different types of advanced pipelining techniques have been developed in the time that it has been around, and many more will continue to develop in the future. The following will give a bit of insight as to what the different techniques are, as well as how to best use the knowledge of past failures and successes to predict what could be in the future for computing. EPIC: One of the earliest computing technologies for microprocessing systems was created by a joint HP and Intel partnership, and labeled EPIC. Standing for Explicitly Parallel Instruction Computing, the EPIC philosophy of programming a microprocessor had its roots in the concept of Very Long Instruction Word (VLIW) approaches to computing. The issue with VLIW was initially that it couldnt outpace superscalar processors at the time because the VLIW approach wasted too much memory and had many issues with holding up the pipeline and filling it with excessive bubbles in order to keep it full. The engineers at HP developed a method in between the brute forced hardware parallelism of superscalar computing, and the brute forced program level parallelism of VLIW. The idea was that superscalar technology, while in the past considered a far better alternative to VLIW, would become far too complex for practical use in the near future, and as a result have a negative impact on their clock speeds. (1) This idea, along with the knowledge that a single-die microprocessor would most likely be available in the near future (1), led HP engineers to develop the EPIC technology as a design philosophy more than as a specific architecture. EPIC has its roots in VLIW, in that the processor takes advantage of ILP by running multiple instructions through the pipeline at the same time. However in order to surpass superscalar, which had been hailed as far superior to VLIW, the EPIC design had to encompass the best aspects of superscalar and also fix the pitfalls of VLIW. The initial concept was to simplify the hardware of a microprocessor as much as possible, and also avoid out-of-order operation, which was a very complex design which was very common in superscalar architectures. The other key difference which EPIC hoped to address was this idea of enabling the same program to be run on different architectures by altering it to most efficiently operate on a certain set of hardware. The VLIW approach did this by setting the instructions up at compile time to be ordered pairs, and filling in voids with blank data, while superscalar approaches did every bit of software alteration in hardware, with complex circuits. The EPIC design was to use a better spinoff of the VLIW approach which simplified the running of the code in a superscalar manner, before the instructions were issued to the processor. This was done by allowing the compiler to use knowledge of the processor design in order to best arrange the code to be completed as
efficiently as possible. There was one key, in that the processor worked best when it was not an out-of-order processor, since the knowledge that it could give results out-of-order would complicate the job of the compiler. However, with such lofty goals, the idea of starting from a VLIW approach led to many problems that needed to be overcome. The main issues with VLIW approaches stemmed from how the compiler had to setup the code in order for it to be usable. VLIW takes code and segments it into blocks, which leaves holes and spaces, but not only in the pipeline, but also in memory where the code itself is stored. This way of encoding the operations lead to very large memory usage to store code for a program, much of which was nonsense. In addition, VLIW approaches worked best when the code was setup for a VLIW processor, which is not an option for most users, who didn't want to have to buy new software for every architecture (and even worse for the programmers, who dont want to write software again for every architecture). EPIC addresses each of these individually, reducing the bubbles caused by interrupts with additional instructions for load prediction and extra methods of branch prediction, since stalls in the pipeline of a VLIW based processor are significantly worse than stalls in another architecture. The other issue for EPIC to address was the idea of the large code size, which the engineers at HP gave to the compiler. In order to generate the code for the processor, the compiler would first go through and, using its knowledge of the processor, optimize the code for it. By assuming the amount of time it would take for an instruction to complete, the compiler optimizes code with minimum holes so that it can get instructions completed at roughly the same time, and avoid a large number of data hazards. The EPIC architecture deals with many of the issues which are present for all architectures in order to completely deal with hazards, but the more interesting solutions presented are for branch hazards and register renaming. Branch hazards, viewing branch prediction, are always an issue, but even more so when using a processing architecture which uses the VLIW approach. In order to solve this issue, EPIC uses the compiler to assemble many different branch options into larger pieces and even to avoid branches entirely where necessary (1). The issue for register renaming stems from the reason why it is done on a superscalar architecture to begin with. In order to handle various data dependencies, the superscalar architecture often renames registers in order to prevent data hazards from presenting themselves. The additional freedom given to the compiler to doctor the code allowed the EPIC design to fix this problem of using more registers than the instruction can see, by allowing it to rename the registers in the program itself, allowed for the EPIC microprocessor design to use the same number of registers as a superscalar processor, while also having all of them available at all times. History of pipelining: Pipelining was developed in the early years of microcomputer architecture, and is what initially led the user base away from supercomputers and instead into microprocessors. The basic concept that is used is the idea of a 5-stage RISC pipeline, which allows for one instruction to be completed per clock cycle, and in general improves the overall efficiency of any processor by a significant amount. The chief distinctions in any pipelining design are both the number of stages, as well as how the stages are set up, or rather how the new stages divide up the tasks of the older stages. In general, as we know, the more and more uniform pipeline stages involved, the faster an instruction will be completed, due to the reduction in the critical path, which lead to a generally faster clock cycle. The goal of processor engineers was initially to make as many
pipeline stages as possible, while also keeping each stage with roughly the same critical path, so as to use the minimum clock cycle. However, as Intel realized a few years ago, there are some issues with this philosophy. In the first pipelined Intel architecture, the Pentium chip used the basic 5-stage architecture. They then progressed quickly, and for the Pentium Pro, Intel was running with a 12-stage architecture. This type of approach to pipelining allowed them to create a faster computer without significantly altering the technology on their hardware. The next jump was backwards a bit to the P3, with a 10-stage pipeline, but featured faster hardware in order to allow significantly better clock speeds. However, the P3 design was capped at right around a 1GHz speed, something that the engineers at Intel looked to fix with the Pentium 4. In order to solve this issue, they created the P4 with a 20-stage pipeline, allowing for a drastic increase in possible clock speeds, with of course a downside of not actually accomplishing twice as much with twice the clock speed as a P3 processor. With the higher and higher clock speeds, in order to keep the processor gaining speed in execution, they needed to develop a faster pipeline, which led them into a 31-stage architecture for the Prescott cores in the P4 family. These cores topped out close to 4GHz, but in the end drew too much power. (2) The next step for Intel was to tone down the absurdity of having so many clock cycles. In order to keep the power usage lower, they developed newer and smaller architecture, and also ran the CPU with an efficient 14-stage pipeline. This pipeline also was wider than previous pipelines, supporting 4 total instructions at a time as opposed to the 3 supported by previous Intel processors. The shorter pipeline allowed Intel to have faster processors which did not have the issues associated with a very large pipeline. The typical issues with longer pipelines are of course, larger branch error penalties, as well as an increased penalty from data hazards involving loads. Overall, the design of Intel processors was initially growing in pipeline stages with time, but after they encountered issues with the P4 processor, Intel dropped down to a more manageable size, and has not increased since the drop to the 14 stages used in the NetBurst architecture (3). ILP: Pipelining is our practical solution in order to meet the needs of instruction level parallelism. At the heart of this concept, we know that the demands of ILP cannot be perfectly met by the hardware and coding schemes. Imperfect branch and jump prediction, long memory accesses, and limited virtual registers will always continue to be a problem. However, since the 1960's we have been on the journey to make pipelined processors fit an ideal image of what Instruction Level Parallelism would look like. Although we have not made any processor fit that desired image, these techniques that are essential to processors used today are definite stepping stones toward this concept of efficiency and output. Scoreboarding: There are a few options to implement dynamic scheduling in a machine. One of the more simple, yet effective, methods to accomplish multiple tasks simultaneously came with the scoreboard technology. The scoreboard makes use of instruction level parallelism by executing instructions in a block simultaneously (a block refers to code sequence that does not hold any branches
except the entry and exit). Scoreboarding technology then will allow instructions to simultaneously execute as long as two conditions are met. There must not be any datadependencies in the instructions that are being processed or that were recently issued. Also, the hardware or resources must be available at the time the scoreboard logs the instruction. Therefore, instructions are processed in-order; however, if there are no dependencies, an instruction execution may finish earlier than the previous one issued (out-of-order completion). In this way, scoreboarding can be classified as a very simple type of dynamic scheduling. (4). The CDC 6600 was one of the first models to implement scoreboarding, and at the time it was classified as unique since there wasnt a large concept of superscalar architecture. Even though a 10MHz clock was used for the CDC 6600, due to parallelism, instructions could be executed roughly around the speed that a 40MHz CPU would (7). Scoreboarding takes advantage of the MIPS approach to the pipeline where structural and data hazards were checked in the early stages. There are four stages that handle the instructions: issue, read operands, execution, and write result. The issue stage checks to make sure the hardware is available and analyzes code for WAW hazards. If these are detected, then the instruction is delayed. In the read operand stage, source operands are checked to see if any previously issued instructions are going to write to it as a destination (RAW hazards). The instruction may safely execute in the execution stage, and finally can notify the scoreboard that execution has completed after obtaining a result. This way, the scoreboard knows when it's safe to issue another instruction. After execution, the scoreboard checks for WAR hazards until writing the result to the destination. The scoreboard manages this by observing the status of the instructions, functional units, and registers (4). Overall, one of the main purposes of this implementation is to reduce stalls; however, this doesnt occur easily as there exist factors prohibit efficiency. As stated previously, this form of pipelining can be useful only in code block segments since branching conditions must be preevaluated before following instructions can be issued. Moreover, for the scoreboard and other processors that use simultaneous execution, the window size of how many instructions, not including cached, can look ahead denotes a limit for the number of instructions to be issued in a cycle. (4). Also, the various kinds and the number of hardware elements can be a limitation as structural hazards arise during dynamic scheduling. Finally, the existence of ant dependencies and output dependencies will create stalling. For scoreboarding, forwarding results in order to satisfy dependencies does not occur as both registers have to be available for operands to be read. However, the results are written into the register as soon as they complete the execution stage assuming no WAR hazards. One structural hazard that presents itself is since the scoreboard needs to communicate with the functional units, the number of allowed units to be used cannot exceed the number of buses available. Unfortunately, scoreboarding doesnt take advantage of optimizing the code sequence, thus extensive pipeline would seem to be a good substitute; however, this can cause drawbacks to other parts of the system and functional units. Due to these restraints, better implementations of dynamic scheduling came about that dealt specifically to the hazards and limitations scoreboarding couldnt handle. Even though the scoreboard technology developed a method to deal with structural and data hazards, it was soon surpassed by the implementation of Tomasulos algorithm which avoided dependencies such as WAW and WAR altogether by the register renaming scheme.
Tomasulo: After the scoreboard technology gained way, more thought was placed into how to minimize the hazards that were still present. Fortunately, the scheme provided by Robert Tomasulo aims for taking care of these hazards. Main differences that present themselves compared to scoreboarding are the concepts of register renaming, reservation stations, a common data bus, as well as the fact that the algorithm can perform across branches. There are quite a few highlights of this technique. For instance, the compiler can allow code to go unoptimized and we can still achieve high performance. Since caches weren't introduced before this scheme, the addition of them only compounds the efficiency of the algorithm because during a cache miss, out-oforder instructions can still be executing without penalty. For these reasons and more, this implementation of dynamic scheduling was adopted heavily during the growth of the 1990's. Tomasulo's algorithm employs three stages for each instruction: issue, execute, and write data. The first area where we see some improvement compared to other schemes is directly at the start where instructions are issued. Reservation stations, which are used to buffer the operands of issued instructions (in order to solve dependencies) can keep registers available and keep the operands ready to be operated on. From the coder's point of view, a specific register that's stated may be represented by two or more reservation stations simultaneously. (6). This step eliminates WAR and WAW hazards because if an instruction updates and fetches the operand from the reservation station and not the register, then it leaves the register open to allow previous issued instructions to be written to. This step also attempts to rename registers specified in the code to ones that are available on the clock cycle. In the next stage, operands that become available are immediately placed into reservation stations, thus when all operands are supplied, and then an operation can be executed. Instructions still needs to be delayed however to avoid reading after write. For the case where instructions have their operands available on the same clock cycle, then the functional unit will take them in arbitrarily except for loads and stores which are maintained in the order they were written in the code. In the final step, the results are written on a common data bus (7). Instead of waiting for instruction results to be placed in registers, the results can be placed on a bus so that dependencies can be filled before needing to wait until writing. The IBM 360/91 was designed with Tomasulo's algorithm in order to achieve high floating-point performance. Even though only four registers were used with long delays, as well as having a two-address instruction format, the algorithm minimized the need for using the registers as much as possible. Tracking instruction dependencies allowed this to take place (6). With the implementation of the common data bus and the reservation registers, many hazards can be avoided at the cost of an extra clock cycle. Instead of results coming straight from a functional unit they are buffered on a bus before the write stage takes place. Data may actually not be stored unless absolutely necessary. Compared to Scoreboarding technology, data doesn't need to be transferred with a register in between the functional units, and for this reason, makes an increase in the efficiency of how the system is used. As far as branch prediction is concerned, if branches are correctly predicted, the reservation stations work well. For example, multiple iterations of loops can unfold. Another addition to the Tomasulo's Algorithm was the implementation of reorder buffers. This process essentially allowed instructions to be issued in order, execute out of order, and then be completed in order. In a sense, the machine may not
appear to be taking use of dynamic scheduling, but all functional units are kept busy in the execution stage. Furthermore, Tomasulo's approach was highly built upon by the addition of caches, hardware speculation, and branch prediction. With the idea set behind the approach, especially considering how to handle hazards and unnecessary use of functional units, this made Tomasulo's design have more room for growth and advancement compared to Scoreboarding. Branch Prediction: On the issue of pipeline performance, we assumed that instruction level parallelism can occur in segments or blocks of code being limited by branches. Because branches hold conditions as to which instructions should be issued next, they drastically slow down the pipeline. In order to take a chance at decreasing the delay, prediction of the branches is implemented. If the prediction is right, then there shouldn't be any penalties to holding control dependencies through a branch instruction. Prediction is split into two categories: static and dynamic. Static branch prediction can be done when code is compiled; however, some branching schemes can only be predicted during run time done by the hardware. On the static side, there are several different methods to predict branches. Since there are only two options to branches, they can be easily analyzed. The easiest method would be to predict the branch as just taken. However the accuracy of taking correct branches is very low and probably wouldn't be as beneficial as other schemes. The concept that we can implement is to allow branch predictions be based off previous runs of the executed code. This way, a more accurate prediction can be made especially if a branch is 90% taken for example. However, as more conditional branches present themselves, the efficiency of the static branch prediction starts to lose effectiveness. Dynamic schemes are useful in that they evaluate the delay for branching in comparison to the fetching of multiple PC addresses. The logical way to implement this is by having memory of the branches. Therefore, dynamic schemes that deal with prediction of branches can be implemented using buffers as a part of the hardware. One-bit buffers for prediction have been used before, but we would prefer two predictions that weren't correct before changing prediction bit, thus a two-bit buffer is more common. This is done in order to accommodate branches that strongly favor being taken or not taken (4). If we consider the two-bit buffer to be similar to a cache, a branch taken will set the first bit, and if it is taken again it will set the next bit. Two misses and two hits are required to change the prediction state. Speculation could be done on making an n-bit buffer; however, this became what is known as profiling which leads to just comparing whether 1 or 0 appears more often (8). Instead of making a larger bit counter, it's practical to have a counter table of branch addresses that reference specific 2-bit predictors (9). Furthermore, a somewhat stronger branching scheme would be to make predictions based on what are called correlating predictors or two-level predictors. These look at the history of previous branch behaviors in order to evaluate whether or not to take the current branch. If this is compounded onto the two-bit buffer scheme, the chance of correct prediction can be exponential. For example, based on a previous branch there can be two separate two-bit buffers to address the current one with each correlating to a taken or non-taken branch.
Overall, branch prediction is extremely helpful when considering pipelining strategies because with this addition, there is reduced delay in executing parallel instructions. Instead of waiting to evaluate conditions and then taking use of instruction level parallelism, resources are still being continually used with the mindset that it's better to keep functional units busy and be wrong a small percentage of the time rather than waste time waiting for condition affirmations. In all, we have seen pipelining grow and continue to prove its usefulness throughout the years. The hazards that presented themselves at the beginning are decreasing in occurrence with each new design and algorithm. Even physical limitations are starting to pose less of an obstacle with increased accuracy in branch prediction. What seems to remain now is the refinement of these technologies and how we can fit the processing needs of our present world today.
Works Cited (1) Michael S. Schlansker, B. Ramakrishna Rau. EPIC: An Architecture for Instruction-Level Parallel Processors. Compiler and Architecture Research, HP Laboratories, February, 2000, http://www.hpl.hp.com/techreports/1999/HPL-1999-111.pdf (2) Brandon Bell. Intels Core 2 Extreme/Duo Processors. July 13, 2006, http://www.firingsquad.com/hardware/intel_core_2_performance/ (3) Scott M. Fulton. First Intel next-gen news: Lower wattage, fewer pipeline stages. Toms Hardware US, August 23, 2005, http://www.tomshardware.com/news/intel,1317.html (4) Hennessy, John L., David A. Patterson, and Andrea C. Arpaci-Dusseau. Computer Architecture a Quantitative Approach. Vol. 4. Amsterdam: Morgan Kaufmann, 2007. Print. (5) Grishman, Ralph (1974). Assembly Language Programming for the Control Data 6000 Series and the Cyber 70 Series. New York, NY: Algorithmics Press. (6) Stone, Harold S.. High-performance Computer Architecture (Addison-Wesley series in electrical and computer engineering). Repr. with corrections ed. Glenview: Addison-Wesley Educational Publishers Inc, 1988. Print. (7) R.M. Tomasulo, ``An efficient algorithm for exploiting multiple arithmetic units,'' IBM Journal of Research and Development, vol. 11, pp. 25-33, 1967. (8) P.G. Emma, ``Branch prediction,'' Chapter 3 of D. Kaeli and P.-C. Yew (Eds), Speculative Execution in High Performance Computer Architectures, CRC Press, Boca Raton, FL, 2005. (9) S. McFarling, ``Combining branch predictors,'' Technical Report WRL TN-36, Western Research Laboratory, Digital Equipment Corporation, Palo Alto, California, June, 1993.

Advanced Pipe Lining Techniques

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Advanced Pipe Lining Techniques

Hochgeladen von

Copyright:

Verfügbare Formate

Ryan DiBiasio Chris Harris

Das könnte Ihnen auch gefallen