Overview of Processors

Chapter 1
1-Intel Pentium Pro

1.1Overview
Intel Pentium pro was launched after Pentium, and it was 50% higher in performance than the
Pentium but clock speed of both of them was same. It has capability of super pipelining, higher
multiprocessing, less pipeline stalls in RISC. Pentium pro was launched for workstations and
high-end servers. It remained popular for many years for servers because: It had integrated level
2 cache that was very good for multiprocessing, its low price and better performance as
compared to Pentium 1,2,3 and M.
1.2Instruction Set architecture
Its uses RISC, x86 plus Pentium and Pentium Pro extensions.
1.3 Type of instruction set
Intel Pentium pro takes CISC instructions and converts into RISC micro operations. This
conversion is done to reduce the inherent limitation in x86 instruction set, register arithmetic
operations and irregular instructions. It then passed operations in order of execution for
determining that they are ready for execution or not. If not, micro operations are mixed to
prevent pipelining stalls. It has register to register architecture.
1.4 Micro architecture Diagram
1 | Page
1.5 Pipelining stages, clock cycle time, clock speed

It had pipeline depth(stages) of 14, because of its super pipelining while the previous versions of
Pentium was using 4 execution steps in pipelining. The 14 stages are dived into three sections,
called the three independent-engine approach. Eight stages consist of the in-order front end
section, this handles decoding and issuing of instructions. There are three stages which executes
the instructions called out-of-order core. The last three stages handles the in-order retirement.
Super pipelining provides simultaneous, or parallel, processing within a CPU. It overlaps
operations by moving data or instructions into a conceptual pipe with all stages of the Pipe
processing simultaneously. Its clock speed is of 200MHZ and clock cycle time of 0.5 Nano
seconds.
1.6 Techniques being used to detect and correct hazards
Basic hazards are there like data hazard, structural hazards, and conditional hazards.
1.6.1 Hazard resolution
Static Method:
Performed at compiled time in software
Dynamic Method:
Performed at run time using hardware
Stall
Flush
Forward
1.7 Branch prediction Strategies
The Pro has a branch target buffer that is two times the Pentiums and its accuracy is increased.
Branch Target Buffer
Stores information about previously executed branches
Indexed by instruction address
Specifies branch destination + whether or not taken
512 entries
Branch Processing
Look for instruction in BTB
If found, start fetching at destination
Branch condition resolved early in WB
If prediction correct, no branch penalty
If prediction incorrect, lose ~3 cycles
Which corresponds to > 3 instructions
Update BTB
Critical to Performance
1115 cycle penalty for misprediction
2 | Page
Handling BTB misses

Detect in cycle 6
Predict taken for negative offset, not taken for positive
1.8 Multiprocessing
It has single core. And it supports Up to 4 processors with compatible motherboard.
1.9 Support for multithreading and different types of multithreading being supported
It has single core. And it doesnt support multi threading.
1.10 Turbo boost
It doesnt support turbo boost.
1.11Types and hierarchy of caches being used
1.11.1 Register
Register are at the top of memory hierarchy with size of 32 bits in Pentium Pro, it provides very
fast access of data in X8086 CPU.They are very expensive memory unit.
1.11.2 Level 1 Cache
This is the next level of memory access in Intel Pentium pro. Level 1 cache in Pentium pro is a
fixed sized cache with the size of 16 KB, 8KB for instructions and 8KB for data. The cost of
memory access in level 1 cache in Pentium pro is much lower than register. It is 2-Way Set
Associative
1.11.3 Level 2 Cache
After Register access, Level 1 cache, Level 2 cache come in hierarchy of caches in Pentium pro.
That is an integrated level 2 cache that makes cache run at same clock speed as the processor
usually at 0.5ns.
As it has integrated cache memory it is not possible to expand memory without expanding
CPU.Level 2 cache is of 1Mb. It is also very expensive and large chip is required for level 2
cache that is also consider as a drawback and Intel didnt use integrated level 2 cache in next
versions.
3 | Page
Chapter 2
2 MIPS R5000
2.1 Overview
In 1996 the R5000 processor was released by MIPS Computer Systems Inc.
The design goal MIPS had with the R5000 was to provide several advanced features at a
Low cost. The way this was accomplished was that they borrowed a lot of features from
The more advanced R10000 processor, while they still managed to retain a low chip
Complexity. The role of the R5000 was to provide good performance for mid- to low-range
Segments of workstations. Its main strength was its good price/performance ratio in regards to
graphics processing.
2.2 Instruction Set architecture
The R5000 is a 64-bit superscalar RISC processor, but it is also capable of running in 32-bit
mode. It has 32 integer registers and 32 floating-point registers, all of which are 64 bits wide.
The processor can execute both a single precision floating-point Instruction and a dissimilar (i.e.
integer, load/store etc.) instruction simultaneously.
MIPS-IV that uses Register to register architecture.
4 | Page

It consist of two sub processors CPU and CPI
5 | Page

No of pipelining stages in MIPS R5000 are five.
Every instruction the R5000 processes is sent through the five-stage pipeline. Having
A pipeline allows the chip to process several instructions at the same time. The five stages are as
follows:
1. Instruction fetch. Two instructions are fetched. The CPU checks which unit should handle the
instruction.
2. Register access. The CPU accesses the register. Possible branch addresses are calculated.
3. Execute. The FPU and the integer unit executes the instructions.
4. Access data cache. The primary data cache is accessed by the instructions. Virtual address to
physical address translation is performed.
5. Write-back. The processed information is written to the appropriate destination. its clock speed
is 200MHZ and clock cycle time is 0.5ns
In addition to substantially increasing the amount of forwarding required, this
Longer-latency pipeline increases both the load and branch delays.
1. Structural Hazard
Load delays are 2 cycles, since the data value is available at the end of DS.
2. Data Hazard
A load instruction followed by an immediate use results in a 2-cycle stall. The shorthand
pipeline schedule when a use immediately follows a load. Forwarding is required for the
result of a load instruction to a destination that is 3 or 4 cycles later.
3. Control Hazard
Basic branch delay is 3 cycles, since the branch condition is computed during EX.
No support for dynamic branch prediction.
2.8 Multiprocessing
The initial R5000 did not support multiprocessing, but the package reserved eight pins for the
future addition of this feature.
Later, the R5000 processor incorporates 8 external signals with multiprocessor support. These
signals allow for arbitration and data coherency between processors. Symmetric multiprocessing
(SMP) systems implementing the full MESI (Modified Exclusive Shared Invalid) cache
consistency protocol in both primary and secondary caches, as well as other styles of
multiprocessing will be supported by the R5000 family. Multiprocessor support is not available
in the first production version of the R5000. The performance of SMP systems based on the
R5000 is expected to scale well up to 4 processors.
It doesnt support multi-threading.
6 | Page
2.10 Turbo boost

2.11 Types and hierarchy of caches being used
MIPS R5000 has both internal and external cache to achieve higher performance. Each cache is
two way associative.
2.11.1 Internal cache
Its on chip cache consists of total 64 KB: that is 32 kB data cache and 32 kB instruction cache.
That is double the size of cache as its previous versions. The increased cache size increases
performance by a significant amount, since applications then can avoid accessing the slower
level 2 cache (or the much slower RAM).
Both the D-cache and the I-cache are two-way set associative, virtually indexed and
Physically tagged. The 32 bytes, or 8 words, fixed line-size applies to both caches. The
D-cache loads 8 bytes of data each cycle, which results in a bandwidth of 1.6 GB/s at
200 MHz
2.11.2 External Cache
The R5000 has an optional level 2 cache that is either 512 kB, 1 MB or 2 MB in size. The
processor has a dedicated secondary cache interface which is used to send signals between the
secondary cache, the processor and the secondary cache tag RAM. The external cache supports
both write-back and write-through data transfer protocols. The data in the secondary cache is
accessed through a 64 bits wide system bus.
2.11.3 Secondary Cache Support
MIPS R5000 processor consist of a dedicated secondary cache interface. These signals provide
an efficient interface
Between the processor, the secondary cache, and the secondary cache tag RAM. All tag RAM
interface signals
Chapter 3
3.1 Overview
Itanium, also called Merced, is the first generation of processors built on the IA-64
Architecture. It is the result of a cooperation between Intel and Hewlett-Packard whose
Goal was to create a 64-bit post-RISC architecture for heavy server applications
3.2 Instruction Set architecture
It has 64-bit post-RISC architecture for heavy server applications.
It is register to register architecture.
The instruction format for sending instructions to the CPU is called very long instruction word
(VLIW). Each of these words is 16 bytes long and is called a bundle. A bundle is organized in
7 | Page
three 41-bit instruction slots and a template field. Each instruction can take one or two slots
depending on the size needed. The template field contains information about the instructions in
the bundle and it tells if the bundle can be executed in parallel with the next bundle.

The pipeline of the Itanium is a highly parallel 10-stage pipeline. It is able to execute up to 6
instructions in every clock cycle. The pipeline is divided into two groups of units, the front-end
and the back end. The front-end consists of the IGP, FET and ROT units which is responsible for
fetching instruction bundles and doing branch prediction. It also does pre-fetching which hides
fetch latency. The back-end consists of the other units and is responsible for the execution of the
instructions. The communication of bundles is done via a decoupling buffer with room for 8
bundles. This decoupling of the front-end and the back-end makes it possible for the front-end to
keep running, fetching and pre-fetching if the back-end would stall.
The pipelining stages are given below:
Instruction Pointer Generation
Fetch
Rotate
Expand
8 | Page
Rename
Word-Line Decode
Register Read
Execute
Exception Detect
Write Back
Its clock speed is 800Mhz and and clock cycle time is 0.12ns.

Basic hazards are there like data hazard, structural hazards, and conditional hazards.
3.6.1 Hazard resolution
Static Method:
Performed at compiled time in software
Dynamic Method:
Performed at run time using hardware
Stall
Flush
Forward
In the Itanium design the complexity of branch prediction has been moved from the CPU to the
compiler (EPIC). The CPU relies on the compiler to put information about this into the code
being executed. The benefit of this is that since the compiler has a lot more time to decide on a
branch and also access to the complete code it can often make better decisions. It also speeds up
the execution of the code since the same decision doesnt have to be done every time the code is
being executed. However the runtime behavior of a program is not always obvious in the code
that generates it. This sometimes makes it impossible for the compiler to correctly predict what
probably could be predicted at runtime by some logic on the CPU itself Using the information
handed from the compiler the pre-fetch engine of the Itanium can see to that the instructions
needed can be fetched from the L1 cache
3.8 Multiprocessing
Yes it has support for multiprocessing.
Yes, Intel Itanium (IA-64 Merced) has a support of hyper threading. HyperThreading introduces
the idea of thread level parallelism, and is a way of making better use of under-utilized
functional units in superscalar (and VLIW) processors. It is equally applicable to IA-32 and has
started to appear in later Pentiums (and on TV adverts). Essentially, in many processors much of
9 | Page
the time, many functional units are under used. Since a lot of modern code is multi-threaded
(particularly server code, and servers will be targetted first) it makes sense to try to interleave
multiple threads simultaneously. The hardware pretends to be multiple virtual processors. This
has to be done carefully. For example, suppose we have a simple machine with one integer unit
and one floating point unit, executing mainly integer code. The limiting factor in such a machine
is likely to be the single integer unit, and unless we can interleave a separate mainly floating
point thread, we are unlikely to see any increase in performance. In fact, we are likely to slow
things down. However, suppose we have many integer units. Inter-instruction dependencies are
now likely to be the limiting factor, and one integer thread is unlikely to keep many units busy
simultaneously
3.10 Turbo boost
3.11 Types and hierarchy of caches being used
3.11.1 Caches
The cache memory of the Itanium consists of 2 or 4 MB off-die L3 cache, 96 KB L2 cache and
32 KB L1. The L1 cache is divided in half between instructions and data. As can be seen in
Figure 8.1 the latency of the L3 cache is quite high which one of the main bottlenecks of the
CPU is. The other latencies in the memory hierarchy is also quite high and the bandwidth of the
memory bus is a bit poor. All these things were vastly improved in the next generation of the
CPU
10 | P a g e
Chapter 4
4.1 Critical thinking
Here is the comparative analysis between the three processes that are discussed in previous
chapters.
Manufacturer
Version
Intel
Pentium Pro
Intel
Intel Itanium (IA-64
Merced)
MIPS
R5000
ISA Support
X86
Very wide RISC
instructions
RISC
MIPS-4- RISC
No. of Pipelining
stages,
clock speed
No. of transistors
Multiple
Instruction Issue
Benefits/Drawback
s
14
10
200 MHz
5.5 Million
Yes 2 way
800 MHz
221 Million
200 MHz
3.6 Million
Yes 2 way
Very difficult to
increase cache size,
because of its integrated
cache.
Costly at its time
because its target was
server and gaming
machines.
It a one of the fastest

processor of its time. It
had intense heating
problem.
As mentioned it was
one of the fastest at its
time with high memory
support thats why its
Sometimes Processes
Crashing problem with
optimum usage
Cost Analysis
We can increase the

memory size by
incurring more cost in it
because it has extra
11 | P a g e
Memory hierarchy
16Kb level 1 and

integrated level 2 cache.
cost was very much at

its time.
It has caches up to level
L1,L2,L3
Multi-Threading
No
Yes
memory slot.
L1, L2 , Internal and
external cache plus it
has secondary cache
support
no
4.2 Critical analysis of Pentium pro

Pentium pro was launched in 1995. It CPU core is consisting of 5.5 million transistors in level 2
cache. It was launched for servers and high end workstations. It has a super scalar processor
with high order processor features, optimized for 32-bit operation. The other, more critical
distinction of the Pentium Pro is its handling of instructions. It takes the Complex Instruction Set
Computer (CISC) x86 instructions and converts them into internal Reduced Instruction Set
Computer (RISC) micro-ops. The conversion is designed to help avoid some of the limitations
inherent in the x86 instruction set, such as irregular instruction encoding and register-to-memory
arithmetic operations. The micro-ops are then passed into an out-of-order execution engine that
determines whether instructions are ready for execution; if not, they are shuffled around to
prevent pipeline stalls.
The main characteristics of Pentium pro are given below
Super
The Pentium Pro dramatically increases the number of execution steps, to 14
pipelining
32-Bit
The Pentium Pro is optimized for running 32-bit code and so gives a greater
Optimization
performance improvement over the Pentium when using the latest software,
back in 1995
Wider Address
The address bus on the Pentium Pro is widened to 36 bits, giving it a
Bus
maximum addressability of 64 GB of memory
Greater
Quad processor configurations are supported with the Pentium Pro compared
Multiprocessing to only dual with the Pentium.
Out of Order
Instructions flowing down the execution pipelines can complete out of order.
Completion
Superior
The branch target buffer is double the size of the Pentium's and its accuracy
Branch
is increased
Prediction Unit
Register
This feature improves parallel performance of the pipelines
Renaming
Speculative
The Pro uses speculative execution to reduce pipeline stall time in its RISC
Execution
core.
4.2.1 Major drawbacks in Pentium pro

12 | P a g e
It has been terribly troublesome for Intel to manufacture the Pentium professional at the volumes
and value levels necessary for it to become a thought processor. There are unit 2 main reasons for
this. First, the cache itself is much miniaturized and so rather pricier to provide than the everyday
SRAM chips used on a Pentium motherboard for level a pair of cache. Second, some issues with
the cache don't seem to be found till when it's been mated with the processor and put in in their
shared package; once this happens the total package (including the processor) should be thrown
away, reducing yields and increasing prices. Owing to the issues with its style, Intel has
abandoned the integrated-cache construct and it's unlikely that any future computer processors
can use it within the same means that the Pentium professional will.
4.3 Critical analysis of Itanium

At its core the Itanium is basically a RISC design but with multiple functional units enabling it to
execute 6 instructions in each cycle (6-wide super scalar). What makes this architecture stand out
from the ordinary RISC design is the way instructions are fed and decoded by the CPU. Its one
of the benefit is in its design the complexity of branch prediction has been moved from the CPU
to the compiler (EPIC). The CPU relies on the compiler to put information about this into the
code being executed. The benefit of this is that since the compiler has a lot more time to decide
on a branch and also access to the complete code it can often make better decisions. It also
speeds up the execution of the code.
4.3.1 Major drawbacks in Itanium
Its main reason of failure was heating problems. The Itanium was extremely power hungry and
runs very hot. It had been using up to 130 watts in some tests, and this appears to be a really
serious problem. The problems should arise from the choice of VLIW design, which should not
be suitable for a general-purpose CPU as Itanium as some articles indicate.
4.4 Critical analysis of MIPS R5000
The R5000 processor obtains high performance at low cost. New features include the following:
The R5000 processor runs the MIPS IV instruction set which contains additional floating-point
instructions such as the multiply-add (MADD) which accelerates geometry-processing in 3-D
graphics applications.
A dual-issue instruction mechanism allows floating-point load (or store) and floating-point
ALU instructions to be dispatched in the same cycle. Highly pipelined floating point ALU
instructions greatly enhance 3D graphics performance.
An on-chip write buffer enhances bus performance by facilitating pipelined write operations.
Separate integer and floating-point ALUs enhance performance in mixed integer/floating point
applications by allowing long latency instructions such as floating-point divide and square-root
operations to be performed at the same time as integer ALU operations.
13 | P a g e
Large on-chip instruction and data caches, each 32 Kbytes in size, allows many common
applications to run within the primary cache, enhancing performance by reducing the number of
accesses to both secondary cache and main memory.
The multi-processing capability of the R5000 processor enables multiprocessor servers with
additional processors added at low-cost. This system configuration is especially valuable for the
Windows NT midrange server market.
4.5 Conclusion
In a Nutshell, every processor has its own drawbacks and benefits like intel Pentium pro and
MIPS r500 were initially designed of servers machines, but Pentium pro failure was due the not
availability of expandability factor. Its cache is fixed with motherboard and we cannot increase
the size of memory. But we can increase memory in R5000 because it has extra memory slot. All
of the three have reduced instructions set computer and they have less number of pipelining
stalls.
Intel itanium clock speed was very much high as compare to the above two mentioned. It has 800
MHz speed with 221 million transistor and it is also latest as compared to Pentium pro and MIPS
R500 while Intel Pentium pro have 5.5 million and R5000 have 3.6 we cannot increase them.
Pentium pro and r5000 dont have multithreading support and we cannot implement multithreading in it, but Itanium have support of multi-threading From these above mentioned three
processors I would recommend to buy and build a machine of Intel Itanium processor, because of
its high clock speed multi-threading. As I mentioned above Intel Itanium had heating problem
that why it failed and Intel overcome this issue with next processors.
References
1) https://www.pctechguide.com/pentium-cpus/pentium-pro
2) https://en.wikipedia.org/wiki/Pentium_Pro
3) http://www.pcguide.com/ref/cpu/fam/g6PPro-c.html
4) http://www.cpu-world.com/CPUs/Pentium-II/Intel-Pentium%20Pro
%20200%20256%20KB%20-%20KB80521EX200%20256K
%20%28BP80521200%20256K%29.html
5) http://www.cpushack.com/CIC/announce/1996/MIPSR5000.html
6) http://www.sgidepot.co.uk/depot/R5000_Pr_Ov.pdf
7) http://www.princeton.edu/~ajavadia/ELE475.pdf
8) http://home.deib.polimi.it/silvano/FilePDF/ARCMULTIMEDIA/Lesson_2_Branch_Prediction.pdf
9) https://en.wikipedia.org/wiki/R5000
10) http://euler.mat.uson.mx/~havillam/ca/CS323/0708.cs-323014.html
11)
14 | P a g e
15 | P a g e

Overview of Processors

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Overview of Processors

Hochgeladen von

Copyright:

Verfügbare Formate

Chapter 1

1-Intel Pentium Pro

1.5 Pipelining stages, clock cycle time, clock speed

Handling BTB misses

2.4 Micro architecture Diagram

2.5 Pipelining stages, clock cycle time, clock speed

2.10 Turbo boost

3.5 Pipelining stages, clock cycle time, clock speed

3.6 Techniques being used to detect and correct hazards

It a one of the fastest

We can increase the

16Kb level 1 and

cost was very much at

4.2 Critical analysis of Pentium pro

4.2.1 Major drawbacks in Pentium pro

4.3 Critical analysis of Itanium

Das könnte Ihnen auch gefallen