Final Draft

Cell:
An overview of Sony, IBM, and Toshibas multi core processor
Max Morgan
Jamie Zhou
May 13, 2015
In 2005, the frequency wars between CPU vendors was dying down.
Performance gains based on instruction level parallelism between generations were
stagnating, and increasing frequency even further resulted in astronomical power
consumption. More and more architects were shifting their focus to thread level
parallelism instead of instruction level parallelism to yield the gains that consumers
needed. Sonys ambitious Cell processor was an attempt at taking advantage of this
shift to hardware level parallelism early on, emphasizing a substantially higher
quantity of weaker processors. In theory, the processor was capable of over 200
Gflops1 of performance compared to an Athlon 64 x2 4600+s 19 Gflops, a
staggering order of magnitude difference in performance. The processor was built
on 2 critical concepts:
1. One General Purpose Processor
2. Eight Weaker Processors
The Cell processor had a radically different set of focuses compared to rival
processors from AMD and Intel. The PS3 was the first and most major actual
implementation of the Cell processor, and also the last major implementation. This
paper will cover all the major quirks involved in Cell, as well as the benefits and
disadvantages behind each of these quirks. Both theoretical and real world
performance will be considered, as well as the implications of their relationship.
Whether the design principles of the Cell processor were considerably ahead of their
time or only plausible in theory, its unusual design was Sonys attempt at shifting
the fundamental paradigms of processor design and computer architecture.
One General Purpose Processor

The Cell processor was composed of one PowerPC Processing Element (PPE),
Figure 1, and eight Synergistic
Processing Elements (SPEs), Figure 2.
The PPE is perhaps the simplest part of
the design, and represents a transient
step between conventional instruction
level parallelism (ILP) based designs
and Cells thread level parallelism
Figure 1:PPE Highlighted2
(TLP). The PPE is relatively similar to other consumer processors on the market. It
features two in-order dual-issue cores, simultaneous multithreading (similar to
Intels hyperthreading), a 64kb L1, and a 512kb L2. It has 96 registers divided 3
ways into general purpose registers, floating point registers, and vector registers. 2
While in-order instruction execution was an unusual design choice, the PPEs
flexibility was intended to ease in developers used to working on IBMs other chips.
Programs that could run on IBMs other architectures could first run on the PPE, then
be optimized to take advantage of the eight SPEs.
Eight Weaker Processors
The Cells SPEs are where its ingenuity really shines through. Each SPE is a
fully functional processor, with several restrictions. SPEs are similar to the PPE in
that they are in-order dual-issue, but are limited in that they have no cache or
branch predictor.2 These restrictions are where Sonys decision to utilize in-order
instruction execution pay off. Some context is important, so the first important
design decision to consider is why

Sony chose to implement in-order vs
out-of-order.
In-Order vs. Out-of-Order
By the time the Cell processor
rolled out on the market, in-order
processors were very uncommon for
Figure 2: SPEs Highlighted2
general purpose processors. In combination with caches, out-of-order processors

were capable of mitigating most of the latency caused by memory accesses by
dynamically reordering instructions while they are in air. In-order processors, in
comparison, have to take a massive hit in latency when they fail to hit the cache,
idling for hundreds of cycles while the processor retrieves the appropriate data from
memory. In exchange, in-order processors have much a simpler physical
implementation as well as a shortened pipeline, leading to vastly cheaper and less
power intensive designs.2
Mitigating Issues
By far, the biggest issue facing in-order processors is memory latency.
Waiting for incredibly slow memory to retrieve data while idling is a waste of CPU
cycles that could potentially be calculating something else. The modern approach
with caches is not viable with in-order processors, as while caches have high hit
rates after warming up, misses kill the processors throughput and even worse, are
unpredictable. In addition, large caches have been shown not to benefit the usage
cases the Cell is targeted at, yielding less than a 2% improvement in games and
media.2 Accessing the memory directly is even worse. Cell bypasses the problem by
using SRAM, memory with speed similar to caches. Each SPE has a local store of
256KB SRAM. By accessing fast memory through instructions instead of depending
on hardware algorithms to prioritize high demand memory accesses, memory
latencies are now predictable while still being fast in most cases. 3
Another interesting design decision Sony made was not including a branch
predictor.2 While no branch predictor reduces hardware cost in each individual SPE,
manifesting itself eightfold in the actual processor, it also means that the processor
is completely dependent on software branch prediction. To avoid branches, each
SPE has 128 registers that can be utilized in loop unrolling. In combination with
SRAM instead of caches, the Cell moves complexity from the hardware to the
compiler. By compiling instructions based off of predictable memory latencies and
minimizing branches through loop unrolling, SPEs have a relatively low hardware
cost while providing an immense level of thread level parallelism in the hands of
crafty programmers.
Other Important Features
High Bandwidth Memory Controller
The Cell processor needs extremely high memory bandwidth to
accommodate up to ten threads, and uses an on die dual channel XDR memory
controller capable of a memory bandwidth of 25.6 GB/s, almost reaching GPU
memory bandwidth.2 Having a high bandwidth memory controller on die
substantially plays into reducing memory latencies, again driving Cells primary
directive of reducing memory latency while maintaining high thread level
parallelism at an affordable price.
Pseudo-NMOS
logic
Cell implements logic gates
processors. Most
in a different fashion than most

processors implement gates
through static
complementary metal
oxide semiconductors
(CMOS),
have a high
circuit design. Most
transistor cost, but vastly simplify
Figure 3: CMOS Transistor2
description languages have static

ready to use. Cell uses
which are relatively slow and
hardware
CMOS gates already written and
Figure 4: NMOS Transistor2
pseudo-n-type metal
oxide semiconductor (NMOS) gates, which use less transistors and lowers
power usage. The downside comes into play during manufacturing and
design. Static CMOS gates are normally organized behind a latch, and only
require that latch to be clocked, as in Figure 3. 2,3 Individual CMOS gates are
not clocked. Cells implementation requires each NMOS gate to have two
separate clocks, which makes routing clocks quite complicated. NMOS gates
are also vastly less common, and are not available in libraries like most static
CMOS gates. Finally, NMOS gates also require a more complicated
manufacturing process compared to CMOS gates. NMOS gates require a large
tradeoff in terms of effort for lower power usage and transistor count.
Real World Implications
The Cell is a unique design from an architecture perspective in its extremely
high level of parallelism. Some of the decisions made, like having ten threads,
distinguish the Cell in terms of computational ability. Others, like not having a
branch predictor, make it a liability for real world implementation. From a computer
architects perspective, the tradeoffs may seem agreeable, especially for
applications that can take advantage of the Cells high level of thread level
parallelism. Each SPE utilizes a single instruction multiple data (SIMD) architecture,
which means that each SPE is capable of operating on multiple elements of data in
each clock cycle. The amount of data that can be calculated is dependent upon the
operation being performed1.
Cells SPEs are unique in that they specialize in vector operations. In fact, the
Cell processor can only do scalar operations through the preferred slot. This also
requires shifts upon the scalar data element to put it into the preferred spot, then
shifting the scalar data element back into its original location. In other words, SPEs
were designed to do vector operations almost exclusively. This is a limiting aspect
to the software design for this processor. Regular C code uses scalar operations,
such as 4 + 5, as their most basic operations. Fortunately, there are C language
extensions that utilize vector operations; one example is known as intrinsics. This is
important because utilizing the SPEs to perform scalar calculations is not going to
come close to utilizing the theoretical throughput they are capable of. SPEs are able
to handle both scalar and vector operations interchangeably, though utilizing any
scalar operations is not optimal. In order to eliminate all scalar operations,
elements of a vector may not be inserted or extracted. Instead, vector operations
such as shifts, rotations and shuffles may be performed to modify vectors 1. Since
the vast majority of modern CPUs utilize scalar processing 4, many traditional
developers will have to dedicate time to learning vector operations. Graphics
processing units (GPUs), however, do utilize vector processing. A developer who is
experienced with GPUs would potentially be better suited to programming in the
SPEs. The problem with this concept is that if a developer is working with the GPU,
he/she is usually more concerned with graphics engines than with physics engines,
and if the development team is attempting to port the same software from a more
traditional CPU, they will have to choose to either train one or more CPU developers
to use vector operations or have one or more GPU developers do the conversion
that would be required to shift from scalar operations to vector operations. In any
scenario, the time it will take to utilize the power of the SPEs will be more than the
time it will take to utilize a normal CPU.
Theoretically, the Cell is capable of masking memory accesses through a
clever compiler and large matrices by having the computation take a longer amount
of time than the communication to memory. When working with dense matrices, the
Cell can reach close to ideal performance. However, when working with sparse
matrices, the calculations do not take long enough to veil the memory accesses,
and the processor once again becomes limited by memory access speed. The upper
bound for calculations with 4 byte single precision floating point values is 12.8
Gflops, 6% of the original 205 Gflops touted by Sony. While 12.8 Gflops is still
impressive, the problem is exacerbated by the fact that 12.8 Gflops can only be
reached in ideal coding situations, and generally cant be achieved.
While the Cell has no branch predictors, there are 128 registers on each SPE
dedicated to loop unrolling. With the number of cycles it takes to access main
memory in mind, unrolling loops is essential to utilizing the SPEs. It is unclear
whether or not the developer must unroll the loop, but even if a compiler is doing
the job of loop unrolling, the code must be written in a way that minimizes
dependencies, or else there will be nops between instructions in the unrolled loop 1.
As can be seen, there are a few fundamental differences that developers must take
into account when they are writing code for the SPEs.
The Playstation 3 is a gaming console that utilizes the Cell processor. SIMD
vector processors are normally associated with GPUs, yet Sony opted to implement
vector operations in their CPU. Robert Valdez offers an explanation in his article on
the website howstuffworks.com: Vector processors are designed to quickly process
several pieces of data at once.
Not only are there a maximum of ten possible
threads, but each thread is SIMD, enabling a large number of parallel instructions to
be executed. This is critical to Sony and the PS3 the Cell was originally developed
for.3 Its importance in graphics processors is to do the necessary calculations to
display the different pixels at the same time. It also involves a highly parallel
concept that is fairly unique to video games. The logic behind virtual physics and
A.I., or artificial intelligence, is very parallel oriented. 2 The Cell is well equipped for
doing parallel calculations for both, as every rock that is moved and every enemy
that is spawned requires their own individual sets of calculations to provide a
realistic simulation.
Cell A Retrospective
The Cell is a unique processor, and in the world of computers unique often
has negative connotations. Unique can often mean difficult to work with, due to
developers needing to learn new coding techniques to take advantage of the
architectures quirks, exemplified by the Cell and its oddities like specialization in
vector processing or mitigating memory accesses. Did these quirks cost Sony in
regards to its greatest competition, the Xbox 360? Will Greenwald, from PC
Magazine, admitted that the PS3 is probably a more powerful system overall, but
that cross-platform games, that is, games that were released on both consoles, are
roughly equal to each other in overall performance 5. SPEs, and the difficulty
associated with learning to code for them, were a large detriment to the PS3s
performance. Issues with optimizing for the SPEs apply beyond the realm of video
game development. Any developer looking to use the Cell processor must adapt
their methodologies to optimize for operations the SPEs specialize in. 1 One way to
gauge the success, or failure, of the Cell processor is to look at the sales of the PS3
versus its primary competitor, the Xbox 360. As of January of 2014, there is
reported to be around 82 million PS3s sold worldwide vs. around 80 million Xbox
360s sold worldwide.6 Overall, the difference in sales is negligible. In some ways this
can be gauged as a failure, as the Cell was intended to dominate the competition,
with a hefty research price tag of 400 million dollars. 7 Regardless, the Cell was
remarkably ahead of its time, as the most recent processors in consoles also feature
8 cores8, and was a remarkable attempt at reinvigorating the stagnant status quo in
computer architecture.
Sources
[1] Alfredo Buttari et al. A Rough Guide to Scientific Computing On the PlayStation
3, Innovative Computing Laboratory, University of Tennesse Knoxville, Knoxville,
TN, Tech. Rep. UT-CS-07-595, May 11, 2007.
[2] Anand Lal Shimpi (March 2005.), Understanding the Cell
Microprocessor. Available: http://www.anandtech.com/show/1647 [May 13, 2015]
[3] J. A. Kahle et al. Introduction to the Cell multiprocessor, IBM J. RES. & DEV.
Vol. 49, No. 4/5, July/September 2005.
[4] Robert Valdes. () How PlayStation 3 Works [Online]. Available:
http://electronics.howstuffworks.com/playstation-three.htm. [May 11, 2015]
[5] Will Greenwald. (2012, June 7) Xbox 360 vs. PlayStation 3: Which Console Wins
the Gaming Game [Online]. Available:
http://www.pcmag.com/article2/0,2817,2405305,00.asp [May 12, 2015]
[6] Kate T. et al. () PlayStation 3 vs. Xbox 360 [Online]. Available:
http://www.diffen.com/difference/PlayStation_3_vs_Xbox_360 [May 8, 2015]
[7] Leigh Alexander. (2009, January 2) Report: Sonys Cell Dev Cost $400 Million,
Aided Microsoft Tech [Online]. Available:
http://www.gamasutra.com/view/news/112644/Report_Sonys_Cell_Dev_Cost_400_Mil
lion_Aided_Microsoft_Tech.php [May 13, 2015]
[8] Jamie Lendino. (2015, April 16) Xbox One vs PS4: How the Hardware Specs
Compare [Online]. Available: http://www.extremetech.com/gaming/156273-xbox720-vs-ps4-vs-pc-how-the-hardware-specs-compare [May 13, 2015]
[9] IEEE, "Overview of the Architecture, Circuit Design, and Physical Implementation
of a First-Generation Cell Processor", IEE Journal of Solid-State Circuits, January
2006.
[10] (2012, October 26). Are GPUs just vector processors? [Online]. Available:
https://theincredibleholk.wordpress.com/2012/10/26/are-gpus-just-vectorprocessors/ [May 9, 2015]

Final Draft

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Final Draft

Hochgeladen von

Copyright:

Verfügbare Formate

Cell:

An overview of Sony, IBM, and Toshibas multi core processor

May 13, 2015

One General Purpose Processor

Figure 1:PPE Highlighted2

design decision to consider is why

Figure 2: SPEs Highlighted2

general purpose processors. In combination with caches, out-of-order processors

in a different fashion than most

transistor cost, but vastly simplify

Figure 3: CMOS Transistor2

description languages have static

which are relatively slow and

CMOS gates already written and

Figure 4: NMOS Transistor2

Not only are there a maximum of ten possible

Das könnte Ihnen auch gefallen