Beruflich Dokumente
Kultur Dokumente
A Computer Science 625a Project Created For: Mark Daley Created By: Nathan Lemieux 250017145 December 2005
1. Introduction
The Cell was developed by the STI team which includes Sony, Toshiba and IBM. They teamed together to produce a scaleable and flexible processor that is extremely powerful yet energy conscious. It was another attempt to redo the RISC revolution by simplifying processor micro-architecture and moving complexity into software [2]. Over the past twenty years, the processor industry has introduced techniques to increase clock speeds dramatically but jeopardized die space, power consumption and simplicity. I will start off by outlining a brief history of the Cell conception in Section 2 and then provide an overview of its architecture in Section 3. In Section 4, I will discuss how to use resources available in the Cell. In Section 5, I will compare the Cell to other architectures and in Section 6 I will look at why some of the design decisions were made. Before I conclude in Section 8, I will discuss in section 7, what software will run on the Cell and what it will take to program for the Cell.
2. History
The original Cell concept was developed by Sony Computer Entertainment Incorporated (SCEI) in 1999 after the release of its Playstation 2 game console. The Emotion Engine inside the PS2 did not meet its publicized expectations, so SCEI went back to the drawing board. The STI group was formed in 2000 with the first design center opening in 2001. In the fall of 2002, the patent for the Cell was published. Sometime after that, a prototype was developed and STI claims it was clocked at over 4.5 GHz. In February 2005, the final architectural design was released to the public at the International Solid-State Circuits Conference (ISSCC). Each company in the STI group has its own requirements and expectations on what the Cell should be capable of. STI intends to scale the processor by varying the number of cores on a chip and the number of units in a single core and by linking multiple chips to each other over a distributed network for low-end and high-end applications. IBM wants a powerful and scaleable processor, Sony wants a cheap but powerful processor; Toshiba also wants a cheap yet energy efficient processor. IBM IBM brings to the table its vast knowledge and expertise in developing and manufacturing state-of-the-art microprocessors. IBM has put its POWER architecture inside the Cell to act as the brains of the chip. IBM has plans to put the Cell in its high-end sever line. Sony Corporation Sony Corporation is a leading manufacturer of audio and video products for the consumer and professional markets. SCEI manufactures, distributes, and markets the PlayStation game console family. SCEI has already discussed plans to launch the new PS3 in 2006 with the Cell processor. This will be the first commercially available product that contains the Cell processor. 1
Toshiba Toshiba Corporation is a leader in the development and manufacture of electronic devices and components for digital consumer products. Toshiba has plans to put the Cell in HDTVs to decode MPEG-2 streams simultaneously. The STI group that has been formed is a strategic alliance; each company brings a different skill set and vast amount of capital to invest in the Cell development. The purpose of developing the Cell was to reduce the cost of components by building their own. It is likely STI will produce the Cell in vast numbers because five plus years of development cost usually results in an expensive product that can only be reduced by manufacturing in volume.
3. Architecture
Cell architecture is intended to be scaleable through the use of vector processing elements (SPE). You can scale up by adding SPEs or scale down by removing them. Depending on the usage requirements of the Cell processor it will be capable of a number of different configurations. Basic configuration consists of the following: PowerPC Processing Element (PPE) Multiple (eight) Synergistic Processing Elements (SPE) Rambus Memory Interface Controller (MIC) Rambus FlexIO interface Element Interconnect Bus (EIB)
Theoretical Performance Calculations With the above configuration the Cell has a theoretical computing power of 256 GFLOPS (Billion Floating Points Operations per Second) Single Precision not including the PPE computing power when clocked at 4 GHz. 8(SPE) x 4Ghz x 4 (32 bit words in a vector) x 2 (multiply-adds are counted as 2 operations) = 256 GLOPS So each SPE is capable of 32 SP GFLOPS [9]. SPE can produce 2 DP FMADD operations every 7 cycles or 4 SP FMADD every cycle [8]. So that is approximately 2.3 DP GFLOPS per SPE or approximately 18.4 DP GFLOPS in total. Again this is not including the processing power of the PPE which will be capable of 8 DP GFLOPS because of the AltiVec unit. Supercomputers rankings are done by Double Precision calculations. So, one Cell processor clocked at 4 GHz has a theoretical capability to reach approximately 26 DP GFLOPS [8]. For comparison, the current Supercomputer BlueGene/L develop by IBM has a theoretical peak performance of 183500 GFLOPS but has only achieved 136800 GFLOPS. IBMs BlueGene/L has 65536 processors giving each processor a theoretical peak performance of approximately 2.8 DP GFLOPS [14]. If the Cell only achieves half of its theoretical performance, say 13 DP GFLOPS, it will still far exceed the current supercomputers in performance per processor.
parallel and if not, then in program order. If necessary, after execution, the instructions are put back in sequence and the result is written back to local memory [4]. Each SPE contains a 256 KB local memory which STI called local store. The local store memory is visible to the PPE and can be addressed directly by software. There is no memory or coherency mechanisms used within the local stores. Each SPE contains a 128 entry register file with 128 bits per entry. The register file has six read ports and two write ports. The SPEs cannot operate directly on main memory; they have to move data to and from the local stores. The DMA device in each SPE handles the movement of data between the main memory and the local store in blocks of 1024 bits or 128 bytes. The SPEs operate on registers which are read from or written to local stores. Unlike Power processors, the SPEs operate only on their local memory (LS). Code and data must be transferred into the associated LS for an SPE to execute or operate on. LS addresses do have an alias in the PPE address map and transfers to and from LS to memory at large (including other LS) are coherent in the system. As a result, a pointer to a data structure that has been created on the PPE can be passed to an SPE and the SPE can use this pointer to issue a DMA command to bring the data structure into its LS in order to perform operations on it. If the SPE (or PPE) issues a DMA command to place it back into non-LS memory after some computations, the transfer is again coherent [11]
[10]
Four Cell configuration with addition switch hardware for communication [13]
Since the Cell is extremely flexible in its configurations there is a need for a number of potential resource utilization schemes. Tasks are divided into SPE and PPE modules or jobs. Each SPE module is a sub-task, which operates using one or more SPEs depending on compute power needed; modules can also stream data to one another. PPE Scheduling The PPE maintains a job queue, schedules jobs in SPEs and monitors progress. Each SPE performs its own job and synchronizes with the PPE. When an SPE has finished its execution the next job in line is assigned to that SPE. These jobs are self contained mini-programs. SPE self scheduling Scheduling is distributed across the SPEs. SPEs run their own mini-kernel which allows them to assign jobs to themselves without guidance from the PPE. The SPE uses shared memory for all jobs in this configuration as the PPE still maintains the job queue. Stream Processing Each SPE runs a distinct program to be chained together. Data comes from an input stream and is sent to SPE(s) to be stored in its local store. When an SPE(s) has terminated the processing, the output data is stored in its local store. The next SPE reads the output from the first SPEs local store and processes it and stores it in its local store. This process of passing data from SPE to SPE continues until the stream processing operation has finished. 6
Above is an example of the Steam Processing schema in the Cell processor. SPEs load programs for reading DVD, decoding video, decoding audio and display. The data would be passed off from SPE to SPE until finally ending up on the TV [9] With a good compiler or some cleaver programming, I believe it is possible to combine different schemes. For example, you could have four SPEs working in parallel/queue scheme while the other four SPEs work in serial/stream scheme. In the above schemes, all the SPEs are dynamically assigned, so the developer does not need to know how many SPEs there are or what the SPEs are currently computing. STI also intends to allow distributed processing among the different products that will contain the Cell (from a PDA to camera to a microwave). The Operating System will perform the necessary communication set-up and will use whatever local network technology that is available.
[9] 7
Above I discussed reducing the complexity of the Cell. This includes the removal of OOO hardware (thus reducing amount of cache needed), branch prediction hardware, and the substitution of local addressable memory (LS) for cache. STI has removed some control logic from the SPEs for more local storage space and execution hardware. The SPE does not do register renaming, branch prediction or instruction reordering as STI has eliminated the instruction window. Secondly, and most important, STI has substituted the level 1 cache for locally addressable memory termed local store to reduce the memory latency gap. The basic idea is that the Cell has moved memory closer to the execution units and let the processor store frequently used data in that local memory[2]. Since there is no cache in the SPE, the burden of managing the local store has been moved into the software. The use of a large unified register set and good compiler can schedule the instructions so dependencies have less impact. An advantage of a large register set is that loop unrolling and interleaving can be supported without the use of reorder hardware. The Compiler with the help of many registers can do what the OOO hardware does. The result is a less complex processor that operates at a higher frequency with relatively low power consumption.
7. Programming
Primarily, the Cells PPE is a 64-bit PowerPC. The compatibility with the Power Architecture provides a base for porting existing software including operating systems like Apples OS X (will still need some extensive re-writing of code, but significantly less than re-writing Microsoft Windows for the Cell). Currently, Sony, Toshiba, and IBM have already ported Linux (different versions) to run on the Cells PPE core but additional patches are being developed jointly to unlock the performance potential of the Cells SPE. According to STI both the PPE and SPE are programmable in C/C++ using a common API. Existing 32-bit POWER applications will run on the Cell Processor without modification. Any algorithm that is can be vectorable or made parallel will release the performance gains of the SPEs. So, applications that use graphics, audio, video and encryption will perform superiorly. Nevertheless, scalar operations will also run across different SPE, just not at the same performance.
Since STI has removed some of the design complexity of the SPEs, they have increased the programming complexity for developers. Developing for the SPE will almost certainly be done at the hardware level. The tasks of managing the local store memory and system level coherency early on will be done by the programmer or library writer, but over time this should be effectively handled by the compiler. These future compilers should be able to use auto-vector code and create parallel code automatically but currently this is not available. Just this month IBM has released a Cell Software Development Kit (SDK). You can download the SDK for free and start programming for the Cell on any x86 (32-bit or 64-bit). The SDK comes with a bunch of software such as Linux kernel patches, the GNU toolchain (GCC), the XL C compiler (with different optimization options from none to loop analysis to whole program analysis), a system simulator, code samples, libraries and of course lots of documentation. It requires a significant amount of system performance and needs Red Hat Fedora Core 4 operating system to be installed. For more information see references [15 Thanks Mark]. For information purposes only, STI has released a hardware development kit for the PS3 nicked named Cytology. It is not as powerful (2.4 GHz) as the promised PS3 (3.2 GHz) when released in 2006. If your are interested in obtaining one you will have to wait at least until the New Year as they are being distributed in very small numbers to developers.
8. Conclusion
The Cell is a general purpose processor but optimized for high performance computing tasks. The intended basic configuration consists of 9 cores. PPE core is a conventional POWER processor and acts as a controller. The other eight SPE cores are independent vector processors that can work alone or chained together and perform most the computational workload. STI has given the Cell the ability to communicate with other Cells or bring in a huge amount input via high speed Rambus interconnects. STI design decisions were made to create a powerful processor which was easily configured for different uses from gaming system, to HDTVs to super computers, while still being energy conscious. The Cell is very scaleable as SPE can be added to scale up or removed to scale down. It has taken STI over five years to design the Cell so one would think it would be expensive once it is released. Since it is very scaleable in its design and flexible in it uses it should be produced in vast numbers by the STI group providing exceptional power for a reasonable price. Programming for the Cell may be difficult at first as the PPE and SPE are different types of processors. In the meantime, compilers and libraries will be developed, giving programmers a break. The Cell will be successful in the niche market that STI members have designed it for and will change the way we perceive current processors. However, with the recent announcement by Apple, switching to Intel from the PowerPC, it may be a few years before we see the Cell in a desktop PC. Apples reasons for the switch include price of PowerPC chips, the inability of the PPC970FX to reach 3 GHz and failing to provide Apple with a low power PowerPC chip to be implemented in their PowerBook line.
10
Glossary of Acronyms
STI Sony, Toshiba, IBM SCEI Sony Computer Entertainment Incorporated ISSCC International Solid-State Circuits Conference PPE Power Processing Element SPE Synergistic Processing Element LS Local Store EIB Element Interconnect Bus IPC Instruction Per Cycle MIC Memory Interface Card OOO Out Of Order (Execution) CPU Central Processing Unit DMA Direct Memory Access SIMD Single Instruction Multiple Data RISC Reduced Instruction Set Computer CISC Complex Instruction Set Computer POWER Performance Optimization With Enhanced RISC DVD Digital Video Decoder IBM International Business Machine DP Double Precision SP Single Precision GFLOPS Billion Floating Points Operations per Second FMADD - Floating point Multiply-Add Instruction GPU Graphics Processing Unit KB Kilobyte GHz Gigahertz (Unit of Frequency equal to one billion hertz) MHz Megahertz (Unit of Frequency equal to one million hertz) API Application Program Interface SoC System on a Chip SDK Software Development Kit GNU GNU Not Unix PDA Personal Digital Assistant
11
References
[1] Wikipedia, Cell (microprocessor), http://en.wikipedia.org/wiki/Cell_processor [2] Stokes, Jon, Introducing the IBM/Sony/Toshiba Cell Processor, Part 1, http://arstechnica.com/articles/paedia/cpu/cell-1.ars/1, 02/07/2005 [3] Stokes, Jon, Introducing the IBM/Sony/Toshiba Cell Processor, Part 2, http://arstechnica.com/articles/paedia/cpu/cell-2.ars, 02/08/2005 [4] Suzuoki, Masakazu & Yamazaki, Takeshi, US Patent Application http://appft1.uspto.gov/netacgi/nphParser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/PTO/searchbool.html&r=1&f=G&l=50&co1=AND&d=PG01&s1=20020138637&OS=20020138 637&RS=20020138637, 10/26/2002 [5] Press Release, IBM, Sony, Sony Computer Entertainment Inc. and Toshiba Disclose Key Details of the Cell Chip, http://www.us.playstation.com/Pressreleases.aspx?id=252, 02/07/2005 [6] Hofstee, Peter, Power Efficient Processor Architecture and The Cell Processor, http://ieeexplore.ieee.org/iel5/9519/30167/01385948.pdf?arnumber=1385948, HPCA-11 2005 [7] Press Release, Sony Computer Entertainment Inc. To launch its next generation computer entertainment system, http://www.us.playstation.com/Pressreleases.aspx?id=279, 05/16/2005 [8] Wang, David, Cell Micro[rpcessor III , http://realworldtech.com/page.cfm?ArticleID=RWT072405191325, 07/24/2005 [9] Blachford, Nicholas, Cell Architecture Explained Version 2, http://blachford.info/computer/Cell/Cell0_v2.html, 2005 [10] Wang, David, ISSCC 2005: The Cell Microprocessor, http://realworldtech.com/page.cfm?ArticleID=RWT021005084318, 02/10/2005 [11] Wang, David, Cell Microprocessor Revisited, http://realworldtech.com/page.cfm?ArticleID=RWT022805234129, 02/28/2005 [12] Blachford, Nicholas, Cell Architecture Explained Version 1, http://blachford.info/computer/Cell/archive/Cell0.html, 2005 [13] J. Kahle, M. Day, H. Hofstee, C. John, T. Maeureu, D. Shippy, Introduction to the Cell multiprocessor, http://researchweb.watson.ibm.com/journal/rd/494/kahle.html, IBM Journal of Research and Development, Vol. 49, Number 4/5, 2005 [14] Top 500 List, http://www.top500.org/lists/2005/06/, 06/2005 [15] Get started with the Cell Broadband Engine Software Development Kit, http://www-128.ibm.com/developerworks/power/library/pa-cellstartsim/#N10276, IBM 11/09/2005
12
CPU
GPU
Sound Memory
System Bandwidth System Floating Point Performance Storage HDD USB I/O Memory Stick SD CompactFlash Ethernet Communication Wi-Fi Bluetooth Controller
13