Sie sind auf Seite 1von 12

GPU Architecture

CPU & APU Architecture

CS 6823-003 Spring12 @ ASU

Supported by

CPU Computing
CPU performance is the product of many related advances Increased transistor density Increased transistor performance Wider data paths Pipelining Superscalar execution Speculative execution Caching Chip- and system-level integration

CS 6823-003 Spring12 @ ASU

Supported by

Bandwidth Gravity of Modern Computer Systems


The Bandwidth between key components ultimately dictates system performance
Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance goes falls back to what the speeds and feeds dictate
Perform well themselves Cooperate well too

CS 6823-003 Spring12 @ ASU

Supported by

Superscalar Execution
Superscalar and, by extension, out-of-order execution is one solution that has been included on CPUs for a long time
Out-of-order scheduling logic requires a substantial area of the CPU die to maintain dependence information and queues of instructions to deal with dynamic schedules throughout the hardware Speculative instruction execution necessary to expand the window of out-of-order instructions to execute in parallel results in inefficient execution of throwaway work

CS 6823-003 Spring12 @ ASU

Supported by

Out-of-order Execution

CS 6823-003 Spring12 @ ASU

Supported by

VLIW
VLIW is a heavily compiler-dependent method for increasing instruction-level parallelism in a processor
VLIW moves the dependence analysis work into the compiler

CS 6823-003 Spring12 @ ASU

Supported by

SIMD and Vector Processing


SIMD and its generalization in vector parallelism approach improved efficiency by:
The same operation be performed on multiple data elements

CS 6823-003 Spring12 @ ASU

Supported by

Multithreading

CS 6823-003 Spring12 @ ASU

Supported by

Two Threads Scheduled in Time Slice Fashion

CS 6823-003 Spring12 @ ASU

10

Supported by

10

Multi-Core Architectures
A large number of threads interleave execution to keep the device busy, whereas each individual thread takes longer to execute than the theoretical minimum

CS 6823-003 Spring12 @ ASU

11

Supported by

11

AMD Bobcat and Bulldozer CPUs


Bobcat (left) follows a traditional approach mapping functional units to cores, in a low-power design Bulldozer (right) combines two cores within a module, offering sharing of some functional units

CS 6823-003 Spring12 @ ASU

12

Supported by

12

The AMD Radeon HD6970 GPU


The device is divided into two halves where instruction control: scheduling and dispatch is performed by the level wave scheduler for each half
The 24 16-lane SIMD cores execute four-way VLIW instructions on each SIMD lane and contain private level 1 caches and local data shares (LDS)

CS 6823-003 Spring12 @ ASU

13

Supported by

13

CPU and GPU Architectures

CS 6823-003 Spring12 @ ASU

14

Supported by

14

Niagara 2 CPU from Sun/Oracle


Relative similar to the GPU design

CS 6823-003 Spring12 @ ASU

15

Supported by

15

E350 "Zacate" AMD APU


Two "Bobcat" low-power x86 cores Two 8-wide SIMD cores with five-way VLIW units Connected via a shared bus and a single interface to DRAM

CS 6823-003 Spring12 @ ASU

16

Supported by

16

Intel Sandy Bridge


Intel combines four Sandy Bridge x86 cores with an improved version of its embedded graphics processor
The concept is the same as the devices under that category from AMD

CS 6823-003 Spring12 @ ASU

17

Supported by

17

Intels Core i7 Processor


Four CPU cores with simultaneous multithreading Made with 45nm process technology Each chip has 731 million transistors and consumes up to 130W of thermal design power

CS 6823-003 Spring12 @ ASU

18

Supported by

18

Intels Nehalem
Four-wide superscalar Out of order, speculative execution Simultaneous multithreading Multiple branch predictors On-die power gating On-die memory controllers Large caches Multiple interprocessor interconnects

CS 6823-003 Spring12 @ ASU

19

Supported by

19

Todays Intel PC Architecture: Single Core System


FSB connection between processor and Northbridge (82925X)
Memory Control Hub

Northbridge handles primary PCIe to video/GPU and DRAM.


PCIe x16 bandwidth at 8 GB/s (4 GB each direction)

Southbridge (ICH6RW) handles other peripherals

CS 6823-003 Spring12 @ ASU

20

Supported by

20

Todays Intel PC Architecture: Dual Core System


Bensley platform
Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB). FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/s per DIMM PCIe links form backbone PCIe device upstream bandwidth now equal to down stream Workstation version has x16 GPU link via the Greencreek MCH

Two CPU sockets


Dual Independent Bus to CPUs, each is basically a FSB (Front-Side Bus) CPU feeds at 8.510.5 GB/s per socket Compared to current Front-Side Bus CPU feeds 6.4GB/s

PCIe bridges to legacy I/O devices

CS 6823-003 Spring12 @ ASU

Source: http://www.2cpu.com/review.php?id=109
21
Supported by

21

Todays AMD PC Architecture


AMD HyperTransport Technology bus replaces the Front-side Bus architecture HyperTransport similarities to PCIe:
Packet based, switching network Dedicated links for both directions

Shown in 4 socket configuraton, 8 GB/sec per link Northbridge/HyperTransport is on die Glueless logic
to DDR, DDR2 memory PCI-X/PCIe bridges (usually implemented in Southbridge)

CS 6823-003 Spring12 @ ASU

22

Supported by

22

Todays AMD PC Architecture


Torrenza technology
Allows licensing of coherent HyperTransport to 3rd party manufacturers to make socketcompatible accelerators/coprocessors Allows 3rd party PPUs (Physics Processing Unit), GPUs, and coprocessors to access main system memory directly and coherently Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

CS 6823-003 Spring12 @ ASU

23

Supported by

23

HyperTransport Feeds and Speeds


Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-toboard interconnect such as PCIe HyperTransport 1.0 Specification
800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way)

HyperTransport 3.0 Specification


1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) Added AC coupling to extend HyperTransport to long distance to system-to-system interconnect

HyperTransport 2.0 Specification


Added PCIe mapping 1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way)

Courtesy HyperTransport Consortium


CS 6823-003 Spring12 @ ASU

24

Source: White Paper: AMD HyperTransport Technology-Based System Architecture Supported by

24

Das könnte Ihnen auch gefallen