5 1 GPU Architecture

GPU Architecture
CPU & APU Architecture
CS 6823-003 Spring12 @ ASU
Supported by
CPU Computing
CPU performance is the product of many related advances Increased transistor density Increased transistor performance Wider data paths Pipelining Superscalar execution Speculative execution Caching Chip- and system-level integration
Supported by
Bandwidth Gravity of Modern Computer Systems

The Bandwidth between key components ultimately dictates system performance
Especially true for massively parallel systems processing massive amount of data Tricks like buffering, reordering, caching can temporarily defy the rules in some cases Ultimately, the performance goes falls back to what the speeds and feeds dictate
Perform well themselves Cooperate well too
Supported by
Superscalar Execution
Superscalar and, by extension, out-of-order execution is one solution that has been included on CPUs for a long time
Out-of-order scheduling logic requires a substantial area of the CPU die to maintain dependence information and queues of instructions to deal with dynamic schedules throughout the hardware Speculative instruction execution necessary to expand the window of out-of-order instructions to execute in parallel results in inefficient execution of throwaway work
Supported by
Out-of-order Execution
Supported by
VLIW
VLIW is a heavily compiler-dependent method for increasing instruction-level parallelism in a processor
VLIW moves the dependence analysis work into the compiler
Supported by
SIMD and Vector Processing

SIMD and its generalization in vector parallelism approach improved efficiency by:
The same operation be performed on multiple data elements
Supported by
Multithreading
Supported by
Two Threads Scheduled in Time Slice Fashion
10
Supported by
10
Multi-Core Architectures
A large number of threads interleave execution to keep the device busy, whereas each individual thread takes longer to execute than the theoretical minimum
11
Supported by
11
AMD Bobcat and Bulldozer CPUs

Bobcat (left) follows a traditional approach mapping functional units to cores, in a low-power design Bulldozer (right) combines two cores within a module, offering sharing of some functional units
12
Supported by
12
The AMD Radeon HD6970 GPU

The device is divided into two halves where instruction control: scheduling and dispatch is performed by the level wave scheduler for each half
The 24 16-lane SIMD cores execute four-way VLIW instructions on each SIMD lane and contain private level 1 caches and local data shares (LDS)
13
Supported by
13
CPU and GPU Architectures
14
Supported by
14
Niagara 2 CPU from Sun/Oracle

Relative similar to the GPU design
15
Supported by
15
E350 "Zacate" AMD APU

Two "Bobcat" low-power x86 cores Two 8-wide SIMD cores with five-way VLIW units Connected via a shared bus and a single interface to DRAM
16
Supported by
16
Intel Sandy Bridge

Intel combines four Sandy Bridge x86 cores with an improved version of its embedded graphics processor
The concept is the same as the devices under that category from AMD
17
Supported by
17
Intels Core i7 Processor

Four CPU cores with simultaneous multithreading Made with 45nm process technology Each chip has 731 million transistors and consumes up to 130W of thermal design power
18
Supported by
18
Intels Nehalem
Four-wide superscalar Out of order, speculative execution Simultaneous multithreading Multiple branch predictors On-die power gating On-die memory controllers Large caches Multiple interprocessor interconnects
19
Supported by
19
Todays Intel PC Architecture: Single Core System

FSB connection between processor and Northbridge (82925X)
Memory Control Hub
Northbridge handles primary PCIe to video/GPU and DRAM.

PCIe x16 bandwidth at 8 GB/s (4 GB each direction)
Southbridge (ICH6RW) handles other peripherals
20
Supported by
20
Todays Intel PC Architecture: Dual Core System

Bensley platform
Blackford Memory Control Hub (MCH) is now a PCIe switch that integrates (NB/SB). FBD (Fully Buffered DIMMs) allow simultaneous R/W transfers at 10.5 GB/s per DIMM PCIe links form backbone PCIe device upstream bandwidth now equal to down stream Workstation version has x16 GPU link via the Greencreek MCH
Two CPU sockets

Dual Independent Bus to CPUs, each is basically a FSB (Front-Side Bus) CPU feeds at 8.510.5 GB/s per socket Compared to current Front-Side Bus CPU feeds 6.4GB/s
PCIe bridges to legacy I/O devices
Source: http://www.2cpu.com/review.php?id=109
21
Supported by
21
Todays AMD PC Architecture

AMD HyperTransport Technology bus replaces the Front-side Bus architecture HyperTransport similarities to PCIe:
Packet based, switching network Dedicated links for both directions
Shown in 4 socket configuraton, 8 GB/sec per link Northbridge/HyperTransport is on die Glueless logic
to DDR, DDR2 memory PCI-X/PCIe bridges (usually implemented in Southbridge)
22
Supported by
22
Todays AMD PC Architecture

Torrenza technology
Allows licensing of coherent HyperTransport to 3rd party manufacturers to make socketcompatible accelerators/coprocessors Allows 3rd party PPUs (Physics Processing Unit), GPUs, and coprocessors to access main system memory directly and coherently Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.
23
Supported by
23
HyperTransport Feeds and Speeds

Primarily a low latency direct chip-to-chip interconnect, supports mapping to board-toboard interconnect such as PCIe HyperTransport 1.0 Specification
800 MHz max, 12.8 GB/s aggregate bandwidth (6.4 GB/s each way)
HyperTransport 3.0 Specification

1.8 - 2.6 GHz Clock, 41.6 GB/s aggregate bandwidth (20.8 GB/s each way) Added AC coupling to extend HyperTransport to long distance to system-to-system interconnect
HyperTransport 2.0 Specification

Added PCIe mapping 1.0 - 1.4 GHz Clock, 22.4 GB/s aggregate bandwidth (11.2 GB/s each way)
Courtesy HyperTransport Consortium

24
Source: White Paper: AMD HyperTransport Technology-Based System Architecture Supported by
24

5 1 GPU Architecture

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

5 1 GPU Architecture

Hochgeladen von

Copyright:

Verfügbare Formate

GPU Architecture

CPU & APU Architecture

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

Bandwidth Gravity of Modern Computer Systems

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

SIMD and Vector Processing

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

Two Threads Scheduled in Time Slice Fashion

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

AMD Bobcat and Bulldozer CPUs

CS 6823-003 Spring12 @ ASU

The AMD Radeon HD6970 GPU

CS 6823-003 Spring12 @ ASU

CPU and GPU Architectures

CS 6823-003 Spring12 @ ASU

Niagara 2 CPU from Sun/Oracle

CS 6823-003 Spring12 @ ASU

E350 "Zacate" AMD APU

CS 6823-003 Spring12 @ ASU

Intel Sandy Bridge

CS 6823-003 Spring12 @ ASU

Intels Core i7 Processor

CS 6823-003 Spring12 @ ASU

CS 6823-003 Spring12 @ ASU

Todays Intel PC Architecture: Single Core System

Northbridge handles primary PCIe to video/GPU and DRAM.

Southbridge (ICH6RW) handles other peripherals

CS 6823-003 Spring12 @ ASU

Todays Intel PC Architecture: Dual Core System

Two CPU sockets

PCIe bridges to legacy I/O devices

CS 6823-003 Spring12 @ ASU

Todays AMD PC Architecture

CS 6823-003 Spring12 @ ASU

Todays AMD PC Architecture

CS 6823-003 Spring12 @ ASU

HyperTransport Feeds and Speeds

HyperTransport 3.0 Specification

HyperTransport 2.0 Specification

Courtesy HyperTransport Consortium

Source: White Paper: AMD HyperTransport Technology-Based System Architecture Supported by

Das könnte Ihnen auch gefallen