Untitled

Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor
Henry Wong1, Anne Bracy2, Ethan Schuchman2, Tor M. Aamodt1, Jamison D. Collins2, Perry H. Wang2, Gautham Chinya2, Ankur Khandelwal Groen3, Hong Jiang4, Hong Wang2
henry@stuffedcow.net, anne.c.bracy@intel.com
Dept. Of Electrical and Computer Engineering, University of British Columbia Microarchitecture Research Lab, Microprocessor Technology Labs, Intel Corporation 3 Digital Enterprise Group, Intel Corporation 4 Graphics Architecture, Mobility Groups, Intel Corporation
1
Parallel Architectures and Compilation Techniques, October 27, 2008
Pangaea
Integrates IA32 CPU with GPU cores Improved area/power efficiency Tighter integration Modular design
Pangaea. PACT 2008
Motivation
GPUs have low Energy Per Instruction
~100x less EPI than CPU Parallel performance too Pangaea targets non-graphics computation for further efficiency gains
Tightly-coupled
easier to program lower communication latency
Minimize changes to existing software (OS)
Pangaea. PACT 2008
Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance
Conclusion
Pangaea. PACT 2008
Programmable GPU
Rendering pipeline
Polygons go in Pixels come out
DX10 has 3 programmable stages
Pangaea. PACT 2008
Nvidia CUDA, AMD Stream

Use shader processors without graphics API C-like high-level language for convenience
Pangaea. PACT 2008
GPU + CPU
Loosely-coupled to the CPU
Off-chip latency Explicit data copy between memory spaces Cooperation?
Pangaea. PACT 2008
GPU Integration
Put them on the same chip
Off-chip latency Explicit data copy between memory spaces Cooperation??
Pangaea. PACT 2008
Pangaea
Single-chip, tightly-coupled
Off-chip latency Shared memory address space: Share, not copy Cooperation!
Pangaea. PACT 2008
Pangaea Architecture
Tightly-integrated
User-level interrupts (ULI) for communication Shared memory and cache
Use GPU cores for compute

Execution Unit (EU)
10
Pangaea. PACT 2008
Overview
Conclusion
11
Pangaea. PACT 2008
EU Thread Life Cycle
Work... Done! Work... Done!
Work... Signal Work... Signal User User
Handler Handler
12
Pangaea. PACT 2008
User-Level Interrupts (ULI)

EMONITOR
Watches for an address invalidation Calls user interrupt handler in response
ERETURN
Returns from user-level interrupt handler
SIGNAL
Tells Thread Spawner to start new thread.
13
Pangaea. PACT 2008
Using ULI CPU Code

{ task_complete = false; EMONITOR(&task_complete, &handler); SIGNAL(&eu_routine, &eu_data); { Do some work } }
14
Pangaea. PACT 2008
Using ULI EU Code
{ Do some work; task_complete = true; }
15
Pangaea. PACT 2008
Using ULI User Handler

handler() { if (task_complete) { Use EU result or start EU task } ERETURN(); }
16
Pangaea. PACT 2008
ULI Pipeline Modifications
Instruction Channel registers Interrupt Microcode Decoder unit Accept Maps Support New new instruction scenario for EMONITOR, user-level (address) flows ERETURN, and interrupts to interrupt user SIGNAL handler flow instructions
17 Pangaea. PACT 2008
Overview
Conclusion
18
Pangaea. PACT 2008
Shared Memory Hierarchy

Shared address space
Address Translation Remapping: CPU handles memory translation when EU TLB misses See Perry Wang, et al., EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System
Shared memory hierarchy

Share a cache with the CPU Helps collaborative multi-threading Avoids copying data between CPU and GPU
19
Pangaea. PACT 2008
Area/Power Efficiency
Graphics pipeline area is 9.5 cores
65 nm synthesis of Intel GMA X4500
Power is 4.9 cores Replace graphics pipeline with Thread Spawner

Thread Spawner is tiny: 1% of core
Front-end
Back-end
20
Pangaea. PACT 2008
Overview
Conclusion
21
Pangaea. PACT 2008
Pangaea Prototype
Synthesis of production-quality RTL code
2-issue, in-order IA32 CPU (37% of design) 2 EUs from Intel GMA X4500 (31% x 2)
Virtex 5 LX330, 136772 LUTs, 17 MHz

66% of LX330
Boots Linux, Windows, DOS, ...
22
Pangaea. PACT 2008
Thread Spawn Latency
Thread Spawn latency reduced by 60x when bypassing graphics pipeline

GPGPU driver software overhead not included
23
Pangaea. PACT 2008
Throughput Performance
2 EUs vs. 1 CPU

k-means and svm collaborate with CPU k-means is CPU-bound
24
Pangaea. PACT 2008
Latency Sensitivity
Bicubic and FGT code larger than 4KB i-cache k-means is CPU-bound Insensitive to memory latency < ~60 cycles
Can trade off level of memory hierarchy to share
25
Pangaea. PACT 2008
Conclusions
Added ULI communication to IA32, built on cache coherency mechanisms
Modularity allows scalable design
Shared memory and cache is good for ease of programming and collaboration
Highest-performance implementation not critical
Legacy graphics takes up 9.5 EUs of area, 4.9 EUs of power. Remove if not necessary.
Prototype shows it is ok to remove
26
Pangaea. PACT 2008
Conclusions
IA32 ULI built on cache coherency mechanisms enables scalable, modular design Shared memory and cache is good for ease of programming and collaboration Legacy graphics fixed functions have high overhead
27
Pangaea. PACT 2008
Questions?
28
Pangaea. PACT 2008
EU vs. CPU Peak Throughput

2 EUs have 2x peak performance vs. CPU TLP increases utilization (92% vs. 65%, linear) Large register file (57% vs. 7.4% memory, bicubic) Multiply-accumulate (55% of bicubic) SIMD-8/16 instructions lowers instruction count
29
Pangaea. PACT 2008
Shaders
For each vertex, run a program.
...or each pixel
Program instances mutually independent Shaders designed to run many independent instances of the same short program
30
Pangaea. PACT 2008
GPGPU
For each _____, run a program. Write shader programs to do something nongraphics Sparse matrix solvers, linear algebra, sorting algorithms... Brook for GPUs
31
Pangaea. PACT 2008
Other Stuff from Intel MRL

Papers
Multiple Instruction Stream Processor EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System
Ideas
User-level sequencer so OS isn't modified Let CPU handle exceptions on behalf of sequencer Shared memory space for ease of programming
Pangaea can be thought of as extension of Exo
32
Pangaea. PACT 2008
Pangaea Resource Usage

2-issue, in-order IA32 CPU 2 EUs from Intel GMA X4500 Virtex 5 LX330, 136772 LUTs, 17 MHz
33
Pangaea. PACT 2008
Earlier Pangaea Prototype

1 CPU, 1 EU, 256 kB memory Virtex 4 LX200, 130352 4-LUTs, 17.5 MHz
34
Pangaea. PACT 2008

Untitled

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Untitled

Hochgeladen von

Copyright:

Verfügbare Formate

Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor

Parallel Architectures and Compilation Techniques, October 27, 2008

Pangaea. PACT 2008

Minimize changes to existing software (OS)

Pangaea. PACT 2008

Pangaea. PACT 2008

DX10 has 3 programmable stages

Pangaea. PACT 2008

Nvidia CUDA, AMD Stream

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

Use GPU cores for compute

Pangaea. PACT 2008

Pangaea. PACT 2008

EU Thread Life Cycle

Work... Done! Work... Done!

Work... Signal Work... Signal User User

Pangaea. PACT 2008

User-Level Interrupts (ULI)

Pangaea. PACT 2008

Using ULI CPU Code

Pangaea. PACT 2008

Using ULI EU Code

{ Do some work; task_complete = true; }

Pangaea. PACT 2008

Using ULI User Handler

Pangaea. PACT 2008

ULI Pipeline Modifications

Pangaea. PACT 2008

Shared Memory Hierarchy

Shared memory hierarchy

Pangaea. PACT 2008

Power is 4.9 cores Replace graphics pipeline with Thread Spawner

Pangaea. PACT 2008

Pangaea. PACT 2008

Virtex 5 LX330, 136772 LUTs, 17 MHz

Boots Linux, Windows, DOS, ...

Pangaea. PACT 2008

Thread Spawn Latency

Thread Spawn latency reduced by 60x when bypassing graphics pipeline

Pangaea. PACT 2008

2 EUs vs. 1 CPU

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

EU vs. CPU Peak Throughput

Pangaea. PACT 2008

Pangaea. PACT 2008

Pangaea. PACT 2008

Other Stuff from Intel MRL

Pangaea can be thought of as extension of Exo

Pangaea. PACT 2008

Pangaea Resource Usage

Pangaea. PACT 2008

Earlier Pangaea Prototype

Pangaea. PACT 2008

Das könnte Ihnen auch gefallen