Sie sind auf Seite 1von 34

Pangaea: A Tightly-Coupled Heterogeneous IA32 Chip Multiprocessor

Henry Wong1, Anne Bracy2, Ethan Schuchman2, Tor M. Aamodt1, Jamison D. Collins2, Perry H. Wang2, Gautham Chinya2, Ankur Khandelwal Groen3, Hong Jiang4, Hong Wang2
henry@stuffedcow.net, anne.c.bracy@intel.com

Dept. Of Electrical and Computer Engineering, University of British Columbia Microarchitecture Research Lab, Microprocessor Technology Labs, Intel Corporation 3 Digital Enterprise Group, Intel Corporation 4 Graphics Architecture, Mobility Groups, Intel Corporation
1

Parallel Architectures and Compilation Techniques, October 27, 2008

Pangaea
Integrates IA32 CPU with GPU cores Improved area/power efficiency Tighter integration Modular design

Pangaea. PACT 2008

Motivation
GPUs have low Energy Per Instruction
~100x less EPI than CPU Parallel performance too Pangaea targets non-graphics computation for further efficiency gains

Tightly-coupled
easier to program lower communication latency

Minimize changes to existing software (OS)

Pangaea. PACT 2008

Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance

Conclusion

Pangaea. PACT 2008

Programmable GPU
Rendering pipeline
Polygons go in Pixels come out

DX10 has 3 programmable stages

Pangaea. PACT 2008

Nvidia CUDA, AMD Stream


Use shader processors without graphics API C-like high-level language for convenience

Pangaea. PACT 2008

GPU + CPU
Loosely-coupled to the CPU
Off-chip latency Explicit data copy between memory spaces Cooperation?

Pangaea. PACT 2008

GPU Integration
Put them on the same chip
Off-chip latency Explicit data copy between memory spaces Cooperation??

Pangaea. PACT 2008

Pangaea
Single-chip, tightly-coupled
Off-chip latency Shared memory address space: Share, not copy Cooperation!

Pangaea. PACT 2008

Pangaea Architecture
Tightly-integrated
User-level interrupts (ULI) for communication Shared memory and cache

Use GPU cores for compute


Execution Unit (EU)

10

Pangaea. PACT 2008

Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance

Conclusion

11

Pangaea. PACT 2008

EU Thread Life Cycle

Work... Done! Work... Done!

Work... Signal Work... Signal User User

Handler Handler

12

Pangaea. PACT 2008

User-Level Interrupts (ULI)


EMONITOR
Watches for an address invalidation Calls user interrupt handler in response

ERETURN
Returns from user-level interrupt handler

SIGNAL
Tells Thread Spawner to start new thread.

13

Pangaea. PACT 2008

Using ULI CPU Code


{ task_complete = false; EMONITOR(&task_complete, &handler); SIGNAL(&eu_routine, &eu_data); { Do some work } }

14

Pangaea. PACT 2008

Using ULI EU Code

{ Do some work; task_complete = true; }

15

Pangaea. PACT 2008

Using ULI User Handler


handler() { if (task_complete) { Use EU result or start EU task } ERETURN(); }

16

Pangaea. PACT 2008

ULI Pipeline Modifications

Instruction Channel registers Interrupt Microcode Decoder unit Accept Maps Support New new instruction scenario for EMONITOR, user-level (address) flows ERETURN, and interrupts to interrupt user SIGNAL handler flow instructions
17 Pangaea. PACT 2008

Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance

Conclusion

18

Pangaea. PACT 2008

Shared Memory Hierarchy


Shared address space
Address Translation Remapping: CPU handles memory translation when EU TLB misses See Perry Wang, et al., EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System

Shared memory hierarchy


Share a cache with the CPU Helps collaborative multi-threading Avoids copying data between CPU and GPU

19

Pangaea. PACT 2008

Area/Power Efficiency
Graphics pipeline area is 9.5 cores
65 nm synthesis of Intel GMA X4500

Power is 4.9 cores Replace graphics pipeline with Thread Spawner


Thread Spawner is tiny: 1% of core

Front-end

Back-end

20

Pangaea. PACT 2008

Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance

Conclusion

21

Pangaea. PACT 2008

Pangaea Prototype
Synthesis of production-quality RTL code
2-issue, in-order IA32 CPU (37% of design) 2 EUs from Intel GMA X4500 (31% x 2)

Virtex 5 LX330, 136772 LUTs, 17 MHz


66% of LX330

Boots Linux, Windows, DOS, ...

22

Pangaea. PACT 2008

Thread Spawn Latency

Thread Spawn latency reduced by 60x when bypassing graphics pipeline


GPGPU driver software overhead not included

23

Pangaea. PACT 2008

Throughput Performance

2 EUs vs. 1 CPU


k-means and svm collaborate with CPU k-means is CPU-bound

24

Pangaea. PACT 2008

Latency Sensitivity

Bicubic and FGT code larger than 4KB i-cache k-means is CPU-bound Insensitive to memory latency < ~60 cycles
Can trade off level of memory hierarchy to share

25

Pangaea. PACT 2008

Conclusions
Added ULI communication to IA32, built on cache coherency mechanisms
Modularity allows scalable design

Shared memory and cache is good for ease of programming and collaboration
Highest-performance implementation not critical

Legacy graphics takes up 9.5 EUs of area, 4.9 EUs of power. Remove if not necessary.
Prototype shows it is ok to remove

26

Pangaea. PACT 2008

Conclusions
IA32 ULI built on cache coherency mechanisms enables scalable, modular design Shared memory and cache is good for ease of programming and collaboration Legacy graphics fixed functions have high overhead

27

Pangaea. PACT 2008

Questions?

28

Pangaea. PACT 2008

EU vs. CPU Peak Throughput


2 EUs have 2x peak performance vs. CPU TLP increases utilization (92% vs. 65%, linear) Large register file (57% vs. 7.4% memory, bicubic) Multiply-accumulate (55% of bicubic) SIMD-8/16 instructions lowers instruction count

29

Pangaea. PACT 2008

Shaders
For each vertex, run a program.
...or each pixel

Program instances mutually independent Shaders designed to run many independent instances of the same short program

30

Pangaea. PACT 2008

GPGPU
For each _____, run a program. Write shader programs to do something nongraphics Sparse matrix solvers, linear algebra, sorting algorithms... Brook for GPUs

31

Pangaea. PACT 2008

Other Stuff from Intel MRL


Papers
Multiple Instruction Stream Processor EXOCHI: Architecture and Programming Environment for a Heterogeneous Multi-core Multithreaded System

Ideas
User-level sequencer so OS isn't modified Let CPU handle exceptions on behalf of sequencer Shared memory space for ease of programming

Pangaea can be thought of as extension of Exo

32

Pangaea. PACT 2008

Pangaea Resource Usage


2-issue, in-order IA32 CPU 2 EUs from Intel GMA X4500 Virtex 5 LX330, 136772 LUTs, 17 MHz

33

Pangaea. PACT 2008

Earlier Pangaea Prototype


1 CPU, 1 EU, 256 kB memory Virtex 4 LX200, 130352 4-LUTs, 17.5 MHz

34

Pangaea. PACT 2008

Das könnte Ihnen auch gefallen