Beruflich Dokumente
Kultur Dokumente
Henry Wong1, Anne Bracy2, Ethan Schuchman2, Tor M. Aamodt1, Jamison D. Collins2, Perry H. Wang2, Gautham Chinya2, Ankur Khandelwal Groen3, Hong Jiang4, Hong Wang2
henry@stuffedcow.net, anne.c.bracy@intel.com
Dept. Of Electrical and Computer Engineering, University of British Columbia Microarchitecture Research Lab, Microprocessor Technology Labs, Intel Corporation 3 Digital Enterprise Group, Intel Corporation 4 Graphics Architecture, Mobility Groups, Intel Corporation
1
Pangaea
Integrates IA32 CPU with GPU cores Improved area/power efficiency Tighter integration Modular design
Motivation
GPUs have low Energy Per Instruction
~100x less EPI than CPU Parallel performance too Pangaea targets non-graphics computation for further efficiency gains
Tightly-coupled
easier to program lower communication latency
Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance
Conclusion
Programmable GPU
Rendering pipeline
Polygons go in Pixels come out
GPU + CPU
Loosely-coupled to the CPU
Off-chip latency Explicit data copy between memory spaces Cooperation?
GPU Integration
Put them on the same chip
Off-chip latency Explicit data copy between memory spaces Cooperation??
Pangaea
Single-chip, tightly-coupled
Off-chip latency Shared memory address space: Share, not copy Cooperation!
Pangaea Architecture
Tightly-integrated
User-level interrupts (ULI) for communication Shared memory and cache
10
Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance
Conclusion
11
Handler Handler
12
ERETURN
Returns from user-level interrupt handler
SIGNAL
Tells Thread Spawner to start new thread.
13
14
15
16
Instruction Channel registers Interrupt Microcode Decoder unit Accept Maps Support New new instruction scenario for EMONITOR, user-level (address) flows ERETURN, and interrupts to interrupt user SIGNAL handler flow instructions
17 Pangaea. PACT 2008
Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance
Conclusion
18
19
Area/Power Efficiency
Graphics pipeline area is 9.5 cores
65 nm synthesis of Intel GMA X4500
Front-end
Back-end
20
Overview
Background on GPU Computation Pangaea: IA32-GPU chip multiprocessor
User-Level Interrupt mechanism Architecture trade-offs Prototype performance
Conclusion
21
Pangaea Prototype
Synthesis of production-quality RTL code
2-issue, in-order IA32 CPU (37% of design) 2 EUs from Intel GMA X4500 (31% x 2)
22
23
Throughput Performance
24
Latency Sensitivity
Bicubic and FGT code larger than 4KB i-cache k-means is CPU-bound Insensitive to memory latency < ~60 cycles
Can trade off level of memory hierarchy to share
25
Conclusions
Added ULI communication to IA32, built on cache coherency mechanisms
Modularity allows scalable design
Shared memory and cache is good for ease of programming and collaboration
Highest-performance implementation not critical
Legacy graphics takes up 9.5 EUs of area, 4.9 EUs of power. Remove if not necessary.
Prototype shows it is ok to remove
26
Conclusions
IA32 ULI built on cache coherency mechanisms enables scalable, modular design Shared memory and cache is good for ease of programming and collaboration Legacy graphics fixed functions have high overhead
27
Questions?
28
29
Shaders
For each vertex, run a program.
...or each pixel
Program instances mutually independent Shaders designed to run many independent instances of the same short program
30
GPGPU
For each _____, run a program. Write shader programs to do something nongraphics Sparse matrix solvers, linear algebra, sorting algorithms... Brook for GPUs
31
Ideas
User-level sequencer so OS isn't modified Let CPU handle exceptions on behalf of sequencer Shared memory space for ease of programming
32
33
34