Beruflich Dokumente
Kultur Dokumente
Darek Kawamoto
Introduction
Project Origin Inverse Discrete Cosine Transform
Kernel Summary Performance
Conclusion
Project Origin
Computer Animation for Scientific Visualization
Stuart Levy, UIUC / NCSA Goal: Decode big (1920x1080) JPEG images fast (~30 fps) GPU cheaper than specialized hardware
2x 1 j cos 16
1-D IDCT:
7
Where C f =
1 when f = 0, 1otherwise 2
2x 1 i 1 p x = C i G i cos 2 i= 0 16
2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms
IDCT Kernel
Thread Parallelism
Each thread corresponds to an element of the matrix Threads compute IDCT across columns, then rows
Other Optimizations
Careful use of 16KB Shared Memory: 6 blocks per SMP Unrolled 5x: Each iteration computes five 2-D IDCTs
How precise?
short implementations do almost as well as float double precision has no advantages
(short) Chen-Wang
11 Multiplies, 29 Adds per 1-D transform
Speed
30 fps means each frame needs to be ready in 33 ms How much time to perform the other JPEG functions? With 67,200 blocks, we have 28.6 ms left With 245,760 blocks, we have 16.4 ms left
Conclusion
Could not previously hope to process in < 33 ms Application now depends on the speedup of other kernels
Design Approach
Start decoding in the middle of the stream at several places, combine results when synchronization occurs, and throw out all extra work
Design Approach
Spawn parallel work threads
Problems
Does the parallel speedup of successful synchronization offset the penalty of extra work? Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability Each decoder thread doesn't know how much data it will decode Allocate memory on device using atomic functions
New Approach
Suppose EOB = 0101
EOB Overhead
Overhead associated with finding EOB symbols
Implemented a kernel to do so, < 1 ms
Decode Synchronization
Each decoder thread maintains information
Where it started (bit it first looked at) Length and Data of Decoder output Where it is (the bit it currently looks at) Synchronization occurs when the current thread location matches another's start location
Problem
What happens to false EOB hits... do they synchronize? How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?
Synchronization
Problem
What happens to false EOB hits... do they synchronize?
Answer
In general, yes they do. After several hundred bits, they synchronize with the real stream and will end at the next parallel section Experiments in Klein and Wiseman, Parallel Huffman Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder threads from doing too much work
Memory Allocation
Problem
How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?
Solution
Store decoder output in chunks of global memory Use atomic functions to acquire locks on chunks Requires compute capability 1.1 (G92s)
Implementation
Implementation of Parallel Decode is difficult
Nave schemes may get marginal speedup Full implementation of the presented design could result in meaningful speedup Could not implement with equipment available (G80) Implemented EOB Kernel ~ 60 lines of code Several Decoder Kernels may take ~ 500 lines of code Does not fully implement the baseline JPEG specification Markers can interrupt the block data, but can also be used to help synchronization.
Conclusion
IDCT Kernel speedup (5-59x) based on context
Because of the serial nature of JPEG, applications often do not make use of wholesale transforms