08 Dkawamo2 JPEG Presentation

CUDA JPEG Essentials
Darek Kawamoto
Introduction
Project Origin Inverse Discrete Cosine Transform
Kernel Summary Performance
Parallel Huffman Decode for JPEG

Design Approach Design Problems, Solutions Implementation Remarks
Conclusion
Project Origin
Computer Animation for Scientific Visualization
Stuart Levy, UIUC / NCSA Goal: Decode big (1920x1080) JPEG images fast (~30 fps) GPU cheaper than specialized hardware
CUDA and Two JPEG Bottlenecks:

Inverse Discrete Cosine Transform (IDCT) Straightforward, similar to class machine problems Huffman Decode Stage Tricky parallelization of a serial process
Inverse Discrete Cosine Transform

2-D IDCT:
2x 1 i 1 p xy = C i C j Gij cos 4 i= 0 j= 0 16
7 7
2x 1 j cos 16
1-D IDCT:
7
Where C f =
1 when f = 0, 1otherwise 2
2x 1 i 1 p x = C i G i cos 2 i= 0 16
2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms
IDCT Kernel
Thread Parallelism
Each thread corresponds to an element of the matrix Threads compute IDCT across columns, then rows
Memory Access Patterns

Shared memory: broadcast, or no bank conflicts Global memory: buffered, coalesced
Other Optimizations
Careful use of 16KB Shared Memory: 6 blocks per SMP Unrolled 5x: Each iteration computes five 2-D IDCTs
IDCT Performance -- How...?

How to benchmark?
libJPEG: executes processes serially GPU: executes IDCT process wholesale
How precise?
short implementations do almost as well as float double precision has no advantages
How much work?

GPU shines with > 64,000 blocks JPEG specific: CPU can short circuit vectors of zeros Let CPU short circuit ~50% of columns in the first IDCT
IDCT Performance -- Cost

IDCT Implementations:
(float) Nave 1-D
64 Multiplies and 64 Adds per 1-D transform
(short) Chen-Wang
11 Multiplies, 29 Adds per 1-D transform
(float) Arai, Agui, and Nakajima (AA&N)

5 Multiplies, 29 Adds per 1-D transform Other multiplies folded into de-quantization tables
IDCT Performance -- Small

Approx. Execution times for 67,200 blocks:
(float) Nave 1-D GPU 4.69 ms (float) Nave 1-D CPU Serial 333 ms (71x) (float) Nave 1-D CPU Wholesale 100 ms (21x) (short) Chen-Wang Serial 30 ms (6.4x) (float) AA&N Wholesale 25 ms (5.3x) (float) AA&N Serial 268 ms (57x)
GPU: ~29 GFLOPS
IDCT Performance -- Big

Approx. Execution times for 245,760 blocks:
(float) Nave 1-D GPU (float) Nave 1-D CPU Serial (float) Nave 1-D CPU Wholesale (short) Chen-Wang Serial (float) AA&N Wholesale (float) AA&N Serial 16.94 ms 1250 ms (73x) 375 ms (22x) 113 ms (6.7x) 91 ms (5.4x) 1000 ms (59x)
GPU: ~30 GFLOPS
IDCT Performance Conclusion

Amount
Wholesale transforms work much better than retail 67,200 IDCT blocks performs almost as well as 245,760
Speed
30 fps means each frame needs to be ready in 33 ms How much time to perform the other JPEG functions? With 67,200 blocks, we have 28.6 ms left With 245,760 blocks, we have 16.4 ms left
Conclusion
Could not previously hope to process in < 33 ms Application now depends on the speedup of other kernels

Huffman Compression
Prefix-free, variable length code Serial in nature: decode each symbol in sequential order
Parallel Decoding Challenge

Impossible to determine where symbols start and end without decoding all previous symbols
Design Approach
Start decoding in the middle of the stream at several places, combine results when synchronization occurs, and throw out all extra work
Design Approach
Spawn parallel work threads

Design Approach
Start decoding in the middle of the stream at several places, combine results when synchronization occurs, and throw out all extra work
Problems
Does the parallel speedup of successful synchronization offset the penalty of extra work? Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability Each decoder thread doesn't know how much data it will decode Allocate memory on device using atomic functions
Choosing Work Wisely

Exploit Block Coding
Each block of coefficients encodes a DC coefficient and assorted AC coefficients Due to quantization and coding scheme, it's likely a block will end with a End of Block (EOB) symbol If the EOB block symbol is 4 or more symbols and can't prefix itself, probability of random occurrence is 1/16 In regions where we want to start a parallel decode thread, only start after possible EOB blocks Can use any symbol to attempt synchronization, EOB is arbitrary, but practical because DC coefficient is coded differently
New Approach
Suppose EOB = 0101
EOB Overhead
Overhead associated with finding EOB symbols
Implemented a kernel to do so, < 1 ms
Effectiveness depends on block length statistics

If we guarantee a true EOB hit in each section of stream we look at, then we guarantee synchronization with that section If we do not guarantee synchronization, some threads may have to decode multiple sections Research on these statistics necessary to make appropriate design decisions that maximize the probability of EOB hits while minimizing the amount of false hits
Decode Synchronization
Each decoder thread maintains information
Where it started (bit it first looked at) Length and Data of Decoder output Where it is (the bit it currently looks at) Synchronization occurs when the current thread location matches another's start location
Problem
What happens to false EOB hits... do they synchronize? How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?
Synchronization
Problem
What happens to false EOB hits... do they synchronize?
Answer
In general, yes they do. After several hundred bits, they synchronize with the real stream and will end at the next parallel section Experiments in Klein and Wiseman, Parallel Huffman Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder threads from doing too much work
Memory Allocation
Problem
How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?
Solution
Store decoder output in chunks of global memory Use atomic functions to acquire locks on chunks Requires compute capability 1.1 (G92s)
Putting it All Together

After all decoder threads have finished
We figure out which threads did meaningful work Chain the decoded data together to create the output Makes use of the decoder thread information Clear out the scratch space (memory chunks) Throw away all of the extra work
Implementation
Implementation of Parallel Decode is difficult
Nave schemes may get marginal speedup Full implementation of the presented design could result in meaningful speedup Could not implement with equipment available (G80) Implemented EOB Kernel ~ 60 lines of code Several Decoder Kernels may take ~ 500 lines of code Does not fully implement the baseline JPEG specification Markers can interrupt the block data, but can also be used to help synchronization.
Conclusion
IDCT Kernel speedup (5-59x) based on context
Because of the serial nature of JPEG, applications often do not make use of wholesale transforms
Parallel Huffman Decoding is Complex

Is now the main bottleneck of JPEG decompression Lots of potential speedup to be had, but requires careful and precise research and development
30 frame per second high-res JPEG Animation

Possible and probable, with additional work

08 Dkawamo2 JPEG Presentation

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

08 Dkawamo2 JPEG Presentation

Hochgeladen von

Copyright:

Verfügbare Formate

CUDA JPEG Essentials

Parallel Huffman Decode for JPEG

CUDA and Two JPEG Bottlenecks:

Inverse Discrete Cosine Transform

Memory Access Patterns

IDCT Performance -- How...?

How much work?

IDCT Performance -- Cost

(float) Arai, Agui, and Nakajima (AA&N)

IDCT Performance -- Small

GPU: ~29 GFLOPS

IDCT Performance -- Big

GPU: ~30 GFLOPS

IDCT Performance Conclusion

Parallel Huffman Decode for JPEG

Parallel Decoding Challenge

Parallel Huffman Decode for JPEG

Choosing Work Wisely

Effectiveness depends on block length statistics

Putting it All Together

Parallel Huffman Decoding is Complex

30 frame per second high-res JPEG Animation

Das könnte Ihnen auch gefallen