Sie sind auf Seite 1von 22

CUDA JPEG Essentials

Darek Kawamoto

Introduction
Project Origin Inverse Discrete Cosine Transform
Kernel Summary Performance

Parallel Huffman Decode for JPEG


Design Approach Design Problems, Solutions Implementation Remarks

Conclusion

Project Origin
Computer Animation for Scientific Visualization
Stuart Levy, UIUC / NCSA Goal: Decode big (1920x1080) JPEG images fast (~30 fps) GPU cheaper than specialized hardware

CUDA and Two JPEG Bottlenecks:


Inverse Discrete Cosine Transform (IDCT) Straightforward, similar to class machine problems Huffman Decode Stage Tricky parallelization of a serial process

Inverse Discrete Cosine Transform


2-D IDCT:
2x 1 i 1 p xy = C i C j Gij cos 4 i= 0 j= 0 16
7 7

2x 1 j cos 16

1-D IDCT:
7

Where C f =

1 when f = 0, 1otherwise 2

2x 1 i 1 p x = C i G i cos 2 i= 0 16

2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

IDCT Kernel
Thread Parallelism
Each thread corresponds to an element of the matrix Threads compute IDCT across columns, then rows

Memory Access Patterns


Shared memory: broadcast, or no bank conflicts Global memory: buffered, coalesced

Other Optimizations
Careful use of 16KB Shared Memory: 6 blocks per SMP Unrolled 5x: Each iteration computes five 2-D IDCTs

IDCT Performance -- How...?


How to benchmark?
libJPEG: executes processes serially GPU: executes IDCT process wholesale

How precise?
short implementations do almost as well as float double precision has no advantages

How much work?


GPU shines with > 64,000 blocks JPEG specific: CPU can short circuit vectors of zeros Let CPU short circuit ~50% of columns in the first IDCT

IDCT Performance -- Cost


IDCT Implementations:
(float) Nave 1-D
64 Multiplies and 64 Adds per 1-D transform

(short) Chen-Wang
11 Multiplies, 29 Adds per 1-D transform

(float) Arai, Agui, and Nakajima (AA&N)


5 Multiplies, 29 Adds per 1-D transform Other multiplies folded into de-quantization tables

IDCT Performance -- Small


Approx. Execution times for 67,200 blocks:
(float) Nave 1-D GPU 4.69 ms (float) Nave 1-D CPU Serial 333 ms (71x) (float) Nave 1-D CPU Wholesale 100 ms (21x) (short) Chen-Wang Serial 30 ms (6.4x) (float) AA&N Wholesale 25 ms (5.3x) (float) AA&N Serial 268 ms (57x)

GPU: ~29 GFLOPS

IDCT Performance -- Big


Approx. Execution times for 245,760 blocks:
(float) Nave 1-D GPU (float) Nave 1-D CPU Serial (float) Nave 1-D CPU Wholesale (short) Chen-Wang Serial (float) AA&N Wholesale (float) AA&N Serial 16.94 ms 1250 ms (73x) 375 ms (22x) 113 ms (6.7x) 91 ms (5.4x) 1000 ms (59x)

GPU: ~30 GFLOPS

IDCT Performance Conclusion


Amount
Wholesale transforms work much better than retail 67,200 IDCT blocks performs almost as well as 245,760

Speed
30 fps means each frame needs to be ready in 33 ms How much time to perform the other JPEG functions? With 67,200 blocks, we have 28.6 ms left With 245,760 blocks, we have 16.4 ms left

Conclusion
Could not previously hope to process in < 33 ms Application now depends on the speedup of other kernels

Parallel Huffman Decode for JPEG


Huffman Compression
Prefix-free, variable length code Serial in nature: decode each symbol in sequential order

Parallel Decoding Challenge


Impossible to determine where symbols start and end without decoding all previous symbols

Design Approach
Start decoding in the middle of the stream at several places, combine results when synchronization occurs, and throw out all extra work

Design Approach
Spawn parallel work threads

Parallel Huffman Decode for JPEG


Design Approach
Start decoding in the middle of the stream at several places, combine results when synchronization occurs, and throw out all extra work

Problems
Does the parallel speedup of successful synchronization offset the penalty of extra work? Yes, if we choose our work wisely! Do so by exploiting JPEG structure and probability Each decoder thread doesn't know how much data it will decode Allocate memory on device using atomic functions

Choosing Work Wisely


Exploit Block Coding
Each block of coefficients encodes a DC coefficient and assorted AC coefficients Due to quantization and coding scheme, it's likely a block will end with a End of Block (EOB) symbol If the EOB block symbol is 4 or more symbols and can't prefix itself, probability of random occurrence is 1/16 In regions where we want to start a parallel decode thread, only start after possible EOB blocks Can use any symbol to attempt synchronization, EOB is arbitrary, but practical because DC coefficient is coded differently

New Approach
Suppose EOB = 0101

EOB Overhead
Overhead associated with finding EOB symbols
Implemented a kernel to do so, < 1 ms

Effectiveness depends on block length statistics


If we guarantee a true EOB hit in each section of stream we look at, then we guarantee synchronization with that section If we do not guarantee synchronization, some threads may have to decode multiple sections Research on these statistics necessary to make appropriate design decisions that maximize the probability of EOB hits while minimizing the amount of false hits

Decode Synchronization
Each decoder thread maintains information
Where it started (bit it first looked at) Length and Data of Decoder output Where it is (the bit it currently looks at) Synchronization occurs when the current thread location matches another's start location

Problem
What happens to false EOB hits... do they synchronize? How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?

Synchronization
Problem
What happens to false EOB hits... do they synchronize?

Answer
In general, yes they do. After several hundred bits, they synchronize with the real stream and will end at the next parallel section Experiments in Klein and Wiseman, Parallel Huffman Decoding with Applications to JPEG Files (2003) Can use advanced logic to prevent false hitting decoder threads from doing too much work

Memory Allocation
Problem
How does each decoder thread know how much data it will decode? How do we allocate memory for each thread?

Solution
Store decoder output in chunks of global memory Use atomic functions to acquire locks on chunks Requires compute capability 1.1 (G92s)

Putting it All Together


After all decoder threads have finished
We figure out which threads did meaningful work Chain the decoded data together to create the output Makes use of the decoder thread information Clear out the scratch space (memory chunks) Throw away all of the extra work

Implementation
Implementation of Parallel Decode is difficult
Nave schemes may get marginal speedup Full implementation of the presented design could result in meaningful speedup Could not implement with equipment available (G80) Implemented EOB Kernel ~ 60 lines of code Several Decoder Kernels may take ~ 500 lines of code Does not fully implement the baseline JPEG specification Markers can interrupt the block data, but can also be used to help synchronization.

Conclusion
IDCT Kernel speedup (5-59x) based on context
Because of the serial nature of JPEG, applications often do not make use of wholesale transforms

Parallel Huffman Decoding is Complex


Is now the main bottleneck of JPEG decompression Lots of potential speedup to be had, but requires careful and precise research and development

30 frame per second high-res JPEG Animation


Possible and probable, with additional work

Das könnte Ihnen auch gefallen