Sie sind auf Seite 1von 18

# DATA COMPRESSION AND HUFFMAN ALGORITHM

## Technical Seminar Paper Submitted by

Presented by
Vineet Agarwala IT200118155

Anisur Rahman

## NATIONAL INSTITUTE OF SCIENCE & TECHNOLOGY

DATA COMPRESSION
Virtually all forms of data - text, numerical, image, video contain
redundant elements
Data can be compressed by eliminating the redundant elements.
A code is substituted for the eliminated redundant element, where
the code is shorter than eliminated element.
When compressed data is retrieved from storage or received over
a communications link, it is expanded back to its original form, based
on the code.
Compression is used:
to save storage space
to reduce communications transmission requirements
 The art or science of compactly representing information
Digital realm: using lesser number of bits to represent information
Data + Compression = information – redundancy
REDUNDANCY
Most types of computer files are fairly redundant -- they have the same
information listed over and over again. File-compression programs
simply get rid of the redundancy

“Ask not what your country can do for you -- ask what
you can do for your country.”

## Ignoring the difference between capital and lower-case

letters, roughly half of the phrase is redundant. Nine words
-- ask, not, what, your, country, can, do, for, you -- give us
almost everything we need for the entire quote
Compression Techniques
Lossless
Data can be completely recovered after decompression
Recovered data is identical to original
Exploits redundancy in data
Lossy
Data cannot be completely recovered after
decompression
Some information is lost for ever
Gives more compression than lossless
Discards “insignificant” data components
IMAGE COMPRESSION
 Image compression can be lossy or lossless
 Methods for lossless image compression are:
Run-length encoding
Entropy coding
Adaptive dictionary algorithms such as LZW
 Methods for lossy compression are:
Reducing the color space to the most common colors in the image.
The selected colors are specified in the color palette in the header of
the compressed image. Each pixel just references the index of a color
in the color palette. This method can be combined with dithering to
blur the color borders.
Transform coding. This is the most commonly used method. A
Fourier-related transform such as DCT or the wavelet transform are
applied, followed by quantization and entropy coding.
Fractal compression.
JPEG (TRANSFORM COMPRESSION)
JPEG is named after its origin, the Joint Photographers Experts
Group
This involves reducing the number of bits per sample or entirely
discard some of the samples
MULTIMEDIA COMPRESSION
Multimedia compression is a general term referring to the
compression of any type of multimedia, most notably
graphics, audio, and video
MPEG (Moving Pictures Experts Group ) The future of this
technology is to encode the compression and
uncompression algorithms directly into integrated circuits.
The approach used by MPEG can be divided into two types
of compression: within-the-frame and between-frame
DATA COMPRESSION ALGORITHMS

## LOSSY COMPRESSION LOSS LESS

COMPRESSION
Run Length Encoding
CS & Q
Huffman Coding
JPEG
Delta
MPEG
LZW
RUN-LENGTH ENCODING
Data files frequently contain the same character
repeated many times in a row.

## Example of run-length encoding. Each run of zeros is

replaced by two characters in the compressed file: a zero
to indicate that compression is occurring, followed by the
number of zeros in the run.
HUFFMAN ENCODING
This method is named after D.A. Huffman, who
developed the procedure in the 1950s.
More than 96% of this file consists of only 31
characters out of 127
HUFFMAN ENCODING EXAMPLE
Character frequencies
A: 20% (.20)
B: 9% (.09)
C: 15%
D: 11%
E BF D A C
E: 40% .4 .14 .15 .20 .15
F: 5% 1
0
B F
.09 .05
HUFFMAN ENCODING EXAMPLE (CONDT.)

Codes ABCDEF
1.0
A: 010 0 1
B: 0000 ABCDF E
C: 011 .6 .4
0 1
D: 001
E: 1 BFD AC
.25 .35
F: 0001 0 1 0 1
BF D A C
.14 .15 .20 .15
0 1
B F
.09 .05
Run Length Encoding
CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

CTAAAAAGGGTCGTTTTTTGCCCGGGGGCCTCCCCCCC

## CT5A3GTCG6TG3C5GCCT7C } Run length encoded: 21

symbols
Run Length Encoding (Contd.)
WWWBWWWWWBWWWBWWWWBWWWWWBWWWBWW
WWWBWWBWWWWWWBBBWWWWWWWBWBWWWWW
WWBWWBBWWWWWBWWWWBWWWWBWWWWB

WWWBWWWWWBWWWBWWWWB….

3WB5WB3WB4WB….

## #W3151314….. Optimization requires escape character

Run Length Encoding (Contd.)
Is run length encoding practical for images?

No

Yes

## Chances of three or more identical Some images do have lots of consecutive

consecutive pixels are low for most real pixels.
images. Especially images with low color depth.
Especially images with large color depth. RLE is used for fax machines, and by BMP,
TIFF and PCX files.
LZW Compression
LZW compression is named after its
developers, A. Lempel and J. Ziv, with later
modifications by Terry A. Welch. It is the
foremost technique for general purpose data
compression due to its simplicity and
versatility
LZW Compression (contd.)
LZW compression
flowchart.
The variable, CHAR, is
a single byte. The
variable, STRING, is a
variable length
sequence of bytes.
Data are read from the
input file (box 1 & 2) as
single bytes, and
written to the
compressed file (box 4)
as 12 bit codes.
CONCLUSION
Is it possible to create a data compression
algorithm that will always compress data?

## Is there an optimal data compression algorithm?

Lossless: No, compression rates depend on the data.
Lossy: No, the quality of compression is subjective.