Sie sind auf Seite 1von 8

International Journal of Information and Computation Technology.

ISSN 0974-2239 Volume 3, Number 3 (2013), pp. 139-146


© International Research Publications House
http://www. irphouse.com /ijict.htm

Analysis and Comparison of Algorithms for


Lossless Data Compression

Anmol Jyot Maan

Hyderabad, INDIA.

Abstract

Data compression is an art used to reduce the size of a particular file.


The goal of data compression is to eliminate the redundancy in a file’s
code in order to reduce its size. It is useful in reducing the data storage
space and in reducing the time needed to transmit the data. Data
compression can either be lossless or lossy. Lossless data compression
recreates the exact original data from the compressed data while lossy
data compression cannot regenerate the perfect original data from the
compressed data. Lossy methods are mainly used for compressing
sound, images or video. A lot of data compression algorithms are
available to compress files of different formats. This paper involves the
discussion and comparison of a selected set of lossless data
compression algorithms.

Keywords: Data compression, Lossless Compression, Lossy


Compression, Huffman Coding, Arithmetic Coding, Run Length
Encoding.

1. Introduction
Data compression is the art of representing information in compact form. It reduces the
file size which in turn reduces the required storage space and makes the transmission
of data quicker. Compression techniques try to find redundant data and remove these
redundancies. Data compression can be divided into two broad classes: lossless data
compression and lossy data compression. In lossless compression, the exact original
data can be recovered from compressed data. It is used when the difference between
original data and decompressed data cannot be tolerated. Medical images, text needed
in legal purposes and computer executable files are compressed using lossless
140 Anmol Jyot Maan

compression techniques. Lossy compression, as the name suggests, involves loss of


information. It is used in the applications where the lack of reconstruction is not an
issue. Videos and audios are compressed using lossy compression.
The extremely fast growth of data that needs to be stored and transferred has given
rise to the demands of better transmission and storage techniques. Various lossless data
compression algorithms have been proposed and used. Huffman Coding, Arithmetic
Coding, Shannon Fano Algorithm, Run Length Encoding Algorithm are some of the
techniques in use. This paper examines Huffman Coding, Arithmetic Coding, and Run
Length Encoding Algorithm.

2. Run Length Encoding


Run Length Encoding (RLE) is the simplest of the data compression algorithms. It
replaces runs of two or more of the same character with a number which represents the
length of the run, followed by the original character. Single characters are coded as
runs of 1. The major task of this algorithm is to identify the runs of the source file, and
to record the symbol and the length of each run. The Run Length Encoding algorithm
uses those runs to compress the original source file while keeping all the non-runs
without using for the compression process.

Example of RLE:
Input: AAABBCCCCD
Output: 3A2B4C1D

3. Huffman Coding
First Huffman coding algorithm was developed by David Huffman in 1951. Huffman
coding is an entropy encoding algorithm used for lossless data compression. In this
algorithm fixed length codes are replaced by variable length codes. When using
variable-length code words, it is desirable to create a prefix code, avoiding the need for
a separator to determine codeword boundaries. Huffman Coding uses such prefix code.
Huffman procedure works as follow:
1. Symbols with a high frequency are expressed using shorter encodings than
symbols which occur less frequently.
2. The two symbols that occur least frequently will have the same length.
The Huffman algorithm uses the greedy approach i.e. at each step the algorithm
chooses the best available option. A binary tree is built up from the bottom up. To see
how Huffman Coding works, let’s take an example. Assume that the characters in a
file to be compressed have the following frequencies:
A: 25 B: 10 C: 99 D: 87 E: 9 F: 66
The processing of building this tree is:
1. Create a list of leaf nodes for each symbol and arrange the nodes in the order
from highest to lowest.
Analysis and Comparison of Algorithms for Lossless Data Compression 141

C:99 D:87 F:66 A:25 B:10 E:9


2. Select two leaf nodes with the lowest frequency. Create a parent node with
these two nodes and assign the frequency equal to the sum of the frequencies of
two child nodes.

Now add the parent node in the list and remove the two child nodes from the list.
And repeat this step until you have only one node left.
142 Anmol Jyot Maan

3. Now label each edge. The left child of each parent is labeled with the digit 0
and right child with 1. The code word for each source letter is the sequence of
labels along the path from root to the leaf node representing the letter.

Huffman Codes are shown below in the table

Table 1: Huffman Codes.

C 00
D 01
F 10
A 110
B 1110
E 1111

4. Arithmetic Coding
Arithmetic Coding is useful for small alphabets with highly skewed probabilities. In
this method, a code word is not used to represent a symbol of the text. Instead, it
produces a code for an entire message. Arithmetic Coding assigns an interval to each
symbol. Then a decimal number is assigned to this interval. Initially, the interval is [0,
1). A message is represented by a half open interval [x, y) where x and y are real
numbers between 0 and 1. The interval is then divided into sub-intervals. The number
of sub-intervals is identical to the number of symbols in the current set of symbols and
size is proportional to their probability of appearance. For each symbol a new internal
division takes place based on the last sub interval.
Consider an example illustrating encoding in Arithmetic Coding.
Analysis and Comparison of Algorithms for Lossless Data Compression 143

Table 2: Encoding in Arithmetic Coding.

Symbol Probability Range


X 0.5 [0.0, 0.5)
Y 0.3 [0.5, 0.8)
Z 0.2 [0.8, 1.0)

Table 3: Encoding symbol “YXX”.

Symbol Range Low High


Value Value
0 1
Y 1 0.5 0.8
X 0.3 0.5 0.65
X 0.15 0.5 0.575

In table 3, range, high value and low value are calculated as:
Range= High value – Low value
High Value= Low value + Range * high range of the symbol being computed
Low Value= Low value + Range * low range of the symbol being computed

The string “YXX” is represented by an arbitrary number within the interval [0.5,
0.575).

Figure 1: Graphical display of shrinking ranges.


144 Anmol Jyot Maan

5. Measuring compression performances


There are various criteria to measure the performance of a compression algorithm.
However, the main concern has always been the space efficiency and time efficiency.
Following are some measurements used to evaluate the performances of lossless
algorithm.

1. Compression Ratio: It is the ratio between the size of the compressed file and
the size of the source file.

2. Compression factor: It is the inverse of the compression ratio.

3. Saving percentage: it calculates the shrinkage of the source file.

6. Comparing the algorithms:


1. Run Length Encoding: In the worst case RLE generates the output data which
is 2 times more than the size of input data. This is due to the fewer amount of
runs in the source file. And the files that are compressed have very high values
of compression ratio. This algorithm does not provide significant improvement
over the original file.
2. Huffman Coding vs. Arithmetic Coding: Huffman Coding Algorithm uses a
static table for the whole coding process, so it is faster. However it does not
produce efficient compression ratio.
On the contrary, Arithmetic algorithm can generate a high compression ratio, but
its compression speed is slow.
The table 4 presents a simple comparison between these compression methods.

Table 4: Huffman Coding Vs. Arithmetic Coding.

COMPRESSION METHOD ARITHMETIC HUFFMAN


Compression ratio Very good Poor
Compression speed Slow Fast
Decompression speed Slow Fast
Memory space Very low Low
Compressed pattern matching No Yes
Permits Random access No Yes
Analysis and Comparison of Algorithms for Lossless Data Compression 145

Conclusion
Arithmetic coding techniques outperforms Huffman coding and Run Length Encoding.
Also the Compression ratio of the Arithmetic coding algorithm is better than the other
two algorithms examined above. In this paper, it is found that the Arithmetic Coding is
the most efficient algorithm among the selected ones.

References

[1] Introduction to Data Compression, Khalid Sayood, Ed Fox (Editor), March


2000.
[2] Ken Huffman. Profile: David A. Huffman, Scientific American, September
1991, pp. 54–58.
[3] Blelloch, E., 2002. Introduction to Data Compression, Computer Science
Department, Carnegie Mellon University.
[4] Senthil Shanmugasundaram, Robert Lourdusamy, A Comparative Study Of
Text Compression Algorithm, International Journal of Wisdom Based
Computing, Vol.1 (3)
[5] S.R. Kodituwakku. U. S.Amarasinghe Comparison Of Lossless Data
Compression Algorithms For Text Data
[6] P.Yellamma Dr.Narasimham Challa. Performance Analysis Of Different Data
Compression Techniques On Text File October-2012.
[7] http://www.ieeeghn.org/wiki/index.php/Historyof Lossless Data Compression
Algorithms
[8] http://www.binaryessence.com/dct/en000003.htm
[9] Data compression Wikipedia.
146 Anmol Jyot Maan

Das könnte Ihnen auch gefallen