Huffman Coding (Autorecovered)

HUFFMAN CODING
Project Report
Submitted By
URVASHI GUPTA, 04220902717
In partial fulfillment of the requirements for
the award of the degree
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

GOVIND BALLABH PANT GOVT. ENGINEERING COLLEGE,
NEW DELHI-110020
TABLE OF CONTENT
Certificate
Acknowledgement
About the Organization
Abstract
1. Introduction
2. Compression
2.1 The Basic Idea
2.2 Building the Huffman Tree
2.3 An Example
3. Decompression
3.1 Storing the Huffman Tree
3.2 Creating the Huffman Table
3.3 Storing Sizes
4. Program Code and Output
5. Conclusion and Future Works
ACKNOWLEDGEMENT
ABOUT THE ORGANIZATION
Coding Blocks was founded in 2014 with a mission to create skilled Software Engineers for our
country and the world. Their motto is to bridge the gap between the quality of skills
demanded by industry and the quality of skills imparted by conventional institutes. At Coding
Blocks, we strive to increase student interest by providing hands on practical training on every
concept taught in the classroom. Coding Blocks offers courses comprise the most relevant
subjects and technologies that are being used in the industry these days. It is having the most
comprehensive topic coverage, and of constantly updating with the rapidly changing times.
In addition to theoretical teaching, they believe in hands-on experience to all our students
during classroom teaching, to make them suitable for industries with the required skills.
ABSTRACT
Data reduction is one of the data pre-processing techniques which can be applied to obtain a
reduced representation of the data set that is much smaller in volume, yet closely maintains the
integrity of the original data. That is, mining on the reduced data set should be more efficient
yet produce the same analytical results. Data compression is useful, where encoding
mechanisms are used to reduce the data set size. In data compression, data encoding or
transformations are applied so as to obtain a reduced or compressed representation of the
original data. Huffman coding is a successful compression method used originally for text
compression. Huffman's idea is, instead of using a fixed-length code such as 8-bit extended
ASCII or DBCDIC for each symbol, to represent a frequently occurring character in a source
with a shorter codeword and to represent a less frequently occurring one with a longer
codeword.
Chapter 1
INTRODUCTION
The project aims at developing programs that transform a string of characters in some
representation into a new string (of bits) which contains the same information.
Compression refers to reducing the quantity of data used to represent a file, image or video
content without excessively reducing the quality of the original data. It also reduces the
number of bits required to store and/or transmit digital media. To compress something means
that you have a piece of data and you decrease its size. There are different techniques who to
do that and they all have their own advantages and disadvantages. One trick is to reduce
redundant information, meaning saving sometimes once instead of 6 times. Another one is to
find out which parts of the data are not really important and just leave those away.
Huffman Coding also called as Huffman Encoding is a famous greedy algorithm that is used
for the lossless compression of data. It uses variable length encoding where variable length
codes are assigned to all the characters depending on how frequently they occur in the given
text. The character which occurs most frequently gets the smallest code and the character
which occurs least frequently gets the largest code. To prevent ambiguities while decoding,
Huffman coding implements a rule known as a prefix rule which ensures that the code
assigned to any character is not a prefix of the code assigned to any other character.
Chapter 2
COMPRESSION
2.1 The Basic Idea

The basic idea behind Huffman Coding is to use fewer bits for more frequently occurring
characters. We will see how this is done using a tree that stores characters at the leaves, and
whose root-to-leaf paths provide the bit sequence used to encode the characters.
We'll assume that each character has an associated weight equal to the number of times the
character occurs in a file, for example. In the "go go gophers" example, the characters 'g' and
'o' have weight 3, the space has weight 2, and the other characters have weight 1. When
compressing a file we'll need to calculate these weights, we'll ignore this step for now and
assume that all character weights have been calculated.
2.2 Building the Huffman Tree
Huffman's algorithm assumes that we're building a single tree from a group (or forest) of
trees. Initially, all the trees have a single node with a character and the character's weight.
Trees are combined by picking two trees, and making a new tree from the two trees. This
decreases the number of trees by one at each step since two trees are combined into one tree.
The algorithm is as follows:
1. Begin with a forest of trees. All trees are one node, with the weight of the tree equal to
the weight of the character in the node. Characters that occur most frequently have the
highest weights. Characters that occur least frequently have the smallest weights.
2. Repeat this step until there is only one tree:
Choose two trees with the smallest weights, call these trees T1 and T2. Create a new
tree whose root has a weight equal to the sum of the weights T1 + T2 and whose left
subtree is T1 and whose right subtree is T2.
3. The single tree left after the previous step is an optimal encoding tree.
2.2 An Example
We'll use the string "go go gophers" as an example. Initially we have the forest shown below.
The nodes are shown with a weight/count that represents the number of times the node's
character occurs.
We pick two minimal nodes. There are five nodes with the minimal weight of one, it doesn't
matter which two we pick. In a program, the deterministic aspects of the program will dictate
which two are chosen, e.g., the first two in an array, or the elements returned by a priority
queue implementation. We create a new tree whose root is weighted by the sum of the
weights chosen. We now have a forest of seven trees as shown here:
Choosing two minimal trees yields another tree with weight two as shown below. There are
now six trees in the forest of trees that will eventually build an encoding tree.
Again we must choose the two trees of minimal weight. The lowest weight is the 'e'-node/tree
with weight equal to one. There are three trees with weight two, we can choose any of these
to create a new tree whose weight will be three.
Now there are two trees with weight equal to two. These are joined into a new tree whose
weight is four. There are four trees left, one whose weight is four and three with a weight of
three.
Two minimal (three weight) trees are joined into a tree whose weight is six. In the diagram
below we choose the 'g' and 'o' trees (we could have chosen the 'g' tree and the space-'e' tree
or the 'o' tree and the space-'e' tree.) There are three trees left.
The minimal trees have weights of three and four, these are joined into a tree whose weight is
seven leaving two trees.
Finally, the last two trees are joined into a final tree whose weight is thirteen, the sum of the
two weights six and seven. Note that this tree is different from the tree we used to illustrate
the Huffman coding above, and the bit patterns for each character are different, but the total
number of bits used to encode "go go gophers" is the same.
The character encoding induced by the last tree is shown below where again, 0 is used for left
edges and 1 for right edges.
Char binary
'g' 00
'o' 01
'p' 1110
'h' 1101
'e' 101
'r' 1111
's' 1100
' ' 100
The string "go go gophers" would be encoded as shown (with spaces used for easier reading,
the spaces wouldn't appear in the real encoding).
00 01 100 00 01 100 00 01 1110 1101 101 1111 1100
Once again, 37 bits are used to encode "go go gophers". There are several trees that yield an
optimal 37-bit encoding of "go go gophers". The tree that actually results from a programmed
implementation of Huffman's algorithm will be the same each time the program is run for the
same weights (assuming no randomness is used in creating the tree).
CHAPTER 3
DECOMPRESSION
The process of decompression is simply a matter of translating the

stream of prefix codes to individual byte values, usually by traversing the Huffman tree node
by node as each bit is read from the input stream reaching a leaf node necessarily
terminates the search for that particular byte value). Before this can take place, however,
the Huffman tree must be somehow reconstructed.
3.1 Storing the Huffman Tree
 In the simplest case, where character frequencies are fairly predictable, the tree can
be preconstructed (and even statistically adjusted on each compression cycle) and
thus reused every time, at the expense of at least some measure of compression
efficiency
 Otherwise, the information to reconstruct the tree must be sent.
 A naive approach might be to prepend the frequency count of each character to the
compression stream. Unfortunately, the overhead in such a case could amount to
several kilobytes, so this method has little practical use.
 Another method is to simply prepend the Huffman tree, bit by bit, to the output
stream. For example, assuming that the value of 0 represents a parent node and 1 a
leaf node, whenever the latter is encountered the tree building routine simply reads
the next 8 bits to determine the character value of that particular leaf. The process
continues recursively until the last leaf node is reached at that point, the Huffman
tree will thus be faithfully reconstructed. The overhead using such a method ranges
from roughly 2 to 320 bytes (summing an 8-bit alphabet).
Many other techniques are possible is well. In any case, since the compressed data can
include its "trailing bits the decompressor must be able to determine when to stop
producing output. This can be accomplished by either transmitting the length of
the decompressed data along with the compression model or by defining a special code
symbol to signify the end of input (the latter method can adversely affect code length
optimality, however).
3.2 Creating the Huffman Table
To create a table or map of coded bit values for each character you'll need to traverse the
Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach
a leaf. For example, if you reach a deal that stores the character ‘C’, following a path left-
left-right-right-left, then an entry in the ‘C’-th location of the map should be set to 00110.
You'll need to make a decision about how to store the bit patterns in the map. At least two
methods are possible for implementing what could be a class/struct Bit Pattern:
 Use a string. This makes it easy to add a character (using +) to a string during tree
traversal and makes it possible to us string as Bit Pattern. Your program may be slow
because appending characters to a string (in creating the bit pattern) and accessing
characters in a string (in writing 0's or 1's when compressing) is slower than the next
approach.
 Alternatively, you can store an integer for the bitwise coding of a character. You
need to store the length of the code too to differentiate between 01001 and 00101.
However, using an int restricts root-to-leaf paths to be at most 32 edges long since
an int holds 32 bits. In a pathological file, a Huffman tree could have a root-to-leaf
path of over 100. Because of this problem, you should use strings to store paths
rather than ints. A slow correct program is better than a fast-incorrect program.
3.2 Storing Sizes
The operating system will buffer output, i.e., output to disk actually occurs when some
internal buffer is full. In particular, it is not possible to write just one single bit to a file, all
output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any
case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but
you cannot be sure precisely when they're written during the execution of your program.
Also, because of buffering, if all output is done in eight-bit chunks and your
program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number
of bits written is a multiple of eight. Because of the potential for the existence of these
"extra bits when reading one bit at a time, you cannot simply read bits until
there are no more left since your program might then read the extra bits written due to
buffering.
CHAPTER 4
PROGRAM CODE AND OUTPUT

Huffman Coding (Autorecovered)

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Huffman Coding (Autorecovered)

Hochgeladen von

Copyright:

Verfügbare Formate

HUFFMAN CODING

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

2.1 The Basic Idea

2.2 Building the Huffman Tree

00 01 100 00 01 100 00 01 1110 1101 101 1111 1100

The process of decompression is simply a matter of translating the

3.1 Storing the Huffman Tree

3.2 Creating the Huffman Table

3.2 Storing Sizes

Das könnte Ihnen auch gefallen