Huffman Code

HUFFMAN CODE
1. INTRODUCTION / HISTORY
In computer science and information theory, a Huffman code is a particular type of

optimal prefix code that is commonly used for lossless data compression. The process of finding
or using such a code proceeds by means of Huffman coding, an algorithm developed by David
A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method
for the Construction of Minimum-Redundancy Codes". The output from Huffman's algorithm can
be viewed as a variable-length code table for encoding a source symbol (such as a character in
a file).
The algorithm derives this table from the estimated probability or frequency of
occurrence (weight) for each possible value of the source symbol. As in other entropy encoding
methods, more common symbols are generally represented using fewer bits than less common
symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the
number of input weights if these weights are sorted. However, although optimal among
methods encoding symbols separately, Huffman coding is not always optimal among all
compression methods.
In 1951, David A. Huffman and his MIT information theory classmates were given the
choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper
on the problem of finding the most efficient binary code. Huffman, unable to prove any codes
were the most efficient, was about to give up and start studying for the final when he hit upon
the idea of using a frequency-sorted binary tree and quickly proved this method the most
efficient.
In doing so, Huffman outdid Fano, who had worked with information theory inventor
Claude Shannon to develop a similar code. Huffman coding uses a specific method for choosing
the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free
codes", that is, the bit string representing some particular symbol is never a prefix of the bit
string representing any other symbol). Huffman coding is such a widespread method for
creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix
code" even when such a code is not produced by Huffman's algorithm.
2. DISCUSSION
Huffman coding is an efficient method of compressing data without losing information.

In computer science, information is encoded as bits—1's and 0's. Strings of bits encode the
information that tells a computer which instructions to carry out. Video games, photographs,
movies, and more are encoded as strings of bits in a computer. Computers execute billions of
instructions per second, and a single video game can be billions of bits of data. It is easy to see
why efficient and unambiguous information encoding is a topic of interest in computer science.
Huffman coding provides an efficient, unambiguous code by analyzing the frequencies

that certain symbols appear in a message. Symbols that appear more often will be encoded as
a shorter-bit string while symbols that aren't used as much will be encoded as longer strings.
Since the frequencies of symbols vary across messages, there is no one Huffman coding that
will work for all messages. This means that the Huffman coding for sending message X may
differ from the Huffman coding used to send message Y. There is an algorithm for generating
the Huffman coding for a given message based on the frequencies of symbols in that particular
message.
Huffman coding works by using a frequency-sorted binary tree to encode symbols.
In information theory, the goal is usually to transmit information in the fewest bits
possible in such a way that each encoding is unambiguous. For example, to encode A, B, C, and
D in the fewest bits possible, each letter could be encoded as “1”. However, with this encoding,
the message “1111” could mean “ABCD” or “AAAA”—it is ambiguous.
Encodings can either be fixed-length or variable-length.
A fixed-length encoding is where the encoding for each symbol has the same number of bits.
For example:
A 00
B 01
C 10
D 11
A variable-length encoding is where symbols can be encoded with different numbers of bits. For
example:
A 000
B 1
C 110
D 1111
Huffman code is a way to encode information using variable-length strings to represent

symbols depending on how frequently they appear. The idea is that symbols that are used more
frequently should be shorter while symbols that appear more rarely can be longer. This way,
the number of bits it takes to encode a given message will be shorter, on average, than if a
fixed-length code was used. In messages that include many rare symbols, the string produced
by variable-length encoding may be longer than one produced by a fixed-length encoding.
As shown in the above sections, it is important for an encoding scheme to be

unambiguous. Since variable-length encodings are susceptible to ambiguity, care must be taken
to generate a scheme where ambiguity is avoided. Huffman coding uses a greedy algorithm to
build a prefix tree that optimizes the encoding scheme so that the most frequently used
symbols have the shortest encoding. The prefix tree describing the encoding ensures that the
code for any particular symbol is never a prefix of the bit string representing any other symbol.
To determine the binary assignment for a symbol, make the leaves of the tree correspond to
the symbols, and the assignment will be the path it takes to get from the root of the tree to
that leaf.
The Huffman coding algorithm takes in information about the frequencies or

probabilities of a particular symbol occurring. It begins to build the prefix tree from the bottom
up, starting with the two least probable symbols in the list. It takes those symbols and forms a
subtree containing them, and then removes the individual symbols from the list. The algorithm
sums the probabilities of elements in a subtree and adds the subtree and its probability to the
list. Next, the algorithm searches the list and selects the two symbols or subtrees with the
smallest probabilities. It uses those to make a new subtree, removes the original
subtrees/symbols from the list, and then adds the new subtree and its combined probability to
the list. This repeats until there is one tree and all elements have been added.
Given the following probability table, create a Huffman tree to encode each symbol.
Symbol Probability
A 0.30.3
B 0.30.3
C 0.20.2
D 0.10.1
E 0.10.1
The two elements with the smallest probability are D and E. So we create the subtree:
And update the list to include the subtree DE with a probability of 0.1 + 0.1 = 0.2:0.1+0.1=0.2:
Symbol Probability
A 0.30.3
B 0.30.3
C 0.20.2
DE 0.20.2
The next two smallest probabilities are DE and C, so we create the subtree:
And update the list to include the subtree CDE with a probability of 0.2 + 0.2 =
0.4:0.2+0.2=0.4:
Symbol Probability
A 0.30.3
B 0.30.3
CDE 0.40.4
The next two smallest probabilities are A and B, so we create the subtree:
And update the list to include the subtree AB with a probability of 0.3 + 0.3 = 0.6:0.3+0.3=0.6:
Symbol Probability
AB 0.60.6
CDE 0.40.4
Now, we only have two elements left, so we build the subtree:

The probability of ABCDE is 11, which is expected since one of the symbols will occur.
Here are the encodings we get from the tree:
Symbol Encoding
A 11
B 10
C 01
D 001
E 000
The Huffman Coding Algorithm
 Take a list of symbols and their probabilities.
 Select two symbols with the lowest probabilities (if multiple symbols have the same
probability, select two arbitrarily).
 Create a binary tree out of these two symbols, labeling one branch with a "1" and the
other with a "0". It doesn't matter which side you label 1 or 0 as long as the labeling is
consistent throughout the problem (e.g. the left side should always be 1 and the right
side should always be 0, or the left side should always be 0 and the right side should
always be 1).
 Add the probabilities of the two symbols to get the probability of the new subtree.
 Remove the symbols from the list and add the subtree to the list.
 Go back through the list and take the two symbols/subtrees with the smallest
probabilities and combine those into a new subtree. Remove the original
symbols/subtrees from the list, and add the new subtree to the list.
 Repeat until all of the elements are combined.

3. SAMPLE PROBLEM
Using the text ABCBAACD as example and applying those steps, we have the
following tree:
So the new representation of the bytes on the text are:

 A: 0
 B: 10
 C: 111
 D: 110
4. REFERENCES
https://www2.cs.duke.edu/csed/poop/huff/info/
https://brilliant.org/wiki/huffman-encoding/
https://en.wikipedia.org/wiki/Huffman_coding
LEMPEL-ZIV
1. INTRODUCTION
Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by

Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an
improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The
algorithm is simple to implement and has the potential for very high throughput in hardware
implementations.[1] It is the algorithm of the widely used Unix file compression utility compress
and is used in the GIF image format.
The scenario described by Welch's 1984 paper[1] encodes sequences of 8-bit data as
fixed-length 12-bit codes. The codes from 0 to 255 represent 1-character sequences consisting
of the corresponding 8-bit character, and the codes 256 through 4095 are created in a
dictionary for sequences encountered in the data as it is encoded. At each stage in
compression, input bytes are gathered into a sequence until the next character would make a
sequence with no code yet in the dictionary. The code for the sequence (without that character)
is added to the output, and a new code (for the sequence with that character) is added to the
dictionary. The idea was quickly adapted to other situations. In an image based on a color
table, for example, the natural character alphabet is the set of color table indexes, and in the
1980s, many images had small color tables (on the order of 16 colors). For such a reduced
alphabet, the full 12-bit codes yielded poor compression unless the image was large, so the
idea of a variable-width code was introduced: codes typically start one bit wider than the
symbols being encoded, and as each code size is used up, the code width increases by 1 bit, up
to some prescribed maximum (typically 12 bits). When the maximum code value is reached,
encoding proceeds using the existing table, but new codes are not generated for addition to the
table.
Further refinements include reserving a code to indicate that the code table should be
cleared and restored to its initial state (a "clear code", typically the first value immediately after
the values for the individual alphabet characters), and a code to indicate the end of data (a
"stop code", typically one greater than the clear code). The clear code lets the table be
reinitialized after it fills up, which lets the encoding adapt to changing patterns in the input
data. Smart encoders can monitor the compression efficiency and clear the table whenever the
existing table no longer matches the input well.
2. DISCUSSION
Around 1977, Abraham Lempel and Jacob Ziv developed the Lempel-Ziv class of
adaptive dictionary data compression techniques. Also known as LZ77 coding, they are now
some of the most popular compression techniques. The LZ coding scheme takes into account
repetition in phases, words or 32 Advanced data communications parts of words. These
repeated parts can either be text or binary. A flag is normally used to identify coded and
unencoded parts. An example piece of text could be: ‘The receiver requires a receipt which is
automatically sent when it is received.’ This has the repetitive sequence ‘recei’. The encoded
sequence could be modified with the flag sequence #m#n where m represents the number of
characters to trace back to find the character sequence and n the number of replaced
characters. Thus the encoded message could become: ‘The receiver requires a #20#5pt which
is automatically sent wh#6#2 it #30#2 #47#5ved.’ Normally a long sequence of text has many
repeated words and phases, such as ‘and’, ‘there’, and so on. Note that in some cases this could
lead to longer files if short sequences were replaced with codes that were longer than the
actual sequence itself.
The Lempel-Ziv-Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of

frequently used groups of characters (or 8-bit binary values). Before the file is decoded, the
compression dictionary must be sent (if transmitting data) or stored (if data is being stored).
This method is good at compressing text files because text files contain ASCII characters (which
are stored as 8-bit binary values) but not so good for graphics files, which may have repeating
patterns of binary digits that might not be multiples of 8 bits.
Why do we need Compression Algorithm ?
There are two categories of compression techniques, lossy and lossless. Whilst each
uses different techniques to compress files, both have the same aim: To look for duplicate data
in the graphic (GIF for LZW) and use a much more compact data representation. Lossless
compression reduces bits by identifying and eliminating statistical redundancy. No information is
lost in lossless compression. On the other hand, Lossy compression reduces bits by removing
unnecessary or less important information. So we need Data Compression mainly because:
 Uncompressed data can take up a lot of space, which is not good for limited hard drive
space and internet download speeds.
 While hardware gets better and cheaper, algorithms to reduce data size also helps
technology evolve.
 Example: One minute of uncompressed HD video can be over 1 GB.How can we fit a
two-hour film on a 25 GB Blu-ray disc?Lossy compression methods include DCT
(Discreet Cosine Transform), Vector Quantisation and Huffman coding while Lossless
compression methods include RLE (Run Length Encoding), string-table compression,
LZW (Lempel Ziff Welch) and zlib. There Exist several compression Algorithms, but we
are concentrating on LZW.
What is Lempel–Ziv–Welch (LZW) Algorithm ?
The LZW algorithm is a very common compression technique. This algorithm is typically
used in GIF and optionally in PDF and TIFF. Unix’s ‘compress’ command, among other uses. It
is lossless, meaning no data is lost when compressing. The algorithm is simple to implement
and has the potential for very high throughput in hardware implementations. It is the algorithm
of the widely used Unix file compression utility compress, and is used in the GIF image format.
The Idea relies on reoccurring patterns to save data space. LZW is the foremost technique for
general purpose data compression due to its simplicity and versatility. It is the basis of many PC
utilities that claim to “double the capacity of your hard drive”.
How does it work?
LZW compression works by reading a sequence of symbols, grouping the symbols into
strings, and converting the strings into codes. Because the codes take up less space than the
strings they replace, we get compression.Characteristic features of LZW includes,
 LZW compression uses a code table, with 4096 as a common choice for the number of
table entries. Codes 0-255 in the code table are always assigned to represent single
bytes from the input file.
 When encoding begins the code table contains only the first 256 entries, with the
remainder of the table being blanks. Compression is achieved by using codes 256
through 4095 to represent sequences of bytes.
 As the encoding continues, LZW identifies repeated sequences in the data, and adds
them to the code table.
 Decoding is achieved by taking each code from the compressed file and translating it
through the code table to find what character or characters it represents.
Example: ASCII code. Typically, every character is stored with 8 binary bits, allowing up
to 256 unique symbols for the data. This algorithm tries to extend the library to 9 to 12 bits per
character.The new unique symbols are made up of combinations of symbols that occurred
previously in the string. It does not always compress well, especially with short, diverse strings.
But is good for compressing redundant data, and does not have to save the new dictionary with
the data: this method can both compress and uncompress data.
There are excellent article’s written up already, you can look more indepth here and also Mark
Nelson’s article is commendable.
Implementation
The idea of the compression algorithm is the following: as the input data is being
processed, a dictionary keeps a correspondence between the longest encountered words and a
list of code values. The words are replaced by their corresponding codes and so the input file is
compressed. Therefore, the efficiency of the algorithm increases as the number of long,
repetitive words in the input data increases.
Advantages of LZW over Huffman:
 LZW requires no prior information about the input data stream.
 LZW can compress the input stream in one single pass.
 Another advantage of LZW its simplicity, allowing fast execution.

The simple scheme described above focuses on the LZW algorithm itself. Many
applications apply further encoding to the sequence of output symbols. Some package the
coded stream as printable characters using some form of binary-to-text encoding; this increases
the encoded length and decreases the compression rate. Conversely, increased compression
can often be achieved with an adaptive entropy encoder. Such a coder estimates the probability
distribution for the value of the next symbol, based on the observed frequencies of values so
far. A standard entropy encoding such as Huffman coding or arithmetic coding then uses
shorter codes for values with higher probabilities.
LZW compression became the first widely used universal data compression method on
computers. A large English text file can typically be compressed via LZW to about half its
original size.
LZW was used in the public-domain program compress, which became a more or less
standard utility in Unix systems around 1986. It has since disappeared from many distributions,
both because it infringed the LZW patent and because gzip produced better compression ratios
using the LZ77-based DEFLATE algorithm, but as of 2008 at least FreeBSD includes
both compress and uncompress as a part of the distribution. Several other popular compression
utilities also used LZW or closely related methods.
LZW became very widely used when it became part of the GIF image format in 1987. It
may also (optionally) be used in TIFF and PDF files. (Although LZW is available in Adobe
Acrobat software, Acrobat by default uses DEFLATE for most text and color-table-based image
data in PDF files.)
The Variable-length-code LZW (VLC-LZW) uses a variation of the LZW algorithm where
variable-length codes are used to replace patterns detected in the original data. It uses a
dictionary constructed from the patterns encountered in the original data. Each new pattern is
entered into it and its indexed address is used to replace it in the compressed stream. The
transmitter and receiver maintain the same dictionary. The VLC part of the algorithm is based
on an initial code size (the LZW initial code size), which specifies the initial number of bits used
for the compression codes. When the number of patterns detected by the compressor in the
input stream exceeds the number of patterns encodable with the current number of bits then
the number of bits per LZW code is increased by one. The code size is initially transmitted (or
stored) so that the receiver (or uncompressor) knows the size of the dictionary and the length
of the codewords. In 1985 the LZW algorithm was patented by the Sperry Corp. It is used by
the GIF file format and is similar to the technique used to compress data in V.42bis modems.
LZ compression substitutes the detected repeated patterns with references to a

dictionary. Unfortunately the larger the dictionary, the greater the number of bits that are
necessary for the references. The optimal size of the dictionary also varies for different types of
data; the more variable the data, the smaller the optimal size of the directory.
Since the codes emitted typically do not fall on byte boundaries, the encoder and
decoder must agree on how codes are packed into bytes. The two common methods are LSB-
first("least significant bit first") and MSB-first ("most significant bit first"). In LSB-first packing,
the first code is aligned so that the least significant bit of the code falls in the least significant
bit of the first stream byte, and if the code has more than 8 bits, the high-order bits left over
are aligned with the least significant bits of the next byte; further codes are packed with LSB
going into the least significant bit not yet used in the current stream byte, proceeding into
further bytes as necessary. MSB-first packing aligns the first code so that its mostsignificant bit
falls in the MSB of the first stream byte, with overflow aligned with the MSB of the next byte;
further codes are written with MSB going into the most significant bit not yet used in the
current stream byte.
Encoding
A dictionary is initialized to contain the single-character strings corresponding to all the

possible input characters (and nothing else except the clear and stop codes if they're being
used). The algorithm works by scanning through the input string for successively longer
substrings until it finds one that is not in the dictionary. When such a string is found, the index
for the string without the last character (i.e., the longest substring that is in the dictionary) is
retrieved from the dictionary and sent to output, and the new string (including the last
character) is added to the dictionary with the next available code. The last input character is
then used as the next starting point to scan for substrings.
In this way, successively longer strings are registered in the dictionary and available for
subsequent encoding as single output values. The algorithm works best on data with repeated
patterns, so the initial parts of a message see little compression. As the message grows,
however, the compression ratio tends asymptotically to the maximum (i.e., the compression
factor or ratio improves on an increasing curve, and not linearly, approaching a theoretical
maximum inside a limited time period rather than over infinite time).
Decoding
The decoding algorithm works by reading a value from the encoded input and
outputting the corresponding string from the initialized dictionary. To rebuild the dictionary in
the same way as it was built during encoding, it also obtains the next value from the input and
adds to the dictionary the concatenation of the current string and the first character of the
string obtained by decoding the next input value, or the first character of the string just output
if the next value can not be decoded (If the next value is unknown to the decoder, then it must
be the value added to the dictionary this iteration, and so its first character must be the same
as the first character of the current string being sent to decoded output). The decoder then
proceeds to the next input value (which was already read in as the "next value" in the previous
pass) and repeats the process until there is no more input, at which point the final input value
is decoded without any more additions to the dictionary.
In this way, the decoder builds a dictionary that is identical to that used by the encoder, and
uses it to decode subsequent input values. Thus, the full dictionary does not need to be sent
with the encoded data. Just the initial dictionary that contains the single-character strings is
sufficient (and is typically defined beforehand within the encoder and decoder rather than
explicitly sent with the encoded data.)
3. SAMPLE PROBLEM
Compression using LZW

Example 1: Use the LZW algorithm to compress the string: BABAABAAA
The steps involved are systematically shown in the diagram below.
LZW Decompression
The LZW decompressor creates the same string table during decompression. It starts
with the first 256 table entries initialized to single characters. The string table is
updated for each character in the input stream, except the first one.Decoding achieved
by reading codes and translating them through the code table being built.
5. REFERENCES
https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
https://www.geeksforgeeks.org/lzw-lempel-ziv-welch-compression-technique/
http://www.soc.napier.ac.uk/~bill/pdf/ADCO_C03.PDF
CRC
1. INTRODUCTION
CRCs are based on the theory of cyclic error-correcting codes. The use
of systematic cyclic codes, which encode messages by adding a fixed-length check value, for
the purpose of error detection in communication networks, was first proposed by W. Wesley
Peterson in 1961.[2] Cyclic codes are not only simple to implement but have the benefit of being
particularly well suited for the detection of burst errors: contiguous sequences of erroneous
data symbols in messages. This is important because burst errors are common transmission
errors in many communication channels, including magnetic and optical storage devices.
Typically an n-bit CRC applied to a data block of arbitrary length will detect any single error
burst not longer than n bits and the fraction of all longer error bursts that it will detect is (1 −
2−n).
Specification of a CRC code requires definition of a so-called generator polynomial. This

polynomial becomes the divisor in a polynomial long division, which takes the message as
the dividend and in which the quotient is discarded and the remainder becomes the result. The
important caveat is that the polynomial coefficients are calculated according to the arithmetic of
a finite field, so the addition operation can always be performed bitwise-parallel (there is no
carry between digits).
In practice, all commonly used CRCs employ the Galois field of two elements, GF(2). The
two elements are usually called 0 and 1, comfortably matching computer architecture.
A CRC is called an n-bit CRC when its check value is n bits long. For a given n, multiple
CRCs are possible, each with a different polynomial. Such a polynomial has highest degree n,
which means it has n + 1 terms. In other words, the polynomial has a length of n + 1; its
encoding requires n + 1 bits. Note that most polynomial specifications either drop
the MSBor LSB, since they are always 1. The CRC and associated polynomial typically have a
name of the form CRC-n-XXX as in the table below.
The simplest error-detection system, the parity bit, is in fact a 1-bit CRC: it uses the generator
polynomial x + 1 (two terms), and has the name CRC-1.
2. DISCUSSION
A cyclic redundancy check (CRC) is an error-detecting code commonly used in

digital networks and storage devices to detect accidental changes to raw data. Blocks of data
entering these systems get a short check value attached, based on the remainder of
a polynomial division of their contents. On retrieval, the calculation is repeated and, in the
event the check values do not match, corrective action can be taken against data corruption.
CRCs can be used for error correction (see bitfilters).[1]
CRCs are so called because the check (data verification) value is a redundancy (it
expands the message without adding information) and the algorithm is based on cyclic codes.
CRCs are popular because they are simple to implement in binary hardware, easy to analyze
mathematically, and particularly good at detecting common errors caused by noise in
transmission channels. Because the check value has a fixed length, the function that generates
it is occasionally used as a hash function.
The CRC was invented by W. Wesley Peterson in 1961; the 32-bit CRC function, used in
Ethernet and many other standards, is the work of several researchers and was published in
1975.
Cyclic Redundancy Check (CRC) is a block code that was invented by W. Wesley
Peterson in 1961. It is commonly used to detect accidental changes to data transmitted via
telecommunications networks and storage devices.
CRC involves binary division of the data bits being sent by a predetermined divisor
agreed upon by the communicating system. The divisor is generated using polynomials. So,
CRC is also called polynomial code checksum.
Before sending the message over network channels, the sender encodes the message
using CRC. The receiver decodes the incoming message to detect error. If the message is error-
free, then it is accepted, otherwise, the receiver asks for re-transmission of the message.
The process is illustrated as follows:

Computation of CRC
When messages are encoded using CRC (polynomial code), a fixed polynomial called
generator polynomial,G(x) is used. The value of is mutually agreed upon by the sending and the
receiving parties. A k – bit word is represented by a polynomial which ranges from X0 to xk-1.
The order of this polynomial is the power of the highest coefficient, i.e.(K-1) The length
of G(x) should be less than the length of the message it encodes. Also, both its MSB (most
significant bit) and LSB (least significant bit) should be 1. In the process of encoding, CRC bits
are appended to the message so that the resultant frame is divisible by G(x).
 Algorithm for Encoding using CRC
o The communicating parties agrees upon the size of message,M(x) and the
generator polynomial, G(x).
o If r is the order of G(x),r, bits are appended to the low order end of M(x). This
makes the block size bits, the value of which is xrM(x).
o The block xrM(x) is divided by G(x) using modulo 2 division.
o The remainder after division is added to xrM(x) using modulo 2 addition. The
result is the frame to be transmitted, T(x). The encoding procedure makes
exactly divisible by G(x).
 Algorithm for Decoding using CRC
o The receiver divides the incoming data frame T(x) unit by G(x) using modulo 2
division. Mathematically, if E(x) is the error, then modulo 2 division of [M(x) +
E(x)] by G(x) is done.
o If there is no remainder, then it implies that E(x). The data frame is accepted.
o A remainder indicates a non-zero value of E(x), or in other words presence of an

error. So the data frame is rejected. The receiver may then send an erroneous
acknowledgment back to the sender for retransmission.
The concept of the CRC as an error-detecting code gets complicated when an implementer or
standards committee uses it to design a practical system. Here are some of the complications:
 Sometimes an implementation prefixes a fixed bit pattern to the bitstream to be checked.

This is useful when clocking errors might insert 0-bits in front of a message, an alteration
that would otherwise leave the check value unchanged.
 Usually, but not always, an implementation appends n 0-bits (n being the size of the CRC)
to the bitstream to be checked before the polynomial division occurs. Such appending is
explicitly demonstrated in the Computation of CRC article. This has the convenience that the
remainder of the original bitstream with the check value appended is exactly zero, so the
CRC can be checked simply by performing the polynomial division on the received bitstream
and comparing the remainder with zero. Due to the associative and commutative properties
of the exclusive-or operation, practical table driven implementations can obtain a result
numerically equivalent to zero-appending without explicitly appending any zeroes, by using
an equivalent,[8] faster algorithm that combines the message bitstream with the stream
being shifted out of the CRC register.
 Sometimes an implementation exclusive-ORs a fixed bit pattern into the remainder of the
polynomial division.
 Bit order: Some schemes view the low-order bit of each byte as "first", which then during
polynomial division means "leftmost", which is contrary to our customary understanding of
"low-order". This convention makes sense when serial-port transmissions are CRC-checked
in hardware, because some widespread serial-port transmission conventions transmit bytes
least-significant bit first.
 Byte order: With multi-byte CRCs, there can be confusion over whether the byte
transmitted first (or stored in the lowest-addressed byte of memory) is the least-significant
byte (LSB) or the most-significant byte (MSB). For example, some 16-bit CRC schemes
swap the bytes of the check value.
 Omission of the high-order bit of the divisor polynomial: Since the high-order bit is always
1, and since an n-bit CRC must be defined by an (n + 1)-bit divisor which overflowsan n-
bit register, some writers assume that it is unnecessary to mention the divisor's high-order
bit.
 Omission of the low-order bit of the divisor polynomial: Since the low-order bit is always 1,
Standards and common use
Numerous varieties of cyclic redundancy checks have been incorporated into technical
standards. By no means does one algorithm, or one of each degree, suit every purpose;
Koopman and Chakravarty recommend selecting a polynomial according to the application
requirements and the expected distribution of message lengths. The number of distinct CRCs in
use has confused developers, a situation which authors have sought to address. There are
three polynomials reported for CRC-12, twenty-two conflicting definitions of CRC-16, and seven
of CRC-32.
The polynomials commonly applied are not the most efficient ones possible. Since 1993,
Koopman, Castagnoli and others have surveyed the space of polynomials between 3 and 64 bits
in size, finding examples that have much better performance (in terms of Hamming distance for
a given message size) than the polynomials of earlier protocols, and publishing the best of
these with the aim of improving the error detection capacity of future standards. In
particular, iSCSI and SCTP have adopted one of the findings of this research, the CRC-32C
(Castagnoli) polynomial.
The design of the 32-bit polynomial most commonly used by standards bodies, CRC-32-
IEEE, was the result of a joint effort for the Rome Laboratory and the Air Force Electronic
Systems Division by Joseph Hammond, James Brown and Shyan-Shiang Liu of the Georgia
Institute of Technology and Kenneth Brayer of the Mitre Corporation. The earliest known
appearances of the 32-bit polynomial were in their 1975 publications: Technical Report 2956 by
Brayer for Mitre, published in January and released for public dissemination through DTIC in
August, and Hammond, Brown and Liu's report for the Rome Laboratory, published in
May. Both reports contained contributions from the other team. During December 1975, Brayer
and Hammond presented their work in a paper at the IEEE National Telecommunications
Conference: the IEEE CRC-32 polynomial is the generating polynomial of a Hamming code and
was selected for its error detection performance. Even so, the Castagnoli CRC-32C polynomial
used in iSCSI or SCTP matches its performance on messages from 58 bits to 131 kbits, and
outperforms it in several size ranges including the two most common sizes of Internet
packet. The ITU-T G.hn standard also uses CRC-32C to detect errors in the payload (although it
uses CRC-16-CCITT for PHY headers).
CRC32 computation is implemented in hardware as an operation of SSE4.2 instruction set, first

introduced in Intel processors' Nehalem microarchitecture.
A CRC is pretty simple; you take a polynomial represented as bits and the data, and
divide the polynomial into the data (or you represent the data as a polynomial and do the same
thing). The remainder, which is between 0 and the polynomial is the CRC. Your code is a bit
hard to understand, partly because it's incomplete: temp and testcrc are not declared, so it's
unclear what's being indexed, and how much data is running through the algorithm.
The way to understand CRCs is to try to compute a few using a short piece of data (16
bits or so) with a short polynomial -- 4 bits, perhaps. If you practice this way, you'll really
understand how you might go about coding it.
If you're doing it frequently, a CRC is quite slow to compute in software. Hardware

computation is much more efficient, and requires just a few gates.
Requirements of CRC :
A CRC will be valid if and only if it satisfies the following requirements:
1. It should have exactly one less bit than divisor.

2. Appending the CRC to the end of the data unit should result in the bit sequence which is
exactly divisible by the divisor.
The various steps followed in the CRC method are:
1. A string of n as is appended to the data unit. The length of predetermined divisor is n+ 1.
2. The newly formed data unit i.e. original data + string of n as are divided by the divisor using
binary division and remainder is obtained. This remainder is called CRC.
3. Now, string of n Os appended to data unit is replaced by the CRC remainder (which is also of
n bit).
4. The data unit + CRC is then transmitted to receiver.
5. The receiver on receiving it divides data unit + CRC by the same divisor & checks the
remainder.
6. If the remainder of division is zero, receiver assumes that there is no error in data and it
accepts it.
7. If remainder is non-zero then there is an error in data and receiver rejects it.
3. SAMPLE PROBLEM
Data word to be sent - 100100
Key - 1101 [or generator polynomial x3 + x2 + 1]
Sender Side:
Therefore, the remainder is 001 and hence the encoded data sent is 100100001.
Receiver Side:
Code word received at the receiver side 100100001
Therefore, the remainder is all zeros. Hence, the data received has no error.
4. REFERENCES
https://en.wikipedia.org/wiki/Cyclic_redundancy_check
https://www.tutorialspoint.com/what-is-algorithm-for-computing-the-crc
https://www.geeksforgeeks.org/modulo-2-binary-division/
http://ecomputernotes.com/computernetworkingnotes/communication-
networks/cyclic-redundancy-check
HAMMING CODE
1. INTRODUCTION
In telecommunication, Hamming codes are a family of linear error-correcting

codes. Hamming codes can detect up to two-bit errors or correct one-bit errors without
detection of uncorrected errors. By contrast, the simple parity code cannot correct
errors, and can detect only an odd number of bits in error. Hamming codes are perfect
codes, that is, they achieve the highest possible rate for codes with their block length
and minimum distance of three.[1] Richard W. Hamming invented Hamming codes in
1950 as a way of automatically correcting errors introduced by punched card readers.
In his original paper, Hamming elaborated his general idea, but specifically focused on
the Hamming(7,4) code which adds three parity bits to four bits of data.[2]
In mathematical terms, Hamming codes are a class of binary linear codes. For
each integer r ≥ 2 there is a code with block length n = 2r − 1 and message length k =
2r − r − 1. Hence the rate of Hamming codes is R = k / n = 1 − r / (2r − 1), which is
the highest possible for codes with minimum distance of three (i.e., the minimal
number of bit changes needed to go from any code word to any other code word is
three) and block length 2r − 1. The parity-check matrix of a Hamming code is
constructed by listing all columns of length r that are non-zero, which means that the
dual code of the Hamming code is the shortened Hadamard code. The parity-check
matrix has The property that any two columns are pairwise linearly independent.
Due to the limited redundancy that Hamming codes add to the data, they can
only detect and correct errors when the error rate is low. This is the case in computer
memory (ECC memory), where bit errors are extremely rare and Hamming codes are
widely used. In this context, an extended Hamming code having one extra parity bit is
often used. Extended Hamming codes achieve a Hamming distance of four, which
allows the decoder to distinguish between when at most one one-bit error occurs and
when any two-bit errors occur. In this sense, extended Hamming codes are single-error
correcting and double-error detecting, abbreviated as SECDED.
2. DISCUSSION
Errors and Error Correcting Codes
When bits are transmitted over the computer network, they are subject to get
corrupted due to interference and network problems. The corrupted bits leads to
spurious data being received by the receiver and are called errors.
Error-correcting codes (ECC) are a sequence of numbers generated by specific

algorithms for detecting and removing errors in data that has been transmitted over
noisy channels. Error correcting codes ascertain the exact number of bits that has been
corrupted and the location of the corrupted bits, within the limitations in algorithm.
ECCs can be broadly categorized into two types −
 Block codes − The message is divided into fixed-sized blocks of bits, to which
redundant bits are added for error detection or correction.
 Convolutional codes − The message comprises of data streams of arbitrary

length and parity symbols are generated by the sliding application of a Boolean
function to the data stream.
Hamming Code
Hamming code is a block code that is capable of detecting up to two simultaneous bit
errors and correcting single-bit errors. It was developed by R.W. Hamming for error
correction.
In this coding method, the source encodes the message by inserting redundant bits
within the message. These redundant bits are extra bits that are generated and
inserted at specific positions in the message itself to enable error detection and
correction. When the destination receives this message, it performs recalculations to
detect errors and find the bit position that has error.
Encoding a message by Hamming Code

The procedure used by the sender to encode the message encompasses the following
steps −
 Step 1 − Calculation of the number of redundant bits.
 Step 2 − Positioning the redundant bits.
 Step 3 − Calculating the values of each redundant bit.
Once the redundant bits are embedded within the message, this is sent to the user.
Step 1 − Calculation of the number of redundant bits.
If the message contains m𝑚number of data bits, r𝑟number of redundant bits are added
to it so that m𝑟 is able to indicate at least (m + r+ 1) different states. Here, (m + r)
indicates location of an error in each of (𝑚 + 𝑟) bit positions and one additional state
indicates no error. Since, r𝑟 bits can indicate 2r𝑟 states, 2r𝑟 must be at least equal to
(m + r + 1). Thus the following equation should hold 2r ≥ m+r+1
Step 2 − Positioning the redundant bits.
The r redundant bits placed at bit positions of powers of 2, i.e. 1, 2, 4, 8, 16 etc. They
are referred in the rest of this text as r1 (at position 1), r2 (at position 2), r3 (at position
4), r4 (at position 8) and so on.
Step 3 − Calculating the values of each redundant bit.
The redundant bits are parity bits. A parity bit is an extra bit that makes the number of
1s either even or odd. The two types of parity are −
 Even Parity − Here the total number of bits in the message is made even.
 Odd Parity − Here the total number of bits in the message is made odd.
Each redundant bit, ri, is calculated as the parity, generally even parity, based upon its
bit position. It covers all bit positions whose binary representation includes a 1 in the
ith position except the position of ri. Thus −
 r1 is the parity bit for all data bits in positions whose binary representation
includes a 1 in the least significant position excluding 1 (3, 5, 7, 9, 11 and so on)
includes a 1 in the position 2 from right except 2 (3, 6, 7, 10, 11 and so on)
includes a 1 in the position 3 from right except 4 (5-7, 12-15, 20-23 and so on)
Decoding a message in Hamming Code
Once the receiver gets an incoming message, it performs recalculations to detect errors
and correct them. The steps for recalculation are −
 Step 1 − Calculation of the number of redundant bits.
 Step 2 − Positioning the redundant bits.
 Step 3 − Parity checking.
 Step 4 − Error detection and correction
Step 1 − Calculation of the number of redundant bits
Using the same formula as in encoding, the number of redundant bits are ascertained.
2r ≥ m + r + 1 where m is the number of data bits and r is the number of redundant

bits.
Step 2 − Positioning the redundant bits
The r redundant bits placed at bit positions of powers of 2, i.e. 1, 2, 4, 8, 16 etc.
Step 3 − Parity checking
Parity bits are calculated based upon the data bits and the redundant bits using the
same rule as during generation of c1,c2 ,c3 ,c4 etc. Thus
c1 = parity(1, 3, 5, 7, 9, 11 and so on)

c2 = parity(2, 3, 6, 7, 10, 11 and so on)
c3 = parity(4-7, 12-15, 20-23 and so on)
Step 4 − Error detection and correction
The decimal equivalent of the parity bits binary values is calculated. If it is 0, there is no
error. Otherwise, the decimal value gives the bit position which has error. For example,
if c1c2c3c4 = 1001, it implies that the data bit at position 9, decimal equivalent of 1001,
has error. The bit is flipped to get the correct message.
Hamming code
Hamming code is a set of error-correction code s that can be used to detect and
correct biterrors that can occur when computer data is moved or stored. Hamming code
is named for R. W. Hamming of Bell Labs.
Like other error-correction code, Hamming code makes use of the concept
of parity and parity bits, which are bits that are added to data so that the validity of the
data can be checked when it is read or after it has been received in a data
transmission. Using more than one parity bit, an error-correction code can not only
identify a single bit error in the data unit, but also its location in the data unit.
In data transmission, the ability of a receiving station to correct errors in the received
data is called forward error correction (FEC) and can increase throughput on a data link
when there is a lot of noise present. To enable this, a transmitting station must add
extra data (called error correction bits ) to the transmission. However, the correction
may not always represent a cost saving over that of simply resending the information.
Hamming codes make FEC less expensive to implement through the use of a block
paritymechanism.
Computing parity involves counting the number of ones in a unit of data, and adding
either a zero or a one (called a parity bit ) to make the count odd (for odd parity) or
even (for even parity). For example, 1001 is a 4-bit data unit containing two one bits;
since that is an even number, a zero would be added to maintain even parity, or, if odd
parity was being maintained, another one would be added. To calculate even parity,
the XOR operator is used; to calculate odd parity, the XNOR operator is used. Single bit
errors are detected when the parity count indicates that the number of ones is
incorrect, indicating that a data bit has been flipped by noise in the line. Hamming
codes detect two bit errors by using more than one parity bit, each of which is
computed on different combinations of bits in the data. The number of parity bits
required depends on the number of bits in the data transmission, and is calculated by
the Hamming rule:
p
d + p + 1 < = 2 (1)
Where d is the number of data bits and p is the number of parity bits. The total of the
two is called the Hamming code word, which is generated by multiplying the data bits
by a generator matrix .
General algorithm
The following general algorithm generates a single-error correcting (SEC) code for any
number of bits. The main idea is to choose the error-correcting bits such that the index-
XOR (the XOR of all the bit positions containing a 1) is 0. We use positions 1, 10, 100,
etc (in binary) as the error-correcting bits, which guarantees it is possible to set the
error-correcting bits so that the index-XOR of the whole message is 0. If the receiver
receives a string with index-XOR 0, they can conclude there were no corruptions, and
otherwise, the index-XOR indicates the index of the corrupted bit.
The following steps implement this algorithm:
1. Number the bits starting from 1: bit 1, 2, 3, 4, 5, 6, 7, etc.
2. Write the bit numbers in binary: 1, 10, 11, 100, 101, 110, 111, etc.
3. All bit positions that are powers of two (have a single 1 bit in the binary form of
their position) are parity bits: 1, 2, 4, 8, etc. (1, 10, 100, 1000)
4. All other bit positions, with two or more 1 bits in the binary form of their
position, are data bits.
5. Each data bit is included in a unique set of 2 or more parity bits, as determined
by the binary form of its bit position.
1. Parity bit 1 covers all bit positions which have the least significant bit set:
bit 1 (the parity bit itself), 3, 5, 7, 9, etc.
2. Parity bit 2 covers all bit positions which have the second least significant
bit set: bit 2 (the parity bit itself), 3, 6, 7, 10, 11, etc.
3. Parity bit 4 covers all bit positions which have the third least significant bit
set: bits 4–7, 12–15, 20–23, etc.
4. Parity bit 8 covers all bit positions which have the fourth least significant
bit set: bits 8–15, 24–31, 40–47, etc.
5. In general each parity bit covers all bits where the bitwise AND of the
parity position and the bit position is non-zero.
If a byte of data to be encoded is 10011010, then the data word (using _ to represent
the parity bits) would be __1_001_1010, and the code word is 011100101010.
The form of the parity is irrelevant. Even parity is mathematically simpler, but there is
no difference in practice.
3. SAMPLE PROBLEM
Examples for Hamming code:
The message you want to send is 4-bits string:
0000 0001 0010 0011
0100 0101 0110 0111
1000 1001 1010 1011
1100 1101 1110 1111
There is 16 different messages.
The 4-bits messages are mapped to the following Sixteen Valid Codewords
0 0000000 8 1001011
1 0000111 9 1001100
2 0011001 A 1010010
3 0011110 B 1010101
4 0101010 C 1100001
5 0101101 D 1100110
6 0110011 E 1111000
7 0110100 F 1111111
The Hamming Code essentially defines 16 valid codewords. The sixteen words are arranged
such that the minimum distance between any two words is 3.
Check the hamming equation:
M=4, R=3, N=7
Left side: (M+R+1)*(2^M)=8*16=128
Right side: 2^N=128
Perfect match!
Exercise 1: Calculate the Hamming distance between any two codewords in the above table.
The send will only send one of these 16 valid codewords. For example, the send will never send
0000001, which is not a valid codeword.
Due to the transmission error, the receiver might receive invalid codewords. Since the code
transmitted is 7-bit long, total amount of possible codes is 128.
When received a code, the receiver will look for the closest valid codeword as a guess for what
might be actually transmitted.
Decoding at the Receiver Side
For example: if the sender send m=0000000, and the last bit is inverted due to transmission
error, the receiver received r=0000001. The receiver will calculate the Hamming distance
between r and all valid codewords. The codeword with the smallest Hamming distance will be
the one.
In fact, the table of D(0000001, x) is
Code word D(r,) Code word D(r,)
0000000 1 1001011 3
0000111 2 1001100 4
0011001 2 1010010 4
0011110 5 1010101 3
0101010 4 1100001 2
0101101 3 1100110 5
0110011 3 1111000 5
0110100 4 1111111 6
Thus the receiver conclude that the actual transmitted code is 0000000, which is the correct.
4. REFERENCES
https://www.tutorialspoint.com/error-correcting-codes-hamming-codes
https://en.wikipedia.org/wiki/Hamming_code
https://whatis.techtarget.com/definition/Hamming-code
REED SOLOMON
1. INTRODUCTION
Reed–Solomon codes are a group of error-correcting codes that were introduced

by Irving S. Reed and Gustave Solomon in 1960.[1]They have many applications, the most
prominent of which include consumer technologies such as CDs, DVDs, Blu-ray discs, QR
codes, data transmission technologies such as DSL and WiMAX, broadcast systems such as
satellite communications, DVB and ATSC, and storage systems such as RAID 6.
Reed–Solomon codes operate on a block of data treated as a set of finite field elements
called symbols. For example, a block of 4096 bytes (32,768 bits) could be treated as a set of
2731 12-bit symbols, where each symbol is a finite field element of GF(212), the last symbol
padded with four 0 bits. Reed–Solomon codes are able to detect and correct multiple symbol
errors. By adding t check symbols to the data, a Reed–Solomon code can detect any
combination of up to and including t erroneous symbols, OR correct up to and
including ⌊t/2⌋ symbols. As an erasure code, it can correct up to and including t known
erasures, or it can detect and correct combinations of errors and erasures. Reed–Solomon
codes are also suitable as multiple-burst bit-error correcting codes, since a sequence
of b + 1 consecutive bit errors can affect at most two symbols of size b. The choice of t is up to
the designer of the code, and may be selected within wide limits. Error correcting codes are a
signal processing technique to correct errors. They are nowadays ubiquitous, such as in
communications (mobile phone, internet), data storage and archival (hard drives, optical discs
CD/DVD/BluRay, archival tapes), warehouse management (barcodes) and advertisement (QR
codes). Reed–Solomon error correction is a specific type of error correction code. It is one of
the oldest but it is still widely used, as it is very well defined and several efficient algorithms are
now available under the public domain.
Usually, error correction codes are hidden and most users do not even know about
them, nor when they are used. Yet, they are a critical component for some applications to be
viable, such as communication or data storage. Indeed, a hard drive that would randomly lose
data every few days would be useless, and a phone being able to call only on days with a
cloud-less weather would be seldom used. Using error correction codes allows to recover a
corrupted message into the full original message.
2. DISCUSSION
Reed–Solomon codes were developed in 1960 by Irving S. Reed and Gustave Solomon,
who were then staff members of MIT Lincoln Laboratory. Their seminal article was titled
"Polynomial Codes over Certain Finite Fields". (Reed & Solomon 1960). The original encoding
scheme described in the Reed & Solomon article used a variable polynomial based on the
message to be encoded where only a fixed set of values (evaluation points) to be encoded are
known to encoder and decoder. The original theoretical decoder generated potential
polynomials based on subsets of k (unencoded message length) out of n (encoded message
length) values of a received message, choosing the most popular polynomial as the correct one,
which was impractical for all but the simplest of cases. This was initially resolved by changing
the original scheme to a BCH code like scheme based on a fixed polynomial known to both
encoder and decoder, but later, practical decoders based on the original scheme were
developed, although slower than the BCH schemes. The result of this is that there are two main
types of Reed Solomon codes, ones that use the original encoding scheme, and ones that use
the BCH encoding scheme.
Also in 1960, a practical fixed polynomial decoder for BCH codes developed by Daniel
Gorenstein and Neal Zierler was described in an MIT Lincoln Laboratory report by Zierler in
January 1960 and later in a paper in June 1961.[2] The Gorenstein–Zierler decoder and the
related work on BCH codes are described in a book Error Correcting Codes by W. Wesley
Peterson (1961).[3] By 1963 (or possibly earlier), J. J. Stone (and others) recognized that Reed
Solomon codes could use the BCH scheme of using a fixed generator polynomial, making such
codes a special class of BCH codes,[4] but Reed Solomon codes based on the original encoding
scheme, are not a class of BCH codes, and depending on the set of evaluation points, they are
not even cyclic codes.
In 1969, an improved BCH scheme decoder was developed by Elwyn

Berlekamp and James Massey, and is since known as the Berlekamp–Massey decoding
algorithm.
In 1975, another improved BCH scheme decoder was developed by Yasuo Sugiyama,
based on the extended Euclidean algorithm.[5]
In 1977, Reed–Solomon codes were implemented in the Voyager program in the form
of concatenated error correction codes. The first commercial application in mass-produced
consumer products appeared in 1982 with the compact disc, where two interleaved Reed–
Solomon codes are used. Today, Reed–Solomon codes are widely implemented in digital
storage devices and digital communication standards, though they are being slowly replaced by
more modern low-density parity-check (LDPC) codes or turbo codes. For example, Reed–
Solomon codes are used in the Digital Video Broadcasting (DVB) standard DVB-S, but LDPC
codes are used in its successor, DVB-S2.
In 1986, an original scheme decoder known as the Berlekamp–Welch algorithm was

developed.
In 1996, variations of original scheme decoders called list decoders or soft decoders
were developed by Madhu Sudan and others, and work continues on these type of decoders –
see Guruswami–Sudan list decoding algorithm.
In 2002, another original scheme decoder was developed by Shuhong Gao, based on
the extended Euclidean algorithm Gao_RS.pdf .
Data storage
Reed–Solomon coding is very widely used in mass storage systems to correct the burst
errors associated with media defects.
Reed–Solomon coding is a key component of the compact disc. It was the first use of
strong error correction coding in a mass-produced consumer product, and DAT and DVD use
similar schemes. In the CD, two layers of Reed–Solomon coding separated by a 28-
way convolutional interleaver yields a scheme called Cross-Interleaved Reed–Solomon Coding
(CIRC). The first element of a CIRC decoder is a relatively weak inner (32,28) Reed–Solomon
code, shortened from a (255,251) code with 8-bit symbols. This code can correct up to 2 byte
errors per 32-byte block. More importantly, it flags as erasures any uncorrectable blocks, i.e.,
blocks with more than 2 byte errors. The decoded 28-byte blocks, with erasure indications, are
then spread by the deinterleaver to different blocks of the (28,24) outer code. Thanks to the
deinterleaving, an erased 28-byte block from the inner code becomes a single erased byte in
each of 28 outer code blocks. The outer code easily corrects this, since it can handle up to 4
such erasures per block.
The result is a CIRC that can completely correct error bursts up to 4000 bits, or about
2.5 mm on the disc surface. This code is so strong that most CD playback errors are almost
certainly caused by tracking errors that cause the laser to jump track, not by uncorrectable
error bursts.
DVDs use a similar scheme, but with much larger blocks, a (208,192) inner code, and a
(182,172) outer code.
Reed–Solomon error correction is also used in parchive files which are commonly posted
accompanying multimedia files on USENET. The Distributed online storage
service Wuala(discontinued in 2015) also used to make use of Reed–Solomon when breaking up
files.
Bar code
Almost all two-dimensional bar codes such as PDF-417, MaxiCode, Datamatrix, QR Code,
and Aztec Code use Reed–Solomon error correction to allow correct reading even if a portion of
the bar code is damaged. When the bar code scanner cannot recognize a bar code symbol, it
will treat it as an erasure.
Reed–Solomon coding is less common in one-dimensional bar codes, but is used by

the PostBar symbology.
Data transmission
Specialized forms of Reed–Solomon codes, specifically Cauchy-RS and Vandermonde-RS,

can be used to overcome the unreliable nature of data transmission over erasure channels. The
encoding process assumes a code of RS(N, K) which results in N codewords of length N symbols
each storing K symbols of data, being generated, that are then sent over an erasure channel.
Any combination of K codewords received at the other end is enough to reconstruct all
of the N codewords. The code rate is generally set to 1/2 unless the channel's erasure likelihood
can be adequately modelled and is seen to be less. In conclusion, N is usually 2K, meaning that
at least half of all the codewords sent must be received in order to reconstruct all of the
codewords sent.
Reed–Solomon codes are also used in xDSL systems and CCSDS's Space
Communications Protocol Specifications as a form of forward error correction.
Space transmission
Deep-space concatenated coding system.
One significant application of Reed–Solomon coding was to encode the digital pictures
sent back by the Voyager space probe.
Voyager introduced Reed–Solomon coding concatenated with convolutional codes, a

practice that has since become very widespread in deep space and satellite (e.g., direct digital
broadcasting) communications.
Viterbi decoders tend to produce errors in short bursts. Correcting these burst errors is a
job best done by short or simplified Reed–Solomon codes.
Modern versions of concatenated Reed–Solomon/Viterbi-decoded convolutional coding

were and are used on the Mars Pathfinder, Galileo, Mars Exploration
Rover and Cassini missions, where they perform within about 1–1.5 dB of the ultimate limit,
being the Shannon capacity.
These concatenated codes are now being replaced by more powerful turbo codes.
Principles of error correction
Before detailing the code, it might be useful to understand the intuition behind error
correction. Indeed, although error correcting codes may seem daunting mathematically-wise,
most of the mathematical operations are high school grade (with the exception of Galois Fields,
but which are in fact easy and common for any programmer: it's simply doing operations on
integers modulo a number). However, the complexity of the mathematical ingenuity behind
error correction codes hide the quite intuitive goal and mechanisms at play.
Error correcting codes might seem like a difficult mathematical concept, but they are in
fact based on an intuitive idea with an ingenious mathematical implementation: let's make the
data structured, in a way that we can "guess" what the data was if it gets corrupted, just by
"fixing" the structure. Mathematically-wise, we use polynomials from the Galois Field to
implement this structure.
Let's take a more practical analogy: let's say you want to communicate messages to
someone else, but these messages can get corrupted along the way. The main insight of error
correcting codes is that, instead of using a whole dictionary of words, we can use a smaller set
of carefully selected words, a "reduced dictionary", so that each word is as different as any
other. This way, when we get a message, we just have to lookup inside our reduced dictionary
to 1) detect which words are corrupted (as they are not in our reduced dictionary); 2)
correct corrupted words by finding the most similar word in our dictionary.
Let's take a simple example: we have a reduced dictionary with only three words of 4
letters: this, that and corn. Let's say we receive a corrupted word: co**, where * is an erasure.
Since we have only 3 words in our dictionary, we can easily compare our received word with
our dictionary to find the word that is the closest. In this case, it's corn. Thus the missing letters
are rn.
Now let's say we receive the word th**. Here the problem is that we have two words in
our dictionary that match the received word: this and that. In this case, we cannot be sure
which one it is, and thus we cannot decode. This means that our dictionary is not very good,
and we should replace that with another more different word, such as dash to maximize the
difference between each word. This difference, or more precisely the minimum number of
different letters between any 2 words of our dictionary, is called the maximum Hamming
distance of our dictionary. Making sure that any 2 words of the dictionary share only a minimum
number of letters at the same position is called maximum separability.
The same principle is used for most error correcting codes: we generate only a reduced
dictionary containing only words with maximum separability (we will detail more how to do that
in the third section), and then we communicate only with the words of this reduced dictionary.
What Galois Fields provide is the structure (ie, reduced dictionary basis), and Reed–Solomon is
a way to automatically create a suitable structure (make a reduced dictionary with maximum
separability tailored for a dataset), as well as provide the automated methods to detect and
correct errors (ie, lookups in the reduced dictionary). To be more precise, Galois Fields are the
structure (thanks to their cyclic nature, the modulo an integer) and Reed–Solomon is the codec
(encoder/decoder) based on Galois Fields.
If a word gets corrupted in the communication, that's no big deal since we can easily fix
it by looking inside our dictionary and find the closest word, which is probably the correct one
(there is however a chance of choosing a wrong one if the input message is too heavily
corrupted, but the probability is very small). Also, the longer our words are, the more separable
they are, since more characters can be corrupted without any impact.
3. SAMPLE PROBLEM
Example 1.
The following nine 4-tuples over F3 form a (4, 2, 3) linear code over the ternary field F3
= {0, 1, 2}, with generators g1 = (1110) and g2 = (0121):
C = {0000, 1110, 2220, 0121, 1201, 2011, 0212, 1022, 2102}.
By the group (permutation) property, C = C − c for any codeword c ∈ C, so the set of

Hamming distances between any codeword c ∈ C and all other codewords is independent of c.
The minimum Hamming distance d between codewords in a linear code C is thus equal to the
minimum Hamming weight of any nonzero codeword (the minimum distance between 0 and
any other codeword). An (n, k) linear code over Fq with minimum Hamming distance d is called
an (n, k, d) linear code.
More generally, the group property shows that the number of codewords at Hamming
distance w from any codeword c ∈ C is the number Nw of codewords of Hamming weight w in
C.
Much of classical algebraic coding theory has been devoted to optimizing the parameters
(n, k, d); i.e., maximizing the size q k of the code for a given length n, minimum distance d and
field Fq, or maximizing d for a given n, k and q. The practical motivation for this research has
been to maximize the guaranteed error-correction power of the code. Because Hamming
distance is a metric satisfying the triangle inequality, a code with Hamming distance d is
guaranteed to correct t symbol errors whenever 2t < d, or in fact to correct t errors and s
erasures whenever 2t + s < d. This elementary metric is not the whole story with regard to
performance on an AWGN channel, nor does it take into account decoding complexity; however,
it is a good first measure of code power.
4. REFERENCES
https://web.stanford.edu/class/ee392d/Chap8.pdf
https://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction
https://en.wikiversity.org/wiki/Reed%E2%80%93Solomon_codes_for_coders
https://www.sciencedirect.com/topics/engineering/reed-solomon-code
QUAM
1. INTRODUCTION
Quadrature amplitude modulation (QAM) is the name of a family

of digital modulation methods and a related family of analog modulation methods widely used
in modern telecommunications to transmit information. It conveys two analog message signals,
or two digital bit streams, by changing (modulating) the amplitudes of two carrier waves, using
the amplitude-shift keying (ASK) digital modulation scheme or amplitude modulation (AM)
analog modulation scheme. The two carrier waves of the same frequency are out of phase with
each other by 90°, a condition known as orthogonality or quadrature. The transmitted signal is
created by adding the two carrier waves together. At the receiver, the two waves can be
coherently separated (demodulated) because of their orthogonality property. Another key
property is that the modulations are low-frequency/low-bandwidth waveforms compared to the
carrier frequency, which is known as the narrowband assumption.
Phase modulation (analog PM) and phase-shift keying (digital PSK) can be regarded as a
special case of QAM, where the amplitude of the transmitted signal is a constant, but its phase
varies. This can also be extended to frequency modulation (FM) and frequency-shift
keying (FSK), for these can be regarded as a special case of phase modulation.
QAM is used extensively as a modulation scheme for digital telecommunication systems,

such as in 802.11 Wi-Fi standards. Arbitrarily high spectral efficiencies can be achieved with
QAM by setting a suitable constellation size, limited only by the noise level and linearity of the
communications channel.[1] QAM is being used in optical fiber systems as bit rates increase;
QAM16 and QAM64 can be optically emulated with a 3-path interferometer.
Quadrature Amplitude Modulation, QAM utilises both amplitude and phase components
to provide a form of modulation that is able to provide high levels of spectrum usage efficiency.
QAM, quadrature amplitude modulation has been used for some analogue transmissions
including AM stereo transmissions, but it is for data applications where it has come into its own.
It is able to provide a highly effective form of modulation for data and as such it is used in
everything from cellular phones to Wi-Fi and almost every other form of high speed data
communications system.
2. Discussion
What is QAM, quadrature amplitude modulation
Quadrature Amplitude Modulation, QAM is a signal in which two carriers shifted in phase
by 90 degrees (i.e. sine and cosine) are modulated and combined. As a result of their 90°
phase difference they are in quadrature and this gives rise to the name. Often one signal is
called the In-phase or “I” signal, and the other is the quadrature or “Q” signal.
The resultant overall signal consisting of the combination of both I and Q carriers
contains of both amplitude and phase variations. In view of the fact that both amplitude and
phase variations are present it may also be considered as a mixture of amplitude and phase
modulation.
A motivation for the use of quadrature amplitude modulation comes from the fact that a
straight amplitude modulated signal, i.e. double sideband even with a suppressed carrier
occupies twice the bandwidth of the modulating signal. This is very wasteful of the available
frequency spectrum. QAM restores the balance by placing two independent double sideband
suppressed carrier signals in the same spectrum as one ordinary double sideband supressed
carrier signal.
Quadrature Amplitude Modulation, QAM utilises both amplitude and phase components to
provide a form of modulation that is able to provide high levels of spectrum usage efficiency.
QAM, quadrature amplitude modulation has been used for some analogue transmissions
including AM stereo transmissions, but it is for data applications where it has come into its own.
It is able to provide a highly effective form of modulation for data and as such it is used in
everything from cellular phones to Wi-Fi and almost every other form of high speed data
communications system.
Analogue and digital QAM
Quadrature amplitude modulation, QAM may exist in what may be termed either
analogue or digital formats. The analogue versions of QAM are typically used to allow multiple
analogue signals to be carried on a single carrier. For example it is used in PAL and NTSC
television systems, where the different channels provided by QAM enable it to carry the
components of chroma or colour information. In radio applications a system known as C-QUAM
is used for AM stereo radio. Here the different channels enable the two channels required for
stereo to be carried on the single carrier.
Digital formats of QAM are often referred to as "Quantised QAM" and they are being
increasingly used for data communications often within radio communications systems. Radio
communications systems ranging from cellular technology as in the case of LTE through
wireless systems including WiMAX, and Wi-Fi 802.11 use a variety of forms of QAM, and the use
of QAM will only increase within the field of radio communications.
Digital / Quantised QAM basics
Quadrature amplitude modulation, QAM, when used for digital transmission for radio
communications applications is able to carry higher data rates than ordinary amplitude
modulated schemes and phase modulated schemes.
Basic signals exhibit only two positions which allow the transfer of either a 0 or 1. Using
QAM there are many different points that can be used, each having defined values of phase and
amplitude. This is known as a constellation diagram. The different positions are assigned
different values, and in this way a single signal is able to transfer data at a much higher rate.
As shown above, the constellation points are typically arranged in a square grid with
equal horizontal and vertical spacing. Although data is binary the most common forms of QAM,
although not all, are where there constellation can form a square with the number of points
equal to a power of 2 i.e. 4, 16, 64 . . . . , i.e. 16QAM, 64QAM, etc.
By using higher order modulation formats, i.e. more points on the constellation, it is
possible to transmit more bits per symbol. However the points are closer together and they are
therefore more susceptible to noise and data errors.
The advantage of moving to the higher order formats is that there are more points
within the constellation and therefore it is possible to transmit more bits per symbol. The
downside is that the constellation points are closer together and therefore the link is more
susceptible to noise. As a result, higher order versions of QAM are only used when there is a
sufficiently high signal to noise ratio.
To provide an example of how QAM operates, the constellation diagram below shows
the values associated with the different states for a 16QAM signal. From this it can be seen that
a continuous bit stream may be grouped into fours and represented as a sequence.
Normally the lowest order QAM encountered is 16QAM. The reason for this being the
lowest order normally encountered is that 2QAM is the same as binary phase-shift keying,
BPSK, and 4QAM is the same as quadrature phase-shift keying, QPSK.
Additionally 8QAM is not widely used. This is because error-rate performance of 8QAM is
almost the same as that of 16QAM - it is only about 0.5 dB better and the data rate is only
three-quarters that of 16QAM. This arises from the rectangular, rather than square shape of the
constellation.
QAM advantages and disadvantages
Although QAM appears to increase the efficiency of transmission for radio

communications systems by utilising both amplitude and phase variations, it has a number of
drawbacks. The first is that it is more susceptible to noise because the states are closer
together so that a lower level of noise is needed to move the signal to a different decision point.
Receivers for use with phase or frequency modulation are both able to use limiting amplifiers
that are able to remove any amplitude noise and thereby improve the noise reliance. This is not
the case with QAM.
The second limitation is also associated with the amplitude component of the signal.
When a phase or frequency modulated signal is amplified in a radio transmitter, there is no
need to use linear amplifiers, whereas when using QAM that contains an amplitude component,
linearity must be maintained. Unfortunately linear amplifiers are less efficient and consume
more power, and this makes them less attractive for mobile applications.
QAM vs PSK & other modes
When deciding on a form of modulation it is worth comparing AM vs PSK and other

modes looking at what they each have to offer.
As there are advantages and disadvantages of using QAM it is necessary to compare

QAM with other modes before making a decision about the optimum mode. Some radio
communications systems dynamically change the modulation scheme dependent upon the link
conditions and requirements - signal level, noise, data rate required, etc.
The
tabl
SUMMARY OF TYPES OF MODULATION WITH DATA CAPACITIES
e
MODULATION BITS PER -- ERROR MARGIN -- COMPLEXITY belo
SYMBOL w
com
OOK 1 1/2 0.5 Low par
es
BPSK 1 1 1 Medium
vari
QPSK 2 1 / √2 0.71 Medium ous
for
16 QAM 4 √2 / 6 0.23 High
ms
64QAM 6 √2 / 14 0.1 High of
mo
dulation:
Typically it is found that if data rates above those that can be achieved using 8-PSK are
required, it is more usual to use quadrature amplitude modulation. This is because it has a
greater distance between adjacent points in the I - Q plane and this improves its noise
immunity. As a result it can achieve the same data rate at a lower signal level.
However the points no longer the same amplitude. This means that the demodulator
must detect both phase and amplitude. Also the fact that the amplitude varies means that a
linear amplifier si required to amplify the signal.
QAM theory basics
Quadrature amplitude theory states that both amplitude and phase change within a
QAM signal.
The basic way in which a QAM signal can be generated is to generate two signals that
are 90° out of phase with each other and then sum them. This will generate a signal that is the
sum of both waves, which has a certain amplitude resulting from the sum of both signals and a
phase which again is dependent upon the sum of the signals.
If the amplitude of one of the signals is adjusted then this affects both the phase and
amplitude of the overall signal, the phase tending towards that of the signal with the higher
amplitude content.
As there are two RF signals that can be modulated, these are referred to as the I - In-
phase and Q - Quadrature signals.
The I and Q signals can be represented by the equations below:
I=Acos(Ψ) and Q=Asin(Ψ)I=Acos(Ψ) and Q=Asin(Ψ)
It can be seen that the I and Q components are represented as cosine and sine. This is
because the two signals are 90° out of phase with one another.
Using the two equations it is possible to express the signal as:.
cos(α+β)=cos(α)cos(β)−sin(α)sin(β)cos(α+β)=cos(α)cos(β)-sin(α)sin(β)
Using the expression A cos(2πft + Ψ) for the carrier signal.
Acos(2πft+Ψ)=Icos(2?ft)−Qsin(2πft)Acos(2πft+Ψ)=Icos(2?ft)-Qsin(2πft)
Where f is the carrier frequency.
This expression shows the resulting waveform is a periodic signal for which the phase
can be adjusted by changing the amplitude either or both I and Q. This can also result in an
amplitude change as well.
Accordingly it is possible to digitally modulate a carrier signal by adjusting the amplitude

of the two mixed signals.
3. SAMPLE PROBLEM
Using the signal constellation shown, answer the following questions.
a) What type of modulation does this represent?
16 symbols, all with same amplitude but different phases, so this is 16-PSK.
b) How many symbols are represented (M)?
M = 16
c) How many bits per symbol are used (N)?

N = log2M = log216 = 4
d) If the Baud Rate is 10,000 symbols/second, what is the bit rate (Rb)?
Rb = Rs x N = 10,000 symbols/sec x 4 bits/symbol = 40 kbps.
e) Would 16-QAM be more or less susceptible to noise than this type of modulation?
If correctly designed, 16 QAM should in general be less susceptible to noise because the
symbols would be spread further apart. This makes it less likely for the receiver to make an
error.
4. REFERENCES
https://www.electronics-notes.com/articles/radio/modulation/quadrature-amplitude-modulation-
what-is-qam-basics.php
https://en.wikipedia.org/wiki/Quadrature_amplitude_modulation
https://www.electronics-notes.com/articles/radio/modulation/quadrature-amplitude-modulation-
qam-theory-formulas-equations.php

Huffman Code

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Huffman Code

Hochgeladen von

Copyright:

Verfügbare Formate

HUFFMAN CODE

In computer science and information theory, a Huffman code is a particular type of

Huffman coding is an efficient method of compressing data without losing information.

Huffman coding provides an efficient, unambiguous code by analyzing the frequencies

Huffman coding works by using a frequency-sorted binary tree to encode symbols.

Encodings can either be fixed-length or variable-length.

Huffman code is a way to encode information using variable-length strings to represent

As shown in the above sections, it is important for an encoding scheme to be

The Huffman coding algorithm takes in information about the frequencies or

Now, we only have two elements left, so we build the subtree:

Here are the encodings we get from the tree:

The Huffman Coding Algorithm

 Take a list of symbols and their probabilities.

 Repeat until all of the elements are combined.

So the new representation of the bytes on the text are:

Lempel–Ziv–Welch (LZW) is a universal lossless data compression algorithm created by

The Lempel-Ziv-Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of

Why do we need Compression Algorithm ?

What is Lempel–Ziv–Welch (LZW) Algorithm ?

How does it work?

Advantages of LZW over Huffman:

 LZW requires no prior information about the input data stream.

 LZW can compress the input stream in one single pass.

 Another advantage of LZW its simplicity, allowing fast execution.

LZ compression substitutes the detected repeated patterns with references to a

A dictionary is initialized to contain the single-character strings corresponding to all the

Compression using LZW

Specification of a CRC code requires definition of a so-called generator polynomial. This

A cyclic redundancy check (CRC) is an error-detecting code commonly used in

The process is illustrated as follows:

 Algorithm for Encoding using CRC

o The block xrM(x) is divided by G(x) using modulo 2 division.

 Algorithm for Decoding using CRC

o A remainder indicates a non-zero value of E(x), or in other words presence of an

 Sometimes an implementation prefixes a fixed bit pattern to the bitstream to be checked.

CRC32 computation is implemented in hardware as an operation of SSE4.2 instruction set, first

If you're doing it frequently, a CRC is quite slow to compute in software. Hardware

A CRC will be valid if and only if it satisfies the following requirements:

1. It should have exactly one less bit than divisor.

The various steps followed in the CRC method are:

1. A string of n as is appended to the data unit. The length of predetermined divisor is n+ 1.

4. The data unit + CRC is then transmitted to receiver.

Data word to be sent - 100100

Key - 1101 [or generator polynomial x3 + x2 + 1]

Code word received at the receiver side 100100001

In telecommunication, Hamming codes are a family of linear error-correcting

Errors and Error Correcting Codes

Error-correcting codes (ECC) are a sequence of numbers generated by specific

ECCs can be broadly categorized into two types −

 Convolutional codes − The message comprises of data streams of arbitrary

Encoding a message by Hamming Code

 Step 1 − Calculation of the number of redundant bits.

 Step 2 − Positioning the redundant bits.

 Step 3 − Calculating the values of each redundant bit.

Step 1 − Calculation of the number of redundant bits.

Step 2 − Positioning the redundant bits.

Step 3 − Calculating the values of each redundant bit.

Decoding a message in Hamming Code

 Step 1 − Calculation of the number of redundant bits.

 Step 2 − Positioning the redundant bits.

 Step 3 − Parity checking.

 Step 4 − Error detection and correction

Step 1 − Calculation of the number of redundant bits

Left side: (M+R+1)(2^M)=816=128