Sie sind auf Seite 1von 100

Text Processing

CPT212 – Design & Analysis of Algorithms


Part 1

Pattern/String
Matching
Teh Je Sen (2018)
Learning Outcomes
• In this topic we will learn about
–Pattern/string matching
• Brute force
• Boyer-Moore
• Knuth-Morris-Pratt

Teh Je Sen (2018)


Notations
• In text processing, character strings
are used as a model for text
• Characters of a string comes from a
known alphabet, denoted as Σ
–For DNA, the alphabet includes
Σ = {𝐴, 𝐶, 𝐺, 𝑇}
–For English, the alphabet includes
Σ = {𝑎, 𝑏, … 𝑧, 𝐴, 𝐵, … 𝑍}

Teh Je Sen (2018)


Notations

Substring:
Any string that occurs in a larger string

A;wiejr;ajeonnv;aknsdg;aijwe;rija;dsfad

Prefix: Suffix:
A substring at the A substring at the
beginning of a end of a string
string

Teh Je Sen (2018)


Pattern Matching Problem

Brute Force Search Boyer-Moore


Algorithm Algorithm

Given a text string of length 𝑛 and a


pattern string of length 𝑚 < 𝑛,
determine if the pattern is a substring
of the text.

Knuth-Morris-Pratt Trie
Algorithm Data Structure

Teh Je Sen (2018)


Brute Force Pattern Matching

Teh Je Sen (2018)


Brute Force Pattern Matching

Brute Force Algorithms Brute Force Pattern Matching

Enumerate all possible Enumerate all possible


configurations of inputs involved placements of the pattern relative
and pick the best of all these to the text
enumerated configurations

Teh Je Sen (2018)


Brute Force Pattern Matching

Teh Je Sen (2018)


Brute Force Pattern Matching
Strings are stored as character arrays

Teh Je Sen (2018)


Example
• Illustrate the brute force algorithm to
search for a string pattern “the” in a
text string “Hello there”

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

No match
n = 11, m = 3 k=0
i=0

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

No match
n = 11, m = 3 k=0
i=1

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

No match
n = 11, m = 3 k=0
i=2

Teh Je Sen (2018)


Teh Je Sen (2018)
Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

Match
n = 11, m = 3 k=0
i=6

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

Match
n = 11, m = 3 k=1
i=6

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

Match
n = 11, m = 3 k=2
i=6

Teh Je Sen (2018)


Example

0 1 2 3 4 5 6 7 8 9 10
H e l l o t h e r e
t h e

What is the worst-case


n = 11, m = 3 k=3 complexity of the brute force
i=6 search? O(m(n-m))
Return

Teh Je Sen (2018)


Boyer-Moore Algorithm

Teh Je Sen (2018)


Boyer-Moore Algorithm
Main Idea

Improve the brute-force algorithm


by adding two potentially time-
saving heuristics

Looking-Glass Character-Jump
Heuristic Heuristic

Works as a team

Teh Je Sen (2018)


Boyer-Moore Algorithm
When testing a pattern against the
Looking-Glass
text, perform comparisons against
Heuristic the pattern from right to left.

If there is a mismatch of character


Character-Jump 𝑡𝑒𝑥𝑡 𝑖 = 𝑐 with corresponding
Heuristic character 𝑝𝑎𝑡𝑡𝑒𝑟𝑛 𝑘 , perform the
following:

If 𝒄 does not exist anywhere in the


pattern, shift the pattern completely Else, shift until the
past 𝑡𝑒𝑥𝑡 𝑖 = 𝑐. character 𝒄 gets
aligned with 𝑡𝑒𝑥𝑡[𝑖].
Teh Je Sen (2018)
Boyer-Moore Algorithm
There is a mismatch, the character
𝑐 = "𝑒“ does not exist in the pattern.

Shift the whole pattern


completely past 𝑐
(Character-Jump)

There is mismatch, but


𝑐 = "s“ exists in the
Start comparison from right to left pattern. Shift until 𝒄
(Looking-Glass) aligns with the pattern
(Character Jump)

Teh Je Sen (2018)


Boyer-Moore Algorithm
Additional rules for character-jump heuristic

If a mismatched character occurs somewhere in the pattern,


there are two sub-cases

The occurrence of the The occurrence of the


character, 𝑐 is on the right side character, 𝑐 is on the left
of the current position side of the current position

Shift pattern by one Shift the pattern until the


position mismatched character is
aligned with its match

This is to accommodate the 𝑙𝑎𝑠𝑡 𝑐 function


Teh Je Sen (2018)
Boyer-Moore Algorithm
Step 1:
Text String: aaaaebdaabadbda Character “e” does not exist in
Pattern String: dabacbd in the pattern. Shift past the
mismatch.
Step 2:
Text String: aaaaebdaabadbda Character “a” exists in
Pattern String: dabacbd in the pattern. Shift until the
last occurrence of “a” in the
Step 3: pattern string.
Text String: aaaaebdaabadbda Character “d” exists in
Pattern String: dabacbd in the pattern. However the
last occurrence of “d” is on
Step 4: the right. Shift by one
Text String: aaaaebdaabadbda position.
Pattern String: dabacbd
No match.
Teh Je Sen (2018)
Boyer-Moore Algorithm
• The algorithm depends on checking
the last occurrence of the
mismatching character, 𝑐
• last can be implemented as a map
• Each character (in the pattern and
text string) is a key, and their last
position is the value
–If the character does not exist in the
pattern, the value is set to 𝑙𝑎𝑠𝑡 𝑐 = −1

Teh Je Sen (2018)


Boyer-Moore Algorithm

Teh Je Sen (2018)


Boyer-Moore Algorithm
Initialize the 𝑙𝑎𝑠𝑡 map

Store the position of characters in the


pattern in the map. Remember, if the key
already exists, the value is overwritten.

Teh Je Sen (2018)


Boyer-Moore Algorithm

Calculate how much to shift

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d

Fill the corresponding 𝒍𝒂𝒔𝒕 map

a b c d e

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d

Solution:

a b c d e
3 5 4 6 -1

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛=
What are the
𝑚=
corresponding
𝑖=
𝑘 = variables?

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖=6
𝑘=6

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖=6
𝑘=6

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖=5
𝑘=5

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖=4
𝑘=4

Teh Je Sen (2018) Calculate 𝒊 and 𝒌


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖 = 𝑖 + 7 − min(4,1 + 𝑙𝑎𝑠𝑡. 𝑔𝑒𝑡(𝑒))
𝑖=4
𝑘 =7−1
𝑘=4

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖 = 𝑖 + 7 − min(4,1 + (−1))
𝑖=4
𝑘=6
𝑘=4

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7
𝑖 = 11
𝑖=4
𝑘=6
𝑘=4

Teh Je Sen (2018)


Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a a a a e b d a a b a d b d a
d a b a c b d
a b c d e
3 5 4 6 -1

𝑛 = 15
𝑚=7 Complete the rest of it
𝑖 = 11 yourself to verify the
𝑘=6 correctness of the algorithm.

Teh Je Sen (2018)


Knuth-Morris-Pratt Algorithm

Teh Je Sen (2018)


KMP Pattern Matching
• Basic concept:
–If the matched substring has a prefix
that is the same as the suffix, continue
the matching from the location right after
the prefix.

Teh Je Sen (2018)


KMP Pattern Matching
• Basic concept:

Continue search
here

Teh Je Sen (2018)


KMP Pattern Matching
• Implementation
–Require to compute a failure function,
𝑓(𝑘)
–The failure function will indicate where
to continue the matching

Teh Je Sen (2018)


KMP Pattern Matching - Example
Continue the matching
from location 1

No match. Check the failure function for


the previous character.

Teh Je Sen (2018)


KMP Pattern Matching - Example
Continue the matching
from location 0

No match. Check the failure function for


the previous character.

Teh Je Sen (2018)


KMP Pattern Matching - Example

Complete the remaining iterations

Teh Je Sen (2018)


KMP Pattern Matching - Example

Teh Je Sen (2018)


KMP Pattern Matching - Algorithm

Teh Je Sen (2018)


KMP Pattern Matching
We know how to use the failure function
to perform pattern matching.
But how do we compute the function?

Example string:
abacab

𝒍 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching – Fail Function

Teh Je Sen (2018)


KMP Pattern Matching
Initialize 𝑘 = 0, 𝑗 = 1

pattern[𝑘] and pattern[𝑗] do not match

if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +
k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching
pattern[𝑘] and pattern[𝑗] match

𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++

k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 0 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching
pattern[𝑘] and pattern[𝑗] do not match

if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +

k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching
pattern[𝑘] and pattern[𝑗] do not match

if 𝑘 > 0, 𝑘 = 𝑓𝑎𝑖𝑙 𝑘 − 1
else 𝑗 + +

k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching
pattern[𝑘] and pattern[𝑗] match

𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++

k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 0 0

Teh Je Sen (2018)


KMP Pattern Matching
pattern[𝑘] and pattern[𝑗] match

𝑓𝑎𝑖𝑙 𝑗 = 𝑘 + 1
𝑗++
𝑘++

k j

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 1 0

Teh Je Sen (2018)


KMP Pattern Matching

Recall:

𝑙 0 1 2 3 4 5
𝑝𝑎𝑡𝑡𝑒𝑟𝑛[𝑙] a b a c a b
𝑓𝑎𝑖𝑙(𝑙) 0 0 1 0 1 2

Teh Je Sen (2018)


KMP Pattern Matching – Fail Function

Teh Je Sen (2018)


End of Part 1

Teh Je Sen (2018)


Part 2

Tries & Text


Compression
Teh Je Sen (2018)
Learning Outcomes
• In this topic we will learn about
–Pattern/string matching
• Tries
–Text compression – Huffman Coding

Teh Je Sen (2018)


Re-cap
Pattern Matching Algorithms

Brute Force Boyer-Moore Knuth-Morris-Pratt

Both algorithms preprocess


the pattern string.

How about preprocessing the text string?

Suitable for applications in which many queries


are performed on a fixed text.

Cost to preprocess the text string is compensated by speedup


in queries
Teh Je Sen (2018)
Tries (Pronounced as “try”)
• Tree-based structure for storing
strings
–Ordered Tree
• Applied in information retrieval.
• Uses parts of the key to navigate the
search
–Each key is a sequence of characters

Teh Je Sen (2018)


Example of a Standard Trie
Let 𝑆 be a set of strings where: 8 strings
𝑆 = {𝑏𝑒𝑎𝑟, 𝑏𝑒𝑙𝑙, 𝑏𝑖𝑑, 𝑏𝑢𝑙𝑙, 𝑏𝑢𝑦, 𝑠𝑒𝑙𝑙, 𝑠𝑡𝑜𝑐𝑘, 𝑠𝑡𝑜𝑝}
Longest word = 5 chars
A standard trie for 𝑆 is as shown below:

Height of
tree = 5

8 leaves

Teh Je Sen (2018)


A Standard Trie
Properties of a standard trie that
stores a collection, 𝑆 of 𝑠 strings

Each node of the Each child of an The tree has 𝑠


tree is labeled with a internal node has a number of leaves
character distinct label where each leaf
corresponds to a
The height of the tree is
word in 𝑆.
equal to the length of the
longest string in 𝑆.

Teh Je Sen (2018)


A Standard Trie
Take note of the following property:
The tree has 𝑠 number of leaves where
each leaf corresponds to a word in 𝑆.

What happens if there is a word that is a prefix of a


another word? E.g. 𝑆 = {𝑡ℎ𝑒, 𝑡ℎ𝑒𝑟𝑒, 𝑡ℎ𝑒𝑠𝑒}
Use special characters to end the words. E.g. 𝑆 = {𝑡ℎ𝑒#, 𝑡ℎ𝑒𝑟𝑒, 𝑡ℎ𝑒𝑠𝑒}

Teh Je Sen (2018)


A Standard Trie

Tries can be used to perform word matching , where


we want to find a specific word in a text.
E.g. “bull”

Teh Je Sen (2018)


Compressed Tries
• Rule:
–Each internal node must have at least
two children
• Chains of single child nodes are
compressed
–Nodes are thus labeled with strings
rather than characters

Teh Je Sen (2018)


Compressed Tries
Original trie
What do you think it will look like after compression?

Teh Je Sen (2018)


Advantage of Compressed Tries

Reduces total space


requirement from 𝑂(𝑛) to
𝑂(𝑠), where 𝑛 is the total
length of all strings in 𝑆.

Teh Je Sen (2018)


Suffix Tries
• Let 𝑆 consist of strings that are all
suffixes of a string 𝑋.
• A trie for 𝑆 in this case is known as a
suffix trie.

Teh Je Sen (2018)


Suffix Tries
• Example: Let 𝑆 consist of all suffixes
of the word “minimize”
𝑆 = {𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒, 𝑖𝑛𝑖𝑚𝑖𝑧𝑒, 𝑛𝑖𝑚𝑖𝑧𝑒, 𝑖𝑚𝑖𝑛𝑧𝑒, 𝑚𝑖𝑧𝑒, 𝑖𝑧𝑒, 𝑧𝑒, 𝑒}

Draw the resulting trie!

Teh Je Sen (2018)


Example
𝑆 = {𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒, 𝑖𝑛𝑖𝑚𝑖𝑧𝑒, 𝑛𝑖𝑚𝑖𝑧𝑒, 𝑖𝑚𝑖𝑛𝑧𝑒, 𝑚𝑖𝑧𝑒, 𝑖𝑧𝑒, 𝑧𝑒, 𝑒}

m i n z e
i
i e
n z n m z
m
i e i i e
i
m m z
z Compress this
i i e
e trie!
z z

e e
Teh Je Sen (2018)
Example

Teh Je Sen (2018)


Advantage of Suffix Tries

Can perform efficient


matching to determine if a
pattern is a substring on text 𝑋

Teh Je Sen (2018)


Text
Compression
Teh Je Sen (2018)
Introduction to Text Compression
Definition: Text Compression

Given a string 𝑋 defined over


some alphabet (such as ASCII),
efficiently encode 𝑋 into a smaller
binary string 𝑌 using only
characters 0 and 1.

Definition: Coding

Representation of data in another


representation (e.g. binary). An
encoded character is called a
codeword.

Teh Je Sen (2018)


Coding

Definition: Fixed Length Code

Each codeword has a same number


of bits.

Definition: Variable Length Code

Each codeword can use different


number of bits.

Teh Je Sen (2018)


Fixed Code - Example
String Char ASCII Codeword Freq

“hello how are you” h 1101000 2


e 1100101 2
Number of characters: l 1101100 2
17 o 1101111 3

Number of bits: w 1110111 1


7 × 17 = 119 a 1100001 1
r 1110010 1
y 1111001 1
u 1110101 1
0100000 3

Teh Je Sen (2018)


Fixed Code - Example
Represent the characters in a code tree
Each character is a leaf.
Left = 0, Right = 1 0 1
Code word is the path
from root to leaf. 0 0
1

0 1 0 1 0

0 0 1 0 1 0 1 0 1
1

h e l o w a r y u _
𝒉 = 𝟎𝟎𝟎𝟎
Teh Je Sen (2018)
Fixed Code - Example
String Char ASCII Code Freq
“hello how are you” h 1101000 0000 2
e 1100101 0001 2
Number of characters: l 1101100 0010 2
17 o 1101111 0011 3

Number of bits: w 1110111 0100 1


7 × 17 = 119 a 1100001 0101 1
r 1110010 0110 1
After using fixed code:
4 × 17 = 68 y 1111001 0111 1
u 1110101 1000 1
0100000 1001 3

Teh Je Sen (2018)


Variable-length Code A greedy algorithm

Huffman Coding

Use short codewords to encode Basic rule: No codeword can be a


high-frequency characters. prefix of another codeword. This
simplifies decoding.

Teh Je Sen (2018)


Huffman Coding – How do we do it?
Char ASCII Fixed-length Huffman Freq
h 1101000 0000 0111 2
e 1100101 0001 110 2
l 1101100 0010 111 2
o 1101111 0011 010 3
w 1110111 0100 0000 1
a 1100001 0101 0001 1
r 1110010 0110 0010 1
y 1111001 0111 0011 1
u 1110101 1000 0110 1
0100000 1001 10 3
Total Bits 119 68 51

Teh Je Sen (2018)


Huffman Coding
Basic Process:
1. Sort the characters based on frequency in ascending
order.
2. Initialize a tree for each character with a single root
node.
3. While there is more than 1 tree:
i. Take two trees, 𝑇1 , 𝑇2 with the lowest frequencies.
ii. Create a new tree with 𝑇1 and 𝑇2 as children.
iii. Let the frequency of the new tree be equal to the sum
of the frequency of the children.
iv. Associate 0 with the left branch and 1 with the right
branch

Teh Je Sen (2018)


Huffman Coding - Example
Char Freq Char Freq
h 2 w 1
e 2 a 1
l 2 r 1
o 3 y 1
Sort
w 1 u 1
a 1 h 2
r 1 e 2
y 1 l 2
u 1 o 3
3 3

Teh Je Sen (2018)


Huffman Coding - Example
Initialize a tree for each character with a
Char Freq single root node.
w 1
a 1
r 1
y 1
u 1
h 2
e 2
l 2
o 3
w a r y u h e l o _
3

Teh Je Sen (2018)


Huffman Coding - Example
While there is more than 1 tree:
Char Freq i. Take two trees, 𝑇1 , 𝑇2 with the
w 1 lowest frequencies.
ii. Create a new tree with 𝑇1 and 𝑇2
a 1 as children.
r 1 iii. Let the frequency of the new tree
be equal to the sum of the
y 1
frequency of the children.
u 1 iv. Associate 0 with the left branch
h 2 and 1 with the right branch
e 2
l 2 2
o 3 0 1
w a r y u h e l o _
3

Teh Je Sen (2018)


Huffman Coding - Example
While there is more than 1 tree:
Char Freq i. Take two trees, 𝑇1 , 𝑇2 with the
w 1 lowest frequencies.
ii. Create a new tree with 𝑇1 and 𝑇2
a 1 as children.
r 1 iii. Let the frequency of the new tree
be equal to the sum of the
y 1
frequency of the children.
u 1 iv. Associate 0 with the left branch
h 2 and 1 with the right branch
e 2 What do we do next?
l 2 2 2 3 4
o 3 0 1 0 1 0 1 0 1
w a r y u h e l o _
3

Teh Je Sen (2018)


Huffman Coding - Example
While there is more than 1 tree:
i. Take two trees, 𝑇1 , 𝑇2 with the lowest
Char Freq frequencies.
w 1 ii. Create a new tree with 𝑇1 and 𝑇2 as children.
iii. Let the frequency of the new tree be equal to
a 1 the sum of the frequency of the children.
iv. Associate 0 with the left branch and 1 with
r 1 the right branch
y 1
u 1 Smallest frequency trees Next smallest frequency
trees
h 2
4
e 2 0 1
l 2 2 2 3 4
o 3 0 1 0 1 0 1 0 1
w a r y u h e l o _
3

Teh Je Sen (2018)


Huffman Coding - Example
While there is more than 1 tree:
i. Take two trees, 𝑇1 , 𝑇2 with the lowest
Char Freq frequencies.
w 1 ii. Create a new tree with 𝑇1 and 𝑇2 as children.
iii. Let the frequency of the new tree be equal to
a 1 the sum of the frequency of the children.
iv. Associate 0 with the left branch and 1 with
r 1 the right branch
y 1
u 1 Smallest frequency trees
h 2
4 6
e 2 0 1 0 1
l 2 2 2 o 3 4
0 1 0 1 0 1 0 1
o 3
w a r y u h e l _
3

Teh Je Sen (2018)


Huffman Coding - Example
While there is more than 1 tree:
i. Take two trees, 𝑇1 , 𝑇2 with the lowest
Char Freq frequencies.
w 1 ii. Create a new tree with 𝑇1 and 𝑇2 as children.
iii. Let the frequency of the new tree be equal to
a 1 the sum of the frequency of the children.
iv. Associate 0 with the left branch and 1 with
r 1 the right branch
17
y 1 0
10
u 1 1 1
0
h 2
4 6 7
e 2 0 1 0 1 0 1
l 2 2 2 o 3 _ 4
0 1 0 1 0 1 0 1
o 3
w a r y u h e l
3

Teh Je Sen (2018)


Huffman Coding – Confirm Results
Char Huffman Freq
h 0111 2
17
e 110 2 0
10
l 111 2 0 1 1
o 010 3
4 6 7
w 0000 1 0 1 0 1 0 1
a 0001 1 2 2 o 3 _ 4
0 1 0 1 0 1 0 1
r 0010 1
w a r y u h e l
y 0011 1
u 0110 1
10 3
Total 51
Bits
Teh Je Sen (2018)
Huffman Coding
Characteristics
Based on the Greedy Method.
Why Greedy?

Given an optimization problem, the


solution is derived by choosing the
decision that leads to the best cost
improvement.

Does not always lead to optimal


solutions. But for this particular
problem it does (due to having the
greedy-choice property). Global optimum can be obtained by
selecting locally optimum choices.

Teh Je Sen (2018)


Huffman Coding
Characteristics
Not deterministic – Will have multiple
possible solutions depending on how
equal-frequency trees are selected.

Ultimately, the average length remains


the same.

Teh Je Sen (2018)


What is the
Compression Rate compression rate
for the fixed-
𝑙𝑒𝑛𝑔𝑡ℎ 𝑖𝑛𝑝𝑢𝑡 − 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑢𝑡𝑝𝑢𝑡 length encoding
𝑙𝑒𝑛𝑔𝑡ℎ 𝑖𝑛𝑝𝑢𝑡 and Huffman?

Char ASCII Fixed-length Huffman Freq


h 1101000 0000 0111 2
e 1100101 0001 110 2
l 1101100 0010 111 2
o 1101111 0011 010 3
w 1110111 0100 0000 1
a 1100001 0101 0001 1
r 1110010 0110 0010 1
y 1111001 0111 0011 1
u 1110101 1000 0110 1
0100000 1001 10 3
Total Bits 119 68 51

Teh Je Sen (2018)


Solution
• Fixed Length:
119 − 68
= 0.429
119
• Huffman:

119 − 51
= 0.571
119
Superior method

Teh Je Sen (2018)


Practice
Come up with the Huffman code for the following string:
“a fast runner need never be afraid of the dark”
1. Sort the characters based on frequency in ascending
order.
2. Initialize a tree for each character with a single root
node.
3. While there is more than 1 tree:
i. Take two trees, 𝑇1 , 𝑇2 with the lowest frequencies.
ii. Create a new tree with 𝑇1 and 𝑇2 as children.
iii. Let the frequency of the new tree be equal to the
sum of the frequency of the children.
iv. Associate 0 with the left branch and 1 with the
right branch

Teh Je Sen (2018)


Solution

Teh Je Sen (2018)


End of Part 2

Teh Je Sen (2018)

Das könnte Ihnen auch gefallen