Class Coding and Two Applications

TEL-AVIV UNIVERSITY
The Iby and Aladar Fleischman Faculty of Engineering
Class Coding and Two Applications
A thesis submitted toward the degree of

Master of Science in Electrical and Electronic Engineering
By
Avishay Orpaz
July 2004
ii
TEL-AVIV UNIVERSITY
The Iby and Aladar Fleischman Faculty of Engineering
Class Coding and Two Applications
A thesis submitted toward the degree of

Master of Science in Electrical and Electronic Engineering
By
Avishay Orpaz
This research was carried out in the Department of Electrical

Engineering – Systems under the supervision of
Dr. Shlomo Weiss
July 2004
iv
To my wife Idit
Who spent on this work as

much time as I did
vi
Contents
1 Introduction and Related Work 1
1.1 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Code Compression in Embedded Systems . . . . . . . . 2
1.2.1.1 Motivation . . . . . . . . . . . . . . . . . . . 2
1.2.1.2 Related Work . . . . . . . . . . . . . . . . . . 3
1.2.2 Compressed Pattern Matching . . . . . . . . . . . . . . 6
1.2.2.1 Motivation . . . . . . . . . . . . . . . . . . . 6
1.2.2.2 Pattern Matching . . . . . . . . . . . . . . . . 7
1.2.2.3 Text Compression . . . . . . . . . . . . . . . 7
1.2.2.4 Compressed Matching Problem Definition . . 8
1.2.2.5 Related Work . . . . . . . . . . . . . . . . . . 9
1.3 Outline of the proposed algorithm . . . . . . . . . . . . . . . . 11
1.4 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . 12
2 Class Compression 15
2.1 The Class Compression System . . . . . . . . . . . . . . . . . 15
2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
vii
viii CONTENTS
2.2.1 Compression Ratio . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 The optimal class structure problem . . . . . . . . . . . . . . . 22
2.3.1 Brute force approach . . . . . . . . . . . . . . . . . . . 22
2.3.2 A faster algorithm . . . . . . . . . . . . . . . . . . . . 23
2.3.3 The Optimization Algorithm and the Prefix . . . . . . 27
2.3.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.6 Implementation . . . . . . . . . . . . . . . . . . . . . . 32
2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.1 Implementation Issues . . . . . . . . . . . . . . . . . . 34
2.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3 Code Compression for Embedded Systems 37
3.1 System Parameters . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.1 Probability Model . . . . . . . . . . . . . . . . . . . . . 37
3.1.2 Codebook Limit . . . . . . . . . . . . . . . . . . . . . . 39
3.1.3 Static vs. Dynamic Structure . . . . . . . . . . . . . . 39
3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 42
4 Multi-resolution string matching 47
4.1 Low resolution text . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 A string matching algorithm . . . . . . . . . . . . . . . . . . . 49
4.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.2 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.3 Double-byte symbols . . . . . . . . . . . . . . . . . . . 53

CONTENTS ix
4.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Relation of number of classes to search speed . . . . . . 57
4.4 Extra space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 58
4.5.1 Implementation Notes . . . . . . . . . . . . . . . . . . 58
4.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Enhancement Possibilities . . . . . . . . . . . . . . . . . . . . 64
5 Summary 67
x CONTENTS
List of Figures
1.1 Computing System with Code Compression . . . . . . . . . . 5
2.1 Entropy Coder . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Class Structure Optimization Example . . . . . . . . . . . . . 32
2.3 Compression ratio for some files from the testset with different
number of classes . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Approaching the Huffman limit (X axis is logarithmic for con-

venience) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Class Structure of CodePack . . . . . . . . . . . . . . . . . . . 41
3.2 Comparison of Probability Models for Code Compression . . . 42
3.3 Comparison of Dictionary Sizes for Different Models . . . . . . 43
3.4 The Effect of Codebook Limit (with varying N ) . . . . . . . . 44
3.5 The Effect of Codebook Limit (fixed N ) . . . . . . . . . . . . 44
3.6 Static Class Configuration . . . . . . . . . . . . . . . . . . . . 45
4.1 Multiple resolution text using class coding . . . . . . . . . . . 49
4.2 Class-compressed text stream format . . . . . . . . . . . . . . 50
4.3 Illustration of string matching procedure . . . . . . . . . . . . 54
xi
xii LIST OF FIGURES
4.4 Class coding compared with other compression algorithms . . 61

4.5 False matches results . . . . . . . . . . . . . . . . . . . . . . . 61
4.6 Search time results . . . . . . . . . . . . . . . . . . . . . . . . 62
4.7 Search time versus false matches . . . . . . . . . . . . . . . . . 62
4.8 Optimal selection of N . . . . . . . . . . . . . . . . . . . . . . 63
4.9 Comparison of Search Times . . . . . . . . . . . . . . . . . . . 63
List of Tables
1.1 Taxonomy of compressed matching methods . . . . . . . . . . 10
2.1 Symbol Probabilities for the example . . . . . . . . . . . . . . 31

2.2 Class Assignment for the example . . . . . . . . . . . . . . . . 31
2.3 Final Symbol Coding . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Datasets for compression experiments . . . . . . . . . . . . . . 34
xiii
xiv LIST OF TABLES
Abstract
This work presents a coding system called class coding and two of its appli-
cations. The class coding system is a compression scheme that has unique
features. The system is analyzed and a novel algorithm is presented to op-
timally adjust its parameters. The first application of this system is code
compression. The work explores the parameters of the system that are rele-
vant to such usage. The second application is compressed matching – a class
of algorithms that are designed to search a text within another, compressed
text, without having to completely decompress it. The features of class coding
are used to provide fast, flexible and efficient search algorithm.
xv
Chapter 1
Introduction and Related Work
1.1 Goals and Objectives
This work deals with two subjects: code compression in embedded systems
and compressed pattern matching. The common denominator to these, usu-
ally unrelated, problems is a compression method called class compression.
The objectives of the work are as follows:
• Study the class compression method, its properties and develop an

algorithm to optimize it (in the sense of compression ratio)
• Explore the use of class compression for the application of code com-
pression in embedded systems. Parameters that have influence over the
system performance and tradeoffs will be introduced and studied.
• Develop a method to perform pattern matching in class-compressed

files, making use of its special properties to achieve performance bene-
fits.
1
2 CHAPTER 1. INTRODUCTION AND RELATED WORK
1.2 Introduction
1.2.1 Code Compression in Embedded Systems
1.2.1.1 Motivation
Driven by an expanding market for consumer electronics and communications

equipment, embedded software is becoming increasingly complex [37, 38]. In
high-end embedded products, 32-bit microprocessor cores provide the com-
puting power needed to run complex algorithms in real-time. Timely devel-
opment of such complex and large embedded applications requires the use
of high-level languages and compilers instead of manually crafted assembly
code. From the system point of view, larger applications and compiled code
are both factors that add up to a requirement for larger instruction memory.
Another factor is the system software. Currently, many embedded prod-

ucts use real-time operating systems with modest memory requirements, typ-
ically in the range of 10KB to 100KB. Embedded versions of Linux and Win-
dows are becoming increasingly popular [50] in high-end 32-bit applications
and in products that do not have tight real-time requirements, such as set-
top boxes and networked game consoles. Scaled down versions of Linux or
Windows NT may require a few megabytes of memory.
The RISC architecture, which is now dominant in embedded systems

(practically in all but computers based on the old x86 architecture) uses
fixed length instruction words to utilize fast decoding and branching. Unlike
variable length instruction word architectures, the fixed-length methods sac-
rifices speed to code size (because some instructions includes unused bits).
1.2. INTRODUCTION 3
This is usually not a concern in PCs and other non-embedded systems, where
memory is in abundance. In embedded systems the picture is different. These
systems are typically subject to space, power and cost constrains, in which
instruction memory, as any other resource, is limited.
The available instruction memory space may be better utilized by encod-

ing embedded software in a compact format. This cost effective approach
received a lot of attention in recent years. The concept is to compress the
instructions in the memory (using one of the many available compression
methods or new ones) thereby saving memory, while the decompression is
performed in real time, just before the code is executed. This function is
usually being performed by special hardware unit. Not every compression
method is suitable for such application. The system must be capable of de-
compressing the code quickly, affecting performance as little as possible, and
handle the compressed address space efficiently. All theses mechanisms are
usually required to work transparently in respect to the software.
1.2.1.2 Related Work
Several approaches have been used to produce compact code.
Thumb [56] and MIPS-16 [28] offer extended instruction sets that include
short instructions for embedded applications. The use of short instructions
adds minimal run-time overhead. This is not compression method per se, but
code size is reduced. The major drawback of this technique is that code has
to be recompiled (if it is hand written in assembly, it should be rewritten).
Moreover, all the development tools must be modified to utilize the new
instruction. This method is also very processor-specific. It cannot be ported
to other processor without being almost completely redesigned.
A second approach is to reduce the size of the compiled code by generating

a custom instruction set, matched to the characteristics of the compiled pro-
gram [18, 22]. The custom instructions are interpreted at a speed slower by
a factor of 2-4 relative to compiler generated code [19], or a tailored decoder
may produce the processor’s internal signals directly [34].
The third approach, illustrated in figure 1.1, takes the binary image of
regular code, generated by the regular development tools and compresses
it. This compressed code cannot, of course, be executed on it own - it has
to be decompressed by some hardware unit in the target system. Several
implementations for this approach have been suggested. Chen [13, 12] de-
scribes a dictionary based system, in which common code segments are re-
placed by a single reference. At runtime, the decompression core retrieves
these code segments back for execution. Another paper, based on the same
method is by Lefurgy[39]. Other researchers tried other compression algo-
rithms. Kozuch[32, 33] studied the statistical properties of RISC code using
0-order and 1-order models. Ernst[16] uses a method called operand factor-
ization along with a Markov probability model. Lekatsas[40] and Xie[64] use
arithmetic coding (reduced to accommodate the fast decoding requirement);
Liao[41] uses a dictionary based method called EPM; Wolfe and Chanin[62]
use Huffman based methods. Finally IBM’s CodePack system[20] is of special
interest to our work. It will be described in details in a later chapter.
Another important aspect of code compression systems is the address

mapping problem. After compression, the address of some instructions changes.
Branch instructions specify a destination address, which was calculated by
1.2. INTRODUCTION 5
"% %
!
# $
" "
%""
"% %
" & !
# $
' " "
%""
Figure 1.1: Computing System with Code Compression
(a) Hardware configuration; (b) Hardware configuration with code compression;

(c) Software development workflow; (d) Software development workflow with code
compression
the compiler according to the uncompressed code. After compression, this

address may be no longer correct. It may also be possible that the new ad-
dress does not even fall on a byte boundary. Two common methods exists
to overcome this problem. The first (introduced in [62]) uses a line address
table, which is a table used to translate at runtime the address from the
uncompressed address space to the compressed address space. The second
approach, introduced in [39], scans the compressed code and patches all the
branching instruction to their correct destination.
1.2.2 Compressed Pattern Matching
1.2.2.1 Motivation
The problem of pattern matching is a fundamental problem in computer

science, concerned at finding the first (or all) occurrence of a string within
other, larger string. In this work we address an instance of the string match-
ing problem usually known as the compressed matching problem. We are
interested in finding an occurrence of a string within another string, but in
this case the second string is compressed by some compression method. We
want to do that without having to decompress the large string first. It is also
desirable that the algorithms allowing such operation would do it faster than
the simple solution and of course afford as much compression as possible.
This problem is becoming more and more important due to the vast
amount of information modern servers and personal computers have to deal
with.
1.2. INTRODUCTION 7
1.2.2.2 Pattern Matching
The problem of pattern matching is formalized as follows (from [54]):
Given a pattern string x, with |x| = m, and a text string y, with

|y| = n, where m, n ≥ 0 and m ≤ n, if x occurs as a substring of y
then determine the position within y of the first occurrence of x,
i.e. return the least value of i such that y(i, i + m − 1) = x(1, m).
The simplest form of string matching is the brute force approach. This
algorithm simply scans the y and compare every character with a character
in x. This algorithm has two drawbacks: first, it requires the ability to go
back in y; and second, it has a worst case time complexity of O(m · n).
Two algorithms that overcome these problems are the Knuth-Morris-
Pratt[31] and Boyer-Moore[7]. Both have as worst case time complexity
of O(m + n) but the last one possesses sublinear time in the average case,
which makes it extremely important and useful.
It is also important to note here a fundamental result by Rivest[47], who
showed that in the worst case, no algorithm can perform string matching
with less than n − m + 1 comparisons, thus making sublinear-in-the-worst-
case algorithm impossible.
1.2.2.3 Text Compression
The goal of text compression algorithms is to encode a given string (text) in

such a way, that it will consume as little storage as possible. Certainly, the
encoding must be reversible: for a given encoding algorithm C, there exist a
decoding algorithm D such that t = D(C(t)).
This problem has been extensively studied, and the two most common
coding families today are the entropy coding and the dictionary coding. In
entropy coding, the goal is to find the most suitable code for every symbol,
such that the total length of the encoded text is minimized. Some members of
this family are the Huffman[23] and arithmetic coding. Dictionary methods,
in contrary, try to replace frequently occurring substring with references to
their previous occurrence. Some members are Lampel-Ziv and their variants.
In the recent years, some newer methods have emerged such as anti-
dictionaries[14], byte pair encoding[14] and block sorting[8].
The performance of a compression algorithm is measured by two factors:

the compression it can attain and the speed of compression/decompression.
1.2.2.4 Compressed Matching Problem Definition
The methods described in the previous section do achieve their goals, but
the encoded text, in the general case, cannot be searched for a pattern us-
ing the algorithms described in 1.2.2.2. For them to work, the compression
algorithms change the structure, the symbol encoding and order of the text,
thus making the direct application of the plain text matching algorithms
impossible.
Searching compressed texts is possible in one of two ways: 1) decompress-

ing the file then employing a known searching algorithm; and 2) using a new
algorithm designed specifically for compressed pattern matching. The first
approach has some drawbacks. Decompressing a file can consume, in some
cases, considerable time and space. Moreover, in certain cases it is desirable
to search a text repeatedly for different patterns. In the direct method we
1.2. INTRODUCTION 9
will have to decompress the same text over and over again. In some other
cases, we are not interested in the entire file, but only in the area where the
desired string has been found. Again, we will have to decompress the entire
text.
The second approach has the potential of overcoming all these drawbacks
if a good algorithm is found.
Amir and Benson[1] were the first to define this problem and give perfor-
mance goals (in the following definition u is the length of the uncompressed
text, n is the length of the compressed text and m is the length of the pattern
being searched):
• An algorithm that finds all the occurrences of the pattern in the text
in O(u) time is efficient
in O(n log m + m) time is almost optimal
in O(n + m) time is optimal
In later paper, Amir, Benson and Farach[2] observe that there is some-
times a tradeoff between the execution speed and the extra space consumed.
The amount of extra space used is, therefore, another important performance
measure.
1.2.2.5 Related Work
The work on this subject can be coarsely divided by the type of algorithm
used to compress the text. Table 1.1 presents this taxonomy.
Run-length Huffman LZ* Other

Amir and De Moura et Kida et Shibata et
Benson [1], al.[44], al.[26], Kida et al.[51], Shibata
Eilam-Tzoreff Manber[42], al.[27], Amir, et al.[52],
and Klein and Benson and Takeda[55],
Vishkin[15], Shapira[30] Farach[2], Mitarai et
Amir, Landau Klein and al.[43], Tarhio
and Vishkin[3] Shapira[29], et al.[46]
Farach and
Thorup[17],
Ga̧sieniec and
Rytter[21]
Table 1.1: Taxonomy of compressed matching methods
The first group (also chronologically) of methods is based on run-length

encoding. The most salient members are Tzoreff and Vishkin[15] who studied
the problem in one dimension; Amir, Landau and Vishkin[3] and Amir and
Benson[1]. The last two works studied the problem in two dimensions. The
last one, as already mentioned, is the work that formulated the problem of
compressed matching and defined it.
The second group is based on Huffman[23] encoding. De Moura et al.[44]

suggested a method base on compressing complete words, rather than charac-
ters using Huffman code. A special mark bit is used to denote a word begin-
ning allowing fast searches (thereby reducing compression levels). Klein[30]
takes an interesting approach and uses a probabilistic search method that
relies on the resynchronization property of Huffman codes.
The third group is based on the Lampel-Ziv family of compression algo-

rithms. It should not come as a surprise that this is the largest group of all,
since the Lampel-Ziv algorithms are being extensively employed to compress
1.3. OUTLINE OF THE PROPOSED ALGORITHM 11
text with very good results. Some algorithms in this group try to perform
search on files compressed by standard tools, particularly the Unix compress
tool which uses LZW[60]. Two examples for such approach is Amir, Benson
and Farach[2] and Kida et al.[26] who try to make use of the efficient Shift-
And algorithm for searching in (uncompressed) text. Other authors change
the base encoding scheme to facilitate the search. Klein[29], for example,
replaces the back pointers in the LZSS scheme by forward pointers.
The fourth and final group contains other kind of methods. Manber [42]
designed a system in which the 128 unused codes in regular texts (assuming
ASCII encoding) are utilized to encode common character pairs. The author
describes a method for selecting these pairs in such way that no ambiguity will
be present. Shibata et al.[51] uses a hierarchical method in which every stage
consists of removing common symbol pairs from the text and replacing them
by a new symbol. Another scheme by Shibata[53] searches for pattern in text
compressed using antidictionaries (which are lists of forbidden phrases, see
Crochemore[14]). Two recent results are due to Tarhio[46] which generalizes
the method proposed by De Moura[44] and Shibata et al.[52] who uses Boyer-
Moore algorithm to search through BPE compressed text.
1.3 Outline of the proposed algorithm
We propose an algorithm that allows text (over some arbitrary alphabet Σ)

compression and the capability of searching directly in the compressed text.
The basic compression algorithm is inspired by CodePack, an algorithm used
by IBM for code compression in their PowerPC processor[20]. It is a zero-
order substitution scheme, in which every symbol is replaced by a variable

length codeword. The codeword is comprised of two parts. The first part
selects a specific subset of the alphabet, and the second part selects a symbol
within this subset, thereby resolving the original symbol. When looking at
the first part of all the codewords of the text, we get a low resolution image
of the text. This image has certain properties, that allows us to search it
in a very fast manner and to filter out many locations in which the pattern
being searched cannot occur.
We present an algorithm that optimally partitions the alphabet into sub-

sets and then we prove that searching the low resolution indeed can reduce
the necessity of completely decompressing the text and searching in it. All
the algorithms have been implemented, and the experimental results are also
given.
The algorithms described here were previously published in [4] and [5].
1.4 Structure of the work
In the first chapter, we give a general overview of the two problems we

are dealing with – code compression in embedded systems and compressed
pattern matching. We give an overview of the works that have been done in
these research areas along with some recent results.
The second chapter deal with the compression method that we propose
- the class compression. After explaining the scheme, we develop an algo-
rithm for optimizing the compression and discuss the influence of various
parameters on the compression rate.
1.4. STRUCTURE OF THE WORK 13
The third chapter is devoted to code compression. We describe the usage

of class compression in this application and explore the parameters influenc-
ing the system performance.
The fourth chapter describes the compressed pattern matching applica-
tion. The algorithm is described, analyzed and experimental results are
given.
The fifth and final chapter summarizes the work.
Chapter 2
Class Compression
2.1 The Class Compression System
The class compression system, used by CodePack[20, 25] and previously used
by Said and Pearlman[49] (with some variations) is a 0-order entropy coder.
Each symbol in the original text is replaced by some codeword. The length
(in bits) of the input symbols if fixed, but the length of the output codeword
is variable. By tuning the codeword length according to their probability of
occurrence, compression can be attained. The most known coder of this kind
is the Huffman[23] code, which describes a way to optimally select the length
and the bits of the codeword for every symbol. Huffman code possesses
the prefix property, which means that no codeword is a prefix of another
codeword. That way, the code can be decoded without ambiguity.
The class compression system takes a similar approach.
Let Σ = {σ1 , σ2 , . . . , σK } be some arbitrary alphabet. Let {C1 , C2 , . . . CN +1 }

be a set of N + 1 proper subsets of Σ, such that
15
16 CHAPTER 2. CLASS COMPRESSION
Figure 2.1: Entropy Coder
N
[
Σ= Cj (Cover) (2.1)
j=1
Ci ∩ Cj = ∅ ∀i, j (Uniqueness) (2.2)
Each Cj is assigned a codeword that identifies it and then another code-

word for every member within the selected subset. The first codeword is
called prefix and the latter is the index. Every symbol in the alphabet is
represented by a prefix-index pair and from (2.1) and (2.2) it is clear that
that every symbol belongs to one and only one subset (will be called a class
from now on), thus the process can be reversed and the original text can be
extracted from the codewords.
The codewords to the prefix and to the index are allocated differently. The
prefix uses Huffman codes. The index uses a fixed length binary code – every
class has an associated index length (which may be different among classes).
The length of that code must be selected according to the cardinality of the
2.2. ANALYSIS 17
class in order to provide representation for every member. More specifically,

the length of the index of class Cj is not less than dlog2 |Cj |e.
In order to decompress a stream of class coded text, another piece of

information is required – a codebook is necessary to translate between the
codewords and the original symbols. However, including all the symbols in
the codebook may not be economical, since some appear very infrequently,
some may even appear only once. To solve this problem, the literal class is
added to the system. The members of the literal class are all the symbols
that are not members of any other class. It has a prefix exactly like any other
class, but instead of index, the original symbol is copied verbatim. Therefore,
it is not necessary to include them in the codebook. The drawback here is
that the compressed codeword is longer than the uncompressed symbol, but
since they are infrequent, they have little influence on the overall length of
the compressed text. This situation also occurs in other entropy coders.
2.2 Analysis
2.2.1 Compression Ratio
Let
Σ = {σ1 , σ2 , ..., σK } (2.3)
be an alphabet, with every symbol σi having an associated with an oc-

currence probability pi . Without loss of generality, we can assume that the
the symbols are sorted by their probability, that is p1 ≥ p2 ≥ ... ≥ pK .
Let
C = {C1 , C2 , ..., CN , CN +1 } (2.4)
be a set of classes, with CN +1 being the literal class.
A class configuration is a mapping
G:Σ→C (2.5)
A class configuration shall be called valid if it complies with (2.1), (2.2) and
one more rule:
σi ∈ Cj ⇒ σi+1 ∈ Cj 0 j 0 = j or j + 1 (2.6)
1
The compression ratio that will be achieved under such parameter sys-
tem can be obtained by counting the bits in the individual components of
the code:
I III
z }| { II
z }| {
XN z }| { N
X
Pi · (πi + log2 |Ci |) + (B + πN +1 ) · PN +1 + B · |Ci |
i=1 i=1
R= (2.7)
B
where B is the word length of each symbol in the uncompressed text and
Pi is the probability that a symbol belongs to Ci :
The literature defines compression ratio in two different ways: |C(T

1
|T |
)|
and 1 − |C(T )|
|T | ,
where |T | is the length of the text and |C(T )| is the length of the compressed text. We
will adhere to the first form, meaning that the smallest the number, the better is the
compression.
2.2. ANALYSIS 19
X
Pi = pj (2.8)
j:σj ∈Ci
Equation (2.7) has 3 components. The first component (designated I)

represents all the symbols that are encoded in all the classes except for the
literal class. Every such symbol is encoded by a prefix, whose length (in bits)
for class Ci is πi and an index whose length is dlog2 |Cj |e. The second part,
designated II, counts the bits in the symbols included in the literal class. Their
codes include the prefix of the literal class and the uncompressed codeword.
The last part, III, is the space required by the codebook, which has one entry
of the length of the uncompressed codeword for every symbol that does not
belong to the literal class.
Several factors govern the length of the codeword stream given a text (the
compression ratio):
• Number of classes – To understand the influence of the number of classes

on the compression ratio, we look at two extreme cases. The first case
is N = 0 (there is only one class – the literal class). Since there is only
one class, its prefix is 0 bits long and we have only the index. The
index in the literal class, however, is a copy of the original symbols so
in this case we have |C(T )| = |T |.
The second case is when N = |Σ|. In this case every symbol in Σ has
its own class with 0 bit index. The prefix is Huffman coded so we get
that every symbol is Huffman coded. So in this case |C(T )| = |H̃(T )|
(where H̃(T ) is the Huffman coding of T ).
Between these two extremes, the compression ratio rises monotoni-

cally. To prove that, let us assume that we have a class structure with
N classes over some alphabet. We build a new class structure with
N + 1 classes using the following procedure. First, we pick some class
arbitrarily having two symbols or more in it. Then, we split it into
two classes, each containing exactly half the symbols of the original
class (always possible because class cardinality is an integral power of
2). The number of bits required to represent each symbol in each new
class is one bit less than was required by the original class. The pre-
fixes of the two new classes would be the prefix of the original class
appended by a single bit, selecting one of the new classes. The new
structure obtained is clearly valid, has N + 1 classes as required, and
uses exactly the same number of bits. This structure is not necessarily
optimal, meaning that there might be some other structure that uses
less bits (but not more). Therefore, the compression ratio, as a function
of the number of classes is monotonic.
To sum up, the more classes we allow, the better compression ratio we
expect to achieve.
• Class structure – In its documentation ([24, 25, 20]) IBM doesn’t reveal
how it has determined its class structure. This is the main issue dealt
with here, and an optimal class structure optimization will be presented
in the next section.
• Model – The model used to decide what are the symbols of the text
and what is the probability of every symbol. Two choices are possible
with class compression. The first is a static model, built using a-priori
2.2. ANALYSIS 21
knowledge of the probabilistic properties of the text being compressed;

and a semi-static model in which the text is scanned prior to compres-
sion and the number of symbols from each type is counted. We will
restrict ourselves to zero-order models, which means that a symbol is
always replaced by unique codeword, regardless its position in the text
or its context. Class coding can be used for higher order models, by
augmenting the alphabet appropriately.
2.2.2 Redundancy
We start our analysis following [49]. Our model is an i.i.d source of symbols
σi each having a probability of pi . The entropy of the source is:
K
X
H=− pi log2 pi (2.9)
i=1
After separating index and prefix, the entropy would be:
N N X pj
X X pj
H=− Pi log2 Pi − Pi · log2
i=1 i=1 j∈C
Pi Pi
i
N N
(2.10)
X X X pj
=− Pi log2 Pi − pj log2
i=1 i=1 j∈Ci
Pi
In our coding scheme, the indices are not entropy coded, but have a fixed
length. Thus the second additive term in (2.10) changes:
N N X
0
X X 1
H =− Pi log2 Pi − pj log2 (2.11)
i=1 i=1 j∈Ci
|Ci |
taking the difference between (2.10) and (2.11) we get:

N X
0
X pj · |Ci |
∆H = H − H = (pj log2 ) (2.12)
i=1 j∈Ci
Pi
which leads us to an important conclusion: losses due to the fixed length

Pj
of the index are minimized if pi ≈ |Cj |
for σi ∈ Cj .
2.3 The optimal class structure problem
As already has been shown in the last section, the class configuration has a
major impact on the compression ratio. In this section we investigate this
problem, and propose an efficient algorithm to find such mapping.
2.3.1 Brute force approach
Our goal is to assign K symbols to N +1 classes under the constraint of (2.2),

(2.1) and (2.6). The number of members of each class (with the exception of
the literal class) is an integral power of 2, therefor it is possible to write:
|Ci | = 2zi zi is integer, 1 ≤ i ≤ N (2.13)
and from (2.2) and (2.1) we get
N
X
|Ci | ≤ |Σ| (2.14)
i=1
replacing (2.13) in (2.14)
N
X
2zi ≤ K (2.15)
i=1
2.3. THE OPTIMAL CLASS STRUCTURE PROBLEM 23
The problem now is how many vectors Z = (z1 , z2 , ..., zN ) satisfy (2.15).
To bound this, let us assume z1 = z2 = ... = zN = z. Now (2.15) becomes
N · 2z ≤ K (2.16)
and therefore
z ≤ log2 (K/N ) = log2 K − log2 N ≈ log2 K (2.17)
and every Z where zn ≤ z would now satisfy (2.15). The last approxima-
tion is due to the fact, that in real world application K N . It is evident
that this is a lower bound since there might be some vectors Z in which some
component is greater than z still satisfying (2.15).
Finally, brute force scheme would have to check at least
z N = (log2 K)N ≈ O((log K)N ) (2.18)
vectors.
In practice (as we shall see later), N is rather small, so that brute force
search is feasible. Nevertheless, it is desirable to find an algorithm that can
solve the problem in much shorter time (both from complexity point of view
and in practice).
2.3.2 A faster algorithm
We introduce a graph G = (V, A) having K +1 nodes – one node correspond-

ing to each symbol in Σ and one additional, final node.
V = {vk |1 ≤ k ≤ K + 1} (2.19)
The nodes are connected by arcs of two types – final and non-final.
A = ANF ∪ AF (2.20)
Non-final arcs are defined as:
ANF = {ai,j | ∀i, j, i<j≤K

(2.21)
and (j − i) is an integral power of 2}
Final arcs are defined as:
AF = {ai,j | ∀i, j, j = (K + 1)} (2.22)
Each arc in the graph corresponds to a feasible class – a class that may
be selected to be included in C. A non-final arc extending from vi to vj
represents a class in which the symbols σi , σi+1 , ..., σj−1 are included. A final
arc extending from vi to vK+1 (note that a final arc always ends at the final
node) represents a literal class containing the symbols between σi to the end
of the alphabet. Following these definitions, we look at a path from v1 to
vK+1
P = a1,z1 → az2 ,z3 → ... → azN ,K+1 (2.23)
such that
1 < z1 < z2 < ... < zN (2.24)
Such path, defines a class configuration:
C(P ) = {C1 , C2 , ..., CN , CN +1 }
where
σ1 , ..., σz1 −1 ∈ C1
σz1 , ..., σz2 −1 ∈ C2

(2.25)
...
σz` , ..., σz`+1 −1 ∈ C`+1
...
σzN , ..., σN ∈ CN +1
We now claim that a class structure (2.25) corresponding to a path of

the form (2.23) over V , is valid. In order to prove that C(P ) is valid, it is
necessary to show that it satisfies (2.2), (2.1) and (2.6).
The uniqueness and cover properties are direct corollary of (2.24). For
any σi , exists one and only one pair zk and zk+1 , such that zk ≤ i ≤ zk+1 and
from (2.25) we conclude that σi ∈ Ck .
To prove that 2.25 satisfies (2.6), let assume σi , σj ∈ C`+1 (i < j). This
means that z` ≤ i < j ≤ z`+1 − 1 and from (2.24) we see that for any σk ,
i < k < j also σk ∈ C`+1 .
Each arc has a weight:


Pj−1
if ai,j ∈ ANF

( `=i
 p` ) · log2 (j − i) + (j − i) · B
wi,j = (2.26)
 K F
P
`=i p` · B if ai,j ∈ A

The weight is taken directly from (2.7). The weight of non final arcs is
taken from parts I and III, while the weight of final arcs is taken from part II.
2
The weight of the arc is the average number of bits per symbol such class
would have required in the compressed text. Note that this weight includes
the number of bits required in the codebook. The arcs that extend from any
node vi to the final node correspond to a literal class that begins at symbol
σi .
Comparing (2.26) and (2.7), one can notice that the terms πk are absent
from the former. These terms are related to the prefix bits. A key property
of (2.26) is that the weight of a certain class is dependent only on its size and
content (the probabilities of the symbols in it) and it is absolutely indepen-
dent of any other class. Including the prefixes changes this situation. The
prefixes are determined by applying Huffman’s algorithm[23] to the weights
of all the classes (2.8). Therefore, a knowledge of all the classes is required to
find even a single class prefix length. We claim, however, that the influence
of ignoring the prefix length is minor. This claim shall be left as a heuristic,
since it cannot be proved exactly, only a general reason will be given.
2
When using the symbol probability pi as a measure, (2.26) gives the average bit per
symbol. In practice, it is much more efficient to make these calculation in terms of actual
bits, thereby eliminating costly floating-point calculation. This can be achieved easily by
multiplying all the probabilities by the text length, making them count the actual number
of occurrences of a symbol in the text. This step has no influence when seeking minimum
or maximum
If, instead of Huffman codes, we would have used other code for the prefix,
one that achieves the entropy bound, it would be possible to completely
separate the problem of index and prefix (as in (2.10)). Huffman codes
have some redundancy over the entropy, so this statement does not hold.
It is also very difficult to bound this redundancy based on a single symbol
probability[61].
Finally, we can write the optimization problem as a graph-theory problem:
Given a graph G = (V, A) with V and A as defined, find the

shortest path from v1 to vK+1 using at most N+1 steps.
Solving this problem is an small modification of the well known Bellman-

Ford shortest path problem (see, for example [36]). In this algorithm the
j’th iteration finds the shortest path between the origin to all the other
nodes using exactly j +1 steps (arcs). Usually, this process continues until no
improvement can be made to the path, or when all the arcs have been tested.
In our case, we will simply stop the algorithm after N + 1 iterations and take
the shortest path found between the origin and the final node. Having the
path, we would translate it into a class configuration using (2.25) which is
now promised to be both valid and optimal in the sense of compression ratio.
2.3.3 The Optimization Algorithm and the Prefix
As already mentioned, the prefix was not included in the optimization prob-
lem. In this section we will explain why it can be approximately ignored.
First, we will explain why it impossible to include the prefix in the opti-
mization method described in the last section. In order to know what is the
length of the prefix that is required for a certain class, it is not enough to
know which symbols belong to that class, as is the case with the index. To
correctly compute the prefix, we need to know the weight of all the classes.
This fact makes it impossible to accurately include the length of the prefix
in any algorithm that finds the optimal solution in a step-by-step manner.
Moreover, the relationship between symbol probability and their code length
(using Huffman encoding) is highly nonlinear ([65]), which renders numerical
optimization methods unuseful.
However, we claim that under certain conditions, ignoring the prefix has
negligible effect.
Let ρ be the redundancy of the prefix code. We write equation (2.7),

but instead of using πi notation for the bit length of each prefix, we add the
entropy of the prefixes:
PN PN
0 i=1 Pi log2 |Ci | + PN +1 · B + B · i=1 |Ci | + (−Hpref ix + ρ)
R = (2.27)
B
where
N
X +1
Hpref ix = − Pi · log2 Pi (2.28)
i=1
expanding Hentropy we get
PN PN
i=1 Pi (log2 |Ci | + log2 Pi ) + PN +1 (B + log2 PN +1 ) + B · i=1 |Ci | ρ
R= +
B B
(2.29)
the first part of the equation has the form of equation (2.7), thus we can
write:
ρ
R = R̃ + (2.30)
B
R̃ is the “ideal” compression rate - when the prefix coding hits the entropy
bound. ρ, as defined, is the redundancy of Huffman code. A basic property
of such code is ρ ≤ 1 ([65]). This gives us a loose, but absolute bound on
the redundancy (due to the prefix only) of class coding using Huffman coded
prefixes:
1
ρclass ≤ (2.31)
B
We want to improve this bound further, but here we will need to add
some assumptions. The extensive research on the redundancy of Huffman
code reveals that if the probability of the most probable symbol is below
0.5, the upper bound reduces dramatically([65]). In natural languages, the
probability of the most frequent symbol is well below that threshold (see, for
example, in the English language [6]). Moreover, calculating the cumulative
probability of some of the most frequent characters, still yields probability
that is much smaller than 0.5. Remembering the tendency of the optimiza-
tion algorithm to gather symbols with similar probabilities leads us to the
assumption that in such application the redundancy would be negligible and
the optimization algorithm can ignore the prefix safely.
2.3.4 Complexity
The complexity of the general Bellman-Ford algorithm is O(K 3 ). In our

case, however, we make at most N iterations, thus reducing the complexity
to O(N · K 2 ). Inspecting carefully the structure of the arcs in the graph,
we find that the number of arcs going into node Vk is at most log2 K, so by
removing unnecessary comparisons, the time complexity can be reduced even
further to O(N · log2 K · K).
The space complexity will be discussed later, and it is O(N · K).
2.3.5 Example
As an example we will use the following text:
HDFCABACDFCCEEDCEBCBACDGDCCACACCEAEEE
The statistics of this text can be easily extracted (see table 2.1). We are
interested in coding this text using 2 classes (N = 2). Using (2.26), we build
a graph for the optimization problem (for clarity, not all arcs are shown),
which is shown in figure 2.2.
It can be immediately verified, that the emphasized path is the shortest
when using only three arcs at total (one for each class and one final arc for
the literal class). The following tables summarizes the final coding. Table
2.2 summarizes the class configuration and table 2.3 gives the coding for each
symbol in the compressed text.
Counting the bits in both the uncompressed and the compressed text,
reveals that the original representation required 111 bits whereas the latter
Symbol Prob. Code

C 12 010
E 7 100
A 6 000
D 5 011
B 3 001
F 2 101
G 1 110
H 1 111
Table 2.1: Symbol Probabilities for the example
Class |Ci | Pi Prefix Index Len. Members

C1 1 12 00 0 C
C2 2 13 1 1 E, A
C3 5 12 01 – D, B, F, G, H
Table 2.2: Class Assignment for the example
Symbol Code
C 00
E 10
A 11
D 01 011
B 01 001
F 01 101
G 01 110
H 01 111
Table 2.3: Final Symbol Coding

Figure 2.2: Class Structure Optimization Example

The selected path is emphasized
requires only 110 bits.

Several points in this example should be noticed:
• The algorithm tends to gather symbols that have close probability val-
ues. The difference between p1 to p2 is big, but p2 and p3 are very
close, so they were assigned to their own class. p4 , however was not
assigned it C2 even though it is close to p2 and p3 , since that would
require adding p4 to C2 as well, and the price is too high.
• Equation (2.7) can be verified here. The compression attained is rather

unimpressive (the compressed message is one bit shorter than the un-
compressed message), this is due to the small number of classes.
2.3.6 Implementation
The implementation starts with the definition of three arrays: f[k] which
holds pk ; u[l] is the shortest path from the first node to the l’th node; and
trace[l,j] is the number of node through which the shortest path to node
2.4. EXPERIMENTAL RESULTS 33
j in the l’t step passes. We initialize u[l] with the w1,l according to ((2.26))
which is the shortest path from the first node to the l’th node using one step
only. If there is no arc between V1 and Vl , u[l] should be given “infinite”
value (in practice MAXINT can be used).
Next, we run the following loop:
for l=2 to N+1

for j=2 to K
for i=2 to K
u[j] ⇐ min(u[i]+ wi,j )
trace[l,j] ⇐ i for which u[j] is minimal
After completing, u[K+1] will be the value of the shortest path (which
is the length of the compressed message without the prefixes). trace[l,j]
can help us recover the nodes through which that path has passed and that
will give us the class structure.
The size of the trace array governs the space complexity of the algorithm,
which is O(N · K).
2.4 Experimental results
The algorithm described in the previous sections was implemented on a PC

system and tested for compression performance. The texts used are taken
from sources that are commonly used to evaluate compression such as the
Description Size Reference

calgary The Calgary corpus 3.1MB [9]
world95 CIA World factbook 95 2.9MB [63]
canterbury The Canterbury corpus 2.66MB [10]
Table 2.4: Datasets for compression experiments
Calgary and the Canterbury corpuses. Comparative data is also available

for some of these texts using previously introduced compressed matching
algorithms (that were described on Chapter 1). Table 2.4 summarizes the
text datasets.
2.4.1 Implementation Issues
One of the most important parameters in an entropy coding system, is the

model used to extract the probability information for the symbols. An in-
depth discussion of model choice for text compression is beyond the scope
of this work and can be found, for example, at [6]. The model used here
is a simple zero-order model, using character-pair as symbols. It is to be
noted, that more elaborated models can be used. The original CodePack,
for example, divided each 32 bit word into two strictly alternating alphabets,
coding each separately. When we will deal with the string matching problem,
such simple model would be imperative.
Another issue is the transmission of the codebook. In order to improve

compression, a new codebook is selected for every file, and a method must
be established to transfer it. The method we used sends a small header for
each class (containing its size, prefix etc.) and the a list of symbols in that
class. This method is optimized for a relatively small number of class. Thus,
when we increase the number of classes, heading for the Huffman limit, the
codebook becomes less efficient, and the compression ratio drops.
2.4.2 Results
The first variable we would like to examine is the number of classes. In section
2.2 we have claimed that the compression ratio rises monotonically as the
number of classes increases. Figure 2.4.2 shows a single text (world95.txt)
compressed with varying number of classes. The dashed line is the length
of the text compressed using Huffman code and the approach towards the
latter is seen.
Figure 2.3 show compression results for several files selected from the
datasets. The figure shows that when the number of classes is small, increas-
ing it has a big impact on the result, whereas from some point the curve
flattens and increasing the number of classes will no more yield a real im-
provement in compression. This behavior is consistent to all the texts show.
!
!
Figure 2.3: Compression ratio for some files from the testset with different
number of classes
Figure 2.4: Approaching the Huffman limit (X axis is logarithmic for conve-
nience)
Chapter 3
Code Compression for

Embedded Systems
3.1 System Parameters
When considering a code compression application, several design parameters

have to be decided on concerning the compression rate, decompression speed
and chip space requirements and the tradeoffs among them. The goal of this
chapter is to explore the influence of some design parameters of the class
compression system on these factors.
3.1.1 Probability Model
In order to apply an entropy encoding system, a proper probability model

must be used. The probability model determines the probability that the
next symbol (in the string to be compressed) is σ, for all σ ∈ Σ. Models are
characterized by their order. A zero order model specifies a constant proba-
37
38 CHAPTER 3. CODE COMPRESSION FOR EMBEDDED SYSTEMS
bility to each symbol, regardless of the previous (or future) symbols. This is
similar to a source emitting random symbols at some specified probabilities.
N -th order model assigns a probability to each symbol depending also on
its context, which is the previous N symbols. Higher order models have the
potential to provide better compression ratio, since they capture more of the
behavior of the text. Higher order models, however, are more complicated
and require more time to build. In code compression application, the decom-
pression hardware is in the critical path from the memory to the execution
units. Thus, it must perform its function very fast, which means that high
order models are not practical.
Three models where considered for our experiments:
• Full instruction. In this model, every 32 bit instruction was considered

as one atomic unit. The probability of each word was determined by
counting. This model is designated as “Model 32” in the results.
• Half instruction. In this model, every symbol consists of 16-bit which

is half instruction. There is no distinction between the lower and the
higher halves. This model is designated as “Model 16” in the results.
• High/low separation. This is similar to the half instruction model, but

the high and low part are treated separately, each having its own prob-
ability set. This is the model being used by CodePack. Comparing the
other models, it requires to approximately double the hardware. This
model is designated as “Model 16/16” in the results.
3.1. SYSTEM PARAMETERS 39
3.1.2 Codebook Limit
The codebook is required when decompressing the code in runtime. Every

code word pass through it (except literals) in order to get translated to the
uncompressed word. The codebook must be available at all times, and must
have fast access to minimize delays in the decompressing process. To achieve
this, it is implemented as fast RAM in the decompression core. RAM has cost
space relative to its size, and thus it is desirable to have some facility to limit
its size. In the optimization algorithm presented in the last chapter there
was no such facility (the codebook size is determined along with the class
structure to obtain maximum compression). Luckily, only small modification
is required.
The codebook contains all the codes that are not in the literal class. If
we look at the graph representation of the problem, we notice that the size
of the literal class is determined by the node from which the (selected) final
arc extends. If we want to limit the codebook size to K̃ symbols, we must
make sure that the final arc extends from a node not after K̃ nodes. This is
easily accomplished by removing all the arcs from node K̃ + 1.
Obviously, the class structure selected by the new graph may be subop-
timal in comparison with the unlimited graph, but it is reasonable to expect
degradation in compression performance after imposing a new constraint.
3.1.3 Static vs. Dynamic Structure
In order to achieve maximum compression, a new class structure have to

be calculated for every code image. Such method requires the hardware to
be capable of adjusting itself to new class structures at runtime (A possible

implementation of such hardware can be found at [59]). On the other hand,
a static structure, precomputed on some typical code, has less potential to
achieve good compression, but can be implemented in smaller and faster fixed
hardware. CodePack takes the last approach. Its class structure is shown in
Figure 3.1.
In order to determine the class structure for a static configuration, some

method must be applied. We have tried three methods (in the description
below, the term characteristic set is a set of programs that are typical on the
subject architecture):
• Averaging. In this method, we calculate the optimal class configuration

for every program on the characteristic set and then average the length
of all the classes (i.e. all first classes, all second classes etc.). The result
is then rounded to the nearest integral power of two.
• Most Frequent. Again, we calculate the optimal structure for all the
programs, and for every class we select the most frequently occurring
length.
• Set-optimal. In this method, every optimal class structure is used to

compress all the other programs in the characteristic set, and the struc-
ture that gives the best result in average is selected.
3.1. SYSTEM PARAMETERS 41
Figure 3.1: Class Structure of CodePack

Figure 3.2: Comparison of Probability Models for Code Compression
3.2 Experimental Results
Our experiments were performed on a set of programs from the SPEC2000[58]

suite. The programs were compiled to the Alpha architecture, and the code
section were extracted from the object files using GNU binary utilities.
Figure 3.2 shows the influence of the probability model on compression.

The parameters used were N = 6 (similar to that of CodePack) with no
codebook limit. The three models that have been discussed of are shown,
along with compression rate attained by the ZIP compression utility. It is
a general purpose compression software base on Lampel-Ziv algorithm, and
it is given as a reference since it generally attains good results (ZIP files,
however, contain some management overhead). The results show that the
16/16 model provide the best results. Another conclusion is that the 0-order
Figure 3.3: Comparison of Dictionary Sizes for Different Models
approximation is reasonable for this kind of application (since ZIP results

are not much better). Figure 3.3 compares dictionary size for these models.
The 32 bit model requires great amount of dictionary space, which is a major
disadvantage.
Figures 3.4 and 3.5 show compression ratio results for various codebook
limit values. The mapping between the codebook limit to the compression
performance is immediate – as the codebook limit gets tighter, less com-
pression is attained. It is to be noted, that when the codebook size is not
limited, it does not grow to include all the symbols, since from some point on,
adding symbols to the dictionary only increases the size of the overall com-
pressed text. The optimization algorithm determines the optimal codebook
size thereby calculating the optimal class structure.
Compression Ratio [%] 85
80
75
70
65
60
0 10 20 30 40 50 60
Number of Classes
D=128 D=1024 D=4096 UNLIMITED
Figure 3.4: The Effect of Codebook Limit (with varying N )
85
Compression Ratio [%]
80
75
70
65
60
0 4096 8192 12288 16384
Dictionary Size [words]
Figure 3.5: The Effect of Codebook Limit (fixed N )

!"
Figure 3.6: Static Class Configuration
The last figure, 3.6, shows compression results for static class configura-
tion. The class configuration was determined using the methods described in
the previous section, then applied to some files from the set (the character-
istic set was all the programs from SPEC2000). The results show that using
static class structure has very little penalty, and thus it is a feasible design
option.
Chapter 4
Multi-resolution string
matching
4.1 Low resolution text
Let
T = t1 t2 t3 ...t` (4.1)
be a string (text) over Σ (as defined in equation (2.3)). The low resolution
image of T is defined as:
T̂ (T ) = t̂1 t̂2 ...t̂` (4.2)
over the alphabet
Σ̂ = {σ̂1 , σ̂2 , ..., σ̂N , σ̂N +1 } (4.3)
47
48 CHAPTER 4. MULTI-RESOLUTION STRING MATCHING
where
t̂i = σ̂j , ti ∈ Cj (4.4)
Cj ∈ C is a class, as defined in chapter 2. Following this definition, T̂ has

three important properties:
• Uniqueness. For any text T , there is one and only one low resolution
image T̂ . This property is a direct corollary of the uniqueness property
of the class definition (2.2). The reverse direction is not possible, for
any low resolution image, there exists many texts that would produce
it.
• Existence. A low resolution image exists for any text. This property is
a direct corollary of the cover property (2.1).
• Information content. The information content of T̂ is lower than the

information content of the text that was used to produce it, although
the number of symbols is identical in both. This is due to the fact that
|Σ̂| < |Σ|. This means that less bits are required to represent the low
resolution image of the text than the actual text.
In order to recover a text from its low resolution image, more information
needs to be supplied. This information will be called resolving information.
A text represented as a low resolution image and resolving information, will
be called multi-resolution text coding. It is to be noted, that the process can
be repeated several times for the text, which will yield a coding with several
levels of resolution. From this point on, we will restrict ourselves for two-level
coding only.
4.2. A STRING MATCHING ALGORITHM 49
Figure 4.1: Multiple resolution text using class coding
The class coding, presented in the previous chapter, is one possible way
to code a text in multi-resolution fashion. Encoding a text, class coding
produces a stream of prefix-index pairs. Taking the prefixes only, we get the
desired low-resolution image. In order to decode the text completely, we need
the indices, which are used as resolving information. Figure 4.1 illustrates
the process using the text from section 2.3.5.
4.2 A string matching algorithm
4.2.1 Definition
Let
S = s1 s2 s3 ...sr ,
(4.5)
sj ∈ Σ
Figure 4.2: Class-compressed text stream format
1
be a string. We wish to find the first (all ) occurrences of S in T. T,
however, is given in class compressed form, ordered as two separate blocks
- first comes the prefix block, then the index block, as illustrated in figure
4.2. The simplest way to accomplish the task is to extract the compressed
file into some temporary storage, and then to apply known algorithms. As
discussed in the first chapter, this method has several drawbacks.
The following algorithm is able to search the compressed file without

decompressing it first:
• Step 1. Using equation (4.4) and the class transformation G used to

build T , construct a low resolution image of S, Ŝ(S).
• Step 2. Find all the occurrences of Ŝ in T̂ (where T̂ is the prefix block

1
The two problems are identical. We will refer only to finding all the occurrences.
of the compressed text T )
• Step 3. For any occurrence found, decode T locally and check weather
S actually occurs.
4.2.2 Details
In this section, we will focus on a detailed description, feasibility and cor-

rectness of the algorithm. A discussion on performance and complexity will
be given in the following sections.
In the first step of the algorithm, a low resolution image of S has to be

built. It is done by coding S using the same class structure used to encode
T and discarding the indices. This class structure must be known, either
by including it in the compressed stream or by any other mean (such as
static class structure). The resultant string (over Σ̂) is unique, exactly as the
compressed form of T is unique for T .
At the second step, Ŝ is searched in T̂ . Since the prefixes are Huff-

man coded, this step includes a search within a variable-length code. This
fact means that the usual methods for accelerating searches (such as Boyer-
Moore[7] or Knuth-Morris-Pratt[31]) cannot be used. Moreover, each prefix
also carry information about the length of the index in its prefix-index pair.
As we shall see later, it is necessary to decode the index lengths from the
beginning of the prefix block up to each possible occurrence of Ŝ. There-
fore, the search strategy must be serially decoding all the prefixes. It is to
be noted, that decoding all the prefixes of T (its low resolution image) is
significantly less costly than decoding the complete text, since the number
of possible prefixes is much smaller than the number of symbols, and their
Huffman codes are shorter accordingly.
The result of this step would be a list of matches. This list, however, is
valid only for the low resolution string. Since a single low resolution string
can be produced by many full resolution strings, this list may contain false
matches. The uniqueness property, on the other hand, asserts that if S
appears in the text, Ŝ must appear in T̂ . Thus, another step is required to
determine which of the occurrences of Ŝ is a true occurrence of S. To do
that, we need the resolving information stored as indices.
One possible way is to decode the indices, starting at the first symbol
up to the occurrence found. This way, however, means completely decoding
T , which is not desirable. Rather, it is possible to access the index block
“randomly”, start decoding at the occurrence and decode only as much sym-
bols as required. The indices, like the prefixes are stored as variable-length
code. The length of each index is, however, known at an earlier stage – when
its prefix was decoded. To access the i’th index, we only need to decode
the prefixes of s1 ...si and accumulate the length of index paired with every
prefix. That way, the exact location of the index of si within the index block
is known without having to decode all the indices along the way.
After the indices have been checked, and the occurrence of the string
being sought was confirmed, it can be marked as an occurrence.
The process is illustrated in figure 4.3. In part (a) of the figure, the
text is shown (the text, the classes and the symbol codes are taken from the
example at section 2.3.5) and the low resolution image is given just below.
The numbers 1, 2 and 3 denote here that the symbol belongs to C1 , C2 or C3 .
This information is stored as an integral part of the compressed form of T , it

needs not be calculated before any search. Part (b) shows the string sought
S, and its low resolution image. In contrast to the former, this image needs
to be calculated before any search (when changing S). The class structure
used is exactly the one that was used to encode T , which means that the
optimization process need not be repeated. The next step is to search all
the occurrences of Ŝ in T̂ . These occurrences are interlined in part (a). Ŝ
appears 3 times in T̂ from which only one is a true match.
The process of resolving the occurrences is shown in part (c). Decoding

the prefix block, yields not only the prefixes but also the length of the index
associated with each prefix. This length is shown over every prefix. The
first Ŝ occurs after 3 symbols, and accumulating the index length of the first
three symbols provides a pointer into the index block. From this point, three
symbols (the length of S) have to be decoded to determine that this is a true
occurrence.
The same procedure continues on to the next occurrences of Ŝ, but re-
solving the original text on these cases reveals that these are false matches.
4.2.3 Double-byte symbols
In order to improve the compression performance, every symbol, as presented

in the previous chapter, comprises of two characters. This poses a problem
for the search algorithm, since the basic symbol of the string being searched
is only one character long. This may cause ambiguity since every singe string
can be encoded in two different ways.
Figure 4.3: Illustration of string matching procedure

4.3. ANALYSIS 55
Consider, for example, the following text:
TODAY IS THE DAY
It is encoded as double-bytes in the following way:
TO DA Yt IS tT HE tD AY
σ 1 σ2 σ3 σ4 σ5 σ6 σ7 σ8
We wish to search the string DAY within it. The naı̈ve approach will
encode this string as σ2 σ3 . There might be, actually, several possibilities for
the last symbol, so it can be simply dropped (remembering to match the last
character in some later stage). This choice, however, will miss the second
occurrence of the string.
The solution to this situation is to perform two searches: the first will
be done using the said conversion; the second one will start matching the
target string starting from the second character. In the last example, the
first search would try to find the sequence σ2 σ3 and the second will try to
find the sequence σ8 . Combining the results from both searches will yield
the desired result.
4.3 Analysis
The common tool for estimating the performance of algorithms is the asymp-
totic complexity. Reviewing the algorithm described, we see that in the first
step, the prefix block, which is ` symbols long, have to be searched for all the
occurrences of a single string. The complexity of such algorithm is O(` + r)
at the worst case (assuming the use of a slightly more intelligent algorithm
than brute force). In the second step, in the worst case (all the characters
are decoded), another brute force search is performed, thus the overall com-
plexity is O(` + r). This places the algorithm in the efficient category due to
Amir and Benson[1] taxonomy.
From the complexity point of view, the algorithm is efficient (since it’s
polynomial time) and it is no faster than a brute force search (in the uncom-
pressed text).
This is, however, not the complete picture. The stage of prefix decoding
consists of sequential pass over the compressed file, whereas the stage of
index decoding is base on random access. In most cases (when the file is
stored on devices such as disc, CD-ROM etc.) random access is much slower.
Therefore, the real performance measure is the rate of false matches, i.e.
matches in the low resolution part that are not matches in the text. This
phenomenon cannot be observed by the asymptotic complexity tool, since it
ignores constants.
In order to estimate the rate of false matches, a program was designed

to implement the algorithm and measure this value (along with the search
time). The test methodology follows the one presented in [29]. For each
tested file (a subset of the files in table 2.4), random string were selected
having length from 4 to 12 symbols, 10 of each. Though [29] have selected
even longer strings, it would not have much meaning here because a) the false
matches rate falls sharply as the searched string gets longer; and b) such long
string are uncommon. The results of the experiments are given below.
Finally, some actual search time measurements are made.

4.4. EXTRA SPACE 57
4.3.1 Relation of number of classes to search speed
Increasing the number of classes has two effects. The first is increasing the
number of symbols in the low resolution text. The outcome of this fact is
that the low-resolution part becomes closer to the full text and the false
match rate is expected to drop. On the other hand, the increasing number
of symbols in the prefix block requires more bits to represent, and the size of
the prefix block increases. Increase in the prefix block size will increase the
execution time of the first stage.
The conclusion is that there is expected to be a number of classes, which

is optimal in the context of search time. The location of the optimum de-
pends heavily on the access time to the device where the compressed text
is stored. It is therefore also highly dependent on speedup techniques com-
monly employed, such as caching and buffering.
A possible search system configuration is when the index is stored on a

fast but space-limited device (local disk or memory) and the index block is
stored on larger, but slower, device. In such system, the size of the prefix
block is of greater importance and N can be controlled to achieve a desired
size.
4.4 Extra space
While running, the search program needs no extra space, that is dependent
on the length on the uncompressed file. The only memory required is to
accelerate decoding of the prefix’s Huffman code and searching through them.
4.5 Experimental Results
4.5.1 Implementation Notes
Several points in the implementation of the algorithm worth noting:
• Prefix Decoding. The method used to decode the prefix block is byte
oriented. Each prefix is aligned to a byte boundary and then fed into
a lookup table (containing 256 entries) which decodes the prefix at the
beginning of the byte and outputs the class and its actual length. The
length is then used to determine the location of the next prefix and the
class continues to the low-resolution string matching algorithm. The
table has constant length, and it is built in constant time. The only
drawback is that the length of the longest prefix is limited to 8 bits (we
can, of course, use a larger table). In practice, however, this limit is
sufficient. A different way to overcome this limitation is to allocate the
prefix codes using length-limited algorithm (for instance [35] or [57])
which will come in expense of compression.
• String Matching Algorithm. The string matching algorithm in concern is

the one used to find the low resolution image of S in the low resolution
section of T . As the prefixes are decode one-by-one and serially, no
benefit can be gained from character skipping algorithms such as Boyer-
Moore. On the other hand, using the simple brute force algorithm can
lead to unnecessarily long search times. The solution is an algorithm
that matches the target pattern within a text using a deterministic
finite automaton (DFA). This is a two stage process. In the first step
the automaton is built (the algorithm is described in [48] and has a

reference implementation in [11]) and then every new symbol can be
fed to it. If it enters the final state, a match was found.
• Measuring Execution Time. In order to measure time correctly and ac-

curately, all the programs where run under DOS operating system. To
prevent inaccuracies due to I/O access, the tested files where copied as
whole to the memory and the processing time was measured for execu-
tion on that copy. The timing results are treated as simulation times,
and the time measurement units where not specified deliberately. The
results are for comparison purpose only and should not be interpreted
as absolute times.
4.5.2 Results
Figure 4.5.2 shows the compression results of some text files used in the
experiments with ZIP. Unlike the code compression application, natural lan-
guage has more redundancy built-in ([14]), and the usage of 0-order model
cannot exploit them, thus ZIP (which does) gives much better results.
Figures 4.5, 4.6 and 4.7 shows the results obtained from the search runs
described in the previous chapters.
The graph in figure 4.5 shows the rate of false matches, as function of the
number of classes (N ) and the length of the searched pattern. Y axis is given
as percent of the all the substrings of T̂ that are false matches. It is evident
that the false match rate drops with the increasing length and number of
classes. The drop due to the number of classes is attributed to the fact that
as the number of possible symbols (in the prefix) rises, the probability to find
an arbitrary string of symbols (that is, where they do not result from a true
match) drops. The dependency of search speed by the pattern length exists
in all search algorithms, but in this algorithm it is much more emphasized.
Again, this is due to the fact that the probability of finding some arbitrary
string of symbols drops sharply as the length increases. Eventually, the only
prefix string that will match the pattern being searched is the true match,
and the number of false matches will drop to zero.
The graph in figure 4.6 shows the search time depending on the same
variables. The resemblance between this graph and the previous graph is
salient. This fact supports the claim the the search time depends mainly on
the false matches rate. The last graph shows the the relation between the
search time and the rate of false matches. Again, the relation is clear.
The graph in figure 4.8 show the relation of search time versus chang-
ing the number of classes. This graph was taken without operating system
caching. The expected behavior of search time, as previously discussed, is
evident from the graph. Under these terms, the optimal number of classes is
about 10. The file that was searched is world95 and the string searched was
6 character long.
Finally, the graph 4.9 shows a comparison between our algorithm and
the lzgrep program[45]. The graph shows that in some cases, our algorithm
slightly outperforms lzgrep, even though the latter uses the fast Boyer-
Moore method. It comes however in the expense of compression (the LZ-
compressed file is about 20 percent smaller).
"#
$
% &'())*
(+,
Figure 4.4: Class coding compared with other compression algorithms
0.18
0.16
0.14
0.12
r=4
r=6 0.1
r=10 0.08
r=12
0.06
0.04
0.02
0
0 5 10 15 20 25 30 35
N
Figure 4.5: False matches results

0.0016
0.0014
0.0012
r=4 0.001
r=6
0.0008
r=10
r=12 0.0006
0.0004
0.0002
0
0 5 10 15 20 25 30 35
N
Figure 4.6: Search time results
0.002
0.0018
0.0016
0.0014
0.0012
N=4
N=6 0.001
N=10
0.0008
0.0006
0.0004
0.0002
0
0.25 0.2 0.15 0.1 0.05 0
False Match Ratio
Figure 4.7: Search time versus false matches

Figure 4.8: Optimal selection of N
Figure 4.9: Comparison of Search Times

4.6 Enhancement Possibilities
In this section, several possible enhancements to the base algorithm are pre-
sented. They were not implemented, tested or studied, but they are examples
how this approach can be generalized.
• Non-text files. Throughout the work, only text files were mentioned
and considered. However, no feature has relied on this fact. Unlike
other methods in this field of research ([42] for instance) the method
presented here can be used without any changes for arbitrary file types,
including text in various encoding schemes and binary files.
• Approximate matching. Approximate pattern matching is a generaliza-

tion of the (exact) pattern matching problem. In this problem, we try
to find all the occurrences of some pattern within a text, but we allow
some differences (errors) in the matches. The differences are usually
expressed as some distance function and some maximum distance is
allowed. When looking only at the first stage of our algorithm, i.e.
the search in the low resolution part, we get this kind of algorithm
exactly. The number of classes can be used to control the allowed
distance. Moreover, a variation of this problem is the approximate
counting problem, in which we are interested in counting the number
of occurrences of the pattern. Yet again, by counting the number of the
occurrences in the low resolution part such result is naturally obtained.
• Regular expressions. Regular expression are expressions describing sets

of strings. They are commonly being searched in texts. The usual
4.6. ENHANCEMENT POSSIBILITIES 65
way of handling these expression is to compile them into a determin-

istic finite automaton, into which the symbols of the text are being
fed. This approach can also be applied to the low resolution part of
the text (providing the appropriate translation) and then, in the same
way described here, check the resolving information to find the desired
results.
Chapter 5
Summary
In this work we have dealt with several problems. The first problem was the
class coding and its optimization. The method was introduced and an algo-
rithm for optimizing the class structure has been developed. This algorithm
is efficient and is proved to run faster than the brute force approach. Though
the algorithm handles only the index part of the code, a heuristic suggesting
that such method gives good results was given.
The second problem is the the problem of compressed code execution.

The benefits of class coding for such application were detailed, followed by
experimental results demonstrating the influence of several system parame-
ters on the final results. The improvement here, over the CodePack system
(which uses class coding) is the ability to control the parameters easily and
adapt this method to processors other than PowerPC.
The third and last problem is the problem of compressed matching. This
problem, gaining popularity in the recent time, was defined and a novel
algorithm, based on class compression has been suggested. The algorithm is
67
68 CHAPTER 5. SUMMARY
based on a decomposition of the text into two parts – a low resolution part
and resolving information. Unlike in the image processing area (from which
the term ”low resolution image“ is taken), the low resolution part has little
meaning by itself (the original text can not be recovered by this part alone),
but it is very helpful in fast filtering of the text, such that the actual search
can be made fast. The chapter has concluded with experimental results
showing search times.
Bibliography
[1] A. Amir and G. Benson. Efficient two-dimensional compressed matching.

In Proc. Second IEEE Data Compression Conference, pages 279–288,
1982.
[2] A. Amir, G. Benson, and M. Farach. Let sleeping files lie: Pattern
matching in z-compressed files. Journal of Computer and System Sci-
ences, 52:299–307, 1996.
[3] A. Amir, G.M. Landau, and U. Vishkin. Efficient pattern matching with
scaling. In Proceedings of the first annual ACM-SIAM symposium on
Discrete algorithms, pages 344–357, 1990.
[4] A.Orpaz and S. Weiss. A study of codepack: Optimizing embedded

code space. In Proceedings of the Tenth International Symposium on
Hardware/Software codesign, pages 103–108, 2002.
[5] A.Orpaz and S. Weiss. Pattern matching by means of multi-resolution

compression. In Proceedings DCC’99 Data Compression Conference,
page 441, 2003.
69
70 BIBLIOGRAPHY
[6] T. C. Bell, J. G. Cleary, and I. H. Witten. Text Compression. Prentice

Hall, New Jersey, 1990.
[7] J. Boyer and S. Moore. A fast string searching algorithm. Communica-

tions of the ACM, 20, 1977.
[8] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression

algorithm. Technical Report 124, Digital Equipment Corp, 1994.
[9] The Calgary Corpus.

http://links.uwaterloo.ca/calgary.corpus.html.
[10] The canterbury corpus. http://corpus.canterbury.ac.nz.
[11] C. Charras and T. Lecroq. Exact String Matching Algorithms.

http://www-igm.univ-mlv.fr/˜lecroq/string/index.html.
[12] I.C. Chen. Enhancing Instruction Fetching Mechanism Using Data

Compression. PhD thesis, University of Michigan, 1997.
[13] I.C. Chen, P. Bird, and T. Mudge. The impact of instruction compres-
sion on I-cache performance. Technical Report CSE-TR-330-97, EECS
Department, University of Michigan, 1996.
[14] M. Crochemore, F. Mignosi, A. Restivo, and S. Salemi. Text compression

using antidictionaries. Lecture Notes in Computer Science, 1644:261–
270, 1999.
[15] T. Eilam-Tzoreff and U. Vishkin. Matching patterns in strings subject to

multi-linear transformations. Theoretical computer science, 60(3):231–
254, 1988.
BIBLIOGRAPHY 71
[16] J. Ernst, C.W. Fraser, W. Evans, S. Lucco, and T.A. Proebsting. Code
compression. In Proc. Conf. on Programming Languages Design and
Implementation, pages 358–365, June 1997.
[17] M. Farach and M. Thorup. String matching in Lampel-Ziv compressed

strings. In Proc. of the twenty-seventh annual ACM sympoium on Theory
of computing (STOC), pages 703–712, 1995.
[18] M. Franz and T. Kistler. Slim binaries. Communications of the ACM,

40(12):87–94, 1997.
[19] C.W. Fraser and T.A. Proebsting. Finite-state code generation. In Proc.
Conf. on Programming Languages Design and Implementation, pages
270–280, May 1999.
[20] M. Game and A. Booker. CodePack: Code Compression for PowerPC

Processors. International Business Machines (IBM) Corporation, 1998.
[21] L. Ga̧sieniec and W. Rytter. Almost-optimal fully LZW-compressed pat-

tern matching. In Proceedings DCC’99 Data Compression Conference,
1999.
[22] J. Hoogerbrugge, L. Augusteijn, J. Trum, and R. van de Wiel. A code

compression system based on pipelined interpreters. Software - Practice
and Experience, 29(11):1005–1023, 1999.
[23] D.A. Huffman. A method for the construction of minimum redundancy

codes. Proc. IRE, 40(9):1098–1101, September 1952.
72 BIBLIOGRAPHY
[24] IBM. CodePack: PowerPC Code Compression Utility User’s Manual.

Version 3.0. International Business Machines (IBM) Corporation, 1998.
[25] T.M. Kemp, R.M. Montoye, J.D. Harper, J.D. Palmer, and D.J. Auer-
bach. A decompression core for PowerPC. IBM Journal of Research and
Development, 42(6):807–812, Nov 1998.
[26] T. Kida, M. Takeda, A. Shinohara, and S. Arikawa. Shift-and approach

to pattern matching in LZW compressed text. In Proceedings of the
10th Annual Symposium on Combinatorial Pattern Matching, pages 1–
13. Springer-Verlag, 1999.
[27] T. Kida, M. Takeda, A. Shinohara, M. Miyazaki, and S. Arikawa. Mul-

tiple pattern matching in LZW compressed text. In Data Compression
Conference, pages 103–112, 1998.
[28] K. Kissell. MIPS16: High-density MIPS for the Embedded Market. Sil-
icon Graphics MIPS Group, 1997.
[29] S. Klein and D. Shapira. A new compression method for compressed

matching. In IEEE Data Compression Conference (DCC), pages 400–
409, Snowbird, Utah, March 2000.
[30] S. Klein and D. Shapira. Pattern matching in Huffman encoded texts.

In Proceedings of the Data Compression Conference, 2001.
[31] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in
strings. SIAM J. of Computing, 6:323–350, 1977.
BIBLIOGRAPHY 73
[32] M. Kozuch and A. Wolfe. Compression of embedded system programs.

In Proc. Int’l Conf. on Computer Design, 1994.
[33] M. Kozuch and A. Wolfe. Performance analysis of the compressed

code RISC processor. Technical Report CE-A95-2, Princeton Univer-
sity Computer Engineering, 1995.
[34] S.Y. Larin and T.M. Conte. Compiler-driven cached code compression
schemes for embedded ILP processors. In Proc. Int’l Symp. on Microar-
chitecture, pages 82–92, November 1999.
[35] L. Larmore and D.S. Hirschberg. A fast algorithm for optimal length-
limited Huffman codes. Journal of the ACM, 37(3):464–473, Jul 1990.
[36] E.L. Lawler. Combinatorial Optimization. Holt, Rinehart and Winstone,

July 1976.
[37] A.C. Lear. Shedding light on embedded systems. IEEE Software,

16(1):122–125, January/February 1999.
[38] E.A. Lee. What’s ahead for embedded software? IEEE Computer,
33(9):18–26, September 2000.
[39] C. Lefurgy, P. Bird, I.C. Chen, and T. Mudge. Improving code density
using compression techniques. In Proc. Int’l Symp. on Microarchitecture,
pages 194–203, December 1997.
[40] H. Lekatsas, J. Henkel, and W. Wolf. Code compression for low power
embedded system design. In Proceedings of the 37th conference on De-
sign automation, pages 294–299, 2000.
74 BIBLIOGRAPHY
[41] S. Liao, S. Devadas, and K. Keutzer. A text compression based method

for code size minimization in embedded systems. ACM Transactions on
Design Automation of Electronic Systems, 4(1):12–38, January 1999.
[42] U. Manber. A text compression scheme that allows fast searching di-
rectly in the compressed file. In Proceedings of the 5th Annual Sym-
posium on Combinatorial Pattern Matching, pages 113–124. Springer-
Verlag, Berlin, 1994.
[43] S. Mitarai, M. Hirao, T. Matsumoto, A. Shinohara, M. Takeda, and

S. Arikawa. Compressed pattern matching for SEQUITUR. In Data
Compression Conference, pages 469–, 2001.
[44] E.S. De Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and

flexible word searching on compressed text. ACM Transactions on In-
formation Systems, 18(2):113–139, April 2000.
[45] G. Navarro and J. Tarhio. lzgrep – a direct compressed text search tool.
www.dcc.uchile.cl/gnavarro/software.
[46] J. Rautio, J. Tanninen, and J. Tarhio. String matching with stopper

encoding and code splitting. In Combinatorial Pattern Matching, 13th
Annual Symposium, CPM 2002, Fukuoka, Japan, July 3-5, 2002, Pro-
ceedings. Springer, July 2002.
[47] R.L Rivest. On the worst-case behaviour of string-searching algorithms.

SIAM J. of Computing, 6(4):669–674, December 1977.
BIBLIOGRAPHY 75
[48] G. Rozenberg and A. Salomaa. Handbook of formal languages, vol. 1:

word, language, grammar. Springer-Verlag New York, Inc., 1997.
[49] A. Said and W.A. Pearlman. Low-complexity waveform coding via al-
phabet and sample-set partitioning. In Visual Communications and
Image Processing ’97, Proc. SPIE Vol. 3024, pages 25–37, Feb. 1997.
[50] B. Santo. Embedded battle royale. IEEE Spectrum, 38(12):36–41, De-

cember 2001.
[51] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shino-

hara, and S. Arikawa. Byte pair encoding: a text compression scheme
that accelerates pattern matching. Technical Report DOI-TR-CS-161,
Department of Informatics, Kyushu University, April 1999.
[52] Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara, T. Shi-

nohara, and S. Arikawa. A Boyer-Moore type algorithm for pattern
matching. In Proceedings of the 11th Annual Symposium on Combina-
torial Pattern Matching. Springer, July 1999.
[53] Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Pattern matching

in text compressed by using antidictionaries. Lecture Notes in Computer
Science, 1645:37–49, 1999.
[54] Graham A. Stephen. String Searching Algorithms. World Scientific,

1994.
76 BIBLIOGRAPHY
[55] M. Takeda. Pattern matching machine for text compressed using fi-
nite state model. Technical report, Department of Informatics, Kyushu
University, October 1997.
[56] J.L. Turley. Thumb squeezes ARM code size. Microprocessor Report,
9(4), March 1995.
[57] D.C. Van Voorhis. Constructing codes with bounded codeword lengths.
IEEE Transactions on Information Theory, 20(3):288–290, March 1974.
[58] C.T. Weaver. Spec 2000 Binaries.

www.eecs.umich.edu/ ˜chriswea/benchmarks/spec2000.html.
[59] S. Weiss and S. Beren. Class-based decompressor design for compressed

instruction memory in embedded processors. IEEE Transactions on
Computers, 52(11):1495–1500, Nov 2002.
[60] T.A. Welch. A technique for high-performance data compression. IEEE

Computer, 17(6):8–19, June 1984.
[61] S. Wojciech. Asymptotic average redundancy of Huffman (and other)

block codes. IEEE Transactions on Information Theory, 46(7):2434–
2443, Dec 2000.
[62] A. Wolfe and A. Chanin. Executing compressed programs on an em-

bedded RISC architecture. In Proc. Int’l Symp. on Microarchitecture,
pages 81–91, 1992.
[63] The 1995 CIA World Factbook .

http://www.ibiblio.org/gutenberg/etext96/world95.txt.
BIBLIOGRAPHY 77
[64] Y. Xie, W. Wolf, and H. Lekatsas. A code decompression architecture

for VLIW processors. In Proceedings of the 34th annual ACM/IEEE
international symposium on Microarchitecture, pages 66–75, 2001.
[65] C. Ye and R.W. Yeung. A simple upper bound on the redundancy of

Huffman codes. IEEE Transactions on Information Theory, 48:2132–
2138, July 2002.

Class Coding and Two Applications

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Class Coding and Two Applications

Hochgeladen von

Copyright:

Verfügbare Formate

TEL-AVIV UNIVERSITY

The Iby and Aladar Fleischman Faculty of Engineering

Class Coding and Two Applications

A thesis submitted toward the degree of

Class Coding and Two Applications

A thesis submitted toward the degree of

This research was carried out in the Department of Electrical

Who spent on this work as

1 Introduction and Related Work 1

1.1 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Code Compression in Embedded Systems . . . . . . . . 2

1.2.1.2 Related Work . . . . . . . . . . . . . . . . . . 3

1.2.2 Compressed Pattern Matching . . . . . . . . . . . . . . 6

1.2.2.2 Pattern Matching . . . . . . . . . . . . . . . . 7

1.2.2.3 Text Compression . . . . . . . . . . . . . . . 7

1.2.2.4 Compressed Matching Problem Definition . . 8

1.2.2.5 Related Work . . . . . . . . . . . . . . . . . . 9

1.3 Outline of the proposed algorithm . . . . . . . . . . . . . . . . 11

1.4 Structure of the work . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 The Class Compression System . . . . . . . . . . . . . . . . . 15

2.2.1 Compression Ratio . . . . . . . . . . . . . . . . . . . . 17

2.3 The optimal class structure problem . . . . . . . . . . . . . . . 22

2.3.1 Brute force approach . . . . . . . . . . . . . . . . . . . 22

2.3.2 A faster algorithm . . . . . . . . . . . . . . . . . . . . 23

2.3.3 The Optimization Algorithm and the Prefix . . . . . . 27

2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.1 Implementation Issues . . . . . . . . . . . . . . . . . . 34

3 Code Compression for Embedded Systems 37

3.1 System Parameters . . . . . . . . . . . . . . . . . . . . . . . . 37

3.1.1 Probability Model . . . . . . . . . . . . . . . . . . . . . 37

3.1.2 Codebook Limit . . . . . . . . . . . . . . . . . . . . . . 39

3.1.3 Static vs. Dynamic Structure . . . . . . . . . . . . . . 39

3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 42

4 Multi-resolution string matching 47

4.1 Low resolution text . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 A string matching algorithm . . . . . . . . . . . . . . . . . . . 49

4.2.3 Double-byte symbols . . . . . . . . . . . . . . . . . . . 53

1.1 Computing System with Code Compression . . . . . . . . . . 5

2.1 Entropy Coder . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Class Structure Optimization Example . . . . . . . . . . . . . 32

2.4 Approaching the Huffman limit (X axis is logarithmic for con-

3.1 Class Structure of CodePack . . . . . . . . . . . . . . . . . . . 41

3.2 Comparison of Probability Models for Code Compression . . . 42

3.3 Comparison of Dictionary Sizes for Different Models . . . . . . 43

3.4 The Effect of Codebook Limit (with varying N ) . . . . . . . . 44

3.5 The Effect of Codebook Limit (fixed N ) . . . . . . . . . . . . 44

3.6 Static Class Configuration . . . . . . . . . . . . . . . . . . . . 45

4.1 Multiple resolution text using class coding . . . . . . . . . . . 49

4.2 Class-compressed text stream format . . . . . . . . . . . . . . 50

4.3 Illustration of string matching procedure . . . . . . . . . . . . 54

4.4 Class coding compared with other compression algorithms . . 61

1.1 Taxonomy of compressed matching methods . . . . . . . . . . 10

2.1 Symbol Probabilities for the example . . . . . . . . . . . . . . 31

Introduction and Related Work

1.1 Goals and Objectives

• Study the class compression method, its properties and develop an

• Develop a method to perform pattern matching in class-compressed

1.2.1 Code Compression in Embedded Systems

Driven by an expanding market for consumer electronics and communications

Another factor is the system software. Currently, many embedded prod-

The RISC architecture, which is now dominant in embedded systems

The available instruction memory space may be better utilized by encod-

1.2.1.2 Related Work