Kida Data Compression Using Variable-To-Fixed Length Codes

Data Compression using
Variable-to-Fixed Length Codes

Hokkaido University
Graduate school of Information Science and
Technology, Division of Computer Science
Takuya Kida
kida@ist.hokudai.ac.jp
with Satoshi Yoshida, Hirohito Sasakawa, and Kei Sekine
Compact Data Structure for Bigdata @Shonan meeting1
2013/9/29
Contents
Background and aim
Related-works
Variable-to-Fixed length codes (VF codes)

Grammar-based compression
Re-Pair algorithm
Proposed method (Re-Pair VF codes)

Experimental results
Summary
Biography
Background and aim

Meteorological satellite data
Aerial photographs
Internet SNS
Road Information
Information from Citizens
Remote Sensing Data
UGC
User Generated Content
Data
GPS
GPS log
Life log
Open Smart Federation Architecture
GPS Logs & Life Logs
Cloud
Computing
Big Data
Various and Massive Stream Data

Each data entry isnt significant. The data are usually too redundant to store into an
RDB system
Develop an accessible data compression whose

compressed data are reusable for data searching and
analyzing
Related works
Problem Existing methods are inconvenient for
reusing the compressed data, because they must follow
the decompression process.
Amir & Benson 92
Tunstall 67
Dawn of Compressed Pattern Matching
Root of VF codes
Larsson & Moffat 99
Kida et al.98, 99, 00

Efficient CPM Algorithms
Kieffer & Yang 00
Klein & Shapira 08

Suffix tree based VF coding
Kida 09, 10
Brisaboa et al. 03, 10

Dence codings for CPM
Efficient STVF coding algorithms
Yoshida & Kida 12
Combination of Grammar Comp. & VF Codes
Originality We have data compression techniques

that can search keywords on compressed data
directly without decompressing and reach a relatively
good compression performance.
Variable-to-Fixed Length Codes

Compression method that splits the input text into
variable length substring and then converts them
into fixed length codewords.
Strong point Easy to handle the compressed data
=> Enables fast information retrieval or data mining
Weak point Hard to get a good compression ratio
Grammar
Compressio
ns
model data very
well
VF codes
provide high
accessibility
Satisfy both of high comp. ratio and accessibility!

5
Variations of codes
Input Text
Fixed
Length
Compressed
Text
Variable
Length
Fixed Length
Variable Length
FF Code
VF Code
(Fixed length to Fixed

length code)
(Variable length to Fixed

length code)
FV Code
VV Code
(Fixed length to
Variable length code)
(Variable length to Variable

length code)
Tunstall
Code
Grammar-based compression
Input text
ABDABCABCCABDABC
Grammar
E ABC
F ABDE
FECF
Construct a (context
free) grammar that
generates only the
input text
Encode the
grammar
Compressed
text
0101110011101
000011100
Re-pair algorithm
Replace most frequent bigram into a new symbol
dictionary
D AA
E DA
F EB
G CF
Encode with
variable length code
(a variation of Huffman coding)
Proposed Method
(Re-Pair VF codes [Kida&Yoshida
DCC2013])
Compact Data Structure for Bigdata @Shonan meeting
2013/9/29
The basic idea

In the Re-Pair algorithm,
symbol replacement does not always imply
improvement of compression ratio.
We want to encode with the best intermediate
dictionary, which gives highest compression ratio.
We find the best dictionary and then

expand the rules that do not included in it
when translating to a grammar.
Finally, we encode the obtained grammar
by Fixed length codes.
10
How to find the best?

Dict.
D AA
E DA
F EB
G CF
H GE
I FF
J IG
K HI
AADAEBCFGEFFIGHI EIKJBCEFICEK
Compressed sequence:
EIKJBCEFICEK
n
2s + n
The number of sort of symbols: s +
We need log(s + ) bits for each symbol.
sum: (2 s + n )log(s + ) bits

Calculate this value for
each substitution to store
the size of dictionary that
gives
the
highest
compression ratio.
11
Rewind the dictionary

Dictionary
D AA
E DA
F EB
G CF
H GE
I FF
J IG
K HI
Compressed sequence
EIKJBCEFICEK
expand this part
Compressed sequence by the

best dictionary:
EFFGEFFFFFFGBCEFFFCEGEFF
AADAEBCF EFFGEFFFFFFGBCEFFFCEGEFF
12
Experiments
We compared compression ratios, compression times,
decompression times, and pattern matching speeds among these
methods.
Method
VF code via Re-pair (Re-Pair-VF)

Original Re-pair (Re-Pair)
Tunstall code
STVF code (proposed by Kida)
gzip (zgrep which decompress the data then search
keywords)
Data
dazai.utf.txt (Japanese text (UTF-8 encoded), 7MB)

dblp2003.xml (XML data, 90MB)
gbhtg119 (DNA data, 87MB)
reuters21578 (English text, 19MB)
13
Compression ratio
Compression ratio (%)
[Yoshida&Kida2013]
80
60
Re-Pair-VF
Re-Pair
40
Tunstall
STVF
20
gzip
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
Re-Pair-VF overcomes gzip!
14
Compression speed
[Yoshida&Kida2013]
35
) 30
c
e
s
/
B 25
M
(
d
e 20
e
p
s
n 15
o
i
s
s
e
r 10
p
m
o 5
C
Re-Pair-VF
Re-Pair
Tunstall
STVF
gzip
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
15
Decompression speed
[Yoshida&Kida2013]
Decompression speed (MB/sec)
140
120
100
Re-Pair-VF
80
Re-Pair
Tunstall
60
STVF
40
gzip
20
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
16
Search speed [Yoshida&Kida2013]

(for reuters21578[English news paper])
Throughput (MB/sec)
200
150
100
Re-pair-VF best
Tunstall
gzip
50
0
0
10
20
30
40
50
Pattern length
17
Application to large data [Sekine etal.2013]

(English 2.2GB text from Pizza&Chili corpus)
Our proposed method achieved a good compression ratio

comparable to bzip2 (about 27%) for large texts
(codeword length 19bit, block size 128MB, ratio of shared dictionary 3/8, sampling text size 128MB)
18
Application to large data [Sekine etal.2013]

(English 2.2GB text from Pizza&Chili corpus)
The proposed method takes much time for

compressing,
but it can decompress at high speed!
19
Summary
We focused attention to VF codes, which are useful
for reusing compressed data, and developed an
efficient VF coding based on Re-pair algorithm.
The coding enables us to do fast keyword search, in
addition to make partial decompression and recompression easy.
Future works
Improvement of compression speed
Realizing pinpoint access to compressed data with an
original text position.
Development an online coding algorithm
20
Thank you for your kind

attention!
21
Biography
Takuya Kida received the B.S. degree in
Physics, the M.S. and Dr. (Information
Science) degree all from Kyushu University
in 1997, 1999, and 2001, respectively. He
was a Full-time Lecturer of Kyushu University
Library from October 2001 to March 2004.
He is currently an Associate Professor of
Division of Computer Science, Graduate
School
of
Information
Science
and
Technology, Hokkaido University, since April
2004. His research interests include Text
Algorithms and Data Structures, Information
Retrieval, and Data Compression. He is a
member of IEICE, IPSJ, and DBSJ.
22

Kida Data Compression Using Variable-To-Fixed Length Codes

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Kida Data Compression Using Variable-To-Fixed Length Codes

Hochgeladen von

Copyright:

Verfügbare Formate

Data Compression using

Variable-to-Fixed Length Codes

Compact Data Structure for Bigdata @Shonan meeting1

Variable-to-Fixed length codes (VF codes)

Proposed method (Re-Pair VF codes)

Background and aim

Remote Sensing Data

Open Smart Federation Architecture

GPS Logs & Life Logs

Various and Massive Stream Data

Develop an accessible data compression whose

Dawn of Compressed Pattern Matching

Larsson & Moffat 99

Kida et al.98, 99, 00

Kieffer & Yang 00

Klein & Shapira 08

Brisaboa et al. 03, 10

Efficient STVF coding algorithms

Yoshida & Kida 12

Combination of Grammar Comp. & VF Codes

Originality We have data compression techniques

Variable-to-Fixed Length Codes

Satisfy both of high comp. ratio and accessibility!

(Fixed length to Fixed

(Variable length to Fixed

(Variable length to Variable

Compact Data Structure for Bigdata @Shonan meeting

The basic idea

We find the best dictionary and then

How to find the best?

sum: (2 s + n )log(s + ) bits

Rewind the dictionary

Compressed sequence by the

VF code via Re-pair (Re-Pair-VF)

dazai.utf.txt (Japanese text (UTF-8 encoded), 7MB)

Compression ratio (%)

Re-Pair-VF overcomes gzip!

Decompression speed (MB/sec)

Search speed [Yoshida&Kida2013]

Application to large data [Sekine etal.2013]

Our proposed method achieved a good compression ratio

Application to large data [Sekine etal.2013]

The proposed method takes much time for

Thank you for your kind

Das könnte Ihnen auch gefallen