Beruflich Dokumente
Kultur Dokumente
Takuya Kida
kida@ist.hokudai.ac.jp
with Satoshi Yoshida, Hirohito Sasakawa, and Kei Sekine
2013/9/29
Contents
Background and aim
Related-works
Internet SNS
Road Information
Information from Citizens
UGC
User Generated Content
Data
GPS
GPS log
Life log
Cloud
Computing
Big Data
Related works
Problem Existing methods are inconvenient for
reusing the compressed data, because they must follow
the decompression process.
Amir & Benson 92
Tunstall 67
Root of VF codes
Kida 09, 10
Grammar
Compressio
ns
model data very
well
VF codes
provide high
accessibility
Variations of codes
Input Text
Fixed
Length
Compressed
Text
Variable
Length
Fixed Length
Variable Length
FF Code
VF Code
FV Code
VV Code
(Fixed length to
Variable length code)
Tunstall
Code
Grammar-based compression
Input text
ABDABCABCCABDABC
Grammar
E ABC
F ABDE
FECF
Construct a (context
free) grammar that
generates only the
input text
Encode the
grammar
Compressed
text
0101110011101
000011100
Re-pair algorithm
Replace most frequent bigram into a new symbol
dictionary
D AA
E DA
F EB
G CF
Encode with
variable length code
(a variation of Huffman coding)
Proposed Method
(Re-Pair VF codes [Kida&Yoshida
DCC2013])
2013/9/29
10
D AA
E DA
F EB
G CF
H GE
I FF
J IG
K HI
AADAEBCFGEFFIGHI EIKJBCEFICEK
Compressed sequence:
EIKJBCEFICEK
n
2s + n
The number of sort of symbols: s +
We need log(s + ) bits for each symbol.
11
Compressed sequence
EIKJBCEFICEK
expand this part
AADAEBCF EFFGEFFFFFFGBCEFFFCEGEFF
12
Experiments
We compared compression ratios, compression times,
decompression times, and pattern matching speeds among these
methods.
Method
Data
13
Compression ratio
[Yoshida&Kida2013]
80
60
Re-Pair-VF
Re-Pair
40
Tunstall
STVF
20
gzip
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
14
Compression speed
[Yoshida&Kida2013]
35
) 30
c
e
s
/
B 25
M
(
d
e 20
e
p
s
n 15
o
i
s
s
e
r 10
p
m
o 5
C
Re-Pair-VF
Re-Pair
Tunstall
STVF
gzip
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
15
Decompression speed
[Yoshida&Kida2013]
140
120
100
Re-Pair-VF
80
Re-Pair
Tunstall
60
STVF
40
gzip
20
0
dazai.utf.txt
dblp2003.xml
gbhtg119
reuters21578
16
Throughput (MB/sec)
200
150
100
Re-pair-VF best
Tunstall
gzip
50
0
0
10
20
30
40
50
Pattern length
17
19
Summary
We focused attention to VF codes, which are useful
for reusing compressed data, and developed an
efficient VF coding based on Re-pair algorithm.
The coding enables us to do fast keyword search, in
addition to make partial decompression and recompression easy.
Future works
Improvement of compression speed
Realizing pinpoint access to compressed data with an
original text position.
Development an online coding algorithm
20
21
Biography
Takuya Kida received the B.S. degree in
Physics, the M.S. and Dr. (Information
Science) degree all from Kyushu University
in 1997, 1999, and 2001, respectively. He
was a Full-time Lecturer of Kyushu University
Library from October 2001 to March 2004.
He is currently an Associate Professor of
Division of Computer Science, Graduate
School
of
Information
Science
and
Technology, Hokkaido University, since April
2004. His research interests include Text
Algorithms and Data Structures, Information
Retrieval, and Data Compression. He is a
member of IEICE, IPSJ, and DBSJ.
22