Sie sind auf Seite 1von 5

RDM: Rapid Deduplication for Mobile Cloud Storage

Ryan N. S. Widodo Hyotaek Lim*


Department of Ubiquitous IT Division of Information Engineering
Dongseo University, 617-716 Busan, South Korea Dongseo University, 617-716 Busan, South Korea
ryannswidodo@gmail.com htlim@dongseo.ac.kr

ABSTRACT
Cloud storages are used to expand the storage on mobile devices. Data deduplication is a technique to save storage by only saving
It can be accessed through an internet connection. On mobile one copy of redundant data. In addition of saving storage,
devices, network data can be expensive in terms of tariff and deduplication also reduces data transfers for uploading files when
power consumption. Deduplication can be used to reduce the it is done at the client side. Deduplication is commonly applied on
amount of data to be transferred. This paper proposes a fast storage servers. However, applying deduplication on mobile
deduplication system for mobile devices named rapid devices can also reduce the amount of data to be stored [16].
deduplication for mobile cloud storage (RDM). RDM is based on Deduplication is useful for saving storage and bandwidth and can
smart deduplication for mobile cloud storage (SDM) and rapid be a solution for mobile devices which utilize cloud storage to
asymmetric extremum (RAM). RDM used RAM as the chunking expand its storage capability.
algorithm instead of Rabin which is used in SDM. Our
experimental results show RDM is at least 39.5% to 44.7% faster In our previous work [16], we proposed smart deduplication for
at the cost of 4.2% to 9.1% higher size after deduplication. mobile cloud storage (SDM). It utilizes multiple deduplication
methods, file-level and block-level to reduce the process time.
CCS Concepts The deduplication method is chosen by the learning system based
on the data collected during the deduplication process. The main
• Information system ➝ Mobile information processing drawback of SDM is the long chunking process for block-level
systems • Theory of computation➝Data compression deduplication. Rabin rolling hash [13], the hashing algorithm used
as the chunking algorithm for the block-level, took significant
Keywords amount of time in the deduplication process.
Cloud storage; mobile devices; content-defined chunking; multiple
deduplication methods; network access Hash based chunking algorithms are computational heavy,
because it needs to calculate a new hash each time the sliding
window moves. Hashless chunking algorithms solve this by
1. INTRODUCTION finding the cut-point not based on the hash value of the data
Mobile data traffic grows rapidly year by year. It is predicted that stream. Local maximum chunking (LMC) [2] uses two
mobile data traffic will grow by tenfold in the next five years [14]. symmetrical sliding window and treats byte as a number and
Higher data traffic will require higher bandwidth communication compares it to find the cut point. It does a lot of comparisons
which can be costly. One of the main contributors to the growth is which makes the process long. Asymmetric Extremum [20],
cloud storage services. Cloud storage is accessible from anywhere another hashless chunking algorithm, uses two non-sliding
and anytime as long as it is accessible. Cloud storage popularity window to find the cut-point. Because the windows are not
on mobile devices grows fast because of the limited storage moving, AE does fewer comparisons compared to LMC. Rapid
capacity on mobile devices. To further improve cloud storage asymmetric maximum [15] (RAM) is based on AE and uses the
service, multiple cloud storages can also be combined [19]. same windows with different configuration on the windows
However, cloud storages rely on network connection to be position. The different configuration allows RAM to do fewer
accessed from mobile devices and network traffic can be costly in comparisons and have higher chunking throughput. In this paper,
terms of both energy consumption and data tariff. Data we propose rapid deduplication for mobile cloud storage (RDM)
compression technique is often used to reduce the amount of data. which is based on SDM and RAM. RAM is used as the block-
level chunking algorithm instead of Rabin rolling hash. Our
implementation results show at least 39.5% to 44.7% reduction on
Permission to make digital or hard copies of all or part of this work for deduplication process time and a drawback of 4.2% to 9.1% more
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
size after deduplication.
copies bear this notice and the full citation on the first page. Copyrights The rest of this paper is organized as follows. Section 2 contains
for components of this work owned by others than ACM must be
the related work on chunking algorithm and mobile deduplication.
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior Section 3 discusses the design of RDM. In Section 4, we compare
specific permission and/or a fee. Request permissions from RDM with SDM and we discussed the results in Section 5. Lastly,
Permissions@acm.org. we conclude our paper in Section 6.
ICCMS '17, January 20 - 23, 2017, Canberra, Australia
Copyright is held by the owner/author(s). Publication rights licensed to
ACM.
ACM 978-1-4503-4816-4/17/01 …$15.00
DOI: http://dx.doi.org/10.1145/3036331.3036357

14
2. RELATED WORK chunking process. In this paper, we proposed rapid deduplication
for mobile cloud storage (RDM). RAM, a hashless chunking
This section discusses the background of content-defined
algorithm is used to lower the computational overhead for block-
chunking (CDC) algorithm, related work, the limitations, and our
level deduplication.
motivation.
Deduplication eliminates duplicate data by comparing the 3. RAPID DEDUPLICATION FOR
fingerprint of the data. The fingerprint can be taken by using
mathematical hash function. Deduplication application is not MOBILE CLOUD STORAGE
limited to storage data [18]. It can also be applied to virtual disk Rapid deduplication for mobile cloud storage (RDM) is based on
images [7, 17], memory [1, 8, 9], and network traffic [4, 6, 12]. SDM [16] and RAM [15]. RAM is used as block-level chunking
Based on the deduplication method, deduplication can be algorithm in place of Rabin. The purpose of using RAM instead of
categorized into file-level and block-level. Rabin is to improve the deduplication process time. In this section,
we discussed the design of SDM and RAM.
The simplest deduplication method is file-level deduplication, or
can also be called whole file chunking. On file-level, the
deduplication process is fast because there is no chunking process 3.1 SDM design
involved. File-level deduplication is prone to modification, SDM is a deduplication system designed for mobile devices
because a small modification will result in a completely different proposed in our previous work [16]. SDM utilizes multiple
hash. deduplication methods to optimize the process time and the
duplicates detection. Meister et al. shows that some deduplication
Block-level deduplication solves this by splitting the file into methods are better than the other deduplication methods for
chunks. When there is a byte change, only a few chunks are different file type. SDM used this idea to reduce the processing
affected. Block-level deduplication can be done with fix-sized time while removing similar amount of duplicates.
blocks or variable-sized blocks. Deduplication with fix-sized
blocks is faster than deduplication with variable-sized blocks Figure 1 illustrated the design of SDM. SDM uses two
because it does not require additional processing. However, fix- deduplication method, file-level and block-level. In [16], we used
sized blocks have a weakness. Shifting the file by adding a byte or Rabin for the block-level deduplication chunking algorithm. The
bytes at the header of the file will result in a completely different learning system manages the deduplication method assigned for
fingerprint for all blocks. The variable sized block or CDC is each file type. After the file is processed by the chunking
more resistant to byte shifting because it uses the fixed chunking algorithm, the chunk is processed in the chunk management to
offset. A CDC algorithm can be used to determine the fixed find duplicate data. The chunk management used by SDM are
chunking offset. composed of two level, bloom filter [3] and hash table. The bloom
filter performs as the fast lookup because there is no false
CDC algorithms can be categorized into two, hash based CDC negative. To cover the false positive part of the bloom filter, hash
algorithms and hashless CDC algorithms. Hash based CDC table is used.
algorithms uses hash function to find the cut-point. The most
common used hash function is Rabin rolling hash [13]. Rabin Cloud Storage
rolling hash uses a sliding window and calculates the hash of the
window for each time the window slid. A cut-point is found when
Chunk management
the hash matches a predefined pattern. Hashless CDC algorithms
use different approach to find the cut-point. LMC [2], AE [20], Learning System
Mobile device

and RAM [15] treat each byte as a number. When the cut-point
condition is fulfilled, a cut-point is found. Commonly, the File level Block level
deduplication deduplication
condition is the cut-point byte must be bigger than every byte in
the window.
According to Meister et al. [11] some deduplication methods and Local Storage
chunking algorithms are more suitable for some file types. To take
the advantage of this fact, some deduplication systems are
proposed to utilize multiple deduplication methods. Marques et al. Figure 1. Overview of the SDM.
[10] and Haustein et al. [5] proposed deduplication systems with The learning system of SDM: The learning system of SDM
multiple deduplication methods. The deduplication method for chooses the best deduplication method for each file type. The
each file type is predefined. The deduplication method of a file decision on the deduplication method is based on a calculation
type is rotated in Haustein et al. system if the target deduplication made from the block-level chunking throughput, the amount of
ratio is not reached. Both of the systems have a problem with new duplicates detected, the file size, and the upload speed. The
file type that is not in the configuration and the deduplication learning system calculates the amount of duplicates eliminated by
method rotation used in Haustein et al. systems may cause hash using block-level deduplication and the amount of time saved for
incompatibility because of changing the deduplication methods. not uploading the duplicate. When the amount of time saved for
We addressed this issues with SDM [16]. SDM uses a learning uploading less data is bigger than the amount of time used for the
system to decide the deduplication method for each file type. The deduplication process (∑𝐹𝐹/𝑆𝑆𝑏𝑏 < ∑𝑓𝑓/𝑈𝑈𝑠𝑠 ), the learning system
learning system also solved the hash incompatibility problem by will assign block-level deduplication for the file type. Where ∑F
using all deduplication method in the learning phase instead of is total file size (kB), ∑f is total duplicate length for the file type
rotating the deduplication methods. (kB), Sb is the block-level throughput (kBps), and Us is the upload
The main issue with SDM is the low block-level deduplication speed (kBps). The decision is made after processing a few files for
throughput. Most of the deduplication process time is spent on the

15
a file type. Because of space constrains, the details can be seen in RAM is faster than AE and Rabin in our experiment [15] because
[16]. of the fewer comparisons [15] at the cost of higher chunk variance
which leads to lower number of duplicates detected. RAM is more
3.2 RAM algorithm design preferable for deduplication process on low computation devices
such as mobile devices and IoT because of the low computation
RAM is a hashless content-defined chunking (CDC) algorithm
overhead. Our analysis shows that RAM has lower probability or
[15]. It treats each byte as a number with a value which allows the
long chunks than Rabin. This makes the chunk distribution for
algorithm to find a cut-point without high computational overhead.
RAM much closer to the mean which may allow lower chunk
RAM is similar to AE [20] because it also uses two windows, fix-
variance. The probability of long chunks is presented in Table 1
sized and variable-sized windows. The windows configuration can
Because of space limitations, our complete analysis on RAM can
be seen in Figure 2 (a) for RAM and (b) for AE. The placement of
be seen in [15].
the windows is different from AE. In RAM, the fix-sized window
is located at the beginning of the chunk and followed by the Table 1. The probability of long chunks for RAM and Rabin.
variable-sized window and the maximum-valued byte. As we can
observe in Figure 2, the maximum-valued byte is included in the 𝑀𝑀 AE RAM Rabin
chunk at the end of the chunk. The windows used in RAM and AE 1 1
𝑒𝑒 −𝑀𝑀
are not sliding windows. The condition for a byte to become the [(𝑒𝑒 − 1)𝑀𝑀]! [2𝑀𝑀]!
cut-point is when the byte‘s value is larger than all of the bytes in 2 0.166667 0.041667 0.135335
the fixed window. The algorithm searches for a byte that fulfills
3 0.008333 0.001389 0.049787
the condition. The pseudo code for the algorithm can be seen in
Algorithm 1. 4 0.001389 2.48E-05 0.018316
5 2.48E-05 2.76E-07 0.006738
Algorithm 1: Algorithm for RAM chunking
Input: input string, Str, length of the string, L; 6 2.76E-07 2.09E-09 0.002479
Output: cut-point I; 7 2.51E-08 1.15E-11 0.000912
Predefined values: window size, w;
function RAMChunkning (Str, L) 8 1.61E-10 4.78E-14 0.000335
i=1;
while (i<L) 4. PERFORMANCE EVALUATION
if Str[i].value>=max.value then
if i>w then This section discusses the performance of RDM in an
return i experimental environment. RDM is installed as an Android
end if application and ran on an Android device, Samsung Galaxy S3.
The Samsung Galaxy S3 uses 1.4 GHz quad core CPU, 1.786 GB
max.value=Str[i].value of RAM, Android version 4.4.4, and build number
max.position = i KTU84P.E210SKSUKNK3.
end if The systems tested in this experiment are as follows:
i=i+1
end while 1. File-level deduplication system (FDS)
end function 2. Rabin block-level with deduplication system (RabinBDS)
3. Rapid Asymmetric Extremum block-level with deduplication
RAM is a content-defined chunking algorithm because it can re- system (RAMBDS)
adjust the chunk size when there is a byte inserted in the chunk. 4. Smart Deduplication for Mobile with Rabin (SDM)
When a byte is inserted in any position in the chunk and the size 5. RDM (RDM)
of the byte is less than the cut-point byte, only one chunk is
affected. If the size of the byte is larger than the cut-point byte As the names indicate, File-level deduplication system (FDS) is a
size and is inserted in the variable window, then the number deduplication system that performs deduplication at file-level and
affected of chunk will increase. If a byte is inserted in the fix- Block-level deduplication system (BDS) is a deduplication system
sized window and the value is larger than the previous maximum that performs deduplication on block-level. There are two variants
in the fixed window, the number of chunks affected by the byte of BDS, Rabin BDS (RabinBDS) and RAM BDS (RAMBDS).
insertion will stop increasing until the next cut-point byte is larger. RabinBDS only uses Rabin for the chunking algorithm while
RAMBDS only uses RAM. SDM uses both file-level and block-
Maximum valued byte Cut-point level deduplication to reduce deduplication process time. SDM
Fixed Variable has two variants, SDM with Rabin (SDM) and SDM with RAM
Data (RDM). The purpose of using these deduplication systems for
RAM comparisons is to show the performance of RAM when used with
(a) low computation devices compared to Rabin based deduplication
Extreme value Cutpoint system.
Variable Fixed
Data Datasets: In our implementation, we used three datasets to show
AE
the performances of each system in different conditions. The
(b) content of each dataset is shown in Table 2. Dataset 1 contains
audio files which are unique; dataset 2 consists of Android
Figure 2. (a) RAM chunk structure, (b) AE chunk structure. application data with OBB extension; dataset 3 includes dataset 1,
dataset 2, and additional JPG and PDF files. Each dataset

16
represent a different condition. Unique files is represented by

54.37
60.00

48.37
dataset 1, redundant files is represented as dataset 2, and standard FDM

Processing time (s)


backup which may consists of unique files and redundant files is 50.00

34.09
represented by dataset 3.

29.28
40.00 RabinBDS

22.23
20.99
Table 2. The datasets used in the implementation.
30.00

18.22

16.36
RAMBDS

13.43

12.29
11.96
Datasets Content File Size
20.00

8.70
8.30
1 Audio (MP3 files) 560 MB (128 files)

7.56
7.41
2 Android application data (OBB 524 MB (20 files) SDM
10.00
files)
3 Mixed (JPG, PDF, MP3, OBB) 1.53 GB (226 files) 0.00 RDM
Dataset 1 Dataset 2 Dataset 3
Datasets

Evaluation Methodology: We used five deduplication systems to Figure 3. Time consumed for deduplication process (lower is
evaluate the performance of RAM on a mobile device. Processing better).
time and size after deduplication are the performance metrics in
1,800.00

1,532.37
1,428.38
the evaluation. The processing time represents the deduplication
FDM

Size after deduplication (MB)

1,164.08
1,141.61
1,600.00

1,100.21
throughput and the time taken by a system to complete a

1,077.44
deduplication process. The size after deduplication shows the final 1,400.00
size of the datasets after a deduplication process is applied and the RabinBDS
1,200.00
deduplication accuracy. Lower size after deduplication means 1,000.00 RAMBDS

560.90
526.45

525.63
525.55

524.29
508.23
503.09
higher deduplication accuracy because more duplicates are

482.34
800.00
eliminated. This evaluation methodology is also used in our

234.94
234.94
600.00 SDM

183.87
183.87
previous work [16] and other papers [2, 20].
400.00
Configurations: The average block' size needs to be at equal size
200.00 RDM
for all algorithms for a valid comparison. Rabin’s average chunk
0.00
size can be configured by changing the average size, minimum ND
size, maximum size, and the window size. While for RAM, we Dataset 1 Dataset 2 Dataset 3
Datasets
can change the window size to adjust the average chunk size.
However, configuring Rabin’s average chunk size is harder than
Figure 4. Size of the dataset after deduplication (lower is
configuring RAM because of the lower probability of long chunks
better).
of RAM. In our implementation, we ran a test with Rabin and
configured RAM to match the average chunk size of Rabin. The
configuration used for Rabin is as follows: 16 KB, 4 KB, 32 KB, 5. DISCUSSION
and 48 bytes for average, minimum, maximum chunk size, and On mobile devices, RAM performs 26.3% to 44.7% better than
the window size respectively. The window size used for RAM is Rabin in terms of processing time. However the improvement is
18 KB. FDS and BDS are SDM with file-level or block-level not as big as when RAM is applied on desktop PC. In our other
deduplication disabled. work [15], RAM is 500% faster than Rabin when the algorithm is
ran on an Intel i7 6700 and an SSD [15] where the deduplication
Evaluation results: The processing time and the size after throughput is similar to the SSD read speed. This shows the
deduplication results are displayed in Figure 3 and 4. As can be storage read speed on our mobile device bottlenecks the
seen in Figure 3, RAM consumes 26.3% (dataset 1) to 43% deduplication process. With better storage read speed, RAM
(dataset 2) less time compared to Rabin when used in block-level shows higher improvement over Rabin.
deduplication system (BDS) on mobile devices. When used in
SDM, RAM showed small improvement in dataset 1 compared to In some datasets, Rabin eliminates more duplicate data than RAM.
RAMBDS because when processing dataset 1 SDM utilized file- This is similar our evaluation done in [15]. On mobile phone
level deduplication. On dataset 2 and dataset 3, RAM improved devices, there are some cases where RAM is more preferable to
the processing time by 44.7% and 39.5% respectively. Rabin. Those cases are when the upload speed is high enough for
uploading the extra bytes to be negligible, local deduplication
Although RAM is faster than Rabin, Rabin can eliminate more where there is no upload involved and processing time is
duplicates than RAM in dataset 2 and dataset 3 as can be seen in important for battery, and applications where processor consumes
Figure 4. Rabin eliminated 9.1% or 51.07 MB and 4.2% or 64.18 more power than the network adapter. To improve the duplicates
MB more duplicate in dataset 2 and 3 respectively. In dataset 1, elimination, the average chunk size of RAM can be modified to
RAM eliminates 5.14 MB more duplicates data than Rabin. The lower value. However, this will also increases the amount of
results are similar when RAM is utilized in SDM. To ease the chunk metadata.
comparison between deduplication systems, we included ND
which stands for no deduplication in Figure 4. ND size is the full
size of the dataset. 6. CONCLUSION
In this paper, we have discussed how deduplication can improve
the feasibility of cloud storage usage on mobile devices and how
using a better chunking algorithm can reduce the deduplication
process time. We proposed a deduplication system based on SDM
and rapid asymmetric maximum (RAM). The deduplication
system used RAM instead of Rabin for block-level deduplication.

17
The main advantage of RAM is its low computation overhead [9] Lee, B. et al. 2015. MemScope : Analyzing Memory
which allows faster deduplication process. When RAM is used on Duplication on Android Systems. Proceedings of the 6th
SDM as the block-level chunking algorithm (RDM), the process Asia-Pacific Workshop on Systems. ACM. (2015), 19.
time is reduced in all datasets but the amount of duplicates found [10] Marques, L. and Costa, C.J. 2011. Secure Deduplication on
is also reduced in some datasets. In some cases, using RAM on Mobile Devices. Proceedings of the 2011 Workshop on Open
SDM offers 39.5% to 44.7% less processing time with an increase Source and Design of Communication - OSDOC ’11. (2011),
of 4.2% to 9.1% in size after deduplication than using Rabin. 19.
[11] Meister, D. and Brinkmann, A. 2009. Multi-level comparison
7. ACKNOWLEDGEMENT of data deduplication in a backup scenario. Proceedings of
This work was supported by the National Research Foundation of SYSTOR 2009: The Israeli Experimental Systems Conference.
Korea under Grant 2016R1D1A1A09916932. (2009), 8.
[12] Papapanagiotou, I. et al. 2012. Chunk and Object Level
8. REFERENCES Deduplication for Web Optimization : A Hybrid Approach.
[1] Ahn, J. and Shin, D. 2014. Optimizing Power Consumption (2012), 1393–1398.
of Memory Deduplication Scheme. (2014), 1–2.
[13] Rabin, M.O. 1981. Fingerprinting by random polynomials.
[2] Bjørner, N. et al. 2010. Content-dependent chunking for Center for Research in Computing Techn., Aiken
differential compression , the local maximum approach. Computation Laboratory, Univ.
Journal of Computer and System Sciences. 76, 3-4 (2010),
[14] Summary, E. 2016. Cisco Visual Networking Index :
154–203.
Forecast and Methodology , 2014 – 2019. (2016), 2014–2019.
[3] Bloom, B.H. 1970. Space / Time Trade-offs in Hash Coding
[15] Widodo, R.N.S. et al. 2016. A New Content-Defined
with Allowable Errors. Communications of the ACM 13.7
Chunking Algorithm for Data Deduplication in Cloud
(1970). 13, 7 (1970), 422–426.
Storage. Manuscript. (2016).
[4] Bremler-barr, A. et al. Leveraging Traffic Repetitions for
[16] Widodo, R.N.S. et al. 2016. SDM : Smart deduplication for
High-Speed Deep Packet Inspection.
mobile cloud storage. Future Generation Computer Systems.
[5] Haustein, N. et al. 2009. Method of and system for adaptive (2016).
selection of a deduplication chunking technique. United
[17] Xu, J. et al. 2016. Clustering-based acceleration for virtual
States Patent. 2, 12 (2009), 1–7.
machine image deduplication in the cloud environment. The
[6] Hua, K.A. et al. 2015. Redundancy Control through Traffic Journal of Systems & Software. 121, (2016), 144–156.
Deduplication. (2015), 10–18.
[18] Yang, C. et al. 2015. Provable ownership of files in
[7] Jin, K. and Miller, E.L. 2009. The Effectiveness of deduplication cloud storage. July 2013 (2015), 2457–2468.
Deduplication on Virtual Machine Disk Images. Proceedings
[19] Yeo, H.S. et al. 2014. Leveraging client-side storage
of SYSTOR 2009: The Israeli Experimental Systems
techniques for enhanced use of multiple consumer cloud
Conference. May (2009), 1–12.
storage services on resource-constrained mobile devices.
[8] Kim, S.H. et al. 2014. Selective memory deduplication for Journal of Network and Computer Applications. 43, (2014),
cost efficiency in mobile smart devices. IEEE Transactions 142–156.
on Consumer Electronics. 60, 2 (2014), 276–284.
[20] Zhang, Y. et al. 2015. AE : An Asymmetric Extremum
Content Defined Chunking Algorithm for Fast and
Bandwidth-Efficient Data Deduplication. (2015), 1337–1345.

18

Das könnte Ihnen auch gefallen