Sie sind auf Seite 1von 55


(Hard cover page with black cover and gold emboss content as below)
(It should be also cover page[first page])
(Four blank spaces)
A DISSERTATION REPORT ON (12/bold, upper case)
(Two blank spaces)

YOUR PROJECT TITLE GOES HERE (16/bold, upper case)

(Two blank spaces)


FOR THE AWARD OF THE DEGREE (12, upper case)
(One blank space)
OF (12,upper case)
(Two blank spaces)

MASTER OF ENGINEERING (Computer Engineering)

(14/bold, upper case)
(Two blank spaces)

(14, upper case)
(Two blank spaces)

Anil Samale Exam No : B210808

(14, bold/upper case)
(Two blank spaces)
Under the guidance of (14/bold, upper case)
Prof ……………


S. NO. 111/1,WARJE ,PUNE-411058.

CERTIFICATE (16, bold/upper case)

(Three blank spaces)

This is to certify that the project report entitles (12, sentence case)
(One blank space)
(Two blank spaces)
Submitted by
(One blank space)

Anil Samale Exam No : B210808

(12, bold/upper case)
(One blank space)
is a bonafide work carried out by them under the supervision of Prof. …………………………..
and it is submitted towards the partial fulfillment of the requirement of University of Pune, Pune
for the award of the degree of Master of Engineering (Computer Engineering)(12, Sentence case)
(12, Sentence case)

(Four blank spaces)

(Prof. ……………… ) (Prof. )

Guide Head,
Department of Computer Engineering Department of Computer Engineering
(12, Sentence case)

Seal/Stamp of the College

(Dr. )
RMD Sinhgad School of Engineering Pune – 58
(On Company Letter head/seal)

CERTIFICATE (16, bold/upper case)

(Three blank spaces)

This is to certify that the dissertation report entitled (12, sentence case)
(One blank space)
“ YOUR DISSERTATION TITLE GOES HERE” (12, bold/upper case)
(Two blank spaces)

Submitted by
(One blank space)

Name of the Candidate Exam No: ---------------------

(12, Title case)
(One blank space)

is a bonafide work carried out by him/her with the Sponsorship from -----------------under
the supervision of Mr/Ms/Miss. ………………………….. and has been completed
(Four blank spaces)

(Mr. ……………… ) ( Mr. ……………… )

(External Guide) (Internal Guide)
(12, Sentence case)
Place :
Date :

Certificate by Guide

This is to certify that Mr/Ms/Mrs............................................. has completed the dissertation

work under my guidance and supervision and that, I have verified the work for its originality in
documentation, problem statement, implementation and results presented in the dissertation. Any
reproduction of other necessary work is with the prior permission and has given due ownership
included in the references.

Date: Signature of Guide
(Name of guide)

(14, bold/upper case)



The biggest challenge for big data from a security point of view is the protection of user’s
privacy. Big data frequently contains huge amounts of personal identifiable information and
therefore privacy of users is a huge concern. However, encrypted data introduce new challenges
for cloud data deduplication, which becomes crucial for big data storage and processing in cloud.
Traditional deduplication schemes cannot work on encrypted data. Existing solutions of
encrypted data deduplication suffer from security weakness. They cannot flexibly support data
access control and revocation. Therefore, few of them can be readily deployed in practice. In this
paper, we propose a scheme to deduplicate encrypted data stored in cloud based on ownership
challenge and proxy re-encryption. It integrates cloud data deduplication with access control. We
evaluate its performance based on extensive analysis and computer simulations. The results show
the superior efficiency and effectiveness of the scheme for potential practical deployment,
especially for big data deduplication in cloud storage.
Index Term: Access control, Big data, Cloud computing, Data de-duplication. , Proxy re-

(Four blank spaces)

CONTENTS (14, bold, uppercase)

LIST OF PUBLICATIONS ................................................... II
LIST OF DIAGRAMS ..……………………………. IV
LIST OF TABLES …………………………….. V
ABSTRACT ..………………………….. VI


1. Synopsis
Group Id
Dissertation Title
Deduplication on Encrypted Big Data in using HDFS Framework

Sponsorship: (if any)

External Guide
Internal Guide
Technical Key Words (in brief)
Access control, Big data, Cloud computing, Data de-duplication. , Proxy re-encryption.

Relevant mathematics associated with the Dissertation

Set Theory:
S={s, e, X, Y,}
s = Start of the program.
e = End of the program.
X = Input of the program.
Y = Output of the program.
File will be first fragmented then it is encoded and the fragments are allocated. Finally when
we request for file downloading we get file as output. System makes auditing and notify results
to the data owner and proxy agent. Proxy agent replaces the modified code suggested by third
party auditor.
X, Y€U
Let U be the Set of System.
U= {Client, F, S, T, M, D, TP, PA}
Where Client, F, S, T, M, D are the elements of the set.
Let S1 be a set different parameters which will support for complex file as query
S1= {U1,U2.U3……..Un} // set of files
Mapreduce phase
Lets S2 define user authenticate or not
Ui is the master node which having different storage nodes as clusters.

Activity III
S3 and S4
S3 ={data1, data2…..datan} materialized view
S4={a1,a2….an} each file with similarity weight score from database
Fig : venn diahram

Names of at least two conferences where paper (Sem I and Sem II) can be published
Add Publish Papers
Review of Conference/Journal Papers supporting dissertation idea
Add Publish Papers
Plan of dissertation execution
Below table summarizes various tasks being carried out in estimated duration of weeks.

Mileston Task Name Begin date End date Remark

e s
Selecting project domain 15 Aug 20 Aug Done
1 2016 2016
Understanding project need 21 Aug 25 Aug Done
2 2016 2016
Understanding pre requisites 26 Aug 30 Aug Done
3 2016 2016
Information Gathering 1st Sep 15 Sep Done
4 2016 2016
Literature Survey 16thSep 15 Sep Done
5 2016 2016
Refine Project Scope 16 Sep 18 Sep Done
6 2016 2016
7 Concept Development 19 Sep 20 Sep Done
2016 2016
Planning and Scheduling 21 23 Sep Done
8 Sep2016 2016
Requirements analysis 24 Sep 25 Sep Done
9 2016 2016
Risk identification and monitoring 26 27 Sep Done
10 Sep2016 2016
Design and modeling 28 Sep 15 Oct Done
11 2016 2016
Design review and refinement 16 Oct 20 Oct Done
12 2016 2016
GUI design 21st Oct 20 Nov Done
13 2016 2016
Implementation 21 Nov 15 Feb
14 2016 2017
Review and suggestions for 15th 20th
Implementation march march
15 2017 2017
Outcome assessment 21st March 30th
2017 March
16 2017
Testing and Quality Assurance 1st Apr 10th Apr
17 2017 2017
Review and suggestions for Testing 11 Apr 15th Apr
18 and QA 2017 2017
Refined QA activities 16 Apr 30th May
19 2017 2017
Table 1.9: Table for Project Schedule

Problem statement, solving approach and Efficiency issues

• To protect data confidentiality along with secure de-duplication, notion of authorized de-
duplication is proposed in HDFS framework which can provide parallel processing with
minimum time complexity.
• To carry duplicate check firstly privileges assigned to user are checked Instead of data
itself duplicate check is based on differential privileges of users.

• Here, problem of privacy preserving in de-duplication in cloud environment is

considered and advanced scheme supporting differential authorization and authorized
duplicate check is proposed.

• This project addresses the issue in authorized de-duplication to achieve better security.

• We showed that our authorized duplicate check scheme incurs minimal overhead
compared to convergent encryption and network transfer.

2. Technical Keywords (refer ACM Keywords) [In detail]

Big data - Big data is a term that describes the large volume of data – both structured and
unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data
that’s important. It’s what organizations do with the data that matters. Big data can be
analyzed for insights that lead to better decisions and strategic business moves.

Cloud computing - Simply put, cloud computing is the delivery of computing services—
servers, storage, databases, networking, software, analytics and more—over the Internet (“the
cloud”). Companies offering these computing services are called cloud providers and typically
charge for cloud computing services based on usage, similar to how you are billed for water or
electricity at home.

Still foggy on how cloud computing works and what it is for? This beginner’s guide is designed
to demystify basic cloud computing jargon and concepts and quickly bring you up to speed.

Data de-duplication - In computing, data deduplication is a specialized data compression

technique for eliminating duplicate copies of repeating data. Related and somewhat
synonymous terms are intelligent (data) compression and single-instance (data) storage.
Proxy re-encryption - Proxy re-encryption schemes are cryptosystems which allow third
parties (proxies) to alter a ciphertext which has been encrypted for one party, so that it may
be decrypted by another.

3. Introduction
Dissertation Idea
Motivation of the dissertation
Literature survey mainly containing survey of mathematical models referred in paper
Dissertation Idea / Motivation
Data deduplication offers lot of benefits. Lower storage space requirements will save money on
disk expenditures. The more efficient use of disk space also allows for longer disk retention
periods, which provides better recovery time objectives (RTO) for a longer time and reduces the
need for tape backups. Data deduplication also reduces the data that must be sent across a WAN
for remote backups, replication, and disaster recovery. In actual practice, data deduplication is
often used in conjunction with other forms of data reduction such as
conventional compression and delta differencing. Taken together, these three techniques can be
very effective at optimizing the use of storage space. In existing surveys we find some issues
these are below
 Data redundancy issue generated in cloud storage.
 Some systems can be take very long time for data processing
 No data security mechanism yet develop in any system
After finding all these issues in existing approaches to develop a system which can work with
parallel processing finding a duplication issue.
Literature Survey
1.1 A Hybrid Cloud Approach for Secure Authorized De-duplication

De-duplication of data has many forms. Typically, there is no one best way to implement data
de-duplication across an whole an organization. Instead, to maximize the benefits, organizations
may deploy more than one deduplication strategy. Cloud data storage services mostly refer de-
duplication, which removing redundant data by storing only single copy of every file or block
[1]. It is very essential to know the backup and backup challenges, when selecting de-duplication
as a solution.
Advantages: This De-duplication technique reduces the space and bandwidth requirements of
data storage services, and is most effective when applied with multiple users, a common practice
by cloud storage offerings.
Limitations: Data deduplication does not work with traditional encryption techniques. While
using data deduplication technique it should not reduce fault tolerance mechanism. Types of data
de-duplication are described below:
File-level de-duplication: This de-duplication technique is commonly called as single-instance
storage, file-level data de-duplication compares a file that has to be archived or backup that has
already been stored by checking all its attributes against the index. The index is updated and
stored only if the file is unique, if not than only a pointer to the existing file that is stored
references. Only the single instance of file is saved in the result and relevant copies are replaced
by”stub” which points to the original file [1].
Block-level de-duplication: Block-level data de-duplication operates on the basis of sub-file
level. As the name implies, that the file is being broken into segments blocks or chunks that will
be examined for previously stored information vs redundancy. The popular approach to
determine redundant data is by assigning identifier to chunk of data, by using hash algorithm for
example it generates a unique ID to that particular block. The particular unique Id will be
compared with the central index. In case the ID is already present, then it represents that before
only the data is processed and stored before. Therefore only a pointer reference is saved to the
previously stored data. If the ID is new and does not exist, then that block is unique. The unique
chunk is stored and the unique ID is updated in the Index. The size of the chunk which needs to
be checked varies from vendor to vendor [1].
1.2 Content Addressable Storage
Eliminating multiple copies of any file is a form of the de-duplication. Single instance storage
(SIS) environments can detect and eliminate redundant copies of identical files. After a file is
stored in a single-instance storage system than, all the other references to same file, will refer to
the original, single copy. Single instance storage systems compare the content of files to detect if
the incoming file is identical to an existing file in the storage system. Content-addressed storage
is typically combined with single-instance storage functionality [5]. While filelevel de-
duplication avoids storing files that are a duplicate of another file, many files that are considered
unique by single-instance storage measurement may have a huge amount of redundancy within
the files or between files. For example, it would take only one small element (e.g., a new date
inserted into the title slide of a presentation) for single-instance storage to through two large files
as being different and requiring them to be stored without further de-duplication [7].
Advantages: CAS system provides higher searching speed for documents.
Limitations: This system only provides performance benefits when there are more read
operations than update operations.
1.2 Convergent Encryption
Convergent encryption provides data confidentiality in de-duplication. A user (or data owner)
derives a convergent key from each original data copy and encrypts the data copy with the
convergent key. The basic idea of convergent encryption (CE) is to derive the encryption key
from the hash of the plaintext. The intelligible implementation of converging encryption can be
defined as follows: Alice obtain the encrypt key from her message M such that K = H(M), where
H is a cryptographic hash function; he or she can encrypt the message with the help of key,
hence: C = E(K;M) = E(H(M);M), where E is a block cipher texts. By using this technique, two
users with two identical plaintexts will obtain two identical constant size cipher texts since the
encrypt key is the same; hence the cloud storage provider will be authorized to perform de-
duplication on such cipher texts. Therefore, encryption keys are generated, keeping and protected
by users. As the encryption key is deterministically generated from the plaintext, users do not
have to communicate with each other for establishing an agreement on the key to encrypt a given
plaintext. Therefore, converging encryption seems to be a good candidate for the adoption of
encryption and deduplication in the cloud storage domain. In addition, the user derives a tag for
the data copy, such that the tag will be used to detect duplicates [9][11].
A typical convergent encryption is insecure because brute-force attack launched by cloud server
can recover files falling into known set. To understand this, consider that public cloud server
knows the given cipher texts file is drawn from message space S = {F1, F2, …, Fn} of size n,
then it it can recover files file F using at most n on-line encryptions. This can be done by
encrypting each file Fi where i = {1, 2,…,n} to get encrypted text Ci. If C = Ci this means
underlying file is Fi. This means convergent encryption is insecure for predictable files. M.
Bellare design a system, Duplication less than combines a CE-type scheme with the ability to
derived message-obtain keys with the help of a key server (KS) shared by a group of users. The
users interact with the KS by a protocol for oblivious PRFs, ensuring that the KS can
cryptographically mix in secret data to the per-message keys while do not learning anything
about files stored by users. These mechanisms obtain that Duplication less provides strong
security against external attacks and that the security of Duplication less inviting degrades in the
face of comprised systems. Require a user can be compromised, learning the plaintext underlying
another users cipher text requires mounting an online brute force attacks [11].
M. Bellare is to formalize a new cryptographic primitive, Message-Locked Encryption (MLE),
where the key encryption and decryption are performed is it derived from the message. MLE
provides a way to accomplish secure de-duplication, a aim formerly targeted by numerous cloud
database storage providers. They supply definitions of privacy and a form of integrity that they
call tag consistency. They provide ROM security analyses of a natural family of MLE schemes
that includes deployed techniques. They built connections with deterministic encryption, hash
functions secure on correlated inputs [10].
Another type criteria is the location at which deduplication is applied if data are de-duplicated at
the user, then it is called source-based de-duplication, otherwise target-based. In source-based
de-duplication, the user first hashes every data segment he wishes to upload and sends these
result performances to the storage provider to check whether such data are already stored: thus
only not de-duplicated data segments will be actually uploaded by the user. While de-duplication
at the user side can obtain bandwidth savings, it unfortunately can built the system vulnerable to
side channel attacks whereby attackers can instantly discover whether a secure data is stored or
not. On the other side, by de-duplicating data at the storage provider, the system is protected
against side-channel attacks but such solution does not decrease the communication
performance. Wang et al. proposed a new system which provides secure and efficient access to
outsourced data [5].
Here the end user sends a request for data access to the data owner, after that the data owner will
send back an encryption key and access certificate to an end user, and then the end user will send
that access certificate to the data storage provider and the data storage provider will send the
encrypted data blocks to the end user. The advantage of their approach is that, it has a low
storage overhead, but it requires cloud server support to enforce policies. Roxana et al. proposed
a new system which overcomes the drawback of wang’s approach. System supports the use of
multiple policies. Here we focus on new approach which is named as FADE [13]. Authors in
their study have proposed a new protocol called vanish which provides data privacy and self
deleting data. The study is mainly focused on the data and it could be able to access for a limited
period of time. After the time expiry, the data is not accessible to the users nor to the data owner.
Vanish protocol is applicable for only sensitive data. To has selfdeleting property, the activities
takes place are, Vanish first encrypts user’s data locally by taking the help of encryption key and
the encryption key will not be known to the user also, then it destroy local copies of key and after
that it sprinkles bits in DHT randomly. The drawback of this system is it provides the assured
deletion based on time. Even the legitimate users may not be able to access the data after time
expiration [15].
Sven B. et al. [10] proposed twin cloud architecture for secure deduplication in cloud storage. As
the name suggest their approach uses one public cloud and one private cloud, User
communicates with a private cloud (organization maintained cloud) which encrypts data before
outsourcing to public cloud. This private cloud is also responsible for verification of stored data
in public cloud. Their architecture uses private cloud for operations requiring security whereas
other kind of queries is processed by public cloud. Their technique allows maximum utilization
of resources of private cloud, and only high load queries are processed on-demand by the public
Trusted Cloud requires constant amount of storage and is used constantly in the Setup Phase for
pre-computing encryption. The public cloud provides large amount of storage and is used for
time-critical Query operations. Zhang et al. also proposed a hybrid cloud [1] [7] system named
Sedic [7]. The system supports the privacy aware data computing. The system is based on
MapReduce fuction. They address the problem of authorized deduplication of public cloud data.
Here the private cloud is assumed as honest but curious. Advantages: Convergent encryption
provides security while deduplication process. Security in deduplication process can be increased
using twin cloud approach or hybrid cloud approach and using random encryption keys.
Limitations: Increase in complexity in deduplication process is main limitation in above

4. Problem Definition
We model the dispatching and execution of user requests for cloud-computing service. First we
describe portal server, and dynamic resource allocation system. Next we explain workload of
user requests, electricity price, workload queue, and workload scheduler. Then we give the
queuing delay constraint for workload and the capacity constraint for resource allocation sites.
We further formulate the energy cost minimization problem for the dynamic resource allocation
NP-Hard Analysis
When solving problems we have to decide the difficulty level of our problem. There are three
types of classes provided for that. These are as follows:
1) P Class
2) NP-hard Class
3) NP-Complete Class
A problem is NP-hard if solving it in polynomial time would make it possible to solve all
problems in class NP in polynomial time. Some NP-hard problems are also in NP (these are
called “NP-complete”), some are not. If you could reduce an NP problem to an NP-hard problem
and then solve it in polynomial time, you could solve all NP problems. Also, there are decision
problems in NP-hard but are not NP-complete, such as the infamous halting problem.
A decision problem L is NP-complete if it is in the set of NP problems so that any given solution
to the decision problem can be verified in polynomial time, and also in the set of NP-hard
problems so that any NP problem can be converted into L by a transformation of the inputs in
polynomial time.
The complexity class NP-complete is the set of problems that are the hardest problems in NP, in
the sense that they are the ones most likely not to be in P. If you can find a way to solve an NP-
complete problem quickly, then you can use that algorithm to solve all NP problems quickly.
1. Improve the system performance in using parallel processing in HDFS framework with
efficient deduplication prevention approach.

2. In distinguish ability of file token/duplicate-check token. It requires that any user without
querying the private cloud server for some file token, he cannot get any useful
information from the token, which includes the file information or the privilege

3. Provide the access control system using proxy regeneration approach which can eliminate
the data collusion as well SQL injection attacks.

1. To develop the system in HDFS framework with 4 to 16 node cluster.
2. Motivate to save cloud storage and preserve the privacy of data holders by proposing
a scheme to manage encrypted data storage with deduplication.
3. Flexibly support data sharing with deduplication even when the data holder is offline,
and it does not intrude the privacy of data holders.
4. Propose an effective approach to verify data ownership and check duplicate storage
with secure challenge and big data support.
5. Integrate cloud data deduplication with data access control in a simple way, thus
reconciling data deduplication and encryption.
6. Prove the security and assess the performance of the proposed scheme through
analysis and simulation. The results show its efficiency, effectiveness and
7. To increase the storage utilization and reduce network bandwidth for hadoop storage
8. To remove the duplicate copies of data and improve the reliability.
9. To improve integrity.

10. To enhance data security and achieve data confidentiality.

5. Dissertation Plan
5.1 Purpose of dissertation
The purpose of dissertation plan to show contains plan of dissertation for execution of project
and time line chart and software design life cycle and dissertation schedule with this project.

5.2 Software Design Life Cycle

The waterfall model is a sequential design process, used in software development processes, in
which progress is seen as flowing steadily downwards (like a waterfall) through the phases of
conception, initiation, analysis, design, construction, testing, production/Implementation and
maintenance. Waterfall approach was first SDLC Model to be used widely in Software
Engineering to ensure success of the project. In the waterfall approach, the whole process of
software development is divided into separate phases. In Waterfall model, typically, the outcome
of one phase acts as the input for the next phase sequentially. Following is a diagrammatic
representation of different phases of waterfall model.
1. Requirement Gathering and analysis: All possible requirements of the system to be
developed are captured in this phase and documented in a requirement specification doc.

2. System Design: The requirement specifications from first phase are studied in this phase and
system design is prepared. System Design helps in specifying hardware and system requirements
and also helps in defining overall system architecture.
Figure 5.2: Waterfall Model
3. Implementation: With inputs from system design, the system is first developed in small
programs called units, which are integrated in the next phase. Each unit is developed and tested
for its functionality which is referred to as Unit Testing.
4. Integration and Testing: All the units developed in the implementation phase are integrated
into a system after testing of each unit. Post integration the entire system is tested for any faults
and failures.
5. Deployment of system: Once the functional and nonfunctional testing is done, the product is
deployed in the customer environment or released into the market.
6. Maintenance: There are some issues which come up in the client environment. To fix those
issues patches are released. Also to enhance the product some better versions are released.
Maintenance is done to deliver these changes in the customer environment.

All these phases are cascaded to each other in which progress is seen as flowing steadily
downwards (like a waterfall) through the phases. The next phase is started only after the defined
set of goals are achieved for previous phase and it is signed off, so the name”Waterfall Model”.
In this model phases do not overlap.
5.3 Dissertation Schedule
Below chart summarizes various tasks being carried out in estimated duration of weeks
Table 5.3: Dissertation schedule using Gant Chart
6. Software requirement
6.1 Introduction
It specifies for the software requirement for the search the particular documents and various
activities to be performed.

6.2 Purpose and Scope Of The Document

The purpose of the document to enlist the various software frameworks is required to build the
system. The document covers the following points:
 Responsibility of the develop
 System Architecture
 Use cases scenario
The propose system can work out on reducing message passing overhead. Like in gossip based
mining approach overall message passing overhead occurs because it follows random approach
for communication and if any update occurs then the message passing should be occur randomly
for all nodes while in our proposed strategy the global dataset generated is centralized so it will
take only one message for one node to communicate. The issue of fault tolerance is automatically
resolved since we are using hadoop technology for implementation. Since hadoop works by
name node and data node and replication is provided by this platform. so automatically
efficiency is achieved.
6.3 Overview of Responsibilities Of Developer
They may be part of a team that includes analysts, programmers and project managers, or they
may take on all the roles required to develop software programs. The key responsibilities of a
developer are to understand the problem that the software is supposed to solve, design a solution,
and develop and test it before releasing.
Before they begin detailed design, developers work with users to obtain a full understanding of
the softwares requirements. They analyze users needs and recommend new software programs or
upgrades to existing programs. In larger teams, developers may collaborate with business or
systems analysts who carry out the detailed investigation into software requirements.
Developers translate the functional requirements of the software into a specification for detailed
design. They may provide instructions that enable computer programmers to create the code for
the software or they may write the code themselves. If they are instructing programmers,
developers must have a detailed understanding of code so that they can evaluate the work of
other team members.
Software testing is a critical part of the development process. Developers test programs to ensure
that they meet the requirements of the specification and that they are free of errors, known as
bugs. Developers test the programs by entering data and trying out all program functions. They
may also ask users to try test versions of programs to ensure that they are easy to use.
Developers prepare detailed documentation for software programs. Documentation provides a
description of the functions and operation of the software that team members can refer to if they
need to modify or upgrade the program. Documentation also provides the basis for operating
instructions, guides for users, training programs and marketing guides.
6.12 Non Functional Requirements
1. Performance Requirements:
The performance of the system lies in the way it is handled. Every user must be given proper
guidance regarding how to use the system. The other factor which affects the performance is the
absence of any of the suggested requirements.
2. Safety Requirement:
To ensure the safety of the system, perform regular monitoring of the system so as to trace the
proper working of the system. An internal staff has to be trained to ensure the safety of the
system. He has to be trained to handle extreme error cases.
3. Security Requirement:
Any unauthorized user should be prevented from accessing the system. Password authentication
can be introduced.
4. Planned approach towards working: -
The working in the organization will be well planned and organized. The data will be stored
properly in cloud database, which will help in retrieval of information as well as its storage.
5. Accuracy: -
The level of accuracy in the proposed system will be higher. All operation would be done
correctly and it ensures that whatever information is coming from the center is accurate.
6. Reliability: -
The reliability of the proposed system will be high due to the above stated reasons. The reason
for the increased reliability of the system is that now there would be proper storage of

7.5 Feasibility Study

7.5.1 NP Hard and NP Complete Analysis
A problem is NP-hard if solving it in polynomial time would make it possible to solve all
problems in class NP in polynomial time. Some NP-hard problems are also in NP (these are
called "NP-complete"), some are not. If you could reduce an NP problem to an NP-hard problem
and then solve it in polynomial time, you could solve all NP problems. Also, there are decision
problems in NP-hard but are not NP-complete, such as the infamous halting problem
A decision problem L is NP-complete if it is in the set of NP problems so that any given solution
to the decision problem can be verified in polynomial time, and also in the set of NP-hard
problems so that any NP problem can be converted into L by a transformation of the inputs in
polynomial time.
The complexity class NP-complete is the set of problems that are the hardest problems in NP, in
the sense that they are the ones most likely not to be in P. If you can find a way to solve an NP-
complete problem quickly, then you can use that algorithm to solve all NP problems quickly.

7.5.2 Feasibility Analysis On Completed Project

Software projects are bind to delivery dates and available resources. Hence by doing feasibility
study we have to determine whether the system would be feasible or not.

1 Economical Feasibility
It is a cost/benefit analysis. For implementing proposed system there is no need to
upgrade large number of hardware and software. By installing netbeans, JDK etc we can develop
a system. These software’s are easily available with lowest cost.
2 Technical Feasibility
It developed system performs well, then it is technically feasible. The technical issue
usually raised during the feasibility stage of the investigation includes the following:

 Does the necessary technology exist or not?

 Do the proposed equipment have the technical capacity to hold the data required
to use the new system?
 Can the system be upgraded in future?
 Are there technical guarantees of accuracy, reliability, ease of access and data

3 Operational feasibility
Our proposed work maintains the privacy and security on data publishing with minimum
computation time. As explain earlier our problem statement is to publish the anonymized view of
data maintaining the security and privacy of sensitive attribute. Our proposed system solves this
problem. In chapter 8 we will discuss the outcome and performance which proves that our
system fulfills operational feasibility.

4 Time Feasibility
Time feasibility is the parameter which measure whether the system get completed within time
or not? Whether the systems computation time is considerable or not? Our systems computation
time is less than any encryption algorithm. In chapter 8 we discuss the performance regarding
computation time and time complexity of our system. Our system completes development within
desired time. This shows that our system fulfills time feasibility

7.7 Languages/Technology used

Hardware Resources required
1. System : Pentium IV 2.4 GHz.
2. Hard Disk : 40 GB
3. Monitor : 15 VGA Color.
4. Mouse : Logitech.
5. CD-Drive 6. Ram : 2 GB
Software Resources required
1. Operating system: Windows XP/7 Higher
2. Programming Language: JAVA/J2EE/
3. Tools : Ecllipse, Heidi SQL, JDK 1.7 or Higher, HDFS 2.5 or higher
4. Database : MySQL 5.1, MongoDB
7. Detailed Design Document
System Architecture

In the proposed research work to design and implement a system which will provide the parallel
processing to detect the data de-duplication problem in big data environment. The system also
provides benefit access control of data management and proxy revocation of system.
The proposed system can flexibly support access control on encrypted data with deduplication.
Low Cost of Storage. The proposed system can efficiently perform big data deduplication.

The system contains three types of entities:
1) CSP that offers storage services and cannot be fully trusted since it is curious about the
contents of stored data, but should perform honestly on data storage in order to gain commercial
2) data holder (ui) that uploads and saves its data at CSP. In the system, it is possible to have a
number of eligible data holders ðui; i ¼ 1; . . .; nÞ that could save the same encrypted raw data in
CSP. The data holder that produces or creates the file is regarded as data owner. It has higher
priority than other normal data holders.
3) an authorized party (AP) that does not collude with CSP and is fully trusted by the data
holders to verify data ownership and handle data deduplication. In this case, AP cannot know the
data stored in CSP and CSP should not know the plain user data in its storage. In theory it is
possible that CPS and its users (e.g., data holders) can collude. In practice, however, such
collusion could make the CSP lose reputation due to data leakage. A negative impact of bad
reputation is the CSP will lose its users and finally make it lose profits. On the other hand, the
CSP users (e.g., data holders) could lose their convenience and benefits of storing data in CSP
due to bad reputation of cloud storage services. Thus, the collusion between CSP and its users is
not profitable for both of them. Concrete analysis based on Game Theory is provided in [26].
Therefore, we hold such an assumption as: CSP does not collude with its users, e.g., performing
re-encryption for unauthorized users to allow them to access data. Additional assumptions
include: data holders honestly provide the encrypted hash codes of data for ownership
verification. The data owner has the highest priority. A data holder should provide a valid
certificate in order to request a special treatment. Users, CSP and AP communicate through a
secure channel (e.g., SSL) with each other. CSP can authenticate its users in the process of cloud
data storage. We further assume that the user policy PolicyðuÞ for data storage, sharing and
deduplication is provided to CSP during user registration.
A. Our scheme contain following aspect
Encrypted Data Upload:
If data duplication check is negative, the data holder encrypts its data using a randomly selected
symmetric key DEK in order to ensure the security and privacy of data, and stores the encrypted
data at CSP together with the token used for data duplication check. The data holder encrypts
DEK with pkAP and passes the encrypted key to CSP.
Data Deduplication:
Data duplication occurs at the time when data holder u tries to store the same data that has been
stored already at CSP. This is checked by CSP through token comparison. If the comparison is
positive, CSP contacts AP for deduplication by providing the token and the data holder’s PRE
public key. The AP challenges data ownership, checks the eligibility of the data holder, and then
issues a re-encryption key that can convert the encrypted DEK to a form that can only be
decrypted by the eligible data holder.
Data Deletion:
When the data holder deletes data from CSP, CSP firstly manages the records of duplicated data
holders by removing the duplication record of this user. If the rest records are not empty, the CSP
will not delete the stored encrypted data, but block data access from the holder that requests data
deletion. If the rest records are empty, the encrypted data should be removed at CSP.
Data Owner Management:
In case that a real data owner uploads the data later than the data holder, the CSP can manage to
save the data encrypted by the real data owner at the cloud with the owner generated DEK and
later on, AP supports re-encryption ofDEK at CSP for eligible data holders.
Encrypted Data Update:
In case that DEK is updated by a data owner with DEK0 and the new encrypted raw data is
provided to CSP to replace old storage for the reason of achieving better security, CSP issues the
new re-encrypted DEK0 to all data holders with the support of AP.
B. Functinal model :
Secret Sharing Scheme:
Secret sharing scheme performs two operations namely Share and Recover. The secret is divided
and shared by using Share. With enough shares, the secret can be extracted and recovered with
the algorithm of Recover. The input to this module is file. It performs dividing of file into fixed
size blocks or shares. These blocks are then encoded and allocated on cloud server at different
nodes. When user request for file these blocks are decrypted and by combining these blocks file
is given to user.
Tag Generation:
In this tag similarity is considered a kind of semantic relationship between tags, measured by
means of relative co-occurrence between tags, known as J. coefficient. The input to this block is
file blocks. This module assigns tags to each block for duplication check. The output of this
module is blocks with tag assigned.
Mapreduce module :
In this module system will process all the execution in parallel, we used one hash table at the
time of data insertion once data has insert into database it will make a history into the hash table.
For the efficient retrieval we can use the hash table.
Convergent Encryption Module

Traditional encryption, while providing data confidentiality, is incompatible with data de-
duplication. Specifically, traditional encryption requires different users to encrypt their data with
their own keys. Thus, identical data copies of different users will lead to different cipher texts,
making de-duplication impossible. Convergent encryption has been proposed to enforce data
confidentiality while making de-duplication feasible. It encrypts/ decrypts a data copy with a
convergent key, which is obtained by computing the cryptographic hash value of the content of
the data copy. After key generation and data encryption, users retain the keys and send the cipher
text to the cloud. Since the encryption operation is deterministic and is derived from the data
content, identical data copies will generate the same convergent key and hence the same cipher
DFD 1st level
Sequence Diagram

Class Diagram
Use case diagram
Figure: - State diagram for data owner

Figure: - State diagram for user


Space complexity
Once system implementation has done then we can calculate the system time complexity

System flow diagram

System flow Diagram

Time Complexity
In the proposed system, system proposes to give inputs from online data files. System
takes logarithmic time T(n) = O(log n). So n is always changeable into the system. (i.e. it goes to

Basically algorithms taking logarithmic time are commonly found in operations on binary
trees or when using binary search. System that can classify n(log n) algorithm is considered
highly efficient, as the operations per instance required to complete, decrease with each instance.
The system accuracy is around 90%, (estimated) so its better than existing approaches, it also
classified the NP-Complete problem. The complexity has been classified finally as Log(n).
Implementation Steps (Algorithm)

The algorithms used in deduplication technique and the implementation process details are
described in this chapter.

5.2.1 SHA -256

• SHA-256 a cryptographic hash function with digest length of 256 bits.

• SHA-256 operates in the manner of MD4, MD5, and SHA-1: The message to be hashed
is first

(1) padded with its length in such a way that the result is a multiple of 512 bits long,
(2) and parsed into 512-bit message blocks M(1);M(2)…. M (N).
The message blocks are processed one at a time: Beginning with a fixed initial hash value H (0),
sequentially compute
H(i) = H(i-1) + CM(i) (H(i-1));
Where C is the SHA-256 compression function and + means word-wise mod 232 addition. H (N) is
the hash of M.
2 Algorithms for Calculating VM Load (Load Balancing)
Input: ith Node input
Output: Idle or Normal Or Overloaded in percent (%).
Compute Load (VM id):
1. Define a load limit set: F= {F1,F2…Fm}.

2. Calculate weight Degree(N)= ∑ αiFi Where i=1…n

3. Common cloud partition degree from the node consignment degree statistics as:

1. Load amount avg=∑i=1..n

2. Loading Degree(Ni)

3. Initialize threshold T
3 Vector base cosine similarity (VCS)
Input Query Q, Threshold t
Output duplicate if returns 1 else unique

Here we have to find similarity of two vectors: and ,

where and are the components of the vector (features of the document, or values for each
word of the comment ) and the is the dimension of the vectors:


Step 1: Read each row R from dataset D

Step 2: for each ( Column c from R)
Step 3: score= Formula1(R,Q)
If(score > t)
Early stop;
Else step 2 continue
End for
8. Test specification (Refer to document format)
8.1 Introduction

This document is a procedural guide for listing the testing activities that should be carried out for
the sentiment analysis of Customer Reviews project. It describes the software test environment
for testing, identifies the tests to be performed, and provides schedules for test activities.

8.2 Purpose of the Document

The Purpose and objective is:

 Identify all the activities involved in testing,
 Resources required executing testing activities and monitoring mechanisms.
 To describe strategies for generating system test cases
 Improve the Reliability of proposed system
 Reduce the computation time and
 Maximize the security of private data.
8.3 Test Strategy
8.3.1 Testing Process

Testing process to be followed: The testing process highlights the broad-level phases to be
executed. Each of these phases has series of steps to be executed. The various broad-level phases
have been described below: i. Identify the requirements to be tested. ii. Identify the expected
results for each test. iii. Identify the testing-related equipments and reference document that are
required to execute the testing Process. Setup the test environment.

8.3.2 Types of Testing

Along with the type of testing also mention the approach to be followed for the testing, i.e.
Manual Testing or Automated Testing. Use Automated Testing Plan for planning automation
activities in details. The different types of testing that may be carried out in the project are as
1. Unit Testing
2. Integration Testing
3. System Testing
The different types of testing that may be carried out in the project are:

Unit testing
Individual components are tested independently to ensure their quality. The focus is to uncover
error in design and implementation focus is given on following parameters.
 Registration of new user in the login module
 Program logic and program structure in a module
 Each module tested with graphs

Integration testing
Group of dependent components are tested together to ensure their quality of their integration
unit. The purpose is to uncover error when different modules of project are integrated. Unit
testing only does checking of individual. While Integration testing checks the result of
combining the different modules. After integration testing, get confirmation of final result. Below
Table shows the suite of test cases which are executed and passed.


8.4.1 White Box Testing

Sometimes called glass-box testing is a test case design method that uses the control
structure of the procedural design to derive test cases. Using white-box testing methods, the
software engineer can derive test cases that (1) guarantee that all independent paths within a
module have been exercised at least once, (2) exercise all logical decisions on their true and false
sides, (3) execute all loops at their boundaries and within their operational bounds, and (4)
exercise internal data structures to ensure their validity. White-box testing of software is
predicated on close examination of procedural detail by unit and integration testing. Providing
test cases that exercise specific sets of conditions and/or loops tests logical paths through the
software. The status of the program may be examined at various points to determine if the
expected or asserted status corresponds to the actual status. Basis path testing is a white-box
testing technique first proposed by Tom McCabe. The basis path method enables the test case
designer to derive a logical complexity measure of a procedural design and use this measure as a
guide for defining a basis set of execution paths. Test cases derived to exercise the basis set are
guaranteed to execute every statement in the program at least one time during testing.
In this system, the system was tested for the calculation matters were the data provided
for giving the right output or not. If wrong data was provided then what it is throwing error or

8.4.2 Black Box Testing

Also called behavioral testing, focuses on the functional requirements of the software.
That is, black box testing enables the software engineer to derive sets of input conditions that
will fully exercise all functional requirements for a program. Black box testing is not an
alternative to white-box techniques. Rather, it is a complementary approach that is likely to
uncover a different class of error than white-box methods. When computer software is
considered, black box testing Alludes to tests that are conducted at the software interface.
Although they are designed to uncover errors, black-box tests are used to demonstrate that
software functions are operational, that input is properly accepted and output is correctly
produced and that the integrity of external information is maintained. A black-box test examines
some fundamental aspect of a system with a little regard for the internal logical structure of the
software. Black-box testing attempts to find errors in the following categories:
1. Incorrect or missing functions
2. Interface errors
3. Errors in data structures or external database access
4. Behavior or performance errors
5. Initialization and termination errors
By applying back-box techniques, a set of test cases that satisfy the following criteria are
 Test cases that reduce, by a count that is greater than one, the number of additional
test cases that must be designed to achieve reasonable testing.
 Test cases that tell us something about the presence or absence of classes of errors,
rather than an error associated only with the specific test at hand.
White-box testing should not, however, be dismissed as impractical. A limited number of
important logical paths can be selected and exercised. Important data structures can be probed
for validity. The attributes of both black and white box testing can be combined to provide an
approach that validates the software interface and selectively ensures that the internal workings
of the software are correct. Black box testing for this system was done to check the internal
testing i.e., the system is working properly in each case or no. What kind of errors are there in
database design. To find out faults, mistakes there is use of different black box testing methods
like system testing, performance testing, load testing, etc.

8.4.4 System testing

System testing is testing the whole system. It follows the scope of black box testing
which doesn’t require any knowledge of design of code or logic. It performs to test the
fulfillment of functional requirement specification (FRS) and software requirement specification
(SRS). It tests about graphical user interface, usability, performance, compatibility, exception
handling, load, volume, stress, security, accessibility, failure and recovery etc.

8.4.3 Performance testing

Any software should be a quality software and quality measures by using the attributes
reliability, scalability and resource usage. Performance testing is the general testing which
determine how system performs? This is in terms of responsiveness and stability under
maximum workload. Proposed system gives positive response to performance testing. It works
properly and give proper output for large number of database. Resource usage is also
maximum. The only thing is for very large amount of database system performance i.e
computation time of system get increased. Below table analyses some test cases which tests our
software according to above testing strategies and shows up to which extent our software fulfill
Registration Phase
Login Test
Test Flow Expected Result Actual Result P/F
1 User Login Allow login to System allow login P
authenticated user to authenticated
only user only
2 Upload file Show duplicate file is File save / Discard P
duplicated save successfully
9. Data Tables and Discussions

For the system performance evaluation, calculate the matrices for accuracy. The system is
implemented on java 3-tier MVC architecture framework with INTEL 3.0 GHz i5 processor and
8 GB RAM with Hadoop. System also evaluated the computation costs of VCS for varying
values of k, l and K. Throughout this sub-section, system fix m = 6 and n = 2000. However,
system observed that the running time of VCS grows almost linearly with n and m.
The below tables 1 shows current system evaluation outcome

Approach Data Times in

Record Seconds
2000 35
4000 68
6000 102
Serial input records 8000 132
1000 171

Table 1: Time Required for query processing when m = 6, k = 5 and K = 512

After the complete implementation of system evaluate with different experiments. For the second
experiment system focus on time complexity of cryptography algorithm. The system take use
different time for data encryption as well as data decryption purpose. The below figure 3 shows
the encryption and decryption time complexity.
Figure 1: Data encryption and decryption performance
In the first experiment system first compare the time complexity with different existing
algorithms. The below graph show the how much time required in milliseconds calculate the
relevancy with no. of documents.



100 D
200 D
600 500 D




Figure 2: Time complexity existing vs proposed

The below graph 3 show the time complexity between system with hadoop d without hadoop



1500 200 D
With Hadoop




Figure 3: Time complexity with hadoop and without hadoop

10. Summary and Conclusion
11. To protect data confidentiality along with secure de-duplication, notion of authorized de-
duplication is proposed in HDFS framework which can provide parallel processing with
minimum time complexity.

12. To carry duplicate check firstly privileges assigned to user are checked Instead of data itself
duplicate check is based on differential privileges of users.

13. Here, problem of privacy preserving in de-duplication in cloud environment is considered

and advanced scheme supporting differential authorization and authorized duplicate check is

14. This project addresses the issue in authorized de-duplication to achieve better security.

15. We showed that our authorized duplicate check scheme incurs minimal overhead compared
to convergent encryption and network transfer.

16. Future Enhancement

17. References (In IEEE format)

[1]. Jin Li, Yan Kit Li, Xiaofeng Chen,Patrick P. C. Lee and Wenjing Lou, "A Hybrid Cloud
Approach for Secure Authorized Deduplication", IEEE Transaction On Parallel And Distributed
System,Vol.PP,No.99, 2014.
[2]. Maneesha Sharma, Himani Bansal and Amit Kumar Sharma, " Cloud Computing: Different
Approach & Security Challenge", IJSCE, Volume-2, Issue-1,March 2012.
[3]. Kangchan Lee, "Security Threats in Cloud Computing Environments", International Journal
of Security and Its Applications, Vol. 6, No. 4, October, 2012.
[4]. Sashank Dara, "Cryptography Challenges for Computational Privacy in Public Clouds",
International Journal of Security and Its Applications, Volume 4, 2002.
[5]. David Pointcheval, "Asymmetric Cryptography and Practical Security", International Journal
of Security and Its Applications, Volume 4,2002.
[6]. Yogesh Kumar, Rajiv Munjal and Harsh Sharma, "Comparison of Symmetric and
Asymmetric Cryptography with Existing Vulnerabilities and Counter-measures", International
Journal of Computer Science and Management Studies, Vol. 11, Issue 03, Oct 2011.
[7]. Jan Stanek, Alessandro Sorniottiy, Elli Androulakiy, and Lukas Kencl, "A Secure Data De-
duplication Scheme for Cloud Storage", IBM Research, Zurich, May 1994.
[8]. Jin Li, Xiaofeng Chen, Mingqiang Li, Jingwei Li, Patrick P.C. Lee, and Wenjing Lou,
"Secure Auditing and De-duplicating Data in Cloud", IEEE Transactions on
[9]. Deepak Mishra and Sanjeev Sharma, "Comprehensive study of data de-duplication",
International Conference on Cloud, Big Data and Trust,Vol.13.No.15,NOV 2013
[10]. Paul Anderson and Le Zhang, "Fast and Secure Laptop Backups with Encrypted De-
duplication", Proceedings of Eurocrypt, Vol. 6,March 2013.
[11]. Mihir Bellare,Sriram Keelveedhi and Thomas Ristenpart, "Message-Locked Encryption and
Secure Deduplication", Proceedings of Eurocrypt, Vol. 6,March 2013.
[12]. Pietro, R.D., Sorniotti, "Boosting Eciency and Security in Proof of Ownership for
Deduplication", ACM Symposium on Information, 2012.
[13]. David Pointcheval,"Asymmetric Cryptography and Practical Security",International Journal
of Security and Its Applications,Volume 4,2002.
[14]. Sashank Dara,"Cryptography Challenges for Computational Privacy in Public
Clouds",International Journal of Security and Its Applications,Volume 4, 2002.
[15]. Sean Quinlan and Sean Dorward,"Venti: a new approach to archival storage",Bell Labs,
Lucent Technologies,Vol. 6,1998. Sotomayor, Rubén S. Montero and Ignacio M. Llorente, Ian
Foster, Virtual Infrastructure Management in Private and Hybrid Clouds, Published by the IEEE
Computer Society, 2009