Data Processing For Large Database Using Mapreduce Approach Using Apso

Data Processing for large Database Using MapReduce
Approach Using APSO
Abstract:
Big Data is a popular open source implementation of the MapReduce programming
model for cloud computing. However, it faces a number of issues to achieve the best
performance from the underlying systems. These include a serialization barrier that delays the
reduce phase, repetitive merges, and disk accesses, and the lack of portability to different
interconnects. To keep up with the increasing volume of data sets, BigData also requires efficient
I/O capability from the underlying computer systems to process and analyze data. We describe
BigData -A, an acceleration framework that optimizes BigData with plug-in components for fast
data movement, overcoming the existing limitations. A novel network-levitated merge algorithm
is introduced to merge data without repetition and disk access. In addition, a full pipeline is
designed to overlap the shuffle, merge, and reduce phases. Our experimental results show that
BigData -A significantly speeds up data movement in MapReduce and doubles the throughput of
BigData. In addition, BigData -A significantly reduces disk accesses caused by intermediate
data. In this paper, we propose, APSO, a distributed frequent subgraph mining method over
MapReduce. Given a graph database, and a minimum support threshold, APSO generates a
complete set of frequent subgraphs. To ensure completeness, it constructs and retains all patterns
in a partition that have a non-zero support in the map phase of the mining, and then in the reduce
phase, it decides whether a pattern is frequent by aggregating its support computed in other
partitions from different computing nodes. To overcome the dependency among the states of a
mining process, APSO runs in an iterative fashion, where the output from the reducers of
iteration i−1 is used as an input for the mappers in the iteration i. The mappers of iteration i
generate candidate subgraphs of size i (number of edge), and also compute the local support of
the candidate pattern. The reducers of iteration i then find the true frequent subgraphs (of size i)
by aggregating their local supports. They also write the data in disk that are processed in
subsequent iterations.
Existing System
 The size of the result-ant feature set is assumed fixed. Users are required to
explicitly specify the maximum dimension for feature subset. Although the
number of combinations reduces.
 The major drawback is that users may not know in advance what would be
the ideal size of s.
 The feature becomes minimal. By the principle of removing redundancy, the
feature set may shrink to its most minimal size.
 The feature selection methods are custom designed for some particular
classifier and optimizer.
 Two classical algorithms are Classification And Regression Tree algorithm
(CART)for decision tree induction and Rough-set discrimination
 Each time when fresh data arrive, which is typical in the data collection
process that makes the big data inflate to bigger data, the traditional
induction method needs to re-run and the model that was built needs to be
built again with the inclusion of new data.
Proposed:
 In Big Data analytics, the high dimensionality and the streaming

nature of the incoming data aggravate great computational challenges
in data mining.
 Big Data grows continually with fresh data are being generated at all
times; hence it requires an incremental computation approach which is
able to monitor large scale of data dynamically.
 Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-
processing latency.
 We investigated the possibility of using a group of incremental
classification algorithm for classifying the collected data streams
pertaining to Big Data.
 As a case study empirical data streams were represented by five
datasets of different do-main that have very large amount of features,
from UCI archive.
 A novel lightweight feature selection method by using Swarm Search
and Accelerated PSO, which is supposed to be useful for data stream
mining.
 A spectrum of experimental insights for anybody who wishes to
design data stream mining applications for big data analytics using
lightweight feature selection approach such as Swarm Search and
APSO.
 APSO is designed to be used for data mining of data streams on the
fly. The combinatorial explosion is addressed by used swarm search
approach applied in incremental manner.
System Requirement:
Hardware requirements:
 Processor : Dualcore Processor min 2.4 GHz(64-bit)

 Ram : 4 Gb.
 Hard Disk : 500 GB.
 Input device : Standard Keyboard and Mouse.USB port
 Output device : VGA and High Resolution Monitor
Software requirements:
 Operating System : Linux, Cloudera

 Java :Version JDK1.7
 Tools :Eclipse Juno or later
 Data Bases : Hadoop.
Introduction:
MAP-REDUCE has emerged as a popular and easy-to-use programming model for cloud
computing. It has been used by numerous organizations to process explosive amounts of data,
perform massive computation, and extract critical knowledge for business intelligence. Hadoop
is an open source implementation of MapReduce, currently maintained by the Apache
Foundation, and supported by leading IT companies such as Facebook and Yahoo!. Hadoop
implements MapReduce framework with two categories of components: a JobTracker and many
Task- Trackers.
The JobTracker commands TaskTrackers (a.k.a. slaves) to process data in parallel
through two main functions: map and reduce. In this process, the JobTracker is in charge of
scheduling map tasks (MapTasks) and reduce tasks (ReduceTasks) to TaskTrackers. It also
monitors their progress, collects runtime execution statistics, and handles possible faults and
errors through task reexecution. Between the two phases, a ReduceTask needs to fetch a part of
the intermediate output from all finished MapTasks. Globally, this leads to the shuffling of
intermediate data (in segments) from all MapTasks to all ReduceTasks. For many data-intensive
MapReduce programs, data shuffling can lead to a significant number of disk operations,
contending for the limited I/O bandwidth.
This presents a severe problem of disk I/O contention in MapReduce programs, which
entails further research on efficient data shuffling and merging algorithms. proposed the
MapReduce Online architecture to open up direct network channels between MapTasks and
ReduceTasks and speed up the delivery of data from MapTasks to ReduceTasks. It remains as a
critical issue to examine the relationship of Hadoop MapReduce’s three data processing phases,
i.e., shuffle, merge, and reduce, and their implication to the efficiency of Hadoop. With an
extensive examination of Hadoop MapReduce framework, particularly its ReduceTasks, we
reveal that the original architecture faces a number of challenging issues to exploit the best
performance from the underlying system. To ensure the correctness of MapReduce, no
ReduceTasks can start reducing data until all intermediate data have been merged together. This
results in a serialization barrier that significantly delays the reduce operation of ReduceTasks.
More importantly, the current merge algorithm in Hadoop merges intermediate data segments
from MapTasks when the number of available segments (including those that are already
merged) goes over a threshold.
These segments are spilled to local disk storage when their total size is bigger than the
available memory. This algorithm causes data segments to be merged repetitively and, therefore,
multiple rounds of disk accesses of the same data. To address these critical issues for Hadoop
MapReduce framework, we have designed Hadoop-A, a portable acceleration framework that
can take advantage of plug-in components for performance enhancement and protocol
optimizations. Several enhancements are introduced: 1) a novel algorithm that enables
ReduceTasks to perform data merging without repetitive merges and extra disk accesses; 2) a full
pipeline is designed to overlap the shuffle, merge, and reduce phases for ReduceTasks; and 3) a
portable implementation of Hadoop-A that can support both TCP/ IP and remote direct memory
access (RDMA).
Hadoop File System
 Highly fault-tolerant
 High throughput
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
Fault tolerance:
Failure is the norm rather than exception A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s data. Since we have huge number of
components and that each component has non-trivial probability of failure means that there is
always some component that is non-functional. Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.
Data Characteristics:
 Streaming data access.
 Applications need streaming access to data
 Batch processing rather than interactive user access.
 Large data sets and files: gigabytes to terabytes size
 High aggregate data bandwidth
 Scale to hundreds of nodes in a cluster
 Tens of millions of files in a single instance
 Write-once-read-many: a file once created, written and closed need not be changed – this
assumption simplifies coherency
 A map-reduce application or web-crawler application fits perfectly with this model.

MapReduce:
Key Features of MapReduce Model:

• Designed for clouds
• Large clusters of commodity machines
• Designed for big data
• Support from local disks based distributed file system (GFS / HDFS)
• Disk based intermediate data transfer in Shuffling
• MapReduce programming model
• Computation pattern: Map tasks and Reduce tasks
• Data abstraction: KeyValue pairs
HDFS Architecture:
File system Namespace:

 Hierarchical file system with directories and files Create, remove, move, rename etc.
NameNode maintains the file system any Meta information changes to the file system
recorded by the NameNode. An application can specify the number of replicas of the file
needed: replication factor of the file. This information is stored in the NameNode.
Data Replication:
 HDFS is designed to store very large files across machines in a large cluster. Each file is
a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are
replicated for fault tolerance. Block size and replicas are configurable per file. The
NameNode receives a Heartbeat and a Block Report from each DataNode in the cluster.
Block Report contains all the blocks on a Datanode.
Replica Placement:
 The placement of the replicas is critical to HDFS reliability and performance. Optimizing
replica placement distinguishes HDFS from other distributed file systems. Rack-aware
replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Research topic
 Many racks, communication between racks are through switches. Network bandwidth
between machines on the same rack is greater than those in different racks. NameNode
determines the rack id for each DataNode. Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Another research topic?
 Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and
1/3 distributed evenly across remaining racks.
NameNode:
 Keeps image of entire file system namespace and file Blockmap in memory. 4GB of local
RAM is sufficient to support the above data structures that represent the huge number of
files and directories. When the Namenode starts up it gets the FsImage and Editlog from
its local file system, update FsImage with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that
the system can recover back to the last checkpointed state in case of a crash.
Datanode:
 A Datanode stores data in files in its local file system. Datanode has no knowledge about
HDFS file system. It stores each block of HDFS data in a separate file. Datanode does not
create all files in the same directory. It uses heuristics to determine optimal number of
files per directory and creates directories appropriately.
 When the filesystem starts up it generates a list of all HDFS blocks and send this report to
Namenode: Blockreport.
Reliable Storage: HDFS
Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File
System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally
and survive the failure of significant parts of the storage infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be
built with inexpensive computers. If one fails, Hadoop continues to operate the cluster
without losing data or interrupting work, by shifting work to the remaining machines in the
cluster.
HDFS manages storage on the cluster by breaking incoming files into pieces, called
“blocks,” and storing each of the blocks redundantly across the pool of servers. HDFS has
several useful features. In the very simple example shown, any two servers can fail, and the
entire file will still be available.
HDFS notices when a block or a node is lost, and creates a new copy of missing data
from the replicas it manages. Because the cluster stores several copies of every block, more
clients can read them at the same time without creating bottlenecks. Of course there are
many other redundancy techniques, including the various strategies employed by RAID
machines.
HDFS offers two key advantages over RAID: It requires no special hardware, since it
can be built from commodity servers, and can survive more kinds of failure – a disk, a node
on the network or a network interface. The one obvious objection to HDFS – its
consumption of three times the necessary storage space for the files it manages – is not so
serious, given the plummeting cost of storage.
Hadoop for Big Data Analysis
Many popular tools for enterprise data management relational database systems, for
example – are designed to make simple queries run quickly. They use techniques like indexing to
examine just a small portion of all the available data in order to answer a question. Hadoop is a
different sort of tool. Hadoop is aimed at problems that require examination of all the available
data. For example, text analysis and image processing generally require that every single record
be read, and often interpreted in the context of similar records. Hadoop uses a technique called
MapReduce to carry out this exhaustive analysis quickly.
In the previous section, we saw that HDFS distributes blocks from a single file among a
large number of servers for reliability. Hadoop takes advantage of this data distribution by
pushing the work involved in an analysis out to many different servers. Each of the servers runs
the analysis on its own block from the file.
Running the analysis on the nodes that actually store the data delivers much much better
performance than reading data over the network from a single centralized server. Hadoop
monitors jobs during execution, and will restart work lost due to node failure if necessary. In
fact, if a particular node is running very slowly, Hadoop will restart its work on another server
with a copy of the data.
OVERVIEW OF JAVA:
Java technology is both a programming language and a Platform. The Java programming
language is a high-level language that can be characterized by all of the following buzzwords:
 Simple
 Architecture neutral
 Object oriented
 Portable
 Distributed
 High performance
 Interpreted
 Multithreaded
 Robust
 Dynamic
 Secure
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is executed.
The following figure illustrates how this works.
- WORKING OF JAVA
You can think of Java byte codes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java byte codes help make “write
once, run anywhere” possible. You can compile your program into byte codes on any platform
that has a Java compiler. The byte codes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
The Java Platform
A platform is the hardware or software environment in which a program runs. The Java
platform differs from most other platforms in that it’s a software-only platform that runs on top
of other hardware-based platforms.
The Java platform has two components:
 The Java Virtual Machine (Java VM)

 The Java Application Programming Interface (Java API)
You’ve already been introduced to the Java VM. It’s the base for the Java platform and is
ported onto various hardware-based platforms. The Java API is a large collection of ready-made
software components that provide many useful capabilities, such as graphical user interface
(GUI) widgets. The Java API is grouped into libraries of related classes and interfaces; these
libraries are known as packages
The following figure depicts a program that’s running on the Java platform. As the figure
shows, the Java API and the virtual machine insulate the program from the hardware.
-THE JAVA PLATFORM
Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a bit
slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte
code compilers can bring performance close to that of native code without threatening
portability.
JAVA FEATURES
An application is a standalone program that runs directly on the Java platform. A special
kind of application known as a server serves and supports clients on a network. Examples of
servers are Web servers, proxy servers, mail servers, and print servers. Another specialized
program is a servlet. A servlet can almost be thought of as an applet that runs on the server side.
Java Servlets are a popular choice for building interactive web applications, replacing the use of
CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications.
Instead of working in browsers, though, servlets run within Java Web servers, configuring or
tailoring the server.
How does the API support all these kinds of programs? It does so with packages of
software components that provides a wide range of functionality. Every full implementation of
the Java platform gives you the following features:
 The essentials: Objects, strings, threads, numbers, input and output, data structures, system
properties, date and time, and so on.
 Applets: The set of conventions used by applets.
 Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol)
sockets, and IP (Internet Protocol) addresses.
 Internationalization: Help for writing programs that can be localized for users worldwide.
Programs can automatically adapt to specific locales and be displayed in the appropriate
language.
 Security: Both low level and high level, including electronic signatures, public and private
key management, access control, and certificates.
 Software components: Known as JavaBeansTM, can plug into existing component
architectures.
 Object serialization: Allows lightweight persistence and communication via Remote
Method Invocation (RMI).
 Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of
relational databases. The Java platform also has APIs for 2D and 3D graphics, accessibility,
servers, collaboration, telephony, speech, animation, and more. The following figure depicts
what is included in the Java 2 SDK.
-JAVA 2 SDK
ODBC
Microsoft Open Database Connectivity (ODBC) is a standard programming interface for

application developers and database systems providers. Before ODBC became a de facto
standard for Windows programs to interface with database systems, programmers had to use
proprietary languages for each database they wanted to connect to. Now, ODBC has made the
choice of the database system almost irrelevant from a coding perspective, which is as it should
be. Application developers have much more important things to worry about than the syntax that
is needed to port their program from one database to another when business needs suddenly
change.
The ODBC system files are not installed on your system by Windows 95. Rather, they
are installed when you setup a separate database application, such as SQL Server Client or
Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called
ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-
alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program,
and each maintains a separate list of ODBC data sources.
The advantages of this scheme are so numerous that you are probably
thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors make the charge
that it is too slow. Microsoft has always claimed that the critical factor in performance is the
quality of the driver software that is used. In our humble opinion, this is true. The availability of
good ODBC drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never match the
speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the
opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers
get faster every year.
Networking
TCP/IP stack
The TCP/IP stack is shorter than the OSI one:
– TCP/IP STACK
TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless

protocol.
IP datagram’s
The IP layer provides a connectionless and unreliable delivery system. It considers each
datagram independently of the others. Any association between datagram must be supplied by
the higher layers. The IP layer supplies a checksum that includes its own header. The header
includes the source and destination addresses. The IP layer handles routing through an Internet. It
is also responsible for breaking up large datagram into smaller ones for transmission and
reassembling them at the other end.
TCP
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a
virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme
for machines so that they can be located. The address is a 32 bit integer which gives the IP
address. This encodes a network ID and more addressing. The network ID falls into various
classes according to the size of the network address.
Network address
Class A uses 8 bits for the network address with 24 bits left over for other addressing.
Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses
all 32.
Subnet address
Internally, the UNIX network is divided into sub networks. Building 11 is currently on
one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address
8 bits are finally used for host addresses within our subnet. This places a limit of 256
machines that can be on the subnet.
Total address
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a
message to a server, you send it to the port for that service of the host that it is running on. This
is not location transparency! Certain of these ports are "well known".
Sockets
A socket is a data structure maintained by the system to handle network connections. A

socket is created using the call socket. It returns an integer that is like a file descriptor. In fact,
under Windows, this handle can be used with Read File and Write File functions.
#include <sys/types.h>
#include <sys/socket.h>
int socket(int family, int type, int protocol);
Here "family" will be AF_INET for IP communications, protocol will be zero, and type will
depend on whether TCP or UDP is used. Two processes wishing to communicate over a network
create a socket each. These are similar to two ends of a pipe - but the actual pipe does not yet
exist.
. SYSTEM DESIGN
System design is basically a bridge between analysis and implementation phases. It
illustrates how to achieve the solution domain from the problem domain. The main
objective of the design is to transform the high level analysis concepts used to
describe problem domain into an implementation form. System design is the
evaluation of alternative solutions and specification of a detailed computer based
solution
ARCHITECTURAL DESIGN
SEISMIC
VIBRATION
SENSOR 1 CLUSTER
MANAGER
1.login
SEISMIC DATA
2.register
VIBRATION COLLECTOR
SENSOR 2
EARTHQUAKE VIBRATION
ADMIN
SEISMIC READING
VIBRATION
SENSOR N sort()
INPUT FILE Mapped data
reduce()o
partition()
output
COUNTER 1 COUNTER N combine()
format
COUNTER 2 REDUCER
MAPPER
Reduced data
OUTPUT FILE JOB TRACKER
Coordinates map and reduce

phases; provides the
Maintains the Database of the
earthquake possibility zone
readings.
in priority based on the
reading.
Figure Architecture Diagram

MODULES EXPLANATION
1. User-Transparent Shuffle Service
2. Shuffle-on-Write
3. Automated Map Output Placement
4. Flexible Scheduling of Reduce Tasks
USER-TRANSPARENT SHUFFLE SERVICE:
The data-collector collects the Seismic Vibration Readings with some parameters from
different Seismic Vibration Sensors . It collects parameters such as Src , Eqid , Version ,
Date , time , Lat , Lon , Magnitude , Depth , NST , Region.The seismic reading is then sent to
the CLUSTER MANAGER . Every time when the seismic vibration sensor collects the data
it is then sent to the cluster manager for further processings.
SHUFFLE-ON-WRITE:
The shuffler implements a shuffle-on-write operation that proactively pushes

different hadoop counters (nodes).This operation is performed every time when the readings
is collected in the data collector.The shuffing is done based on the parameter Region.Then it
is directed to the admin.
AUTOMATED MAP - OUTPUT PLACEMENT:
The cluster manager distributes the data to different hadoop counters and then the
admin signs in to proceed with the map-reduce process.The admin have to login once the
registration is done with valid user name and password.The data can be processed only by the
authenticated user.The mapping permission has to be given by the authenticated user .The
mapping includes two processes: partition() and combine().
FLEXIBLE SCHEDULING OF APSO TASKS:
The mapped data is then subjected to the reduce process.The reducer includes two
functions: 1.sort() and 2.reduce().The Job-Tracker coordinates the map and reduce phases;
provides the earthquake possibility zone in the priority based on the readings. The Output-File
maintains the database to records the readings.The readings is transferred to the admin ; the
admin intimates about the earthquake zone in advance to the station.
FEASIBILITY ANALYSIS
The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential. Three key considerations involved in the feasibility
analysis are
 Economical feasibility
 Technical feasibility
 Operational feasibility
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This
will lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
OPERATIONAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system and
to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism, which is welcomed, as he is the final user of the system.
Over all DFD
start
starting server
Network traffic
stop
monitoring system
providing input files
Collecting data Hbase data storage
store data
Data flow based on Application Specific

HDFS records
make the output files
find the records
into grouping
Cluster Manager Mapping Reduce
restore the output
Level 0
start
starting server
Network traffic
monitoring system
providing input files
Collecting data
Level 1
Data flow based on

HDFS
make the output files
into grouping
Cluster Manager
restore the output
Mapping Reduce
Level 2
Application Specific
records
store data
Hbase data storage
stop
Sequence Diagram:
Mobile User Mapreducing Hadoop File Hadoop

System Database
user request send to application
request
update
response
reponse from application side
update information in external cloud services
resource allocation
synchronise application and cloud server
task scheduler
cloud optimizer
qos preservation
optimized resource allocation
service update user side
Update HBase user informarmation
return acknowlegement
Traffic Resource Analyzer
Acknowlegment return to mobile User
Connection termination
close
Connection Close
Collaboration Diagram:
1: user request send to application

6: User Interact to Hadoop System
7: update information in external cloud services
16: return acknowlegement Mapredu
Mobile cing
User
22: Connection Close 11: cloud optimizer
20: Connection termination
5: reponse from application side 2: request

8: resource allocation
15: Update HBase user informarmation 12: qos preservation
18: Traffic Resource Analyzer
19: Acknowlegment return to mobile User
10: task scheduler 17:

3: update
9: synchronise application and cloud server
13: optimized resource allocation
21: close
Hadoop File Hadoop
System Database
4: response
14: service update user side
Activity Diagrams:
.
User interact to the
Hadoop File System
Map reducing of Hadoop

File System
HBase Interact to the

HDFS
Traffic resource Analyzer

to Update HDFS
Map Reducing call to

Application Interface
Reliable Transaction
Rate
End
ER-Diagrams:
User ID Name NameNode DataNode
User HDFS
File
MAP REDUCE
Task
Mapper
Tracker
Reducer JobTracker
Job Details
Screenshot:
1.Go For Hadoop Directory :
2. Information File Send to Hadoop File System(HDFS) through a Map reducing protocol
3. Hadoop Mapreducing Program Running in Cloudera Terminals:
4. Error and Byte Reader:

5. Hadoop File Systems File listed:
6.Informations about Data Transfer rate and packet rate listed in terminals:
.
7. Topology and communication Tracker Result List:
8. Map reducing Interface Program implement using eclipse :

Code:
package map.red;
import java.io.IOException;import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class mapred {
public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);

private Text word = new Text();
public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(mapred.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Driver File Coding:
import org.apache.hadoop.mapreduce.Job;
public class StubDriver {
public static void main(String[] args) throws Exception {
/*
* Validate that two arguments were passed from the command line.
*/
if (args.length != 2) {
System.out.printf("Usage: StubDriver <input dir> <output dir>\n");
System.exit(-1);
}
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = new Job();
/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(StubDriver.class);
/*
* Specify an easily-decipherable name for the job.
* This job name will appear in reports and logs.
*/
job.setJobName("Stub Driver");
/*
* TODO implement
*/
/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
Map reducing Class Development Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class StubMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*
* TODO implement
*/
}
}
Hadoop Interface code:
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class StubReducer extends Reducer<Text, IntWritable, Text, DoubleWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
/*
* TODO implement
*/
}
}
Information File Reader and writer protocol definition code:
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import static org.junit.Assert.fail;
import org.junit.Before;
import org.junit.Test;
public class StubTest {
/*
* Declare harnesses that let you test a mapper, a reducer, and
* a mapper and a reducer working together.
*/
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
ReduceDriver<Text, IntWritable, Text, DoubleWritable> reduceDriver;
MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, DoubleWritable> mapReduceDriver;
/*
* Set up the test. This method will be called before every test.
*/
@Before
public void setUp() {
/*
* Set up the mapper test harness.
*/
StubMapper mapper = new StubMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
/*
* Set up the reducer test harness.
*/
StubReducer reducer = new StubReducer();
reduceDriver = new ReduceDriver<Text, IntWritable, Text, DoubleWritable>();
reduceDriver.setReducer(reducer);
/*
* Set up the mapper/reducer test harness.
*/
mapReduceDriver = new MapReduceDriver<LongWritable, Text, Text, IntWritable, Text,
DoubleWritable>();
mapReduceDriver.setMapper(mapper);
mapReduceDriver.setReducer(reducer);
}
/*
* Test the mapper.
*/
@Test
public void testMapper() {
/*
* For this test, the mapper's input will be "1 cat cat dog"
* TODO: implement
*/
fail("Please implement test.");
/*
* Test the reducer.
*/
@Test
public void testReducer() {
/*
* For this test, the reducer's input will be "cat 1 1".
* The expected output is "cat 2".
* TODO: implement
*/
/*
* Test the mapper and reducer working together.
*/
@Test
public void testMapReduce() {
/*
* For this test, the mapper's input will be "1 cat cat dog"
* The expected output (from the reducer) is "cat 2", "dog 1".
* TODO: implement
*/
}
}
Conclusion:
In Big Data analytics, the high dimensionality and the streaming nature of the
incoming data aggravate great computational challenges in data mining. Big Data
grows continually with fresh data are being generated at all times; hence it requires
an incremental computation approach which is able to monitor large scale of data
dynamically. Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-processing
latency. In this paper, we investigated the possibility of using a group of
incremental classification algorithm for classifying the collected data streams
pertaining to Big Data. As a case study empirical data streams were represented by
five datasets of different do-main that have very large amount of features, from
UCI archive. We compared the traditional classification model induction and their
counter-part in incremental inductions. In particular we proposed a novel
lightweight feature selection method by using Swarm Search and Accelerated PSO,
which is supposed to be useful for data stream mining. The evaluation results
showed that the incremental method obtained a higher gain in accuracy per second
incurred in the pre-processing. The contribution of this paper is a spectrum of
experimental insights for anybody who wishes to design data stream mining
applications for big data analytics using lightweight feature selection approach
such as Swarm Search and APSO. In particular, APSO is designed to be used for
data mining of data streams on the fly. The combinatorial explosion is addressed
by used swarm search approach applied in incremental manner. This approach also
fits better with real-world applications where their data arrive in streams. In
addition, an incremental data mining approach is likely to meet the demand of big
data problem in service computing.

Data Processing For Large Database Using Mapreduce Approach Using Apso

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Processing For Large Database Using Mapreduce Approach Using Apso

Hochgeladen von

Copyright:

Verfügbare Formate

Data Processing for large Database Using MapReduce

Approach Using APSO

 In Big Data analytics, the high dimensionality and the streaming

 Processor : Dualcore Processor min 2.4 GHz(64-bit)

 Operating System : Linux, Cloudera

 Suitable for applications with large data sets

 Streaming access to file system data

 Can be built out of commodity hardware

 Streaming data access.

 Applications need streaming access to data

 Batch processing rather than interactive user access.

 Large data sets and files: gigabytes to terabytes size

 High aggregate data bandwidth

 Scale to hundreds of nodes in a cluster

 Tens of millions of files in a single instance

 A map-reduce application or web-crawler application fits perfectly with this model.

Key Features of MapReduce Model:

File system Namespace:

Reliable Storage: HDFS

Hadoop for Big Data Analysis

The Java Platform

The Java platform has two components:

 The Java Virtual Machine (Java VM)

-THE JAVA PLATFORM

Microsoft Open Database Connectivity (ODBC) is a standard programming interface for

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol; UDP (User Datagram Protocol) is a connectionless

A socket is a data structure maintained by the system to handle network connections. A

int socket(int family, int type, int protocol);

Coordinates map and reduce

Figure Architecture Diagram

1. User-Transparent Shuffle Service

3. Automated Map Output Placement

4. Flexible Scheduling of Reduce Tasks

USER-TRANSPARENT SHUFFLE SERVICE:

The shuffler implements a shuffle-on-write operation that proactively pushes

FLEXIBLE SCHEDULING OF APSO TASKS:

providing input files

Collecting data Hbase data storage

Data flow based on Application Specific

Cluster Manager Mapping Reduce

restore the output

providing input files

Data flow based on

restore the output

Hbase data storage

Mobile User Mapreducing Hadoop File Hadoop

reponse from application side

update information in external cloud services

synchronise application and cloud server

optimized resource allocation

service update user side

Update HBase user informarmation

Traffic Resource Analyzer

Acknowlegment return to mobile User

1: user request send to application

22: Connection Close 11: cloud optimizer

20: Connection termination

5: reponse from application side 2: request

10: task scheduler 17:

Map reducing of Hadoop

HBase Interact to the

Traffic resource Analyzer