Beruflich Dokumente
Kultur Dokumente
Abstract:
Big Data is a popular open source implementation of the MapReduce programming
model for cloud computing. However, it faces a number of issues to achieve the best
performance from the underlying systems. These include a serialization barrier that delays the
reduce phase, repetitive merges, and disk accesses, and the lack of portability to different
interconnects. To keep up with the increasing volume of data sets, BigData also requires efficient
I/O capability from the underlying computer systems to process and analyze data. We describe
BigData -A, an acceleration framework that optimizes BigData with plug-in components for fast
data movement, overcoming the existing limitations. A novel network-levitated merge algorithm
is introduced to merge data without repetition and disk access. In addition, a full pipeline is
designed to overlap the shuffle, merge, and reduce phases. Our experimental results show that
BigData -A significantly speeds up data movement in MapReduce and doubles the throughput of
BigData. In addition, BigData -A significantly reduces disk accesses caused by intermediate
data. In this paper, we propose, APSO, a distributed frequent subgraph mining method over
MapReduce. Given a graph database, and a minimum support threshold, APSO generates a
complete set of frequent subgraphs. To ensure completeness, it constructs and retains all patterns
in a partition that have a non-zero support in the map phase of the mining, and then in the reduce
phase, it decides whether a pattern is frequent by aggregating its support computed in other
partitions from different computing nodes. To overcome the dependency among the states of a
mining process, APSO runs in an iterative fashion, where the output from the reducers of
iteration i−1 is used as an input for the mappers in the iteration i. The mappers of iteration i
generate candidate subgraphs of size i (number of edge), and also compute the local support of
the candidate pattern. The reducers of iteration i then find the true frequent subgraphs (of size i)
by aggregating their local supports. They also write the data in disk that are processed in
subsequent iterations.
Existing System
The size of the result-ant feature set is assumed fixed. Users are required to
explicitly specify the maximum dimension for feature subset. Although the
number of combinations reduces.
The major drawback is that users may not know in advance what would be
the ideal size of s.
The feature becomes minimal. By the principle of removing redundancy, the
feature set may shrink to its most minimal size.
The feature selection methods are custom designed for some particular
classifier and optimizer.
Two classical algorithms are Classification And Regression Tree algorithm
(CART)for decision tree induction and Rough-set discrimination
Each time when fresh data arrive, which is typical in the data collection
process that makes the big data inflate to bigger data, the traditional
induction method needs to re-run and the model that was built needs to be
built again with the inclusion of new data.
Proposed:
System Requirement:
Hardware requirements:
Software requirements:
MAP-REDUCE has emerged as a popular and easy-to-use programming model for cloud
computing. It has been used by numerous organizations to process explosive amounts of data,
perform massive computation, and extract critical knowledge for business intelligence. Hadoop
is an open source implementation of MapReduce, currently maintained by the Apache
Foundation, and supported by leading IT companies such as Facebook and Yahoo!. Hadoop
implements MapReduce framework with two categories of components: a JobTracker and many
Task- Trackers.
The JobTracker commands TaskTrackers (a.k.a. slaves) to process data in parallel
through two main functions: map and reduce. In this process, the JobTracker is in charge of
scheduling map tasks (MapTasks) and reduce tasks (ReduceTasks) to TaskTrackers. It also
monitors their progress, collects runtime execution statistics, and handles possible faults and
errors through task reexecution. Between the two phases, a ReduceTask needs to fetch a part of
the intermediate output from all finished MapTasks. Globally, this leads to the shuffling of
intermediate data (in segments) from all MapTasks to all ReduceTasks. For many data-intensive
MapReduce programs, data shuffling can lead to a significant number of disk operations,
contending for the limited I/O bandwidth.
This presents a severe problem of disk I/O contention in MapReduce programs, which
entails further research on efficient data shuffling and merging algorithms. proposed the
MapReduce Online architecture to open up direct network channels between MapTasks and
ReduceTasks and speed up the delivery of data from MapTasks to ReduceTasks. It remains as a
critical issue to examine the relationship of Hadoop MapReduce’s three data processing phases,
i.e., shuffle, merge, and reduce, and their implication to the efficiency of Hadoop. With an
extensive examination of Hadoop MapReduce framework, particularly its ReduceTasks, we
reveal that the original architecture faces a number of challenging issues to exploit the best
performance from the underlying system. To ensure the correctness of MapReduce, no
ReduceTasks can start reducing data until all intermediate data have been merged together. This
results in a serialization barrier that significantly delays the reduce operation of ReduceTasks.
More importantly, the current merge algorithm in Hadoop merges intermediate data segments
from MapTasks when the number of available segments (including those that are already
merged) goes over a threshold.
These segments are spilled to local disk storage when their total size is bigger than the
available memory. This algorithm causes data segments to be merged repetitively and, therefore,
multiple rounds of disk accesses of the same data. To address these critical issues for Hadoop
MapReduce framework, we have designed Hadoop-A, a portable acceleration framework that
can take advantage of plug-in components for performance enhancement and protocol
optimizations. Several enhancements are introduced: 1) a novel algorithm that enables
ReduceTasks to perform data merging without repetitive merges and extra disk accesses; 2) a full
pipeline is designed to overlap the shuffle, merge, and reduce phases for ReduceTasks; and 3) a
portable implementation of Hadoop-A that can support both TCP/ IP and remote direct memory
access (RDMA).
Hadoop File System
Highly fault-tolerant
High throughput
Fault tolerance:
Failure is the norm rather than exception A HDFS instance may consist of thousands of
server machines, each storing part of the file system’s data. Since we have huge number of
components and that each component has non-trivial probability of failure means that there is
always some component that is non-functional. Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS.
Data Characteristics:
Write-once-read-many: a file once created, written and closed need not be changed – this
assumption simplifies coherency
Data Replication:
HDFS is designed to store very large files across machines in a large cluster. Each file is
a sequence of blocks. All blocks in the file except the last are of the same size. Blocks are
replicated for fault tolerance. Block size and replicas are configurable per file. The
NameNode receives a Heartbeat and a Block Report from each DataNode in the cluster.
Block Report contains all the blocks on a Datanode.
Replica Placement:
The placement of the replicas is critical to HDFS reliability and performance. Optimizing
replica placement distinguishes HDFS from other distributed file systems. Rack-aware
replica placement:
Goal: improve reliability, availability and network bandwidth utilization
Research topic
Many racks, communication between racks are through switches. Network bandwidth
between machines on the same rack is greater than those in different racks. NameNode
determines the rack id for each DataNode. Replicas are typically placed on unique racks
Simple but non-optimal
Writes are expensive
Replication factor is 3
Another research topic?
Replicas are placed: one on a node in a local rack, one on a different node in the local
rack and one on a node in a different rack. 1/3 of the replica on a node, 2/3 on a rack and
1/3 distributed evenly across remaining racks.
NameNode:
Keeps image of entire file system namespace and file Blockmap in memory. 4GB of local
RAM is sufficient to support the above data structures that represent the huge number of
files and directories. When the Namenode starts up it gets the FsImage and Editlog from
its local file system, update FsImage with EditLog information and then stores a copy of
the FsImage on the filesytstem as a checkpoint. Periodic checkpointing is done. So that
the system can recover back to the last checkpointed state in case of a crash.
Datanode:
A Datanode stores data in files in its local file system. Datanode has no knowledge about
HDFS file system. It stores each block of HDFS data in a separate file. Datanode does not
create all files in the same directory. It uses heuristics to determine optimal number of
files per directory and creates directories appropriately.
When the filesystem starts up it generates a list of all HDFS blocks and send this report to
Namenode: Blockreport.
Hadoop includes a fault‐tolerant storage system called the Hadoop Distributed File
System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally
and survive the failure of significant parts of the storage infrastructure without losing data.
Hadoop creates clusters of machines and coordinates work among them. Clusters can be
built with inexpensive computers. If one fails, Hadoop continues to operate the cluster
without losing data or interrupting work, by shifting work to the remaining machines in the
cluster.
HDFS manages storage on the cluster by breaking incoming files into pieces, called
“blocks,” and storing each of the blocks redundantly across the pool of servers. HDFS has
several useful features. In the very simple example shown, any two servers can fail, and the
entire file will still be available.
HDFS notices when a block or a node is lost, and creates a new copy of missing data
from the replicas it manages. Because the cluster stores several copies of every block, more
clients can read them at the same time without creating bottlenecks. Of course there are
many other redundancy techniques, including the various strategies employed by RAID
machines.
HDFS offers two key advantages over RAID: It requires no special hardware, since it
can be built from commodity servers, and can survive more kinds of failure – a disk, a node
on the network or a network interface. The one obvious objection to HDFS – its
consumption of three times the necessary storage space for the files it manages – is not so
serious, given the plummeting cost of storage.
Many popular tools for enterprise data management relational database systems, for
example – are designed to make simple queries run quickly. They use techniques like indexing to
examine just a small portion of all the available data in order to answer a question. Hadoop is a
different sort of tool. Hadoop is aimed at problems that require examination of all the available
data. For example, text analysis and image processing generally require that every single record
be read, and often interpreted in the context of similar records. Hadoop uses a technique called
MapReduce to carry out this exhaustive analysis quickly.
In the previous section, we saw that HDFS distributes blocks from a single file among a
large number of servers for reliability. Hadoop takes advantage of this data distribution by
pushing the work involved in an analysis out to many different servers. Each of the servers runs
the analysis on its own block from the file.
Running the analysis on the nodes that actually store the data delivers much much better
performance than reading data over the network from a single centralized server. Hadoop
monitors jobs during execution, and will restart work lost due to node failure if necessary. In
fact, if a particular node is running very slowly, Hadoop will restart its work on another server
with a copy of the data.
OVERVIEW OF JAVA:
Java technology is both a programming language and a Platform. The Java programming
language is a high-level language that can be characterized by all of the following buzzwords:
Simple
Architecture neutral
Object oriented
Portable
Distributed
High performance
Interpreted
Multithreaded
Robust
Dynamic
Secure
With most programming languages, you either compile or interpret a program so that you
can run it on your computer. The Java programming language is unusual in that a program is
both compiled and interpreted. With the compiler, first you translate a program into an
intermediate language called Java byte codes —the platform-independent codes interpreted by
the interpreter on the Java platform. The interpreter parses and runs each Java byte code
instruction on the computer.
Compilation happens just once; interpretation occurs each time the program is executed.
The following figure illustrates how this works.
- WORKING OF JAVA
You can think of Java byte codes as the machine code instructions for the Java Virtual
Machine (Java VM). Every Java interpreter, whether it’s a development tool or a Web browser
that can run applets, is an implementation of the Java VM. Java byte codes help make “write
once, run anywhere” possible. You can compile your program into byte codes on any platform
that has a Java compiler. The byte codes can then be run on any implementation of the Java VM.
That means that as long as a computer has a Java VM, the same program written in the Java
programming language can run on Windows 2000, a Solaris workstation, or on an iMac.
A platform is the hardware or software environment in which a program runs. The Java
platform differs from most other platforms in that it’s a software-only platform that runs on top
of other hardware-based platforms.
The following figure depicts a program that’s running on the Java platform. As the figure
shows, the Java API and the virtual machine insulate the program from the hardware.
Native code is code that after you compile it, the compiled code runs on a specific
hardware platform. As a platform-independent environment, the Java platform can be a bit
slower than native code. However, smart compilers, well-tuned interpreters, and just-in-time byte
code compilers can bring performance close to that of native code without threatening
portability.
JAVA FEATURES
An application is a standalone program that runs directly on the Java platform. A special
kind of application known as a server serves and supports clients on a network. Examples of
servers are Web servers, proxy servers, mail servers, and print servers. Another specialized
program is a servlet. A servlet can almost be thought of as an applet that runs on the server side.
Java Servlets are a popular choice for building interactive web applications, replacing the use of
CGI scripts. Servlets are similar to applets in that they are runtime extensions of applications.
Instead of working in browsers, though, servlets run within Java Web servers, configuring or
tailoring the server.
How does the API support all these kinds of programs? It does so with packages of
software components that provides a wide range of functionality. Every full implementation of
the Java platform gives you the following features:
The essentials: Objects, strings, threads, numbers, input and output, data structures, system
properties, date and time, and so on.
Applets: The set of conventions used by applets.
Networking: URLs, TCP (Transmission Control Protocol), UDP (User Data gram Protocol)
sockets, and IP (Internet Protocol) addresses.
Internationalization: Help for writing programs that can be localized for users worldwide.
Programs can automatically adapt to specific locales and be displayed in the appropriate
language.
Security: Both low level and high level, including electronic signatures, public and private
key management, access control, and certificates.
Software components: Known as JavaBeansTM, can plug into existing component
architectures.
Object serialization: Allows lightweight persistence and communication via Remote
Method Invocation (RMI).
Java Database Connectivity (JDBCTM): Provides uniform access to a wide range of
relational databases. The Java platform also has APIs for 2D and 3D graphics, accessibility,
servers, collaboration, telephony, speech, animation, and more. The following figure depicts
what is included in the Java 2 SDK.
-JAVA 2 SDK
ODBC
The ODBC system files are not installed on your system by Windows 95. Rather, they
are installed when you setup a separate database application, such as SQL Server Client or
Visual Basic 4.0. When the ODBC icon is installed in Control Panel, it uses a file called
ODBCINST.DLL. It is also possible to administer your ODBC data sources through a stand-
alone program called ODBCADM.EXE. There is a 16-bit and a 32-bit version of this program,
and each maintains a separate list of ODBC data sources.
The advantages of this scheme are so numerous that you are probably
thinking there must be some catch. The only disadvantage of ODBC is that it isn’t as efficient as
talking directly to the native database interface. ODBC has had many detractors make the charge
that it is too slow. Microsoft has always claimed that the critical factor in performance is the
quality of the driver software that is used. In our humble opinion, this is true. The availability of
good ODBC drivers has improved a great deal recently. And anyway, the criticism about
performance is somewhat analogous to those who said that compilers would never match the
speed of pure assembly language. Maybe not, but the compiler (or ODBC) gives you the
opportunity to write cleaner programs, which means you finish sooner. Meanwhile, computers
get faster every year.
Networking
TCP/IP stack
– TCP/IP STACK
IP datagram’s
The IP layer provides a connectionless and unreliable delivery system. It considers each
datagram independently of the others. Any association between datagram must be supplied by
the higher layers. The IP layer supplies a checksum that includes its own header. The header
includes the source and destination addresses. The IP layer handles routing through an Internet. It
is also responsible for breaking up large datagram into smaller ones for transmission and
reassembling them at the other end.
TCP
TCP supplies logic to give a reliable connection-oriented protocol above IP. It provides a
virtual circuit that two processes can use to communicate.
Internet addresses
In order to use a service, you must be able to find it. The Internet uses an address scheme
for machines so that they can be located. The address is a 32 bit integer which gives the IP
address. This encodes a network ID and more addressing. The network ID falls into various
classes according to the size of the network address.
Network address
Class A uses 8 bits for the network address with 24 bits left over for other addressing.
Class B uses 16 bit network addressing. Class C uses 24 bit network addressing and class D uses
all 32.
Subnet address
Internally, the UNIX network is divided into sub networks. Building 11 is currently on
one sub network and uses 10-bit addressing, allowing 1024 different hosts.
Host address
8 bits are finally used for host addresses within our subnet. This places a limit of 256
machines that can be on the subnet.
Total address
The 32 bit address is usually written as 4 integers separated by dots.
Port addresses
A service exists on a host, and is identified by its port. This is a 16 bit number. To send a
message to a server, you send it to the port for that service of the host that it is running on. This
is not location transparency! Certain of these ports are "well known".
Sockets
#include <sys/types.h>
#include <sys/socket.h>
Here "family" will be AF_INET for IP communications, protocol will be zero, and type will
depend on whether TCP or UDP is used. Two processes wishing to communicate over a network
create a socket each. These are similar to two ends of a pipe - but the actual pipe does not yet
exist.
. SYSTEM DESIGN
System design is basically a bridge between analysis and implementation phases. It
illustrates how to achieve the solution domain from the problem domain. The main
objective of the design is to transform the high level analysis concepts used to
describe problem domain into an implementation form. System design is the
evaluation of alternative solutions and specification of a detailed computer based
solution
ARCHITECTURAL DESIGN
SEISMIC
VIBRATION
SENSOR 1 CLUSTER
MANAGER
1.login
SEISMIC DATA
2.register
VIBRATION COLLECTOR
SENSOR 2
EARTHQUAKE VIBRATION
ADMIN
SEISMIC READING
VIBRATION
SENSOR N sort()
INPUT FILE Mapped data
reduce()o
partition()
output
COUNTER 1 COUNTER N combine()
format
COUNTER 2 REDUCER
MAPPER
Reduced data
OUTPUT FILE JOB TRACKER
2. Shuffle-on-Write
The data-collector collects the Seismic Vibration Readings with some parameters from
different Seismic Vibration Sensors . It collects parameters such as Src , Eqid , Version ,
Date , time , Lat , Lon , Magnitude , Depth , NST , Region.The seismic reading is then sent to
the CLUSTER MANAGER . Every time when the seismic vibration sensor collects the data
it is then sent to the cluster manager for further processings.
SHUFFLE-ON-WRITE:
The cluster manager distributes the data to different hadoop counters and then the
admin signs in to proceed with the map-reduce process.The admin have to login once the
registration is done with valid user name and password.The data can be processed only by the
authenticated user.The mapping permission has to be given by the authenticated user .The
mapping includes two processes: partition() and combine().
The mapped data is then subjected to the reduce process.The reducer includes two
functions: 1.sort() and 2.reduce().The Job-Tracker coordinates the map and reduce phases;
provides the earthquake possibility zone in the priority based on the readings. The Output-File
maintains the database to records the readings.The readings is transferred to the admin ; the
admin intimates about the earthquake zone in advance to the station.
FEASIBILITY ANALYSIS
The feasibility of the project is analyzed in this phase and business proposal is put forth
with a very general plan for the project and some cost estimates. During system analysis the
feasibility study of the proposed system is to be carried out. This is to ensure that the proposed
system is not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential. Three key considerations involved in the feasibility
analysis are
Economical feasibility
Technical feasibility
Operational feasibility
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources. This
will lead to high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
OPERATIONAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This
includes the process of training the user to use the system efficiently. The user must not feel
threatened by the system, instead must accept it as a necessity. The level of acceptance by the
users solely depends on the methods that are employed to educate the user about the system and
to make him familiar with it. His level of confidence must be raised so that he is also able to
make some constructive criticism, which is welcomed, as he is the final user of the system.
Over all DFD
start
starting server
Network traffic
stop
monitoring system
store data
Level 0
start
starting server
Network traffic
monitoring system
Collecting data
Level 1
Cluster Manager
Mapping Reduce
Level 2
Application Specific
records
store data
stop
Sequence Diagram:
request
update
response
resource allocation
task scheduler
cloud optimizer
qos preservation
return acknowlegement
Connection termination
close
Connection Close
Collaboration Diagram:
.
User interact to the
Hadoop File System
Reliable Transaction
Rate
End
ER-Diagrams:
User HDFS
File
MAP REDUCE
Task
Mapper
Tracker
Reducer JobTracker
Job Details
Screenshot:
1.Go For Hadoop Directory :
2. Information File Send to Hadoop File System(HDFS) through a Map reducing protocol
3. Hadoop Mapreducing Program Running in Cloudera Terminals:
6.Informations about Data Transfer rate and packet rate listed in terminals:
.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
/*
* Validate that two arguments were passed from the command line.
*/
if (args.length != 2) {
System.out.printf("Usage: StubDriver <input dir> <output dir>\n");
System.exit(-1);
}
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = new Job();
/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(StubDriver.class);
/*
* Specify an easily-decipherable name for the job.
* This job name will appear in reports and logs.
*/
job.setJobName("Stub Driver");
/*
* TODO implement
*/
/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
}
Map reducing Class Development Code:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*
* TODO implement
*/
}
}
Hadoop Interface code:
import java.io.IOException;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
/*
* TODO implement
*/
}
}
Information File Reader and writer protocol definition code:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import static org.junit.Assert.fail;
import org.junit.Before;
import org.junit.Test;
/*
* Declare harnesses that let you test a mapper, a reducer, and
* a mapper and a reducer working together.
*/
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
ReduceDriver<Text, IntWritable, Text, DoubleWritable> reduceDriver;
MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, DoubleWritable> mapReduceDriver;
/*
* Set up the test. This method will be called before every test.
*/
@Before
public void setUp() {
/*
* Set up the mapper test harness.
*/
StubMapper mapper = new StubMapper();
mapDriver = new MapDriver<LongWritable, Text, Text, IntWritable>();
mapDriver.setMapper(mapper);
/*
* Set up the reducer test harness.
*/
StubReducer reducer = new StubReducer();
reduceDriver = new ReduceDriver<Text, IntWritable, Text, DoubleWritable>();
reduceDriver.setReducer(reducer);
/*
* Set up the mapper/reducer test harness.
*/
mapReduceDriver = new MapReduceDriver<LongWritable, Text, Text, IntWritable, Text,
DoubleWritable>();
mapReduceDriver.setMapper(mapper);
mapReduceDriver.setReducer(reducer);
}
/*
* Test the mapper.
*/
@Test
public void testMapper() {
/*
* For this test, the mapper's input will be "1 cat cat dog"
* TODO: implement
*/
fail("Please implement test.");
/*
* Test the reducer.
*/
@Test
public void testReducer() {
/*
* For this test, the reducer's input will be "cat 1 1".
* The expected output is "cat 2".
* TODO: implement
*/
fail("Please implement test.");
/*
* Test the mapper and reducer working together.
*/
@Test
public void testMapReduce() {
/*
* For this test, the mapper's input will be "1 cat cat dog"
* The expected output (from the reducer) is "cat 2", "dog 1".
* TODO: implement
*/
fail("Please implement test.");
}
}
Conclusion:
In Big Data analytics, the high dimensionality and the streaming nature of the
incoming data aggravate great computational challenges in data mining. Big Data
grows continually with fresh data are being generated at all times; hence it requires
an incremental computation approach which is able to monitor large scale of data
dynamically. Lightweight incremental algorithms should be considered that are
capable of achieving robustness, high accuracy and minimum pre-processing
latency. In this paper, we investigated the possibility of using a group of
incremental classification algorithm for classifying the collected data streams
pertaining to Big Data. As a case study empirical data streams were represented by
five datasets of different do-main that have very large amount of features, from
UCI archive. We compared the traditional classification model induction and their
counter-part in incremental inductions. In particular we proposed a novel
lightweight feature selection method by using Swarm Search and Accelerated PSO,
which is supposed to be useful for data stream mining. The evaluation results
showed that the incremental method obtained a higher gain in accuracy per second
incurred in the pre-processing. The contribution of this paper is a spectrum of
experimental insights for anybody who wishes to design data stream mining
applications for big data analytics using lightweight feature selection approach
such as Swarm Search and APSO. In particular, APSO is designed to be used for
data mining of data streams on the fly. The combinatorial explosion is addressed
by used swarm search approach applied in incremental manner. This approach also
fits better with real-world applications where their data arrive in streams. In
addition, an incremental data mining approach is likely to meet the demand of big
data problem in service computing.