Sie sind auf Seite 1von 34

BIG DATA ANALYSIS

SHUBHAM GUPTA
B.TECH COMPUTERS
BATCH: E3 ROLL NO. :E059

ABSTRACT

Need for methods other than traditional database

management systems.
Knowing about big data.
Factors by which data can be expressed.
Importance of big data.
Big data tools

Issues and challenges faced by big data.


Indexing methods .
Applications
Brief details about Data mining

BRIEF DESCRIPTION

DBMS and Big data


DBMS and Big data : Database management

systems is a primary and simple method.


Data is increasing rapidly since decade
Rdbms being simple system is not efficient and
collapses.
This huge pool of data is big data

BIG DATA
The big data requires certain approaches:

techniques tools and architecture.


The motive is to solve old and new problems in
better way.
Big data generates value from the storage and
processing very large quantities of digital
information that cannot be analysed by traditional
methods

EXAMPLES

RFID
Transactions in walmart
VISA transactions
tweets
Active users on facebook generating social

interaction data etc.

CHARACTERISTICS OF BIG DATA

Volume

Velocity

Variety

Data
quantity

Data speed Data types

IMPORTANCE OF BIG DATA


The main motive is to use the large data, find

relevant information and analyzing it to find


solutions that reduce time, cost.
help to develop new products and help smarter
decision making.
With help of proper analytics tools we can also find
root causes of failure, issues, defects etc in near real
time and optimize use the routes .

BIG DATA ANALYTICS


Examining large amount of data.
Appropriate information
Identification of hidden patterns
Competitive advantage
Better business decision strategic and operational
Effective marketing, customer satisfaction and

increased revenue.

BIG DATA TOOLS


NoSQL solution: It provides more dynamic model

which is comparatively less rigid.


Apache CASANDRA: It is a better method since it can
handle high velocity data easily and effectively, uses
schema that supports a larger variety of data sets. It
provides continuous availability and much more
cheaper than RDBMS
APACHE HADOOP: It permits distributed processing
of different datasets across clusters of systems using
datasets. It does data storage and data processing

BIG DATA TOOLS


SAP HANA: It helps in large amount data

processing in short period of time. The databases let


HANA perform processing on data stored in RAM.
By this we can get immediate results from user
transaction and data analysis.

CHALLENGES & ISSUES


The challenges include capture, curation, storage,

algorithms, search, sharing, transfer, analysis and


visualization
Issues:

Analysis of the data and data quality.


Searching the data through retrieval algorithm
Addressing data quality.
Displaying meaningful results

Another issue is SECURITY.

INDEXING METHOD : MapReduce

Map( ): This functions sorts the queues and

generates intermediate keys.


Reduce() is used to find occurrence or frequencies. It

merges occurrence of similar intermediate keys.

INDEXING METHOD : Parallel Indexing


2 level indexing
Global level: key -> node
Like a table, should be partitioned and replicated
Problems
Index partitioning
Update propagation for replicas
Local level (within a node): like a normal index
Easier with
Hash tables, bitmaps, inverted files
More complex with trees or graphs
Index partitioning
Replicate top levels, partition low levels

Applications
Maximum temperature: The weather sensors collect data of

weather across the globe all the data is collected by map


reduce method it is sorted and we can calculate maximum
temperature.
Word count: It is used to count number of times the word
appears in a document a mapper maps similar words and
reducer sums the counts for each word and emits single value.
Anagram : It is a word play all the letters of a word are
repeated only once to create a new word.
Election commission: For election a lot of data is required a
persons history region location party details candidate details
etc.

Different natural disasters hit data: Every year

details of data climate area are gathered which


together forms a big data and needs to be handled by
tools.
Mutual friend problem: too many request so mutual
friends are calculated daily and a quick request is
sent.
University database: has courses student data
enrolling and employee data which is fetched by
various minning techniques. Etc

DATA MINING
Big data is a term for a large data set.
Data mining refers to the activity of going through

big data sets to look for relevant or pertinent


information.

Data mining techniques

Cluster Detection.
Decision Trees.
Memory-Based Reasoning.
Neural Networks.
Genetic Algorithm.

Data mining techniques

ADVANTAGES &
DISADVANTAGES

ADVANTAGES
Provides geographic diversity and redundancy.
Ample computing options are available
Huge workloads can be handled by adding new and good

quality of hardware, along with latest processing.


It can also work as stand alone and can be integrated with
soft layer solution and services.
Multi level security protects the big data against various
physical and electronic threats. A number of safety
measures are taken too.
Payments can be made according to the services availed
for the application.

DISADVANTAGES
Encourages large data collections holds on incomprehensible

data if it proves to be useful


People are unaware about information collected and located.
Personal information of people is combined with cosmic data
set. So personal details can be inferred.
It permits public to be manipulative.
Risk analysis can treat people unfairly and carelessly.
It is also used by spies and governments to find out about the
terrorist activities but the major problem is sometimes even
common public can face adverse conditions and data mining
can not efficiently find out the main terrorist.

INFERENCE

INFERENCE
We can infer that big data is efficient since it uses

certain indexing techniques to organize data.


Also we performed a comparitive study of different
algorithms as seen earlier.

INFERENCE
We got to know about different indexing methods

their similarities and differences.


Like the mapreduce method sorts the queue and
generates intermediate keys.
Whereas in parallel indexing graphs trees etc are
formed.
By this we can conclude that if we use proper algorithms
or techniques to manage bulks of data and can be
handled in timely manner then that data would be of
use else it would just occupy space and be of no use.

QUESTIONS ANSWERED

Questions answered are:

Why do we use bigdata ?


Its feasibility and what scale companies use it?
How to process data immediately?
How to handle different kinds of data?
How is data indexing done?

QUESTIONS TO BE
ANSWERED.

Yet to be answered
How to maintain a balance between security and

privacy.
Risk analysis
How to get analysis architecture that can balance
data in real time and historic data.
Efficient distrinbuted processing.

CONCLUSIONS

CONCLUSION
As data is increasing day by day for them databases and

DBMS is not enough.


So to organise and get the real information from the pool of
data i.e. big data. We use tools like Hadoop and MongoDB
which are the top-two Big Data technologies for big data
analysis.
Making use of MapReduce model, MapReduce computation
processes many terabytes of data on thousands of machines
and programmers find the system easy to use.
Data mining and big data are co-related and information can
be retrieved from pool of data by applying the most optimized
data mining algorithm on the big data.

References
Anshul Gupta, Nishchol Mishra, Jitendra Agarwal,
Ravindra patel A Study On Big Data International
conference on Cloud, Big Data and trust 2013,Nov 13-15,
RGPV;
Vibha Bhardwaj, Rahul Johari USICT GGSIPU Big data
analysis:issues and challenges 978-1-4799-76782/15/$31.00 2015 IEEE
From Databases to Big Data Sam Madden
Massachusetts Institute of Technology 10897801/12/$31.00 2012 IEEE

Das könnte Ihnen auch gefallen