Sie sind auf Seite 1von 27

Big Data & Hadoop

Agenda
Introduction to Big Data
Why Big Data?
Big Data Overview
Hadoop Overview
Why Hadoop?
Who can learn Hadoop?

#Trending : Jobs for Hadoop and Java


Hadoop : Architecture & Ecosystem

Introduction to Big Data


Big Data : A Term for collective data sets with large and complex
volumes of data.
Volumes are in Petabytes (1024 TB) or Exabytes (1024 PB) &

will soon be Zettabytes (1024 EB).


Hence, the data are hard to interpret & process in the existing

traditional data processing application and tools.

Why Big Data


To Manage Huge Data in a better way.
Benefit of Data Speed, Capacity & Scalability from cloud storage.
Potential insights by Data Analysis Methods.
Companies can find new prospects & Business Opportunities.
Unlike other methods, with Big Data, Business Users can

Visualize the Data.

Big Data Overview


Big Data include:
Traditional Structured Databases from inventories, orders and
customer information.
Unstructured Data from web, social networking sites etc.,

The problem with these massive datasets are that it cant be


analyzed with standard tool & procedures.
Processing these data appropriately can help an Organization
gain useful insights on the business prospects.

Unstructured data Growth


No. of Emails sent per second

2.9 Million

Videos uploaded on YouTube per min

20 hours

Data processed by Google per day

20PetaBytes

Tweets per day

50 Million

Minutes spent on FaceBook per month

700 Billion

Data sent & received by mobile users per day 1.3 ExaBytes
Products ordered on Amazon per second
*Source:http://www. http://ibm.com/

73 items

Unstructured data Growth

*Source:http://www. http://forbes.com/

Hadoop Overview
Hadoop allows batch processing for colossal data sets

(Petabytes & Exabytes) as a series of parallel processes.


Hadoop cluster comprises a number of server "nodes.
Nodes store and process data in a parallel and distributed
fashion.
Its a parallelized, distributed storage & processing framework
that can operate on commodity servers.

Commodity Hardware
Its the average amount of computing resources.
It doesnt imply low quality but, affordability.
Hadoop Clusters run on Commodity Servers.
Commodity servers have an average ratio of disk space to
memory which is not like specialized servers with high memory
or CPU.
Servers are not designed specifically to distribute storage and
process framework, but its made to fit the purpose.

Benefits of Hadoop
Scalable Hadoop can store and distribute very large data sets
across hundreds of inexpensive servers that operate in parallel.

Failure Tolerance HDFS can replicate files for specified

number of times and can automatically re-replicate data blocks


on nodes that have failed.

Benefits of Hadoop
Cost-Effective Hadoop is a scale-out architecture that stores all
the company's data for later use, for which it offers computing
and storage capabilities for a reasonable price.

Speed Hadoops unique storage method is based on a distributed


file system, resulting in much faster data processing.

Flexible Hadoop easily access new data sources and different


types of data to generate insights.

*Source:http://www. http://datanami.com/

Why Hadoop
It provides insights into daily operations
Drives new product ideas
Used by companies for research and development and marketing
analysis
Image and text processing.
Analyses huge amount of data in comparatively less time.

Network monitoring
Log and/or click stream analysis of various kinds.

Hadoop Forecast

*Source:http://www. http://alliedmarketresearch.com/

Who can Learn Hadoop


Anyone with basic knowledge of Java & Linux.

Even if you arent introduced to Java & Linux before, You can
learn it parallel along with Hadoop.

Hadoop projects are available as Architect, Developer, Tester,


Linux/Network/Hardware Administrator.

Some need the knowledge of Java and some dont.

Who can Learn Hadoop


SQL knowledge will help in learning HiveQL, which is a
feature in Hadoop Ecosystem.

Knowledge of Linux in will be helpful in understanding


Hadoop command line Parameters.

But even without any prerequisite knowledge of Java & Linux,


with the help of few basic classes you can Learn Hadoop.

#Trending: Hadoop Jobs

*Source:http://www. http://the451group.com/

Job Opportunities in Hadoop


MNCs like IBM, Microsoft & Oracle have integrated with
Hadoop.

Also, companies like Facebook, HortonWorks, Amazon, ebay


and Yahoo! are currently looking for Hadoop Professionals.

So, companies are looking for IT professionals with enough


Hadoop Mapreduce skills.

Salary Trend in Hadoop

*Source:http://www. http://itproportal.com/

Hadoop Architecture
The 2 main components of Hadoop are:
Hadoop Distributed File System (HDFS) is the storage
component that breaks files into blocks, replicates and stores
them across the cluster.

MapReduce, the processing component that distributes the


workload for operations on files stored in HDFS and
automatically restarts failed work.

*Source:http://www. http://cloudera.com/

Hadoop Ecosystem
Apache Hadoop Distributed File System offers storage of large
files across multiple machines.

Apache MapReduce is a program for processing large data sets


with a parallel & distributed algorithm on a clusters.
Apache Hive data warehouse in distributed storage facilitating
data summarization, queries and managing large datasets.
Apache Pig is an engine for executing data flows in parallel on
Apache Hadoop.
Apache HBase Non-relational distributed database performing
real-time operations in large tables.

Hadoop Ecosystem
Apache Flume is an Unstructured data aggregator to HDFS.
Apache Sqoop is a system for transferring bulk data
between HDFS and relational databases.
Apache Oozie is a workflow scheduler system to manage Apache

Hadoop jobs.
Apache Zookeeper is a coordinator with tools to write correct
distributed applications.
Apache Avro is a framework for modelling, serializing and
making Remote Procedure Calls.

Q&A

Q&A

THANK YOU

Das könnte Ihnen auch gefallen