Hadoop Module 3.2

www.edureka.
in/hadoop Slide 1
www.edureka.in/hadoop Slide 2
How It Works
LIVE On-Line classes
Class recordings in Learning Management System (LMS)
Module wise Quizzes, Coding Assignments
24x7 on-demand technical support
Project work on large Datasets
Online certification exam
Lifetime access to the LMS
Complimentary Java Classes
Module 1
Understanding Big Data
Hadoop Architecture
Module 2
Hadoop Cluster Configuration
Data loading Techniques
Hadoop Project Environment
Module 3
Hadoop MapReduce framework
Programming in Map Reduce
Module 4
Advance MapReduce
MRUnit testing framework
Module 5
Analytics using Pig
Understanding Pig Latin
Module 6
Analytics using Hive
Understanding HIVE QL
Module 7
Advance Hive
NoSQL Databases and HBASE
Module 8
Advance HBASE
Zookeeper Service
Module 9
Hadoop 2.0 New Features
Programming in MRv2
Module 10
Apache Oozie
Real world Datasets and Analysis
Project Discussion
Course Topics
What is Big Data?
Limitations of the existing solutions
Solving the problem with Hadoop
Introduction to Hadoop
Hadoop Eco-System
Hadoop Core Components
HDFS Architecture
MapRedcue Job execution
Anatomy of a File Write and Read
Hadoop 2.0 (YARN or MRv2) Architecture
Topics for Today
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications. The
challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
What Is Big Data?
Un-Structured Data is Exploding
IBMs Definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Images
Videos
Audios
Sensor
Data
Volume
Velocity Variety
IBMs Definition
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annies Introduction
Hello There!!
My name is Annie.
I love quizzes and
Annies Question
Map the following to corresponding data type:
- XML Files
- Word Docs, PDF files, Text files
- E-Mail body
- Data from Enterprise systems (ERP, CRM etc.)
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Annies Answer
More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop
http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
http://en.wikipedia.org/wiki/Big_Data
IBMs definition Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Further Reading
Web and e-tailing
Recommendation Engines
Ad Targeting
Search Quality
Abuse and Click Fraud Detection
Telecommunications
Customer Churn Prevention
Network Performance
Optimization
Calling Data Record (CDR)
Analysis
Analyzing Network to Predict
Failure
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios
Government
Fraud Detection And Cyber Security
Welfare schemes
Justice
Healthcare & Life Sciences
Health information exchange
Gene sequencing
Serialization
Healthcare service quality
improvements
Drug Safety
Common Big Data Customer Scenarios (Contd.)
Banks and Financial services
Modeling True Risk
Threat Analysis
Fraud Detection
Trade Surveillance
Credit Scoring And Analysis
Retail
Point of sales Transaction
Analysis
Customer Churn Analysis
Sentiment Analysis
Common Big Data Customer Scenarios (Contd.)
Insight into data can provide Business
Advantage.
Some key early indicators can mean Fortunes
to Business.
More Precise Analysis with more data.
Case Study: Sears Holding Corporation
X
*Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer activity
and sales data.
Hidden Treasure
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
Archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Cant explore original
high fidelity raw data
2. Moving data to compute
doesnt scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Collection
Limitations of Existing Data Analytics Architecture
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: A Combined Storage Computer Layer
Accessible
Robust Simple
Scalable
Differentiating
Factors
Hadoop Differentiating Factors
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMS RDBMS EDW MPP NoSQL HADOOP
Hadoop Its about Scale And Structure
Read 1 TB Data
4 I/O Channels
Each Channel 100 MB/s
1 Machine
4 I/O Channels
10 Machines
Why DFS?
45 Minutes
Read 1 TB Data
4 I/O Channels
1 Machine
4 I/O Channels
10 Machines
Why DFS?
4.5 Minutes 45 Minutes
Read 1 TB Data
4 I/O Channels
1 Machine
4 I/O Channels
10 Machines
Why DFS?
Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
Reliable
Economical Flexible
Scalable
Hadoop
Features
Hadoop Key Characteristics
Hello There!!
My name is Annie.
I love quizzes and
Hadoop is a framework that allows for the distributed processing of:
- Small Data Sets
- Large Data Sets
Annies Question
Annies Answer
Large Data Sets. It is also capable to process small data-sets
however to experience the true power of Hadoop one needs to have
data in TBs because this where RDBMS takes hours and fails
whereas Hadoop does the same in couple of minutes.
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data
Structured Data
Hadoop Eco-System
Hadoop and
MapReduce magic in
action
Write intelligent applications using Apache Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
LinkedIn Recommendations
Machine Learning with Mahout
Hadoop is a system for large scale data processing.
It has two main components:
HDFS Hadoop Distributed File System (Storage)
Distributed across nodes
Natively redundant
NameNode tracks locations.
MapReduce (Processing)
Splits a task across processors
near the data & assembles results
Self-Healing, High Bandwidth
Clustered storage
JobTracker manages the TaskTrackers
Hadoop Core Components
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
Hadoop Core Components (Contd.)
Metadata (Name, replicas,):
/home/foo/data, 3,
Rack 1
Blocks
Datanodes
Block ops
Replication
Write
Datanodes
Metadata ops
Client
Read
NameNode
Rack 2
Client
HDFS Architecture
NameNode:
master of the system
maintains and manages the blocks which are present on the
DataNodes
DataNodes:
slaves which are deployed on each machine and provide the
actual storage
responsible for serving read and write requests for the clients
Main Components Of HDFS
Meta-data in Memory
The entire metadata is in main memory
No demand paging of FS meta-data
Types of Metadata
List of files
List of Blocks for each file
List of DataNode for each block
File attributes, e.g. access time, replication factor
A Transaction Log
Records file creations, file deletions. etc
Name Node
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
Name Node:
Keeps track of overall file directory
structure and the placement of Data
Block
NameNode Metadata
Secondary NameNode:
Not a hot standby for the NameNode
Connects to NameNode every hour*
Housekeeping, backup of NemeNode metadata
Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
Secondary Name Node
Hello There!!
My name is Annie.
I love quizzes and
NameNode?
a) is the Single Point of Failure in a cluster
b) runs on Enterprise-class hardware
c) stores meta-data
d) All of the above
Annies Question
Annies Answer
All of the above. NameNode Stores meta-data and runs on reliable
high quality hardware because its a Single Point of failure in a
hadoop Cluster.
Hello There!!
My name is Annie.
I love quizzes and
Annies Question
When the NameNode fails, Secondary NameNode takes over
instantly and prevents Cluster Failure:
a) TRUE
b) FALSE
Hello There!!
My name is Annie.
I love quizzes and
Annies Answer
False. Secondary NameNode is used for creating NameNode
Checkpoints. NameNode can be manually recovered using edits
and FSImage stored in Secondary NameNode.
DFS
Job Tracker
1. Copy Input Files
User
2. Submit Job
3. Get Input Files Info
6. Submit Job
4. Create Splits
5. Upload Job
Information
Input Files
Client
Job.xml.
Job.jar.
JobTracker
DFS
Client
Job.xml.
Job.jar.
6. Submit Job
8. Read Job Files
7. Initialize Job
Job Queue
As many maps
as splits
Input Spilts
Maps Reduces
9. Create
maps and
reduces
Job Tracker
JobTracker (Contd.)
Job Tracker
Job Queue
H3
H4
H5
H1
Task Tracker -
H2
Task Tracker -
H4
10. Heartbeat
12. Assign Tasks
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Picks Tasks
(Data Local if possible)
Task Tracker -
H3
Task Tracker -
H1
JobTracker (Contd.)
Hadoop framework picks which of the following daemon
for scheduling a task ?
a) namenode
b) datanode
c) task tracker
d) job tracker
Annies Question
Hello There!!
My name is Annie.
I love quizzes and
Annies Answer
JobTracker takes care of all the job scheduling and
assign tasks to TaskTrackers.
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Create
7. Complete
5. ack Packet 4. Write Packet
Pipeline of
Data nodes
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Create
3. Write
4
5
4
5
Anatomy of A File Write
6. Close
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Get Block locations
4. Read
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Open
3. Read
5. Read
Anatomy of A File Read
FS Data
Input Stream
6. Close
Client JVM
Client Node
Replication and Rack Awareness
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a) TRUE
b) FALSE
Annies Question
Annies Answer
True. A files is divided into Blocks, these blocks are
written in parallel but the block replication happen in
sequence.
Annies Question
A file of 400MB is being copied to HDFS. The system
has finished copying 250MB. What happens if a client
tries to access that file:
a) can read up to block that's successfully written.
b) can read up to last bit successfully written.
c) Will throw an throw an exception.
d) Cannot see that file until its finished copying.
Annies Answer
Client can read up to the successfully written data block,
Answer is (a)
Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Further Reading
Setup the Hadoop development environment using the documents present in the LMS.
Hadoop Installation Setup Cloudera CDH3 Demo VM
Execute Linux Basic Commands
Execute HDFS Hands On commands
Attempt the Module-1 Assignments present in the LMS.
Module-2 Pre-work
Whats within the LMS
This section will
give you an
insight of Big
Data and
Hadoop course
This section will give
some prerequisites on
Java to understand
the course better
Old Batch
recordings
Handy for you
This section will give
an insight of Hadoop
Installation
Click here to
expand and view
all the elements
of this Module
Recording of
the Class
Presentation
Hands-on
Guides
Quiz
Assignment
Thank You
See You in Class Next Week

Hadoop Module 3.2

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Module 3.2

Hochgeladen von

Copyright:

Verfügbare Formate

www.edureka.

Das könnte Ihnen auch gefallen