Sie sind auf Seite 1von 69

Big Data Architecture

12 June 2017 | Proprietary and confidential information. © Mphasis 2017


Agenda

✓ Hadoop Features
✓ Hadoop Components
✓ Hadoop Processes
✓ Hadoop Architecture
✓ MapReduce Framework
✓ What is YARN
✓ What is ZooKeeper

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


2 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Features

➢ The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware.

➢ HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.


➢ Achieving data localization
➢ HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
❖ Moving the application to the place where the data is residing OR
❖ Making
➢ HDFS relaxesdata local
a few to application
POSIX requirements to enable streaming access to file system data.

➢ HDFS was originally built as infrastructure for the Apache Nutch web search engine project.

➢ HDFS is now an Apache Hadoop subproject. The project URL is http://hadoop.apache.org/hdfs/

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


3 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Components

o Hadoop is combination of two independent components

o HDFS (Hadoop Distributed File System)


➢ Achieving data localization
➢ Designed for scaling in terms of storage and IO bandwidth
❖ Moving the application to the place where the data is residing OR
❖ Making data local to application
o MR framework (MapReduce)
➢ Designed for scaling in terms of performance

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


4 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Process

➢ Achieving data localization

❖ Moving the application to the place where the data is residing OR


❖ Making data local to application

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


5 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Processes

o Processes running on Hadoop

➢ ▪Achieving
NameNode
data localization
▪ DataNode
❖ Moving the application to the place whereUsed ByisHDFS
the data residing OR
▪ ❖Secondary NameNode
Making data local to application

▪ Task Tracker
Used by MapReduce Framework
▪ Job Tracker

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


6 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Processes

➢ Two Masters

NameNode
• Ifdata
➢ Achieving down cannot access HDFS
localization

❖ Moving the application to the place where the data is residing OR


❖ Making data local to application
Job tracker
• If down cannot run MapReduce Job, but still you can access HDFS

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


7 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NameNode and DataNode

HDFS has a master/slave architecture.

An HDFS cluster consists of a single NameNode, a master server that manages the
file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster,
which manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


8 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NameNode and DataNode

The NameNode executes file system namespace operations like opening, closing,
and renaming files and directories.

It also determines the mapping of blocks to DataNodes.

DataNodes are responsible for serving read and write requests from the file system’s
clients.

DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.

HDFS is built using the Java language, any machine that supports Java can run the
NameNode or the DataNode software.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


9 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Hadoop Architecture

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


10 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
File System Namespace

✓ HDFS supports a traditional hierarchical file organization.

✓ A user or an application can create directories and store files


inside these directories.

✓ The file system namespace hierarchy is similar to most other


existing file systems; one can create and remove files, move
a file from one directory to another, or rename a file. HDFS
does not yet implement user quotas. HDFS does not support
hard links or soft links. However, the HDFS architecture does
not preclude implementing these features.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


11 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
File System Namespace

✓ The NameNode maintains the file system namespace.

✓ Any change to the file system namespace or its properties is


recorded by the NameNode.

✓ An application can specify the number of replicas of a file that


should be maintained by HDFS.

✓ The number of copies of a file is called the replication factor of


that file. This information is stored by the NameNode.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


12 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Data Replication

✓ The NameNode maintains the file system namespace.

✓ Any change to the file system namespace or its properties is


recorded by the NameNode.

✓ An application can specify the number of replicas of a file that


should be maintained by HDFS.

✓ The number of copies of a file is called the replication factor of


that file. This information is stored by the NameNode.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


13 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Overview of HDFS

✓ NameNode is the single point of contact

❖Consist of the meta information of the HDFS


❖If it fails, HDFS is inaccessible

✓ DataNodes consist of the actual data

❖Store the data in blocks


❖Blocks are stored in local file system

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


14 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Overview of MapReduce

✓ MapReduce job consist of two tasks


▪ Map Task
▪ Reduce Task

✓ Blocks of data distributed across several machines are processed by map tasks parallely

✓ Results are aggregated in the reducer

✓ Works only on KEY/VALUE pair

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


15 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Secondary NameNode

✓ Not a back up or stand by NameNode

✓ Only purpose is to take the snapshot of NameNode and merging the log file contents
into metadata file on local file system

✓ It’s a CPU intensive operation

✓ In big cluster it is run on different machine

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


16 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Secondary NameNode Cont’d

✓ Two important files (present under this directory


(/home/software/hadoop-temp/dfs/name/previous.checkpoint )
✓ Edits file
✓ Fsimage file

✓ When starting the Hadoop cluster (start-all.sh)


• Restores the previous state of HDFS by reading fsimage file
• Then starts applying modifications to the meta data from the edits file
• Once the modification is done, it empties the edits file
• This process is done only during start up

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


17 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Secondary NameNode Cont’d

✓ Over a period of time edits file can become very big and the next start become very
longer

✓ Secondary NameNode merges the edits file contents periodically with fsimage file to
keep the edits file size within a sizeable limit

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


18 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Job Tracker

✓ MapReduce master
✓ Client submits the job to JobTracker
✓ JobTracker talks to the NameNode to get the list of blocks
✓ Job Tracker locates the task tracker on the machine where data is located
✓ Data Localization
✓ Job Tracker then first schedules the mapper tasks
✓ Once all the mapper tasks are over it runs the reducer tasks

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


19 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Task Tracker

✓ Responsible for running tasks (map or reduce tasks) delegated by job tracker

✓ For every task separate JVM process is spawned

✓ Periodically sends heart beat signal to inform job tracker

• Regarding the available number of slots


• Status of running tasks

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


20 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Heartbeat

What is heartbeat in
hadoop

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


21 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Heartbeat Cont’d

✓ A heartbeat is a signal indicating that node is alive.


✓ A datanode sends heartbeat to Namenode and task tracker will send its heart beat to
job tracker.
✓ If the Namenode or job tracker does not receive heart beat then they will decide that
there is some problem in datanode or task tracker is unable to perform the assigned
task.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


22 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Typical data storage in Hadoop

✓ How data is stored in Hadoop


✓ Example to store data with Replication factor 3
✓ What is Replication factor
• Replication Factor tells about data replication on multiple nodes ie how many copies
of data need to be maintained
• by the way it can achieve high fault tolerant and high availability

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


23 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
How data is stored in HDFS

Rack -1 Rack - 2 Rack -3

1 1
1

2 2 2

3 3 3

4 4 4

Block B: Block C:
Block A:

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


24 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Test Your Knowledge

✓ What are hadoop components


✓ What are hadoop processes
✓ What are master components in hadoop
✓ What is meant by Data replication
✓ Howmany tasks will MapReduce consits and name those tasks
✓ What is secondary NameNode
✓ What is Heartbeat

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


25 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Reading from HDFS

✓ Connects to NameNode (NN)

✓ Ask NN to give the list of DataNodes (DN) that is hosting the replica’s of the block of file

✓ Client then directly read from the data nodes without contacting again to NN

✓ Along with the data, check sum is also shipped for verifying the data integrity.
• If the replica is corrupt client intimates NN, and try to get the data from other
DataNode(DN)

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


26 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MapReduce Framework

✓ What is MapReduce job?

✓ What is input split?

✓ What is mapper?

✓ What is reducer?

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


27 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Typical MapReduce Workflow

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


28 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Typical MapReduce Workflow Cont’d

sd

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


29 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MR Job for word count

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


30 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Checking MR job status on console

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


31 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MR Job status on admin

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


32 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MapReduce Job

✓ It’s a framework for processing the data residing on HDFS


• Distributes the task (map/reduce)
✓ Consist of typically 5 phases:
• Map
• Partitioning
• Sorting
• Shuff ling
• Reduce

✓ A single map task works typically on one block of data (dfs.block.size)


• No of blocks / input split = No of map tasks
✓ After all map tasks are completed the output from the map is passed on to the machines where reduce
task will run

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


33 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MR Terminology

✓ What is job?
• Complete execution of mapper and reducers over the entire data set
✓ What is task?
• Single unit of execution (map or reduce task)
• Map task executes typically over a block of data (dfs.block.size)
• Reduce task works on mapper output
✓ What is “task attempt”?
• Instance of an attempt to execute a task (map or reduce task)
• If task is failed working on particular portion of data, another task
• will run on that portion of data on that machine itself
• If a task fails 4 times, then the task is marked as failed and entire job fails
• Make sure that atleast one attempt of task is run on different machine

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


34 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MR Terminology Cont’d

✓ How many tasks can run on portion of data?


• Maximum 4
• If “speculative execution” is ON, more task will run
✓ What is “failed task”?
• Task can be failed due to exception, machine failure etc.
• A failed task will be re-attempted again (4 times)
✓ What is “killed task”?
• If task fails 4 times, then task is killed and entire job fails.
• Task which runs as part of speculative execution will also be marked as killed

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


35 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Input Split

✓ Portion or chunk of data on which mapper operates

✓ Input split is just a reference to the data

✓ Typically input size is equal to one block of data (dfs.block.size)

✓ Each mapper works only on one input split

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


36 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Input Split Cont’d

✓ Input split size can be controlled.


• Useful for performance improvement
• Generally input split is equal to block size (64MB)
• What if you want mapper to work only on 32 MB of a block data?
✓ Controlled by 3 properties:
• Mapred.min.split.size( default 1)
• Mapred.max.split.size (default LONG.MAX_VALUE)
• Dfs.block.size ( default 64 MB)

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


37 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
What is Mapper

✓ Mapper is the first phase of MapReduce job

✓ Works typically on one block of data

✓ MapReduce framework ensures that map task is run closer to the data to
avoid network traffic
• Several map tasks runs parallel on different machines and each working
on different portion (block) of data

✓ Mapper reads key/value pairs and emits key/value pair

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


38 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Mapper Cont’d

✓ Mapper can use or can ignore the input keu

✓ Mapper can emit


✓ Zero key value pair
✓ 1 key value pair
✓ “n” key value pair

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


39 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Mapper Cont’d

✓ Map function is called for one record at a time


• Input Split consist of records
• For each record in the input split, map function will be called
• Each record will be sent as key –value pair to map function
• So when writing map function keep ONLY one record in mind
• It does not keep the state of whether how many records it has
processed or how many records will appear
• Knows only current record

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


40 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
What is reducer

✓ Reducer runs when all the mapper tasks are completed


✓ After mapper phase , all the intermediate values for a given intermediate
keys is grouped together and form a list

✓ This list is given to the reducer

• Reducer operates on Key, and List of Values


• When writing reducer keep ONLY one key and its list of value in mind
• Reduce operates ONLY on one key and its list of values at a time

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


41 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Reducer Cont’d

✓ NOTE all the values for a particular intermediate key goes to one reducer

✓ There can be Zero, one or “n” reducer.


• For better load balancing you should have more than one reducer

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


42 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Input Format

Input Format Key Value

Text Input Format Offset of the line within a Entire line till “\n” as value
file
Key Value Text Input Format Part of the record till the Remaining record after first
first delimiter delimiter
Sequence File Input Format Key needs to be determined Value need to be
from the header determined from header

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


43 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Input Format Cont’d

Input Format Key Data Type Value Data Type

Text Input Format Long Writable Text

Key Value Text Input Format Text Text

Sequence File Input Foramt ByteWritable ByteWritable

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


44 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Text Input Format

Efficient for processing text data

Example:

Hello, How are you

Hey, I am good

I am from Mphasis

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


45 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Text Input Format Cont’d

✓ Internally every line is associated with offset

✓ The offset is treated as key. The first column is offset


✓ Simply, line numbers are given
0 Hello, How are you

1 Hey, I am good

2 I am from Mphasis

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


46 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Text Input Format Cont’d

0 Hello, How are you

1 Hey, I am good

2 I am from Mphasis

Key Value
0 Hello, How are you
1 Hey, I am good
2 I am from Mphasis

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


47 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
How Input Split is processed by Mapper

✓ Input split by default is the block size (dfs.block.size)


✓ Each input split / block comprises of records
• A record is one line in the input split terminated by “\n” (new line
character)

✓ Every input format has “RecordReader”


• RecordReader reads the records from the input split
• RecordReader reads ONE record at a time and call the map function.
• If the input split has 4 records, 4 times map function will be called,
one for each record
• It sends the record to the map function in key value pair
12 June 2017 | Proprietary and confidential information. © Mphasis 2014
48 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Combiner

✓ Large number of mapper running will produce large amounts of


intermediate data
• This data needs to be passed to the reducer over the network
• Lot of network traffic
• Shuffling/Copying the mapper output to the machine where reducer
will run will take lot of time

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


49 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Combiner Cont’d

✓ Similar to reducer
✓ Runs on the same machine as the mapper task
✓ Runs the reducer code on the intermediate output of the mapper
✓ Thus minimizing the intermediate key-value pairs
✓ Combiner runs on intermediate output of each mapper

Advantages
✓ Minimize the data transfer across the network
✓ Speed up the execution
✓ Reduces the burden on reducer

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


50 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Combiner Cont’d

✓ Combiner has the same signature as reducer class


✓ Can make the existing reducer to run as combiner, if
✓ The operation is associative or commutative in nature
Example: Sum, Multiplication
✓ Average operation cannot be used

✓ Combiner may or may not run. Depends on the framework


✓ It may run more than once on the mapper machine

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


51 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Partitioner

✓ It is called after you emit your key value pairs from mapper
context.write(key,value)
✓ Large number of mappers running will generate large amount of data
• And If only one reducer is specified, then all the intermediate key and its
list of values goes to a single reducer
✓ Copying will take lot of time
✓ Sorting will also be time consuming
✓ Whether single machine can handle that much amount of
intermediate data or not?

✓ Solution is to have more than one reducer


12 June 2017 | Proprietary and confidential information. © Mphasis 2014
52 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Partitioner

✓ It is called after you emit your key value pairs from mapper
context.write(key,value)
✓ Large number of mappers running will generate large amount of data
• And If only one reducer is specified, then all the intermediate key and its
list of values goes to a single reducer
✓ Copying will take lot of time
✓ Sorting will also be time consuming
✓ Whether single machine can handle that much amount of
intermediate data or not?

✓ Solution is to have more than one reducer


12 June 2017 | Proprietary and confidential information. © Mphasis 2014
53 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Partitioner Cont’d

✓ Partitioner divides the keys among the reducers


• If more than one reducer running, then partitioner decides which key
value pair should go to which reducer

✓ Default is “Hash Partitioner”


• Calculates the “hashcode” and do the modulo operator with total
number of reducers which are running

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


54 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Speculative Execution

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


55 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Speculative Execution – Cont’d

➢ Hadoop framework feels that a certain task (Mapper or Reducer) is taking longer on
average compared to the other tasks from the same job, it clones the “long running”
task and run it on another node. This is called Speculative Execution.

➢ Meaning Hadoop is speculating that something is wrong with the “long running” task
and runs a clone task on the other node

➢ The slowness in the “long running” job could be due to a faulty hardware, network
congestion, or the node could be simply busy etc.

➢ Most of the the time this is a false alarm and the task which was considered long
running or problematic completes successfully. In that case Hadoop will kill the cloned
task and proceed with the results from the completed task

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


56 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NextGen – MR ( YARN )

What is YARN
✓ MRv2 is split into two major functionalities
✓ job Tracker , resource management and scheduling/monitoring

✓ Resource Manager(RM) is ultimate authority that arbitrates resources


among all the applications in the system

✓ Application Master(AM), is a framework specific library and is tasked with


negotiating resources from ResourceManager ad working with
NodeManager(s) to execute and monitor the tasks

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


57 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NextGen – MR ( YARN ) – Cont’d

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


58 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NextGen – MR ( YARN ) – Cont’d

✓ ResourceManager has two components


• Scheduler
• ApplicationManager

✓ Scheduler is responsible for allocating resources to various running


applications

✓ Scheduler offers no guarantees about restarting failed tasks either due to


application failure or hardware failure.

✓ Scheduler performs its scheduling function based the resource


requirements of the applications
12 June 2017 | Proprietary and confidential information. © Mphasis 2014
59 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
NextGen – MR ( YARN ) – Cont’d

✓ ApplicationManager is responsible for job-submissions

✓ NodeManager is per-machine framework agent which is respsonsible for


containers, monitoring their resource usage ( cpu,meory,disk, network)
and reporting same to ResourceManager/Scheduler.

✓ ApplicationMaster has the responsibility of negotiating appropriate


resource containers from Scheduler, tracking status and monitoring for
progress

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


60 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Differences b/w MR1 and MR2

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


61 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
MR Job execution on YARN

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


62 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
ZooKeeper

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


63 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
ZooKeeper

What is ZooKeeper

✓ ZooKeeper is an open source Apache project that provides a centralized


infrastructure and services that enable synchronization across a cluster.

✓ ZooKeeper maintains common objects needed in large cluster


environments. Examples of these objects include configuration
information, hierarchical naming space, and so on.

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


64 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
A Typical ZooKeeper Service

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


65 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
Test Your Knowledge

✓ What is input split


✓ What is mapper
✓ What is reducer
✓ what are different types of input formats
✓ what is combiner
✓ What is Partitioner
✓ What is YARN
✓ What is ZooKeeper

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


66 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
References

https://en.wikipedia.org/wiki/Apache_Hadoop

https://hadoop.apache.org/

https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-
site/YARN.html

12 June 2017 | Proprietary and confidential information. © Mphasis 2014


67 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
12 June 2017 | Proprietary and confidential information. © Mphasis 2014
68 12 June 2017 | Proprietary and confidential information. © Mphasis 2017
12 June 2017 | Proprietary and confidential information. © Mphasis 2014
69 12 June 2017 | Proprietary and confidential information. © Mphasis 2017

Das könnte Ihnen auch gefallen