Sie sind auf Seite 1von 34

YARN

CONTROL OVER PARALLELISM

Agenda
MR1 limitations.
Yarn architecture.

MR1 limitations
Scalability
Maximum cluster size ~ 4,500 nodes
Maximum concurrent tasks 40,000

Availability
Failure kills all queued and running jobs

Inflexible slots
Low resource utilization

Lacks support for alternate paradigms and


services

Hadoop versions

Hadoop Versions

Beyond MR

Beyond MR
Applications Run Natively in Hadoop
BATCH
INTERACTIVE
(Tez)
(MapReduce)

ONLINE
(HBase)

STREAMING
(Storm, S4,)

GRAPH
(Giraph)

IN-MEMORY HPC MPI


(Spark)
(OpenMPI)

OTHER
(Search)
(Weave)

YARN (Cluster Resource Management)


HDFS2 (Redundant, Reliable Storage)

Store ALL DATA in one place


Interact with that data in MULTIPLE WAYS
with control on the resources

YARN Components
ResourceManager (RM)
- RM

Runs as a standalone daemon on a dedicated machine.


Master of the cluster resources.
Primary responsibility is allocating resources to different applications
based on the demand and availability. (scheduling)
Does not provide status of running applications ???.
Does not provide information about previously executed applications.
Manages nodes and tracks heartbeats from NMs.
Manages AMs.
Manages containers.
Handles AM requests for resources.
De-allocates containers when a application finishes.
Manages security for applictions.

YARN Components
- RM

Resource
Manager

Resourc
e
Schedul
er

Scheduler

AM
Livelines
s
Monitor

Responsible for allocating resources to


the various running applications subject
to familiar constraints of capacities,
queues, etc.
ApplicationsManager
Responsible for accepting job
submissions, negotiating the first
container for executing the container for

NM
Livelines
s
Monitor

Applicati
onsMana
ger

YARN Components - NM
NodeManager (NM)

Is a daemon running on each worker node.


Responsible for monitoring of resource availability on the node it runs.
Registers with RM and sends information about the resources available on
the nodes.
Communicates with RM to provide information on node services.
Sends heartbeats and container status.
Manages processes in containers.
Launches AMs on request from RM.
Launches application processes on request from AM.
Monitors resource usage by containers.
Kills orphan processes.
Provides logging services.

YARN Components NM

Node
Manager

NodeStatusUpdater

NodeSta
tusUpda
ter

Contain
er
Manage
r

Contain
er
Execute
r

Registers with RM and sends information about the resources available


on the nodes.
ContainerManager
Core of NM, which contains RPC Server, ContainerLauncher,
ContainerMonitor, LogHandler, etc.
ContainerExecuter
Interacts with underlying operating system to securely place files and
directories needed by containers and subsequently launch and cleanup
processes corresponding to containers.

YARN Components AM
Application Master (AM)

One per application.


Application specific.
Runs in a container.
Requests more containers to run application tasks.
Negotiate resources with the ResourceManager and
works with the NodeManager(s) to execute and manage
the containers and their resource consumption.

YARN Components Containers


Containers

A right to use a specific amount of resources on a


specific machine on the cluster, created by the
NodeManager upon request from the
ResourceManager.
Allocates certain amount of resources:
CPU, memory from worker nodes.
Applications run in one or more containers.

YARN Cluster Running a YARN


Application
3
4

2
Resource
Manager

1. Client Application Request.


2. Response with a new
Application ID.
3. Copy job resources to HDFS.
4. Submit Application.

ASC & CLC


ApplicationSubmissionContext

Complete specification of ApplicationMaster.


Provided by the Client.
Contains details like Application ID, User, Queue, etc.
Specifies job files, security tokens, etc.
Container Launch Context.

ContainerLaunchContext
The ApplicationMaster has to provide more information
to the NodeManager to actually launch the container.
Specify command line to launch the process within the
container.
Environment variables, local resources necessary on
the machine prior to launch, etc.

ApplicationSubmissionContex
t
ApplicationSubmissionContext
Application ID.
Application User.
Application Name.
Application Priority.
ContainerLaunchContext.
Resour
ce
Manag
er

ContainerLaunchContext
ContainerLaunchContext:
Container ID.
Resource allocated to the container.
User to whom the container is
allocated.
Security tokens and local resources.
Environment variables.
Command to launch the container.

Node
Manag
er

Node
Manag
er

MyApp

Applicatio
n Master

Resource Request &


Containers
Resource
Manager
6

5. Start Client Application Master and send


Resource Request.
6. Respond with Resource Capabilities.
7. Requested Containers.
8. Assigned Containers.

7
Node
Manager
AM

ResourceRequest
It has the following form:
<resource-name, priority, resource-requirement, number-ofcontainers>
resource-name is either hostname, rackname or * to indicate no
preference. In future, we expect to support even more complex
topologies for virtual machines on a host, more complex networks etc.
priority is intra-application priority for this request (to stress, this isnt
across multiple applications).
resource-requirement is required capabilities such as memory, cpu
etc. (at the time of writing YARN only supports memory and cpu).
number-of-containers is just a multiple of such containers.

Resource Request:
Request priority
Name of the machine or rack (* to signify any
machine or rack)
Resource required for each request.
Number of containers.
A Boolean Relax locality flag.
Node
Manager

Resourc
e
Manage
r

AM

Containers
:
Container
IDs.
Nodes.

AM & NM Communication
Start containers by
sending CLC.
Request Container
Status.
Status Response.

Node
Manager
AM

Node
Manager
Contain
er

Node
Manager
Contain
er

Node
Manager

Node
Manager
Contain
er

Progress & Status Updates


Poll for
status

Node
Manager
AM

Continuou
s
Heartbeat

Resourc
e
Manage
r

Status Update

Node
Manager
Contain
er

Node
Manager
Contain
er

Node
Manager

Node
Manager
Contain
er

Elephants Can Remember


Job History Server
Look at key metrics for a MapReduce Job.
Understand the performance of each job.
Optimize future job runs.
Resource
Manager Web
UI

Job History
Server

Containers Configurations
Variable

yarn.scheduler.minimum-allocationmb
yarn.scheduler.maximum-allocationmb
yarn.scheduler.minimum-allocationvcores
yarn.scheduler.maximum-allocationvcores

Default Value

Description

1024

Minimum memory (in


MB) allocation for every
container request at RM.

8192

Maximum memory (in


MB) allocation for every
container request at RM.

Minimum number of
virtual cores for every
container request at RM.

32

Maximum number of
virtual cores for every
container request at RM.

Containers Configurations
Variable

yarn.nodemanager.resource.memorymb
yarn.nodemanager.resource.cpuvcores
yarn.nodemanager.vmem-pmemratio

Default Value

Description

8192

Amount of Physical
Memory that can be
allocated for containers.

Number of vcores that


can be allocated for
containers.

2.1

Ratio of virtual to
physical memory for
containers

Containers Configuration
Recommended
Recommendations
Total Memory per
Reserved System
Memory

Recommended
Reserved HBase
Memory

4 GB

1 GB

1 GB

8 GB

2 GB

1 GB

16 GB

2 GB

2 GB

24 GB

4 GB

4 GB

48 GB

6 GB

8 GB

64 GB

8 GB

8 GB

72 GB

8 GB

8 GB

96 GB

12 GB

16 GB

128 GB

24 GB

24 GB

256 GB

32 GB

32 GB

512 GB

64 GB

64 GB

Node

Containers Configuration
Recommendations
Total RAM per Node

Recommended Min.
Container Size

< 4 GB

256 MB

Between 4 GB and 8 GB

512 MB

Between 8 GB and 24 GB

1024 MB

Above 24 GB

2048 MB

MapReduce on
YARN

Node
Manager

Node
Manager
Container

Resource
Manager

Node
Manager
MRAppMa
ster

Node
Manager
Container

MapReduce Configurations
yarn-site.xml
Variable

Default Value

Description

yarn.resourcemanager.hostname

0.0.0.0

The host name of the


Resource Manager
The valid service name
(Should only contain az, A-Z, 0-9), and
cannot start with
numbers.

yarn.nodemanager.aux-services

mapred-site.xml
Variable

mapreduce.framework.name

Default Value

Description

local

The runtime
framework for
executing MapReduce
jobs. Can be one of
local, classic, or yarn

MapReduce Configurations
Variable

mapreduce.map.memory.mb

mapreduce.map.cpu.vcores

mapreduce.reduce.memory.mb

mapreduce.reduce.cpu.vcores

Default Value

Description

1024

Amount of memory to
request from
scheduler for each
map task.

The number of virtual


cores to request from
scheduler for each
map task.

1024

Amount of memory to
request from
scheduler for each
reduce task.

The number of virtual


cores to request from
scheduler for each

MapReduce Configurations
Variable

mapred.child.java.opts

mapreduce.map.java.opts

mapreduce.reduce.java.opts

Default Value

Description

Xmx200m

The JVM Options used


to launch container
process that runs map
and reduce tasks.

Xmx200m

The JVM Options used


for child process to run
Map tasks.

Xmx200m

The JVM options used


for child process to run
Reduce tasks.

Shuffle Services
Required for parallel MapReduce job operation.
Reducers fetch the output from all of the Maps by Shuffling Map output data from the
corresponding nodes where the map tasks have run.
It is implemented as a helping service in Node Manager.
NodeManager starts a Netty Web Server in its address space which knows how to
handle MapReduce specific shuffle requests.
Hadoop 2.0 provides for Encrypted Shuffle where HTTPS optional client
authentication is provided.

Das könnte Ihnen auch gefallen