Sie sind auf Seite 1von 45

 

Book Report 
Apache Hadoop  
 

Harish 
17th December, 2019 
 

   

 
 

Introduction 
1. Title 
2. Author 
3. Illustrator 
4. Fiction or Nonfiction 
5. Why did you choose this book? 

Reading Rainbow Tip: ​Was the title interesting? Did the cover spark your curiosity? Was it 
something else? Talk about why you chose the book to help your classmates understand more 
about you! 

   


 

Hadoop Distributed File System (HDFS)

● HDFS  is  a  filesystem  designed  for  storing  very  large  files  with  streaming  data  access 
patterns, running on clusters of commodity hardware 

● HDFS  is  built  around  the  idea  that  the  most  efficient  data  processing  pattern  is  a 
write-once,  read-many-times  pattern.  A  dataset  is  typically  generated  or  copied  from 
source, then various analyses are performed on that dataset over time 

● Hadoop  doesn’t  require  expensive,  highly  reliable  hardware  to  run  on.  It  is  designed  to 
run  on  clusters  of  commodity  hardware  for  which  the  chance  of  node  failure  across  the 
cluster  is  high,  at  least  for  large  clusters.  HDFS  is  designed  to carry on working without a 
noticeable interruption to the user in the face of such failure. 

Blocks

● A disk has a block size, which is the minimum amount of data that it can read or write. 

● Filesystem  blocks  are  typically  a  few  kilobytes  in  size,  while  disk  blocks  are normally 512 
bytes. 

● HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default. 

● Why  Is  a  Block  in  HDFS  So Large? -  HDFS blocks are large compared to disk blocks, and 


the  reason  is  to  minimize  the  cost  of  seeks.  By  making  a  block  large  enough, the time to 
transfer  the  data  from  the  disk  can  be  made  to  be  significantly  larger  than  the  time  to 
seek  to  the start of the block. Thus the time to transfer a large file made of multiple blocks 
operates at the disk transfer rate. 

● To  insure  against  corrupted  blocks  and  disk  and machine failure, each block is replicated 


to  a  small  number  of  physically  separate  machines  (typically  three).  If  a  block  becomes 
unavailable,  a  copy  can  be  read  from  another  location  in  a  way  that  is  transparent  to the 
client.  A  block  that  is  no  longer  available  due  to  corruption  or  machine  failure  can  be 
replicated  from  its  alternative  locations  to  other  live  machines  to  bring  the  replication 
factor back to the normal level. 


 

● Data  nodes are responsible for verifying the data they receive before storing the data and 
its  checksum.  This  applies  to  data  that  they  receive  from  clients  and  from  other  data 
nodes  during  replication. Since every I/O operation on the disk or network carries with it a 
small  chance  of  introducing  errors  into  the  data  that  it  is  reading  or  writing,  when  the 
volumes  of  data  flowing  through  the  system  are  as  large  as  the  ones  Hadoop  is capable 
of  handling,  the  chance  of  data  corruption  occurring  is  high.  The  usual  way  of  detecting 
corrupted  data  is  by  computing  a  checksum  for  the  data  when  it  first  enters  the  system, 
and  again  whenever  it  is  transmitted  across  a  channel  that  is  unreliable  and  hence 
capable of corrupting the data  

*Network latency is an expression of how much time it takes for a packet of data to get from one
designated point to another. In some environments (for example, AT&T), latency is measured
by sending a packet that is returned to the sender; the round-trip time is considered the latency.

*Seek time - the time taken for a disk drive to locate the area on the disk where the data to be
read is stored.

● It  is  possible  to  disable  verification  of  checksums  by  passing  false  to  the  setVerify 
Checksum()  method  on  FileSystem,  before  using  the  open()  method  to  read  a  file.  The 
same  effect  is  possible  from  the  shell  by  using  the  -ignoreCrc  option  with  the  -get  or the 
equivalent  -copyToLocal  command.  This  feature  is  useful  if  you  have  a  corrupt  file  that 
you  want  to  inspect  so  you  can  decide  what to do with it. For example, you might want to 
see whether it can be salvaged before you delete it. 

% ​hadoop fsck /

Example of checking the whole filesystem for a small cluster

% hadoop fsck /user/tom/part-00007 -files -blocks –racks

/user/tom/part-00007 25582428 bytes, 1 block(s): OK


0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/10.251.43.2:50010,
/default-rack/10.251.27.178:50010, /default-rack/10.251.123.163:50010]

This says that the file /user/tom/part-00007 is made up of one block and shows the
Data nodes where the blocks are located. The fsck options used are as follows:

The ​-files option shows the line with the filename, size, number of blocks, and its health (whether there
are any missing blocks).
The ​-blocks​ option shows information about each block in the file, one line per block.


 

The ​-racks ​option displays the rack location and the dat anode addresses for each block.

Namenodes and Datanodes

An  HDFS  cluster  has  two  types  of  node  operating  in  a  master-worker  pattern:  a  namenode  (the 
master)  and  a  number  of  datanodes  (workers).  The  namenode  manages  the  filesystem 
namespace.  It  maintains  the  filesystem  tree  and  the  metadata  for  all  the  files  and  directories  in 
the  tree.  This  information  is  stored  persistently  on  the  local  disk  in  the  form  of  two  files:  the 
namespace  image  and  the  edit  log.  The  namenode  also  knows  the  datanodes  on  which  all  the 
blocks  for  a  given  file  are  located,  however,  it  does  not  store  block  locations  persistently,  since 
this information is reconstructed from datanodes when the system starts. 

Datanodes  are  the  workhorses  of  the  filesystem.  They  store  and  retrieve  blocks  when  they  are 
told  to  (by  clients  or  the  namenode), and they report back to the namenode periodically with lists 
of blocks that they are storing. 


 

Without  the  namenode,  the  filesystem  cannot  be  used.  In  fact,  if  the  machine  running  the 
namenode  were  obliterated,  all  the  files  on  the  filesystem  would be lost since there would be no 
way  of  knowing  how  to  reconstruct  the  files  from  the blocks on the datanodes. For this reason, it 
is  important  to  make the namenode resilient to failure, and Hadoop provides two mechanisms for 
this. 

The  first  way  is  to  back  up  the  files  that  make  up  the persistent state of the filesystem metadata. 
Hadoop  can  be  configured  so  that  the  namenode  writes  its  persistent  state  to  multiple 
filesystems.  These  writes  are  synchronous  and  atomic.  The  usual  configuration choice is to write 
to local disk as well as a remote NFS mount. 

It  is  also  possible  to  run  a  secondary  namenode,  which  despite  its  name  does  not  act  as  a 
namenode.  Its  main  role  is  to  periodically  merge  the  namespace  image  with  the  editlog  to 
prevent  the  edit  log  from  becoming  too  large.  The  secondary  namenode  usually  runs  on  a 
separate  physical  machine,  since  it  requires  plenty  of  CPU  and  as  much  memory  as  the 
namenode  to perform the merge. It keeps a copy of the merged namespace image, which can be 
used  in  the  event  of  the  namenode  failing.  However,  the  state  of  the  secondary  namenode lags 
that  of  the  primary,  so  in  the  event  of  total  failure  of  the  primary,  data  loss  is  almost certain. The 
usual  course  of  action  in  this  case  is  to  copy  the  namenode’s  metadata  files  that  are  on  NFS  to 
the secondary and run it as the new primary. 

FSimage and edit log 

When  a  filesystem  client  performs  a  write  operation  (such  as  creating  or  moving  a  file),  it  is  first 
recorded  in  the  edit  log.  The  namenode  also  has  an  in-memory  representation  of  the  filesystem 
metadata,  which  it updates after the edit log has been modified. The in-memory metadata is used 
to serve read requests. 

The  edit  log  is  flushed  and  synced  after  every  write  before  a  success  code  is  returned  to  the 
client.  For  namenodes  that  write  to  multiple  directories,  the  write  must  be flushed and synced to 
every  copy  before  returning  successfully.  This  ensures  that  no  operation  is  lost  due  to  machine 
failure. 


 

The  fsimage  file  is  a  persistent  checkpoint of the filesystem metadata. However, it is not updated 


for  every  filesystem  write  operation,  since  writing  out  the  fsimage  file,  which  can  grow  to  be 
gigabytes  in  size,  would  be  very  slow.  This does not compromise resilience, however, because if 
the  namenode  fails,  then  the  latest  state  of  its  metadata  can  be  reconstructed  by  loading  the 
fsimage from disk into memory, then applying each of the operations in the edit log. 

As  described,  the  edits  file would grow without bound. Though this state of affairs would have no 


impact  on  the  system  while  the  namenode  is  running,  if  the  namenode  were  restarted,  it  would 
take  a  long  time  to  apply  each  of  the  operations  in  its  (very  long)  edit  log.  During  this  time,  the 
filesystem would be offline, which is generally undesirable. 

The  solution  is  to  run  the secondary namenode, whose purpose is to produce checkpoints of the 


primary’s in-memory filesystem metadata. The check pointing process proceeds as  

1. The secondary asks the primary to roll its edits file, so new edits go to a new file. 

2. The secondary retrieves fsimage and edits from the primary (using HTTP GET). 

3.  The  secondary  loads  fsimage  into  memory,  applies  each  operation  from  edits,  then  creates  a 
new consolidated fsimage file. 

4. The secondary sends the new fsimage back to the primary (using HTTP POST). 

5.  The  primary  replaces  the  old  fsimage  with  the  new  one  from  the  secondary, and the old edits 
file  with  the  new  one  it  started  in  step  1.  It  also  updates  the  fstime file to record the time that the 
checkpoint was taken. 

HDFS Federation 

The  namenode  keeps  a  reference  to  every  file  and  block  in  the  filesystem  in  memory,  which 
means  that  on  very  large  clusters  with many files, memory becomes the limiting factor for scaling 
HDFS  Federation,  introduced  in  the  0.23  release  series,  allows  a  cluster  to  scale  by  adding 
namenodes,  each  of  which  manages  a  portion  of  the  filesystem  namespace.  For  example,  one 
namenode  might  manage  all  the  files  rooted  under  /user,  say,  and  a  second  namenode  might 
handle files under /share. 


 

Under  federation,  each  namenode  manages  a  namespace  volume,  which  is  made  up  of  the 
metadata  for  the  namespace,  and  a  block  pool  containing  all  the  blocks  for  the  files  in  the 
namespace.  Namespace  volumes  are  independent  of  each  other,  which  means  namenodes  do 
not  communicate  with  one  another, and furthermore the failure of one namenode does not affect 
the  availability  of  the  namespaces  managed  by  other  namenodes.  Block  pool  storage  is  not 
partitioned,  however,  so  datanodes  register  with  each  namenode  in  the  cluster and store blocks 
from multiple block pools. 

HDFS High-Availability 

The  combination  of  replicating  namenode  metadata  on  multiple  filesystems,  and  using  the 
secondary  namenode  to  create  checkpoints  protects  against  data  loss,  but  does  not  provide 
high-availability  of  the  filesystem.  The  namenode  is  still  a  single  point  of  failure (SPOF), since if it 
did  fail,  all  clients—including  MapReduce  jobs—would  be  unable  to  read,  write,  or  list  files, 
because  the  namenode  is  the  sole  repository  of  the  metadata  and  the  file-to-block  mapping.  In 
such  an  event  the  whole  Hadoop  system  would  effectively  be  out  of  service  until  a  new 
namenode  could  be  brought  online.  To  recover  from  a  failed  namenode  in  this  situation,  an 
administrator  starts  a  new  primary  namenode  with  one  of  the  filesystem  metadata  replicas,  and 
configures  datanodes  and  clients  to  use  this  new  namenode.  The  new  namenode  is  not  able  to 
serve  requests  until  it  has  i)  loaded  its  namespace  image  into  memory,  ii)  replayed  its  edit  log, 
and  iii)  received  enough  block  reports  from  the  datanodes  to leave safe mode. On large clusters 
with  many  files  and  blocks,  the  time it takes for a namenode to start from cold can be 30 minutes 
or more. 

The  long recovery time is a problem for routine maintenance too. In fact, since unexpected failure 
of  the  namenode  is so rare, the case for planned downtime is actually more important in practice. 
The  0.23  release  series  of  Hadoop  remedies  this  situation  by  adding  support  for  HDFS 
high-availability  (HA).  ​In  this  implementation  there  is  a  pair  of  namenodes  in  an  activestandby 
configuration.  In  the  event  of  the  failure  of  the  active  namenode,  the  standby  takes  over  its 
duties to continue servicing client requests without a significant interruption. 

A few architectural changes are needed to allow this to happen: 


•  The  namenodes  must  use  highly-available  shared  storage  to  share  the  edit  log.  (In  the  initial 
implementation  of  HA  this  will  require  an  NFS  filer,  but  in  future  releases  more  options  will  be 
provided,  such  as  a  BookKeeper-based  system  built  on  Zoo-  Keeper.)  When  a  standby 


 

namenode  comes up it reads up to the end of the shared edit log to synchronize its state with the 
active  namenode,  and  then  continues  to  read  new  entries  as  they  are  written  by  the  active 
namenode. 
•  Datanodes  must  send  block  reports to both namenodes since the block mappings are stored in 
a namenode’s memory, and not on disk. 
•  Clients  must  be  configured  to  handle  namenode  failover,  which  uses  a  mechanism  that  is 
transparent to users. 
 
If  the  active  namenode  fails,  then  the  standby  can  take  over  very  quickly  (in  a  few  tens  of 
seconds)  since  it  has  the  latest  state  available  in  memory:  both  the latest edit log entries, and an 
up-to-date block mapping. 

Failover and fencing

The  transition  from  the  active  namenode  to  the  standby  is  managed  by  a  new  entity  in 
the  system  called  the  failover  controller.  There  are  various  failover  controllers,  but  the 
default  implementation  uses  ZooKeeper  to  ensure  that  only  one  namenode  is  active. 
Each  namenode  runs  a  lightweight  failover  controller  process  whose  job  it  is  to  monitor 
its  namenode  for  failures  (using  a  simple  heartbeating  mechanism)  and  trigger  a  failover 
should a namenode fail. 

Failover  may  also  be  initiated  manually  by  an  administrator,  for  example,  in  the  case  of 
routine  maintenance.  This  is  known  as  a  graceful  failover,  since  the  failover  controller 
arranges an orderly transition for both namenodes to switch roles. 

In  the  case  of  an  ungraceful  failover,  however,  it  is  impossible  to  be  sure  that  the  failed 
namenode  has  stopped  running.  For  example,  a  slow  network or a network partition can 


 

trigger  a  failover  transition,  even  though  the  previously  active  namenode  is  still  running 
and thinks it is still the active namenode. The HA implementation goes to great lengths to 
ensure  that  the  previously  active  namenode  is  prevented  from  doing  any  damage  and 
causing corruption—a method known as fencing.

What is Quorum journal manager?


Quorum Journal Manager (QJM) allows to share edit logs between the Active and
Standby NameNodes

Importantly, when using the ​Quorum Journal Manager, only one NameNode will ever
be allowed to write to the ​JournalNodes

What is Journal nodes?

In  order  for  the  Standby  node  to  keep  its  state  coordinated  with  the  Active  node,  both 

nodes  communicate  with  a  group  of  separate  daemons  called  ‘JournalNodes’  (JNs). 

When  any  namespace  modification  is  performed  by  the  Active  node,  it  logs  a  record  of 

the  changes  made,  in  the  JournalNodes.  The  Standby  node  is  capable  of  reading  the 

amended  information  from  the  JNs,  and  is  regularly  monitoring them for changes. As the 

Standby  Node  sees  the  changes,  it then applies them to its own namespace. In case of a 

failover,  the  Standby  will  make  sure  that  it  has  read  all  the  changes  from  the 

JounalNodes  before  changing  its  state  to  ‘Active  state’.  This  guarantees  that  the 

namespace state is fully synched before a failover occurs. 

To  provide  a  fast  failover,  it  is  essential  that  the  Standby node have to have the updated 

and  current information regarding the location of blocks in the cluster. For this to happen, 


 

the  DataNodes  are  configured  with  the  location  of  both  NameNodes,  and  send  block 

location information and heartbeats to both. 

It  is  essential  that  only  one  of  the  NameNodes  must  be  Active  at  a  time.  Otherwise,  the 

namespace  state  would  deviate  between  the  two  and  lead  to  data  loss  or  erroneous 

results.  In  order  to  avoid  this,  the  JournalNodes  will  only  permit  a single NameNode to a 

writer at a time. During a failover, the NameNode which is to become active will take over 

the responsibility of writing to the JournalNodes.

Epoc numbers and it use

Journal  manager  uses  epoc  numbers.  Epoc  numbers  are  integer  which  always  gets 
increased  and  have  unique  value  once  assigned.  Namenode  generate  epoc  number 
using  simple  algorithm  and  uses  it  while  sending  RPC  requests  to  the  QJM.  When  you 
configure  Namenode  HA,  the  first  Active  Namenode  will  get  epoc  value  1.  In  case  of 
failover  or  restart,  epoc  number  will  get  increased.  The  Namenode  with  higher  epoc 
number is considered as newer than any Namenode with earlier epoc number. 

Now both Namenode thinks that they are active and sends write request to quorum journal
manager with their epoc number, how QJM handles this situation?

Quorum journal manager stores epoc number locally which called as promised epoc. 
Whenever JournalNode receives RPC request along with epoc number from Namenode, 
it compares the epoch number with promised epoch. If request is coming from newer 
node which means epoc number is greater than promised epoc then itrecords new epoc 
number as promised epoc. If the request is coming from Namenode with older epoc 
number, then QJM simply rejects the request. 

10 
 

MAP REDUCE PARADIGM

MapReduce  is  a programming model for data processing. The model is simple, yet not too simple 


to  express  useful  programs  in.  Hadoop  can  run  MapReduce  programs  written  in  various 
languages.  MapReduce  programs  are  inherently  parallel,  thus  putting  very  large-scale  data 
analysis into the hands of anyone with enough machines at their disposal. 

A map reduce example

Datafiles are organized by date and weather station. There is a directory for each year from 1901 
to 2001, each containing a gzipped file for each weather station with its readings for that year. For 
example, here are the first entries for 1990: 

There are tens of thousands of weather stations, so the whole dataset is made up of a large 
number of relatively small files.Below is the format of the data. 
 

11 
 

Analyzing the Data with Hadoop

To  take  advantage  of  the  parallel  processing  that  Hadoop  provides,  we  need  to  express  our 
query  as  a  MapReduce  job.  After  some  local,  small-scale  testing,  we  will  be  able  to  run  it  on  a 
cluster of machines. 

MAP AND REDUCE

MapReduce  works  by  breaking  the  processing  into  two  phases:  the  map  phase  and  the  reduce 
phase.  Each  phase  has key-value pairs as input and output, the types of which may be chosen by 
the  programmer.  The  programmer  also  specifies  two  functions:  the  map function and the reduce 
function. 

The  input  to  our  map  phase  is  the  raw  NCDC  data.  We  choose  a  text  input  format  that  gives  us 
each  line  in  the  dataset  as  a  text  value. The key is the offset of the beginning of the line from the 
beginning of the file, but as we have no need for this, we ignore it. 

Our  map  function  is  simple.  We  pull  out  the  year  and  the air temperature, because these are the 
only  fields  we  are  interested  in.  In  this  case,  the  map  function  is  just  a  data  preparation  phase, 
setting  up  the  data  in  such  a  way  that  the  reduce  function  can  do  its  work  on  it:  finding  the 
maximum  temperature  for  each  year.  The  map function is also a good place to drop bad records: 
here we filter out temperatures that are missing, suspect, or erroneous. 

To  visualize  the  way  the  map  works,  consider  the  following  sample  lines  of  input  data  (some 
unused columns have been dropped to fit the page, indicated by ellipses): 

These lines are presented to the map function as the key-value pairs: 

The  keys  are  the  line  offsets  within  the  file,  which  we  ignore  in  our  map  function.  The  map 
function  merely  extracts  the  year  and  the  air  temperature (indicated in bold text), and emits them 
as its output (the temperature values have been interpreted as integers): 

12 
 

The  output  from  the  map  function  is  processed  by  the MapReduce framework before being sent 
to  the  reduce  function.  This  processing  sorts  and  groups  the  key-value  pairs  by  key.  So, 
continuing the example, our reduce function sees the following input: 

Each  year  appears  with  a  list  of  all  its  air  temperature  readings. All the reduce function has to do 
now is iterate through the list and pick up the maximum reading: 

This is the final output: the maximum global temperature recorded in each year.

13 
 

DATA FLOW
First,  some  terminology.  A  MapReduce job is a unit of work that the client wants to be performed: 
it  consists  of  the  input  data,  the  MapReduce  program,  and  configuration  information.  Hadoop 
runs  the  job  by  dividing  it  into  tasks,  of  which  there  are  two  types:  map  tasks  and  reduce  tasks. 
The  tasks  are  scheduled  using  YARN  and  run  on  nodes  in  the  cluster.  If  a  task  fails,  it  will  be 
automatically rescheduled to run on a different node. 

Hadoop  divides  the  input  to  a  MapReduce  job  into  fixed-size  pieces  called  input  splits,  or  just 
splits.  Hadoop  creates  one  map  task  for  each  split,  which  runs the user-defined map function for 
each record in the split. 

Having  many  splits  means  the  time  taken  to  process  each  split  is  small  compared  to  the  time  to 
process  the  whole  input.  So  if  we  are  processing  the  splits  in  parallel,  the  processing  is  better 
load  balanced  when  the  splits  are  small,  since  a  faster  machine  will  be  able  to  process 
proportionally  more  splits  over  the course of the job than a slower machine. Even if the machines 
are  identical,  failed  processes  or  other  jobs  running  concurrently make load balancing desirable, 
and the quality of the load balancing increases as the splits become more fine grained. 

On  the  other  hand,  if  splits  are  too  small,  the  overhead  of  managing  the  splits  and  map  task 
creation  begins  to  dominate  the total job execution time. For most jobs, a good split size tends to 
be  the  size  of  an  HDFS  block,  which  is  128  MB  by  default,  although  this  can  be  changed  for the 
cluster (for all newly created files) or specified when each file is created. 

Hadoop  does  its  best  to  run  the  map  task  on  a  node  where  the  input  data  resides  in  HDFS, 
because  it  doesn’t  use  valuable  cluster  bandwidth.  This  is  called  the  data  locality  optimization. 
Sometimes,  however,  all  the  nodes  hosting  the  HDFS  block  replicas  for  a  map  task’s  input  split 
are  running  other  map  tasks,  so  the  job  scheduler  will  look  for  a  free  map  slot  on  a  node  in  the 
same  rack  as one of the blocks. Very occasionally even this is not possible, so an off-rack node is 
used,  which  results  in  an  inter-rack  network  transfer.  The  three  possibilities  are  illustrated  in 
Figure 2-2​. 

It  should now be clear why the optimal split size is the same as the block size: it is the largest size 
of  input  that  can  be  guaranteed  to  be  stored  on  a  single  node.  If the split spanned two blocks, it 
would  be  unlikely  that  any  HDFS  node  stored  both blocks, so some of the split would have to be 
transferred  across  the  network  to  the  node  running  the  map  task,  which  is  clearly  less  efficient 
than running the whole map task using local data. 

Map  tasks  write  their  output  to  the  local  disk,  not  to  HDFS.  Why  is  this?  Map  output  is 
intermediate  output:  it’s  processed  by  reduce  tasks  to produce the final output, and once the job 

14 
 

is  complete,  the  map  output  can be thrown away. So, storing it in HDFS with replication would be 


overkill.  If  the  node  running  the  map  task  fails before the map output has been consumed by the 
reduce  task,  then  Hadoop  will  automatically  rerun  the map task on another node to re-create the 
map output. 

Reduce  tasks  don’t  have  the  advantage  of  data  locality;  the  input  to  a  single  reduce  task  is 
normally  the  output  from  all  mappers.  In  the  present  example,  we  have  a single reduce task that 
is  fed  by  all  of  the  map  tasks.  Therefore,  the  sorted  map  outputs  have  to  be  transferred  across 
the  network  to  the  node  where  the  reduce  task  is  running,  where  they  are  merged  and  then 
passed  to  the  user-defined  reduce  function.  The  output of the reduce is normally stored in HDFS 
for  reliability.  As  explained  in  ​Chapter  3​,  for  each  HDFS  block  of  the  reduce  output,  the  first 
replica  is  stored  on  the  local  node,  with  other  replicas  being  stored  on  off-rack  nodes  for 
reliability.  Thus,  writing  the  reduce  output  does  consume  network  bandwidth,  but  only  as  much 
as a normal HDFS write pipeline consumes. 

The  whole  data  flow  with  a  single  reduce  task  is  illustrated  in  ​Figure  2-3​.  The  dotted  boxes 
indicate  nodes,  the  dotted  arrows show data transfers on a node, and the solid arrows show data 
transfers between nodes. 

15 
 

Figure 2-3. MapReduce data flow with a single reduce task

The  number  of  reduce  tasks  is  not  governed  by  the  size  of  the  input,  but  instead  is  specified 
independently.  In  ​The Default MapReduce Job​, you will see how to choose the number of reduce 
tasks for a given job. 

When  there  are  multiple  reducers,  the  map  tasks  partition  their  output,  each  creating  one 
partition  for  each  reduce  task.  There  can  be  many  keys  (and  their  associated  values)  in  each 
partition,  but  the  records  for  any  given  key  are  all  in  a  single  partition.  The  partitioning  can  be 
controlled  by  a  user-defined  partitioning  function,  but  normally  the  default  partitioner—which 
buckets keys using a hash function—works very well. 

The  data  flow  for  the  general  case  of  multiple  reduce  tasks  is  illustrated  in  ​Figure  2-4​.  This 
diagram  makes  it clear why the data flow between map and reduce tasks is colloquially known as 
“the shuffle,” as each reduce task is fed by many map tasks. The shuffle is more complicated than 
this  diagram  suggests,  and  tuning  it  can  have  a  big  impact on job execution time, as you will see 
in ​Shuffle and Sort​. 

16 
 

Figure 2-4. MapReduce data flow with multiple reduce tasks

Finally,  it’s  also possible to have zero reduce tasks. This can be appropriate when you don’t need 


the  shuffle  because  the  processing  can  be  carried  out  entirely  in  parallel  (a  few  examples  are 
discussed  in  ​NLineInputFormat​).  In  this  case,  the  only  off-node  data  transfer  is  when  the  map 
tasks write to HDFS (see ​Figure 2-5​). 

COMBINER FUNCTIONS 

Many  MapReduce  jobs  are  limited  by  the  bandwidth  available  on  the  cluster,  so  it  pays  to 
minimize  the  data  transferred  between  map  and reduce tasks. Hadoop allows the user to specify 
a  combiner  function  to  be  run  on  the  map  output,  and  the  combiner  function’s  output  forms  the 
input  to  the  reduce  function.  Because the combiner function is an optimization, Hadoop does not 
provide  a  guarantee  of  how  many  times  it  will  call  it  for a particular map output record, if at all. In 
other  words,  calling  the  combiner  function  zero,  one,  or  many  times  should  produce  the  same 
output from the reducer. 

17 
 

Figure 2-5. MapReduce data flow with no reduce tasks

The  contract  for  the  combiner  function  constrains  the  type  of  function  that  may  be  used.  This  is 
best  illustrated  with  an  example.  Suppose  that  for  the  maximum  temperature  example,  readings 
for  the  year  1950  were  processed  by  two  maps  (because  they  were  in  different  splits).  Imagine 
the first map produced the output: 

and the second produced:

18 
 

The reduce function would be called with a list of all the values:

with output:

since  25  is  the  maximum  value  in  the  list.  We  could  use  a  combiner  function  that,  just  like  the 
reduce function, finds the maximum temperature for each map output. The reduce function would 
then be called with: 

and  would  produce  the  same  output  as  before.  More  succinctly,  we  may  express  the  function 
calls on the temperature values in this case as follows: 

Not  all  functions  possess  this  property.[​20​]  For  example,  if  we  were  calculating  mean 
temperatures, we couldn’t use the mean as our combiner function, because: 

But:

19 
 

The  combiner  function  doesn’t replace the reduce function. (How could it? The reduce function is 


still  needed  to  process  records  with  the  same  key  from  different  maps.) But it can help cut down 
the  amount  of  data  shuffled  between  the  mappers  and  the  reducers,  and  for this reason alone it 
is always worth considering whether you can use a combiner function in your MapReduce job. 

20 
 

MAP REDUCE PROGRAMMING


Word Count

package​ solution;

import​ org.apache.hadoop.fs.Path;
import​ org.apache.hadoop.io.IntWritable;
import​ org.apache.hadoop.io.Text;
import​ org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import​ org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import​ org.apache.hadoop.mapreduce.Job;

/*
* MapReduce jobs are typically implemented by using a driver class.
* The purpose of a driver class is to set up the configuration for the
* MapReduce job and to run the job.
* Typical requirements for a driver class include configuring the input
* and output data formats, configuring the map and reduce classes,
* and specifying intermediate data formats.
*
* The following is the code for the driver class:
*/
public​ ​class​ WordCount {

public static void main(String[] args) throws Exception {

​/*
* The expected command-line arguments are the paths containing
* input and output data. Terminate the job if the number of
* command-line arguments is not exactly 2.
*/
​if​ (args.length != ​2​) {
System.out.printf(
​"Usage: WordCount <input dir> <output dir>​\n​"​);
System.exit(-​1​);
}

​/*
* Instantiate a Job object for your job's configuration.
*/
Job job = ​new​ Job();

​/*
* Specify the jar file that contains your driver, ​mapper​, and reducer.
* ​Hadoop​ will transfer this jar file to nodes in your cluster running
* ​mapper​ and reducer tasks.
*/
job.setJarByClass(WordCount.​class​);

​/*
* Specify an easily-​decipherable​ name for the job.
* This job name will appear in reports and logs.

21 
 

*/
job.setJobName(​"Word Count"​);

​/*
* Specify the paths to the input and output data based on the
* command-line arguments.
*/
FileInputFormat.setInputPaths(job, ​new​ Path(args[​0​]));
FileOutputFormat.setOutputPath(job, ​new​ Path(args[​1​]));
​/*
* Specify the ​mapper​ and reducer classes.
*/
job.setMapperClass(WordMapper.​class​);
job.setReducerClass(SumReducer.​class​);
​job.setCombinerClass(SumReducer.​class​);

​/*
* For the word count application, the input file and output
* files are in text format - the default format.
*
* In text format files, each record is a line delineated by a
* by a line terminator.
*
* When you use other input formats, you must call the
* SetInputFormatClass method. When you use other
* output formats, you must call the setOutputFormatClass method.
*/

​/*
* For the word count application, the mapper's output keys and
* values have the same data types as the reducer's output keys
* and values: Text and IntWritable.
*
* When they are not the same data types, you must call the
* setMapOutputKeyClass and setMapOutputValueClass
* methods.
*/

​/*
* Specify the job's output key and value classes.
*/
job.setOutputKeyClass(Text.​class​);
job.setOutputValueClass(IntWritable.​class​);
job.setNumReduceTasks(​2​);

​/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(​true​);
System.exit(success ? ​0​ : ​1​);
}

22 
 

package​ solution;

import​ java.io.IOException;

import​ org.apache.hadoop.io.IntWritable;
import​ org.apache.hadoop.io.LongWritable;
import​ org.apache.hadoop.io.Text;
import​ org.apache.hadoop.mapreduce.Mapper;

/*
* To define a map function for your MapReduce job, subclass
* the ​Mapper​ class and override the map method.
* The class definition requires four parameters:
* The data type of the input key
* The data type of the input value
* The data type of the output key (which is the input key type
* for the reducer)
* The data type of the output value (which is the input value
* type for the reducer)
*/

public​ ​class​ WordMapper ​extends​ Mapper<LongWritable, Text, Text, IntWritable> {


​/*
* The map method runs once for each line of text in the input file.
* The method receives a key of type LongWritable, a value of type
* Text, and a Context object.
*/
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

​/*
* Convert the line, which is received as a Text object,
* to a String object.
*/
String line = value.toString();

​/*
* The line.split("\\W+") call uses regular expressions to split the
* line up by non-word characters.
*
* If you are not familiar with the use of regular expressions in
* Java code, search the web for "Java ​Regex​ Tutorial."
*/
​for​ (String word : line.split(​"​\\​W+"​)) {
​if​ (word.length() > ​0​) {

​/*
* Call the write method on the Context object to emit a key
* and a value from the map method.

23 
 

*/
​ ew​ IntWritable(​1​));
context.write(​new​ Text(word), n
}
}
package​ solution;

import​ java.io.IOException;

import​ org.apache.hadoop.io.IntWritable;
import​ org.apache.hadoop.io.Text;
import​ org.apache.hadoop.mapreduce.Reducer;

/*
* To define a reduce function for your MapReduce job, subclass
* the Reducer class and override the reduce method.
* The class definition requires four parameters:
* The data type of the input key (which is the output key type
* from the ​mapper​)
* The data type of the input value (which is the output value
* type from the ​mapper​)
* The data type of the output key
* The data type of the output value
*/
public​ ​class​ SumReducer ​extends​ Reducer<Text, IntWritable, Text, IntWritable> {
​/*
* The reduce method runs once for each key received from
* the shuffle and sort phase of the MapReduce framework.
* The method receives a key of type Text, a set of values of type
* IntWritable, and a Context object.
*/
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = ​0​;

/*
* For each value in the set of values passed to us by the ​mapper​:
*/
for​ (IntWritable value : values) {

​/*
* Add the value to the word count counter for this key.
*/
wordCount += value.get();
}

/*
* Call the write method on the Context object to emit a key
* and a value from the reduce method.
*/
context.write(key, ​new​ IntWritable(wordCount));
}
}

24 
 

Anatomy of a MapReduce Job Run 

You  can  run  a  MapReduce  job  with  a  single  method  call:  submit()  on  a  Job  object  (you  can  also 
call  waitForCompletion(),  which  submits  the  job  if  it  hasn’t  been submitted already, then waits for 
it  to  finish).​[​51​]  This  method  call  conceals  a  great  deal  of  processing  behind  the  scenes.  This 
section uncovers the steps Hadoop takes to run a job. 

The  whole  process  is  illustrated  in  ​Figure  7-1​.  At  the  highest  level,  there  are  five  independent 
entities:​[​52​] 

● The client, which submits the MapReduce job. 

● The  YARN  resource  manager,  which  coordinates  the  allocation  of  compute  resources on 
the cluster. 

● The  YARN  node  managers,  which  launch  and  monitor  the  compute  containers  on 
machines in the cluster. 

● The  MapReduce  application  master,  which  coordinates the tasks running the MapReduce 


job.  The  application  master  and  the  MapReduce  tasks  run  in  containers  that  are 
scheduled by the resource manager and managed by the node managers. 

● The distributed filesystem which is used for sharing job files between the other entities. 

25 
 

Fig 7.1 

JOB SUBMISSION
The  submit()  method  on  Job  creates  an  internal  JobSubmitter  instance  and  calls 
submitJobInternal()  on  it  (step  1 in ​Figure 7-1​). Having submitted the job, waitForCompletion() polls 
the  job’s  progress  once  per  second  and  reports  the  progress  to  the  console  if  it  has  changed 
since  the  last  report.  When  the  job  completes  successfully,  the  job  counters  are  displayed. 
Otherwise, the error that caused the job to fail is logged to the console. 

The job submission process implemented by JobSubmitter does the following: 

● Asks  the  resource manager for a new application ID, used for the MapReduce job ID (step 


2). 

● Checks  the  output  specification  of  the  job.  For  example,  if  the  output  directory  has  not 
been  specified  or  it  already  exists,  the  job  is  not  submitted  and  an  error  is  thrown  to  the 
MapReduce program. 

26 
 

● Computes  the  input  splits  for  the  job.  If  the splits cannot be computed (because the input 


paths  don’t  exist,  for  example),  the  job  is  not  submitted  and  an  error  is  thrown  to  the 
MapReduce program. 

● Copies  the  resources  needed  to  run  the  job,  including  the  job  JAR  file,  the  configuration 
file,  and  the  computed  input  splits,  to the shared filesystem in a directory named after the 
job  ID  (step  3).  The  job  JAR  is  copied  with  a  high  replication  factor  (controlled  by  the 
mapreduce.client.submit.file.replication  property,  which  defaults  to  10)  so  that  there  are 
lots  of  copies  across  the  cluster  for the node managers to access when they run tasks for 
the job. 

● Submits the job by calling submitApplication() on the resource manager (step 4). 

JOB INITIALIZATION
When  the  resource  manager  receives  a  call  to  its  submitApplication()  method,  it  hands  off  the 
request  to  the  YARN  scheduler.  The  scheduler  allocates  a  container,  and  the  resource manager 
then  launches  the  application  master’s  process  there,  under  the  node  manager’s  management 
(steps 5a and 5b). 

The  application  master  for  MapReduce  jobs  is  a  Java  application  whose  main  class  is 
MRAppMaster.  It  initializes  the  job  by  creating  a number of bookkeeping objects to keep track of 
the job’s progress, as it will receive progress and completion reports from the tasks (step 6). Next, 
it  retrieves  the  input  splits  computed  in  the  client  from  the  shared  filesystem  (step  7).  It  then 
creates  a  map  task  object  for  each  split,  as  well  as  a  number  of  reduce  task objects determined 
by  the  mapreduce.job.reduces  property (set by the setNumReduceTasks() method on Job). Tasks 
are given IDs at this point. 

The  application  master  must  decide  how to run the tasks that make up the MapReduce job. If the 


job  is  small,  the  application  master  may  choose  to  run  the  tasks  in  the  same  JVM  as  itself.  This 
happens  when  it  judges  that  the  overhead  of  allocating  and  running  tasks  in  new  containers 
outweighs  the  gain  to  be  had  in  running  them in parallel, compared to running them sequentially 
on one node. Such a job is said to be uberized, or run as an uber task. 

What  qualifies  as  a  small  job?  By  default,  a  small  job  is  one  that  has  less  than  10  mappers,  only 
one  reducer,  and  an  input  size  that  is  less  than  the  size  of  one  HDFS  block.  (Note  that  these 
values  may  be  changed  for  a  job  by  setting  mapreduce.job.ubertask.maxmaps, 
mapreduce.job.ubertask.maxreduces,  and  mapreduce.job.ubertask.maxbytes.)  Uber  tasks  must 

27 
 

be  enabled  explicitly  (for  an  individual  job,  or  across  the  cluster)  by  setting 
mapreduce.job.ubertask.enable to true. 

Finally,  before  any  tasks  can  be  run,  the  application  master  calls  the  setupJob()  method  on  the 
OutputCommitter.  For  FileOutputCommitter,  which  is  the  default,  it  will  create  the  final  output 
directory  for  the  job and the temporary working space for the task output. The commit protocol is 
described in more detail in ​Output Committers​. 

TASK ASSIGNMENT 

If  the  job  does  not  qualify  for  running  as  an  uber  task,  then  the  application  master  requests 
containers  for  all  the  map  and  reduce  tasks  in  the  job  from  the  resource  manager  (step  8). 
Requests for map tasks are made first and with a higher priority than those for reduce tasks, since 
all  the  map  tasks  must  complete  before  the  sort  phase  of  the  reduce  can  start  (see  ​Shuffle  and 
Sort​).  Requests for reduce tasks are not made until 5% of map tasks have completed (see ​Reduce 
slow start​). 

Reduce  tasks  can  run  anywhere  in  the  cluster,  but  requests  for  map  tasks  have  data  locality 
constraints  that  the  scheduler  tries  to  honor  (see  ​Resource  Requests​).  In  the  optimal  case,  the 
task  is  data  local—that  is,  running  on  the  same  node  that  the  split  resides  on.  Alternatively,  the 
task  may  be  rack  local:  on  the  same  rack,  but  not  the  same  node,  as  the  split.  Some  tasks  are 
neither  data  local  nor rack local and retrieve their data from a different rack than the one they are 
running  on.  For  a  particular  job  run,  you  can  determine  the  number  of  tasks  that  ran  at  each 
locality level by looking at the job’s counters (see ​Table 9-6​). 

Requests  also  specify  memory  requirements  and  CPUs  for  tasks.  By  default,  each  map  and 
reduce  task  is  allocated  1,024  MB  of  memory  and  one  virtual  core.  The  values  are  configurable 
on  a  per-job  basis  (subject  to  minimum  and  maximum  values  described  in  ​Memory  settings  in 
YARN  and  MapReduce​)  via  the  following  properties:  mapreduce.map.memory.mb, 
mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores. 

TASK EXECUTION 

Once  a  task  has  been  assigned  resources  for  a  container  on  a  particular  node  by  the  resource 
manager’s  scheduler, the application master starts the container by contacting the node manager 
(steps  9a  and  9b).  The  task  is  executed  by  a  Java  application  whose  main  class  is  YarnChild. 
Before  it  can  run  the  task,  it  localizes  the  resources  that  the  task  needs,  including  the  job 
configuration  and  JAR  file,  and  any  files  from  the  distributed  cache  (step  10;  see  ​Distributed 
Cache​). Finally, it runs the map or reduce task (step 11). 

28 
 

The  YarnChild  runs  in  a  dedicated  JVM,  so  that  any  bugs  in  the  user-defined  map  and  reduce 
functions (or even in YarnChild) don’t affect the node manager—by causing it to crash or hang, for 
example. 

Each  task  can perform setup and commit actions, which are run in the same JVM as the task itself 


and  are  determined  by  the  OutputCommitter  for  the  job  (see  ​Output  Committers​).  For  file-based 
jobs,  the  commit  action  moves  the  task  output  from  a temporary location to its final location. The 
commit  protocol ensures that when speculative execution is enabled (see ​Speculative Execution​), 
only one of the duplicate tasks is committed and the other is aborted. 

29 
 

Map Reduce 2 or YARN (Yet Another Resource Negotiator):

Resource Manager:

● Master daemon
● Consists of the scheduler and the App manager
● 1 RM per cluster
● Only takes care of CPU/memory management
● App Manager
o Responsible for accepting job submissions
o Negotiates the first container for executing the application specific app master
o Provide the service for restarting the application master on any failures
o Negotiates resource or container. ( Container is CPU,RAM,Cores)

● Scheduler
○ FIFO​ scheduler - First in first out . Used in older versions
○ FAIR scheduler - Multiple applications running fair share of resources. If there are 2
queues , if one is under utilized the other can take up its resources and operate
○ Capacity​ scheduler - Each queue is allocated it own resources

● Once container has approved by Node manager


● RM starts the App master in one of the data nodes inside the container.

Node Manager:

● Slave daemon
● 1 NM per Node
● Container management
● Task execution

● Nodemanager is allocated around 60-70% of the resources of the data node. (Ex out of 100GB
RAM and 60 core, it allocates 60GB RAM and 30 cores for the node manager)

● Node manager determines the size of the container int the yarn-site.xml

Application Master

● Not a daemon
● 1 AM per application
● Job execution
● App master will run the processing tasks on the containers. It authenticates and provides rights
to an application to use specific amount of resources.
● If a uber job(<10 tasks) comes in AM runs it in its own JVM.
● If not an uber job the AM negotiates with RM to find where resources are free. It instructs NM
to create a container there to execute the jar

30 
 

● Every task sends ping to to AM and also the JHS. In case of AM failure JHS will have the progress
of every task

YARN separates these two roles into two independent daemons: a resource manager to manage 
the use of resources across the cluster, and an application master to manage the lifecycle of 
applications running on the cluster. The idea is that an application master negotiates with the 
resource manager for cluster resources—described in terms of a number of containers each with 
a certain memory limit—then runs application specific processes in those containers. The 
containers are overseen by node managers running on cluster nodes, which ensure that the 
application does not use more resources than it has been allocated 

● The client, which submits the MapReduce job. 

● The YARN resource manager, which coordinates the allocation of compute resources on 
the cluster. 

● The YARN node managers, which launch and monitor the compute containers on 
machines in the cluster. 

● The MapReduce application master, which coordinates the tasks running the MapReduce 
job. The application master and the MapReduce tasks run in containers that are 
scheduled by the resource manager, and managed by the node managers. 

31 
 

YARN in Detail

In a YARN cluster, there are two types of hosts:

● The ​ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to ​NodeManagers.​
● A ​NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.

YARN Configuration File


The YARN configuration file is an XML file that contains properties. This file is placed in a
well-known location on each host in the cluster and is used to configure the ResourceManager
and NodeManager. By default, this file is named ​yarn-site.xml​. The basic properties in this file
used to configure YARN are covered in the later sections.

YARN Requires a Global View


YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own
local resources and communicates its resource configuration to the ResourceManager, which
keeps a running total of the cluster’s available resources. By keeping track of the total, the
ResourceManager knows how to allocate resources as they are requested. (​Vcore has a special

32 
 

meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect
your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of
vcores to physical cores higher than 1 to maximize your use of hardware resources.)​

Containers
Containers are an important YARN concept. You can think of a container as a request to hold
resources on the YARN cluster. Currently, a container hold request consists of vcore and
memory, as shown in Figure 4 (left).

33 
 

For the next section, two new YARN terms need to be defined:

An application is a YARN client program that is made up of one or more tasks (see Figure 5).
For each running application, a special piece of code called an ApplicationMaster helps
coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the
application starts.

An application running tasks on a YARN cluster consists of the following steps:

1. The application starts and talks to the ResourceManager for the cluster:

Figure 5: Application starting up before tasks are assigned to the cluster

34 
 

2. The ResourceManager makes a single container request on behalf of the application:

Figure 6: Application + allocated container on a cluster


3. The ApplicationMaster starts running within that container:

Figure 7: Application + ApplicationMaster running in the container on the cluster

4. The ApplicationMaster requests subsequent containers from the ResourceManager that


are allocated to run tasks for the application. Those tasks do most of the status communication
with the ApplicationMaster allocated in Step 3):

35 
 

Figure 8: Application + ApplicationMaster + task running in multiple containers running on the


cluster
5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated
from the cluster.
6. The application client exits. (The ApplicationMaster launched in a container is more
specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s
control. ​Llama​ is an example of an unmanaged AM.)

36 
 

VERIFYING YARN CONFIGURATION IN THE RM UI


As mentioned before, the ResourceManager has a snapshot of the available resources on the
YARN cluster.

Example: Assume you have the following configuration on your 50 Worker nodes:

1. yarn.nodemanager.resource.memory-mb = 90000
2. yarn.nodemanager.resource.vcores = 60

Doing the math, your cluster totals should be:

1. memory: 50*90GB=4500GB=4.5TB
2. vcores: 50*60 vcores= 3000 vcores

On the ResourceManager Web UI page, the cluster metrics table shows the total memory and
total vcores for the cluster, as seen in Figure 2 below.

Figure 2: Verifying YARN Cluster Resources on ResourceManager Web UI

CONTAINER CONFIGURATION
At this point, the YARN Cluster is properly set up in terms of Resources. YARN uses these
resource limits for allocation, and enforces those limits on the cluster.

YARN Container Memory Sizing


Minimum: yarn.scheduler.minimum-allocation-mb
Maximum: yarn.scheduler.maximum-allocation-mb

37 
 

YARN Container VCore Sizing


Minimum: yarn.scheduler.minimum-allocation-vcores
Maximum: yarn.scheduler.maximum-allocation-vcores
YARN Container Allocation Size Increments
Memory Increment: yarn.scheduler.increment-allocation-mb
VCore Increment: yarn.scheduler.increment-allocation-vcores

Restrictions and recommendations for Container values:

Memory properties:
Minimum required value of 0 for yarn.scheduler.minimum-allocation-mb.
Any of the memory sizing properties must be less than or equal to
yarn.nodemanager.resource.memory-mb.
Maximum value must be greater than or equal to the minimum value.
VCore properties:
Minimum required value of 0 for yarn.scheduler.minimum-allocation-vcores.
Any of the vcore sizing properties must be less than or equal to
yarn.nodemanager.resource.vcores.
Maximum value must be greater than or equal to the minimum value.
Recommended value of 1 for yarn.scheduler.increment-allocation-vcores. Higher values will
likely be wasteful.

Note that in YARN it’s possible to do some very easy misconfiguration. If the Container
memory request minimum (yarn.scheduler.minimum-allocation-mb) is larger than the memory
available per node (yarn.nodemanager.resource.memory-mb), then it would be impossible for
YARN to fulfill that request. A similar argument can be made for the Container vcore request
minimum.

MAPREDUCE CONFIGURATION
The Map task memory property is mapreduce.map.memory.mb. The Reduce task memory
property is mapreduce.reduce.memory.mb. Since both types of tasks must fit within a Container,
the value should be less than the Container maximum size. (We won’t get into details about Java

38 
 

and the YARN properties that affect launching the Java Virtual Machine here. They may be
covered in a future installment.)

SPECIAL CASE: APPLICATION MEMORY MEMORY CONFIGURATION


The property yarn.app.mapreduce.am.resource.mb is used to set the memory size for the
ApplicationMaster. Since the ApplicationMaster must fit within a container, the property should
be less than the Container maximum.

Scheduling in YARN
The ResourceManager (RM) tracks resources on a cluster, and assigns them to applications that
need them. The scheduler is that part of the RM that does this matching honoring organizational
policies on sharing resources. Please note that:

YARN uses queues to share resources among multiple tenants. This will be covered in more
detail in “Introducing Queues” below.
The ApplicationMaster (AM) tracks each task’s resource requirements and coordinates container
requests. This approach allows better scaling since the RM/scheduler doesn’t need to track all
containers running on the cluster.

Scheduling in general is a difficult problem and there is no one “best” policy, which is why
YARN provides a choice of schedulers and configurable policies. They are as follows:

● The FIFO scheduler


● The Fair scheduler
● The Capacity scheduler

The FIFO (First In First Out) scheduler  


FIFO means First In First Out. As the name indicates, the job submitted first will get priority to
execute; in other words, the job runs in the order of submission. FIFO is a queue-based
scheduler. It is a very simple approach to scheduling and it does not guarantee performance
efficiency, as each job would use a whole cluster for execution. So other jobs may keep waiting
to finish their execution

39 
 

The Capacity Scheduler  


The Capacity scheduler is designed to allow applications to share cluster resources in a
predictable and simple fashion. These are commonly known as “job queues”. The main idea
behind capacity scheduling is to allocate available resources to the running applications, based
on individual needs and requirements.

By specifying below property in ​yarn-site.xml​, you can enable the Capacity scheduling policy in
your YARN cluster:

The Fair Scheduler  


The fair scheduler is one of the most famous pluggable schedulers for large clusters. It enables
memory-intensive applications to share cluster resources in a very efficient way. Fair scheduling
is a policy that enables the allocation of resources to applications in a way that all applications
get, on average, an equal share of the cluster resources over a given period.

40 
 

Introducing Queues

Queues  are  the  organizing  structure  for  YARN  schedulers,  allowing multiple tenants to share the 


cluster.  As  applications  are  submitted  to  YARN,  they  are  assigned  to  a  queue  by  the  scheduler. 
The root queue is the parent of all queues. All other queues are each a  child of the root queue or 
another  queue  (also  called  hierarchical  queues).  It  is common for queues to correspond to users, 
departments, or priorities. 

Hadoop cluster sizes 

Hadoop  vendors tend to segregate clusters into three sizes: small, medium, and large. How many 
nodes  constitute  a  small,  medium,  or  large  is  mostly  a  heuristic  exercise.  Some  say  if  you  can 
manage  a  50-node  cluster  then  you  can  manage  a  1,000-node  cluster.  Generally  speaking,  a 
small  cluster  will  tend  to  be  less  than  32  nodes,  a  medium cluster is between 32 and 150 nodes, 
and  a  large  cluster  is  anything  over  150.  Another  common  design  template  is  whether  your 
cluster will fit on a single rack or multiple racks 

41 
 

Hadoop commands 

‘hadoop fs’ (or) ‘hdfs dfs’ can be provided. ‘hdfs dfs’ does not work on recent hadoop versions 

hadoop fs -ls ​
/dev/data/harish 

hadoop fs -mkdir -p /
​dev/data/harish 

hadoop fs -cp ​
/dev/data/harish/1.xml /dev/data/sara/ 

hadoop fs -mv ​
/dev/data/harish/1.xml /dev/data/sara/​
- Absolute path must be 
provided 

hadoop fs -cat /
​dev/data/harish/1.xml​
- To view contents of the file 

hadoop fs -rm ​
/dev/data/harish/1.xml​
- To remove file 

hadoop fs -du ​
/dev/data/harish/​
- To view disk usage of directory 

hadoop fs -df -h /
​dev/data/harish/​
- To view free space  

hadoop fs -text ​
/dev/data/harish/part-000​
- To view contents of file that is in 
sequential format  

To retrieve data from hadoop 

hadoop fs -get /
​dev/data/harish/1.xml /usr/bin/bash​
- Can get multiple files  

hadoop fs -copyToLocal /
​dev/data/harish/1.xml /usr/bin/bash​
- Can get only one 
file  

hadoop fs -moveToLocal /
​dev/data/harish/1.xml /usr/bin/bash​
- Can get only one 
file 

hadoop fs -getMerge /
​dev/data/mapred/part-000* usr/bin/bash​
- To retrieve 
multiple files into one single file. Usually to read mapreduce outputs 

To push data to hadoop 

hadoop fs -put 1.xml /


​dev/data/harish​
- Can copy multiple files 

42 
 

hadoop fs -copyFromLocal 1.xml ​


/dev/data/harish​
- Can copy only one file 

hadoop fs -moveFromLocal 1.xml ​ - Can move only one file 


/dev/data/harish​

hadoop fs -touchz ​
/dev/data/harish/joker.txt​
- This HDFS basic creates a file 
at the path containing the current time as a timestamp 

hadoop fs -test -{option} ​


/dev/data/harish/jo​
- It gives 1 output if a path 
exists; it has zero length, or it is a directory or otherwise 0. 

-d: if the path given by the user is a directory, then it gives 0 output. 

-e: if the path given by the user exists, then it gives 0 output. 

-f: if the path given by the user is a file, then it gives 0 output. 

-s: if the path given by the user is not empty, then it gives 0 output. 

-z: if the file is zero length, then it gives 0 output. 

hadoop fs -stat ​
/dev/data/harish​
-This HDFS File system command prints 
information about the path. 

%b: If the format is a string which accepts file size in blocks. 

%n: Filename 

%o: Block size 

%r: replication 

%y, %Y: modification date.  

hadoop fs -appendToFile 1
​.xml​​
/dev/data/harish​
-This HDFS command append single 
sources or multiple sources from local file system to the file system at the 
destination 

hadoop fs -checksum /
​dev/data/harish/1.xml​
-This HDFS command returns the 
checksum value of the file. A ​
checksum​is the outcome of running an algorithm, 
called a cryptographic hash function, on a piece of data, usually a single 
file. Comparing the checksum that you generate from your version of the file, 
with the one provided by the source of the file, helps ensure that your copy 
of the file is genuine and error free. 

43 
 

hadoop distcp ​
/dev/data/harish/1.xml /prd/data/harish/​
-To copy files from one 
cluster to another cluster  

hadoop fs -count -q -h -v​/dev/data/harish​-Provides the number of files and 


sum of size of all files in all sub directories. 

hadoop fs -rm -r /
​dev/data/harish/ds={2019,2018,2017}​
- Deletes sub directories 
2017,2018,2019 in a single command 

44 

Das könnte Ihnen auch gefallen