Sie sind auf Seite 1von 91

Big Data Hadoop and Spark Developer

Lesson 2—HDFS and YARN

© Simplilearn. All rights reserved.


Learning Objectives

Discuss Hadoop Distributed File System (HDFS)

Explain HDFS architecture and components

Describe YARN and its features

Explain YARN architecture


HDFS and YARN
Topic 1—Hadoop Distributed File System (HDFS)
Why HDFS?

In the traditional system, storing and retrieving volumes of data had three major issues:

Speed: Search and analysis


is time-consuming

Reliability: Fetching data is Cost: 10,000 to $14,000


difficult 3 1 per terabyte
Why HDFS? (Contd.)

HDFS resolves all major issues of the traditional file system.

Hadoop clusters
read/write
terabytes of data
per second
Speed: Search and analysis
is time-consuming
HDFS offers zero
HDFS copies licensing and
2 support costs
the data
multiple times

Reliability: Fetching data is Cost: 10,000 to $14,000


difficult 3 1 per terabyte
What Is HDFS?

HDFS is a distributed file system that provides access to data across Hadoop clusters.
It manages and supports analysis of very large volumes of Big Data.
Characteristics of HDFS

HDFS has high fault-tolerance

HDFS has high throughput

HDFS is economical
How Does HDFS Work?
EXAMPLE

A patron gifts his books to a The librarian decides to arrange The librarian then distributes
college library. the books on a small rack. multiple copies of each book on
other racks based on the category.

Similarly, HDFS creates multiple copies of a data block and stores them in separate systems for easy access.
HDFS Storage

Metadata
HDFS stores files
in a number of
blocks NameNode

Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2

data file B3

B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4

Node C
B1

B3
HDFS Storage

Metadata

NameNode

Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2

data file B3

B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4

Node C
Each block is B1
replicated to a few B3

separate computers
HDFS Storage

Data is divided into


Metadata 128 MB per block

NameNode

Node A Node D
B2 B3 B1
B1
B4 B2
Very large B2

data file B3

B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4

Node C
B1

B3
HDFS Storage

Metadata

NameNode
Metadata keeps
Node A Node D information about the
B1
B2 B3 B1 block and its
B4 B2 replication. It is stored
Very large B2

in NameNode.
data file B3

B4 Node B Node E
B1 B3
1 2 B2
3 4 B4
B4

Node C
B1

B3
HDFS and YARN
Topic 2—HDFS Architecture and Components
HDFS Architecture
It is also known as the master and slave architecture.

Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode

Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

Slave
HDFS Architecture
Responsible for
accepting jobs from
clients Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode

Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

Slave
HDFS Architecture

Maintain

Edit log Fsimage


Secondary
NameNode
Stores the block
Metadata
File system location and its
Master replication
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode

Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

Slave
HDFS Architecture

Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
NameNode A file is split into one or
more blocks, stored,
and replicated in the
slave nodes

Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

Slave
HDFS Architecture

Maintain

Edit log Fsimage


Secondary
NameNode

Metadata
File system
Master
DN1: A,C
DN2: A,C
File.txt = AC
DN3: A,C
Data required for the
operation is loaded and NameNode
segregated into chunks
of data blocks

Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

Slave
HDFS Components

The main components of HDFS:

• Namenode
• Secondary Namenode
• File system
• Metadata
• Datanode
HDFS Components
NAMENODE

The NameNode server is the core component of an HDFS cluster.


Namenode
It maintains and executes file system namespace operations such as opening, closing,
and renaming of files and directories that are present in HDFS.
Secondary
Namenode Metadata
File system
DN1 1 3
File system File.txt =
DN2 1 3
AC
DN3 1 3
NameNode

Metadata

Datanode Data Node 1 Data Node 2 Data Node 3 Data Node N


……
1 3 1 3 1 3

NameNode is a single point of failure.


HDFS Components
NAMENODE: OPERATION

The NameNode maintains two persistent files:


Namenode
• A transaction log called an Edit Log

Secondary • A namespace image called an FsImage


Namenode

File system

Metadata NameNode

Retrieves the Edit Updates with Edit


Datanode log at startup log information

Edit log Fsimage


HDFS Components
SECONDARY NAMENODE

Secondary NameNode server is responsible for maintaining a backup of the NameNode server.
Namenode It maintains the edit log and namespace image information in sync with the NameNode server.
Master
Secondary
Namenode
NameNode Secondary Maintains
NameNode
File system Slave

Edit log Fsimage

Metadata

Datanode Data Node Data Node Data Node

HDFS Cluster

There can be only one Secondary NameNode server in a cluster. It cannot be treated as
a disaster recovery server (it partially restores the NameNode server in case of failure)
HDFS Components
FILE SYSTEM

HDFS exposes a file system namespace and allows user NameNode


Namenode data to be stored in files.

Secondary The file system supports operations such as create,


Namenode remove, move, and rename.

/
File system

Metadata /Dir 1 /Dir 1

Datanode
/Dir 1.1 File A /Dir 2.1

File B
HDFS Components
METADATA

HDFS metadata is the structure of HDFS directories and files in a tree.


Namenode
It includes attributes of directories and files, such as ownership, permissions, quotas, and
Secondary replication factor.
Namenode

File system

Metadata

Datanode
HDFS Components
DATANODE

The DataNode is a multiple instance server.


Namenode
It is responsible for storing and maintaining the data blocks. It also retrieves the blocks when
asked by clients or the NameNode.
Secondary
Namenode
Metadata (Name, replicas, …):
Metadata ops /home/foo/data, 3, …
File system
NameNode

Client
Metadata

Datanode
DataNode 1 DataNode 2 DataNode 3 DataNode 4 DataNode 5
Block

Rack 1 Client
Rack 1
Data Block Split
Data block split is an important process of HDFS architecture. Each file is split into one or more blocks
and the blocks are stored and replicated in DataNodes.

NameNode

DataNode1 DataNode2 DataNode3 DataNode4

A file split into b1 b2 … b2 b3 … b1 b3 … b1 b2 …


blocks
DataNodes managing blocks

By default, each file block is 128 MB.


Block Replication Architecture

Block replication refers to creating copies of a block in multiple DataNodes. Usually, the data
is split into parts, such as part-0 and part-1.

NameNode
JobTracker

B1 B2 B3

Job 1

Block
DataNode Replication DataNode
server1 server 2
Resubmit Job 1
Replication Method

• Each file is split into a sequence of blocks (of the same size, except the last one).
• Blocks are replicated for fault-tolerance.
• The block replication factor is usually configured at the cluster level (can also be done at the file level).
• The NameNode receives a heartbeat and a block report from each DataNode in the cluster.
• A block report lists the blocks on a DataNode.
What Is a Rack?

Rack is a collection of machines that are physically located in a single place/data-


center and connected through a network.
In Hadoop, Rack is a physical collection of slave machines put together at a single
location for data storage.
Rack Awareness in Hadoop

• In large clusters of Hadoop, to improve network traffic while reading/writing HDFS files, Namenode
chooses data nodes that are on the same rack or a nearby rack to read/write request.

• Namenode achieves this rack information by maintaining rack ids of each data node.

• This concept of choosing closer data nodes based on rack information is called Rack Awareness in
Hadoop.
Replication and Rack Awareness in Hadoop

The topology of the replica is critical to ensure the reliability of HDFS. Usually, data is replicated thrice. The
suggested replication topology is as follows:

• The first replica is placed on the same node as


that of the client.

• The second replica is placed on a different rack NameNode


from that of the first replica.

• The third replica is placed on the same rack as


that of the second one, but on a different node. Client

Rack 1 Rack 2
R3N1 R2N1: B1

R3N2 R3N2

R1N3: B1 R2N3: B1
Replication and Rack Awareness: Example
The diagram illustrates a Hadoop cluster with three racks. Each rack contains multiple nodes.

NameNode

B1 B2 B3
Replication and Rack Awareness: Example

R1N1 represents
NameNode
Node 1 on Rack 1
and so on.. B1 B2 B3
Replication and Rack Awareness: Example

The NameNode
decides which
DataNode belongs
to which rack.
NameNode

B1 B2 B3
Implementing Rack Awareness in Hadoop

1. Create a topology data file in Namenode as /etc/hadoop/conf/topology.map

2. Create a topology.py script file: Topology scripts are used by Hadoop to determine the rack location
of nodes. File is present in /etc/hadoop/conf/topology.sh

3. Add this property to core-site.xml of Name node only.


Anatomy of File Write

Step 1: The client creates the file by calling create() method on


DistributedFileSystem.

Step 2: DistributedFileSystem makes an RPC call to the namenode to


create a new file in the filesystem’s namespace.
TheDistributedFileSystem returns an FSDataOutputStream for the client
to start writing data to.

Step 3: As the client writes data, DFSOutputStream splits it into packets


and writes to internal data queue, which is consumed by the
DataStreamer. The DataStreamer streams the packets to the first
datanode in the pipeline.

Step 4: Similarly, the second datanode stores the packet and forwards it
to the third (and last) datanode in the pipeline.

Step 5: DFSOutputStream maintains an internal queue of packets that


are waiting to be acknowledged by datanodes, called the ack queue.

Step 6: When the client has finished writing data, it calls close() on the
stream.
Anatomy of File Read

Step 1: First, the Client will open the file by giving a call to open()
method on FileSystem object.

Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote


Procedure Call), to determine the locations of the blocks for the first few
blocks of the file. For each block, the NameNode returns the addresses
of all the DataNodes that have a copy of that block.

Step 3: The client then calls read() on the DFSInputStream and then
connects to the first closest DataNode for the first block in the file.

Step 4: Data is streamed from the DataNode back to the client, which
calls read() repeatedly on the stream.

Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the DataNode and then find the best DataNode for
the next block.

Step 6: Blocks are read in order. When the client has finished reading, it
calls close() on the FSDataInputStream
HDFS Access

HDFS provides the following access mechanisms:

Python access and


Java API for C language
applications wrapper for non-
Java applications

FS shell for
Web GUI utilized executing
through an HTTP commands
browser on HDFS
HDFS Command Line

$ hdfs dfs -put


Copy file simplilearn.txt from local disk to the user’s directory in
simplilearn.txt
HDFS. This will copy the file to /user/username/simplilearn.txt simplilearn.txt

Get a directory listing of user’s home directory in HDFS $hdfs dfs –ls

$hdfs dfs –mkdir


Create a directory called testing under the user’s home directory testing

hdfs dfs -rm -r


Delete the directory testing and all its contents testing
Demo
HDFS Commands

In this demonstration, you will view how to run a few basic command lines of HDFS. You can also
execute the commands in your lab environment for practice.
Hue File Browser

The file browser in Hue helps to view and manage HDFS directories and files.
Demo
HDFS Access using Hue

In this demonstration, you will see how to access HDFS using Hue. You will also learn how to view and manage
your HDFS directories and files using Hue. You can also execute it in your lab environment for practice.
HDFS and YARN
Topic 3—Introduction to YARN (Yet Another Resource Negotiator)
What Is YARN: Case Study

Yahoo was the first company to embrace Hadoop and became a trendsetter within the Hadoop ecosystem.

In late 2012, Yahoo struggled to handle iterative and stream processing of data on the Hadoop infrastructure
due to MapReduce limitations.

Both iterative and stream processing were important for Yahoo in facilitating its move from batch computing
to continuous computing.

How could this issue be solved?


What Is YARN: Case Study

Yahoo implemented YARN. It installed:

• Over 30,000 production nodes on Spark for iterative processing


• Storm for stream processing,
• Hadoop for batch processing

This helped in handling more than 100 billion events such as clicks, impressions, email content,
metadata, etc., per day.
What Is YARN?

YARN is a resource manager created by separating the processing engine and the
management function of MapReduce. It monitors and manages workloads, a multi-
tenant environment, high availability features of Hadoop, and implements security
controls.
Advantages of Using YARN

There's no need to move data Reduced Higher Resources not utilized by a


between Hadoop YARN and data cluster framework can be consumed
systems running on different
motion utilization by another
clusters of computers
Advantages of
the single-cluster
approach

Lower
operational
costs

Only one "do-it-all" cluster needs to


be managed
YARN Infrastructure

The YARN Infrastructure is responsible for providing computational resources for application executions.

HADOOP 2.7

MapReduce Others
(data processing) (data processing)

YARN provides resources for running


an application
(cluster resource management)
HDFS provides storage
(redundant, reliable storage)
HDFS and YARN
Topic 4—YARN Architecture
YARN Architecture

Client

YARN
Resource Data Processing

The three important elements of the Resource Manager


YARN architecture are Resource
Manager, ApplicationMaster, and
Node Manager. Applications
Scheduler
Manager

Node Manager Node Manager Node Manager

App App App


Container Container Container
Master Master Master

Data Node Data Node Data Node


Components of YARN Architecture
RESOURCE MANAGER

Client

The Resource Manager


Resource (RM) is the master. It
Manager runs several services, YARN
including the Resource Resource Data Processing
Scheduler.
Node
Manager Resource Manager

Application Applications
Master Scheduler
Manager

Node Manager Node Manager Node Manager

App App App


Container Container Container
Master Master Master

Data Node Data Node Data Node


Components of YARN Architecture
RESOURCE MANAGER

The Resource Manager mediates the available resources in the cluster among
competing applications and ensures maximum cluster utilization.
Resource
Manager

Resource Manager
Node
Manager
Applications
Application Scheduler
Manager
Master

It includes a pluggable scheduler called the YarnScheduler, which allows different policies
for managing constraints, such as capacity, fairness, and Service-Level Agreements.
Components of YARN Architecture
RESOURCE MANAGER COMPONENT: SCHEDULER

The Scheduler is responsible for


allocating resources to various running
applications depending on the common
Resource constraints of capacities, queues, and
Manager so on.

Resource Manager
Node
Manager
Applications
Application Scheduler
Manager
Master

The Scheduler has a policy plug-


in to partition cluster resources
among various applications.
Examples: CapacityScheduler,
FairScheduler.
Components of YARN Architecture
RESOURCE MANAGER COMPONENT: APPLICATIONS MANAGER

The Applications Manager is an


Resource interface that maintains a list
Manager of applications that have been
submitted, currently running,
Resource Manager or completed.
Node
Manager
Applications
Application Scheduler
Manager
Master

The ApplicationsManager
accepts job submissions,
negotiates the first container for
executing the application, and
restarts the ApplicationMaster
container on failure.
Components of YARN Architecture
RESOURCE MANAGER: INTERNAL COMPONENTS

Resource Manager
Resource
Manager ClientService YARN Scheduler NodesListManager

Node AdminService NMLivelinessMonitor


Manager
ResourceTrackerService
Application Applications
Master Manager
ApplicationMasterService
Container
AMLivelinessMonitor
Security
ApplicationMasterLauncher

ContainerAllocationExpirer
Components of YARN Architecture
HOW RESOURCE MANAGER OPERATES

• The Resource Manager communicates with the clients through an interface called the ClientService.

• Administrative requests are served by a separate interface called the AdminService. Operators can
Resource
get updated information about the cluster operation using this.
Manager
• In parallel, the ResourceTrackerService receives node heartbeats from the Node Manager to track
Node
new or decommissioned nodes.
Manager
• The NMLivelinessMonitor and NodesListManager keep an updated status of which nodes are
Application
healthy so that the Scheduler and the ResourceTrackerService can allocate work appropriately.
Master
• The ApplicationMasterService manages Application Masters on all nodes, keeping the Scheduler
informed.

• The AMLivelinessMonitor keeps a list of Application Masters and their last heartbeat times to let the
Resource Manager know what applications are healthy on the cluster

• Any ApplicationMaster that does not send a heartbeat within a certain interval is marked as dead
and re-scheduled to run on a new container.
Components of YARN Architecture
RESOURCE MANAGER: HIGH AVAILABILITY MODE

Before Hadoop 2.4, the Resource Manager was the single point of failure in a YARN cluster.

The High Availability, or HA, feature is an Active/Standby Resource Manager pair to remove
Resource
this single point of failure.
Manager

Node
Manager

Application
Master
Components of YARN Architecture
NODE MANAGER

Client

Resource
Manager YARN

The Node Manager (NM) is Resource Data Processing


Node the slave. When it starts, it
Manager Resource Manager
announces itself to the RM
and offers resources to
Application the cluster. Applications
Master Scheduler
Manager

Node Manager Node Manager Node Manager

App App App


Container Container Container
Master Master Master

Data Node Data Node Data Node


Components of YARN Architecture
NODE MANAGER

Client

Resource
Manager YARN
Each Node Manager takes Resource Data Processing
Node instructions from the
Manager Resource Manager and Resource Manager
reports and handles
Application containers on a single
Applications
Master node. Scheduler
Manager

Node Manager Node Manager Node Manager

App App App


Container Container Container
Master Master Master

Data Node Data Node Data Node


Components of YARN Architecture
NODE MANAGER

When a container is leased to an application, the Node Manager sets up the container’s
environment, including the resource constraints specified in the lease and any dependencies.

Resource
Manager
The Node Manager runs on each node and manages the following:
Node
Manager Node Manager • Container lifecycle management
• Container dependencies
Application App
Master Container
Master • Container leases

Data Node • Node and container resource usage


• Node health
• Log management
• Reporting node and container status to the RM
Components of YARN Architecture
NODE MANAGER COMPONENT: YARN CONTAINER

YARN container is a result of a successful resource allocation, that is, the RM has granted an
application a lease to use specified resources on a specific node.
Resource
Manager

Node Node Manager


Manager
App
Application Container
Master
Master
Data Node
Components of YARN Architecture
NODE MANAGER COMPONENT: LAUNCHING A CONTAINER

To launch the container, the Application Master must provide a container launch context
(CLC) that includes the following information:
Resource
Manager • Environment variables

Node • Dependencies, that is, local resources such as data files or shared objects needed prior
Manager to launch

Application • Security tokens


Master
• The command necessary to create the process that the application wants to launch
Components of YARN Architecture
APPLICATION MASTER

The Application Master in YARN is a framework-specific library that negotiates resources from
the RM and works with the Node Manager or managers to execute and monitor containers and
Resource their resource consumption.
Manager

Node
Manager Node Manager

Application App
Master Container
Master

Data Node
Components of YARN Architecture
APPLICATION MASTER: USES

The Application Master:


Resource
Manager • manages the application lifecycle
• makes dynamic adjustments to resource consumption
Node
Manager • manages execution flow
• manages faults
Application
Master • provides status and metrics to the Resource Manager
• interacts with Node Manager and Resource Manager using extensible communication protocols

While every application has its own instance of an AppMaster, it is possible to


implement an AppMaster for a set of applications as well.
Applications of YARN

There can be many different workloads running on a Hadoop YARN cluster.

OTHER
BATCH INTERACTIVE ONLINE STREAMING GRAPH IN-MEMORY HPC MPI
(Search)
(MapReduce) (Tez) (HBase) (Strom, S4,…) (Giraph) (Spark) (OpenMPI)
(Weave…)

Cluster Resource
YARN
Management

Distributed file
system
Running an Application Through YARN

The Node Manager 05 The container executes


04
launches the container the Application Master

The Resource Manager 02 03 The Application Master contacts


allocates a container the related Node Manager

01 The client submits an application


to the Resource Manager
Running an Application Through YARN
STEP 1

Users submit applications to the Resource Manager by typing the hadoop jar command.
The client submits an
application to the The Resource Manager maintains the list of applications on the cluster and available resources
Resource Manager on the Node Manager and determines the next application that receives a portion of the cluster
resource.
The Resource
Manager allocates a $ my-Hadoop-app
container Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode

The Node Manager


launches the
container Resource Manager
Node Manager DataNode
The container
Application
executes Master
NameNode
the Application
Master
Node Manager DataNode
Running an Application Through YARN
STEP 2

When the Resource Manager accepts a new application submission, the Scheduler first selects a
The client submits an container.
application to the
Resource Manager
Next, the Application Master is started and is responsible for the entire life-cycle of the application.
The Resource
$ my-Hadoop-app
Manager allocates a
container
Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode

The Node Manager


launches the Resources Request:
-1xNode1/1GB/1 core
container Resource
-1xNode2/1GB/1 core

Manager Node Manager DataNode


The container
Application
executes Master
NameNode
the Application
Master
Node Manager DataNode
Running an Application Through YARN
STEP 3

The client submits an


After a container is allocated, the Application Master asks the Node Manager managing the host
application to the on which the container was allocated to use these resources to launch an application-specific task.
Resource Manager

$ my-Hadoop-app
The Resource
Manager allocates a
Node Manager DataNode
container Client
The Application
Master contacts the
related Node Node Manager DataNode
Manager
The Node Manager
launches the Resource
Manager
container Node Manager DataNode

Application
The container Here are you
NameNode
containers Master
executes
the Application
Node Manager DataNode
Master
Running an Application Through YARN
STEP 4

The Node Manager does not monitor tasks; it only monitors the resource usage in the containers.
The client submits an
application to the The Application Master negotiates containers to launch all of the tasks needed to complete the
Resource Manager application.

The Resource $ my-Hadoop-app


Manager allocates a
container Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode

The Node Manager


launches the
container Resource Manager
Node Manager DataNode
The container Here are you Application
NameNode
executes containers Master
the Application
Master Node Manager DataNode
Running an Application Through YARN
STEP 5

The client submits an After the application is complete, the Application Master shuts itself and releases its own
application to the container.
Resource Manager

The Resource
Manager allocates a $ my-Hadoop-app
container
Node Manager DataNode
Client
The Application
Master contacts the
related Node
Manager Node Manager DataNode
The Node Manager
launches the
container Resource Manager
Node Manager DataNode
The container
executes Application
“I’m done!” NameNode
Master
the Application
Master
Node Manager DataNode
Tools for YARN Development

Hadoop includes three tools for YARN developers:

YARN
YARN Hue Job
Command
Web UI browser
Line
Tools for YARN Development
YARN WEB UI

YARN YARN web UI runs on 8088 port, by default.


Web UI It also provides a better view than Hue; however, you can’t control or configure from YARN web UI.

Hue Job
browser

YARN
Command
Line
Tools for YARN Development
HUE JOB BROWSER

YARN The Hue Job Browser allows you to monitor status of job, kill a running job, and view logs.
Web UI

Hue Job
browser

YARN
Command
Line
Tools for YARN Development
YARN COMMAND LINE

YARN Most of the YARN commands are for administrators rather than developers.
Web UI
A few useful commands: - yarn –help
list all command of yarn

Hue Job - yarn –version


browser print the version

- yarn logs -applicationId <app-id>


views logs of specified application ID
YARN
Command
Line
Demo
Using YARN Web UI, Hue Job Browser, and YARN command Line

In this demonstration, you will learn how to use YARN Web UI, Hue Job Browser, and YARN
command line. You can also execute it in your lab environment for practice.
Key Takeaways

HDFS is the storage layer for Hadoop.

HDFS chunks data into blocks and distributes them across the cluster.

Slave nodes run DataNode daemons that are managed by a single NameNode.

HDFS can be accessed using Hue, HDFS command, or HDFS API.

YARN manages resources in a Hadoop cluster and schedules jobs.

YARN works with HDFS to run tasks where the data is stored.

YARN executes jobs that can be monitored using Hue, YARN Web UI, or YARN
command.
Quiz
QUIZ
How is NameNode failure in non HA mode tackled?
1

a. Secondary NameNode is switched on as NameNode

b. Secondary NameNode automatically starts working as NameNode

c. ,Primary NameNode is recreated from Secondary NameNode image backup

d. Another NameNode in cluster with replication works as main NameNode


QUIZ
How is NameNode failure in non HA mode tackled?
1

a. Secondary NameNode is switched on as NameNode

b. Secondary NameNode automatically starts working as NameNode

c. ,Primary NameNode is recreated from Secondary NameNode image backup

d. Another NameNode in cluster with replication works as main NameNode

The correct answer is c.


NameNode failure in non-HA mode is tackled by taking an image backup of the NameNode from
Secondary NameNode and incorporating it into a new NameNode machine.
QUIZ Which of the following statements best describes how a large (100 GB) file is stored in
2 HDFS?

The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.

The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.
QUIZ Which of the following statements best describes how a large (100 GB) file is stored in
2 HDFS?

The file is replicated three times by default. Each copy of the file is stored on a separate
a. DataNode.

The master copy of the file is stored on a single DataNode. The replica copies are divided
b. into fixed-size blocks, which are stored on multiple DataNodes.
The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
c. block is replicated three times by default. Multiple blocks from the same file might
reside on the same DataNode.
d. The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each
block is replicated three times by default. HDFS guarantees that different blocks from
the same file are never on the same DataNode.

The correct answer is d.


The file is divided into fixed-size blocks, which are stored on multiple DataNodes. Each block is
replicated three times by default. HDFS guarantees that different blocks from the same file are
never on the same DataNode.
QUIZ
Which of the following describes how a client reads a file from HDFS?
3

The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).

b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.

c. The client contacts the NameNode for the block location(s).

The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.
QUIZ
Which of the following describes how a client reads a file from HDFS?
3

The client queries the NameNode for the block location(s). The NameNode returns the
a. block location(s) to the client. The client reads the data directly off the DataNode(s).

b. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.

c. The client contacts the NameNode for the block location(s).

The NameNode contacts the DataNode that holds the requested data block. Data is
d. transferred from the DataNode to the NameNode and then from the NameNode to the client.

The correct answer is d.


The client contacts the NameNode for the block location(s). NameNode then queries the
DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode
redirects the client to the DataNode that holds the requested data block(s). The client then reads
the data directly off the DataNode.
QUIZ
Which of the following is/are valid statements?
4

HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.

b. HDFS supports a write once-read once data access model.

HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.
QUIZ
Which of the following is/are valid statements?
4

HDFS is optimized for storing a large number of files smaller than the HDFS block
a.
size.

b. HDFS supports a write once-read once data access model.

HDFS is a distributed file system that replaces ext3 or ext4 on Linux nodes in a
c. Hadoop cluster.
HDFS is a distributed file system that runs on top of native OS file systems and is
d. well-suited for storage of very large datasets.

The correct answer is d.


HDFS is a distributed file system that runs on top of native OS file systems and is well-suited for
storage of very large datasets.
QUIZ
The NameNode uses RAM:
5

a. To store the file contents in HDFS

b. To store filenames, list of blocks, and other meta information

c. To store the edits log that keeps track of changes in HDFS

d. To manage distributed read and write locks on files in HDFS


QUIZ
The NameNode uses RAM:
5

a. To store the file contents in HDFS

b. To store filenames, list of blocks, and other meta information

c. To store the edits log that keeps track of changes in HDFS

d. To manage distributed read and write locks on files in HDFS

The correct answer is b.


NameNode uses RAM to store filenames, list of blocks, and other meta information.
QUIZ You need to move a file titled weblogs into HDFS. You are not allowed to copy the file.
You have ample space on your DataNodes. What action should you take to relieve this
6 situation and store more files in HDFS?

a. Increase the block size on all current files in HDFS

b. Increase the block size on your remaining files

c. Decrease the block size on your remaining files

d. Increase the amount of memory for the NameNode


QUIZ You need to move a file titled weblogs into HDFS. You are not allowed to copy the file.
You have ample space on your DataNodes. What action should you take to relieve this
6 situation and store more files in HDFS?

a. Increase the block size on all current files in HDFS

b. Increase the block size on your remaining files

c. Decrease the block size on your remaining files

d. Increase the amount of memory for the NameNode

The correct answer is c.


It is recommended that you decrease the block size on your remaining files.
This concludes “HDFS and YARN.”
The next lesson is “MapReduce and Sqoop.”

©Simplilearn. All rights reserved