Hadoop Fundamentals

Book Report
Apache Hadoop

Harish
17th December, 2019

Introduction
1. Title
2. Author
3. Illustrator
4. Fiction or Nonfiction
5. Why did you choose this book?
Reading Rainbow Tip: Was the title interesting? Did the cover spark your curiosity? Was it
something else? Talk about why you chose the book to help your classmates understand more
about you!

1

Hadoop Distributed File System (HDFS)
● HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware
● HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time
● Hadoop doesn’t require expensive, highly reliable hardware to run on. It is designed to
run on clusters of commodity hardware for which the chance of node failure across the
cluster is high, at least for large clusters. HDFS is designed to carry on working without a
noticeable interruption to the user in the face of such failure.
Blocks
● A disk has a block size, which is the minimum amount of data that it can read or write.
● Filesystem blocks are typically a few kilobytes in size, while disk blocks are normally 512
bytes.
● HDFS, too, has the concept of a block, but it is a much larger unit—64 MB by default.
● Why Is a Block in HDFS So Large? - HDFS blocks are large compared to disk blocks, and

the reason is to minimize the cost of seeks. By making a block large enough, the time to
transfer the data from the disk can be made to be significantly larger than the time to
seek to the start of the block. Thus the time to transfer a large file made of multiple blocks
operates at the disk transfer rate.
● To insure against corrupted blocks and disk and machine failure, each block is replicated

to a small number of physically separate machines (typically three). If a block becomes
unavailable, a copy can be read from another location in a way that is transparent to the
client. A block that is no longer available due to corruption or machine failure can be
replicated from its alternative locations to other live machines to bring the replication
factor back to the normal level.
2

● Data nodes are responsible for verifying the data they receive before storing the data and
its checksum. This applies to data that they receive from clients and from other data
nodes during replication. Since every I/O operation on the disk or network carries with it a
small chance of introducing errors into the data that it is reading or writing, when the
volumes of data flowing through the system are as large as the ones Hadoop is capable
of handling, the chance of data corruption occurring is high. The usual way of detecting
corrupted data is by computing a checksum for the data when it first enters the system,
and again whenever it is transmitted across a channel that is unreliable and hence
capable of corrupting the data
*Network latency is an expression of how much time it takes for a packet of data to get from one
designated point to another. In some environments (for example, AT&T), latency is measured
by sending a packet that is returned to the sender; the round-trip time is considered the latency.
*Seek time - the time taken for a disk drive to locate the area on the disk where the data to be
read is stored.
● It is possible to disable verification of checksums by passing false to the setVerify
Checksum() method on FileSystem, before using the open() method to read a file. The
same effect is possible from the shell by using the -ignoreCrc option with the -get or the
equivalent -copyToLocal command. This feature is useful if you have a corrupt file that
you want to inspect so you can decide what to do with it. For example, you might want to
see whether it can be salvaged before you delete it.
% hadoop fsck /
Example of checking the whole filesystem for a small cluster
% hadoop fsck /user/tom/part-00007 -files -blocks –racks
/user/tom/part-00007 25582428 bytes, 1 block(s): OK

0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/10.251.43.2:50010,
/default-rack/10.251.27.178:50010, /default-rack/10.251.123.163:50010]
This says that the file /user/tom/part-00007 is made up of one block and shows the
Data nodes where the blocks are located. The fsck options used are as follows:
The -files option shows the line with the filename, size, number of blocks, and its health (whether there
are any missing blocks).
The -blocks option shows information about each block in the file, one line per block.
3

The -racks option displays the rack location and the dat anode addresses for each block.
Namenodes and Datanodes
An HDFS cluster has two types of node operating in a master-worker pattern: a namenode (the
master) and a number of datanodes (workers). The namenode manages the filesystem
namespace. It maintains the filesystem tree and the metadata for all the files and directories in
the tree. This information is stored persistently on the local disk in the form of two files: the
namespace image and the edit log. The namenode also knows the datanodes on which all the
blocks for a given file are located, however, it does not store block locations persistently, since
this information is reconstructed from datanodes when the system starts.
Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing.
4

Without the namenode, the filesystem cannot be used. In fact, if the machine running the
namenode were obliterated, all the files on the filesystem would be lost since there would be no
way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it
is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for
this.
The first way is to back up the files that make up the persistent state of the filesystem metadata.
Hadoop can be configured so that the namenode writes its persistent state to multiple
filesystems. These writes are synchronous and atomic. The usual configuration choice is to write
to local disk as well as a remote NFS mount.
It is also possible to run a secondary namenode, which despite its name does not act as a
namenode. Its main role is to periodically merge the namespace image with the editlog to
prevent the edit log from becoming too large. The secondary namenode usually runs on a
separate physical machine, since it requires plenty of CPU and as much memory as the
namenode to perform the merge. It keeps a copy of the merged namespace image, which can be
used in the event of the namenode failing. However, the state of the secondary namenode lags
that of the primary, so in the event of total failure of the primary, data loss is almost certain. The
usual course of action in this case is to copy the namenode’s metadata files that are on NFS to
the secondary and run it as the new primary.
FSimage and edit log
When a filesystem client performs a write operation (such as creating or moving a file), it is first
recorded in the edit log. The namenode also has an in-memory representation of the filesystem
metadata, which it updates after the edit log has been modified. The in-memory metadata is used
to serve read requests.
The edit log is flushed and synced after every write before a success code is returned to the
client. For namenodes that write to multiple directories, the write must be flushed and synced to
every copy before returning successfully. This ensures that no operation is lost due to machine
failure.
5

The fsimage file is a persistent checkpoint of the filesystem metadata. However, it is not updated

for every filesystem write operation, since writing out the fsimage file, which can grow to be
gigabytes in size, would be very slow. This does not compromise resilience, however, because if
the namenode fails, then the latest state of its metadata can be reconstructed by loading the
fsimage from disk into memory, then applying each of the operations in the edit log.
As described, the edits file would grow without bound. Though this state of affairs would have no

impact on the system while the namenode is running, if the namenode were restarted, it would
take a long time to apply each of the operations in its (very long) edit log. During this time, the
filesystem would be offline, which is generally undesirable.
The solution is to run the secondary namenode, whose purpose is to produce checkpoints of the

primary’s in-memory filesystem metadata. The check pointing process proceeds as
1. The secondary asks the primary to roll its edits file, so new edits go to a new file.
2. The secondary retrieves fsimage and edits from the primary (using HTTP GET).
3. The secondary loads fsimage into memory, applies each operation from edits, then creates a
new consolidated fsimage file.
4. The secondary sends the new fsimage back to the primary (using HTTP POST).
5. The primary replaces the old fsimage with the new one from the secondary, and the old edits
file with the new one it started in step 1. It also updates the fstime file to record the time that the
checkpoint was taken.
HDFS Federation
The namenode keeps a reference to every file and block in the filesystem in memory, which
means that on very large clusters with many files, memory becomes the limiting factor for scaling
HDFS Federation, introduced in the 0.23 release series, allows a cluster to scale by adding
namenodes, each of which manages a portion of the filesystem namespace. For example, one
namenode might manage all the files rooted under /user, say, and a second namenode might
handle files under /share.
6

Under federation, each namenode manages a namespace volume, which is made up of the
metadata for the namespace, and a block pool containing all the blocks for the files in the
namespace. Namespace volumes are independent of each other, which means namenodes do
not communicate with one another, and furthermore the failure of one namenode does not affect
the availability of the namespaces managed by other namenodes. Block pool storage is not
partitioned, however, so datanodes register with each namenode in the cluster and store blocks
from multiple block pools.
HDFS High-Availability
The combination of replicating namenode metadata on multiple filesystems, and using the
secondary namenode to create checkpoints protects against data loss, but does not provide
high-availability of the filesystem. The namenode is still a single point of failure (SPOF), since if it
did fail, all clients—including MapReduce jobs—would be unable to read, write, or list files,
because the namenode is the sole repository of the metadata and the file-to-block mapping. In
such an event the whole Hadoop system would effectively be out of service until a new
namenode could be brought online. To recover from a failed namenode in this situation, an
administrator starts a new primary namenode with one of the filesystem metadata replicas, and
configures datanodes and clients to use this new namenode. The new namenode is not able to
serve requests until it has i) loaded its namespace image into memory, ii) replayed its edit log,
and iii) received enough block reports from the datanodes to leave safe mode. On large clusters
with many files and blocks, the time it takes for a namenode to start from cold can be 30 minutes
or more.
The long recovery time is a problem for routine maintenance too. In fact, since unexpected failure
of the namenode is so rare, the case for planned downtime is actually more important in practice.
The 0.23 release series of Hadoop remedies this situation by adding support for HDFS
high-availability (HA). In this implementation there is a pair of namenodes in an activestandby
configuration. In the event of the failure of the active namenode, the standby takes over its
duties to continue servicing client requests without a significant interruption.
A few architectural changes are needed to allow this to happen:

• The namenodes must use highly-available shared storage to share the edit log. (In the initial
implementation of HA this will require an NFS filer, but in future releases more options will be
provided, such as a BookKeeper-based system built on Zoo- Keeper.) When a standby
7

namenode comes up it reads up to the end of the shared edit log to synchronize its state with the
active namenode, and then continues to read new entries as they are written by the active
namenode.
• Datanodes must send block reports to both namenodes since the block mappings are stored in
a namenode’s memory, and not on disk.
• Clients must be configured to handle namenode failover, which uses a mechanism that is
transparent to users.

If the active namenode fails, then the standby can take over very quickly (in a few tens of
seconds) since it has the latest state available in memory: both the latest edit log entries, and an
up-to-date block mapping.
Failover and fencing
The transition from the active namenode to the standby is managed by a new entity in
the system called the failover controller. There are various failover controllers, but the
default implementation uses ZooKeeper to ensure that only one namenode is active.
Each namenode runs a lightweight failover controller process whose job it is to monitor
its namenode for failures (using a simple heartbeating mechanism) and trigger a failover
should a namenode fail.
Failover may also be initiated manually by an administrator, for example, in the case of
routine maintenance. This is known as a graceful failover, since the failover controller
arranges an orderly transition for both namenodes to switch roles.
In the case of an ungraceful failover, however, it is impossible to be sure that the failed
namenode has stopped running. For example, a slow network or a network partition can
8

trigger a failover transition, even though the previously active namenode is still running
and thinks it is still the active namenode. The HA implementation goes to great lengths to
ensure that the previously active namenode is prevented from doing any damage and
causing corruption—a method known as fencing.
What is Quorum journal manager?

Quorum Journal Manager (QJM) allows to share edit logs between the Active and
Standby NameNodes
Importantly, when using the Quorum Journal Manager, only one NameNode will ever
be allowed to write to the JournalNodes
What is Journal nodes?
In order for the Standby node to keep its state coordinated with the Active node, both
nodes communicate with a group of separate daemons called ‘JournalNodes’ (JNs).
When any namespace modification is performed by the Active node, it logs a record of
the changes made, in the JournalNodes. The Standby node is capable of reading the
amended information from the JNs, and is regularly monitoring them for changes. As the
Standby Node sees the changes, it then applies them to its own namespace. In case of a
failover, the Standby will make sure that it has read all the changes from the
JounalNodes before changing its state to ‘Active state’. This guarantees that the
namespace state is fully synched before a failover occurs.
To provide a fast failover, it is essential that the Standby node have to have the updated
and current information regarding the location of blocks in the cluster. For this to happen,
9

the DataNodes are configured with the location of both NameNodes, and send block
location information and heartbeats to both.
It is essential that only one of the NameNodes must be Active at a time. Otherwise, the
namespace state would deviate between the two and lead to data loss or erroneous
results. In order to avoid this, the JournalNodes will only permit a single NameNode to a
writer at a time. During a failover, the NameNode which is to become active will take over
the responsibility of writing to the JournalNodes.
Epoc numbers and it use
Journal manager uses epoc numbers. Epoc numbers are integer which always gets
increased and have unique value once assigned. Namenode generate epoc number
using simple algorithm and uses it while sending RPC requests to the QJM. When you
configure Namenode HA, the first Active Namenode will get epoc value 1. In case of
failover or restart, epoc number will get increased. The Namenode with higher epoc
number is considered as newer than any Namenode with earlier epoc number.
Now both Namenode thinks that they are active and sends write request to quorum journal
manager with their epoc number, how QJM handles this situation?
Quorum journal manager stores epoc number locally which called as promised epoc.
Whenever JournalNode receives RPC request along with epoc number from Namenode,
it compares the epoch number with promised epoch. If request is coming from newer
node which means epoc number is greater than promised epoc then itrecords new epoc
number as promised epoc. If the request is coming from Namenode with older epoc
number, then QJM simply rejects the request.
10

MAP REDUCE PARADIGM
MapReduce is a programming model for data processing. The model is simple, yet not too simple

to express useful programs in. Hadoop can run MapReduce programs written in various
languages. MapReduce programs are inherently parallel, thus putting very large-scale data
analysis into the hands of anyone with enough machines at their disposal.
A map reduce example
Datafiles are organized by date and weather station. There is a directory for each year from 1901
to 2001, each containing a gzipped file for each weather station with its readings for that year. For
example, here are the first entries for 1990:
There are tens of thousands of weather stations, so the whole dataset is made up of a large
number of relatively small files.Below is the format of the data.

11

Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express our
query as a MapReduce job. After some local, small-scale testing, we will be able to run it on a
cluster of machines.
MAP AND REDUCE
MapReduce works by breaking the processing into two phases: the map phase and the reduce
phase. Each phase has key-value pairs as input and output, the types of which may be chosen by
the programmer. The programmer also specifies two functions: the map function and the reduce
function.
The input to our map phase is the raw NCDC data. We choose a text input format that gives us
each line in the dataset as a text value. The key is the offset of the beginning of the line from the
beginning of the file, but as we have no need for this, we ignore it.
Our map function is simple. We pull out the year and the air temperature, because these are the
only fields we are interested in. In this case, the map function is just a data preparation phase,
setting up the data in such a way that the reduce function can do its work on it: finding the
maximum temperature for each year. The map function is also a good place to drop bad records:
here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data (some
unused columns have been dropped to fit the page, indicated by ellipses):
These lines are presented to the map function as the key-value pairs:
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits them
as its output (the temperature values have been interpreted as integers):
12

The output from the map function is processed by the MapReduce framework before being sent
to the reduce function. This processing sorts and groups the key-value pairs by key. So,
continuing the example, our reduce function sees the following input:
Each year appears with a list of all its air temperature readings. All the reduce function has to do
now is iterate through the list and pick up the maximum reading:
This is the final output: the maximum global temperature recorded in each year.
13

DATA FLOW
First, some terminology. A MapReduce job is a unit of work that the client wants to be performed:
it consists of the input data, the MapReduce program, and configuration information. Hadoop
runs the job by dividing it into tasks, of which there are two types: map tasks and reduce tasks.
The tasks are scheduled using YARN and run on nodes in the cluster. If a task fails, it will be
automatically rescheduled to run on a different node.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just
splits. Hadoop creates one map task for each split, which runs the user-defined map function for
each record in the split.
Having many splits means the time taken to process each split is small compared to the time to
process the whole input. So if we are processing the splits in parallel, the processing is better
load balanced when the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine. Even if the machines
are identical, failed processes or other jobs running concurrently make load balancing desirable,
and the quality of the load balancing increases as the splits become more fine grained.
On the other hand, if splits are too small, the overhead of managing the splits and map task
creation begins to dominate the total job execution time. For most jobs, a good split size tends to
be the size of an HDFS block, which is 128 MB by default, although this can be changed for the
cluster (for all newly created files) or specified when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in HDFS,
because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization.
Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split
are running other map tasks, so the job scheduler will look for a free map slot on a node in the
same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is
used, which results in an inter-rack network transfer. The three possibilities are illustrated in
Figure 2-2.
It should now be clear why the optimal split size is the same as the block size: it is the largest size
of input that can be guaranteed to be stored on a single node. If the split spanned two blocks, it
would be unlikely that any HDFS node stored both blocks, so some of the split would have to be
transferred across the network to the node running the map task, which is clearly less efficient
than running the whole map task using local data.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: it’s processed by reduce tasks to produce the final output, and once the job
14

is complete, the map output can be thrown away. So, storing it in HDFS with replication would be

overkill. If the node running the map task fails before the map output has been consumed by the
reduce task, then Hadoop will automatically rerun the map task on another node to re-create the
map output.
Reduce tasks don’t have the advantage of data locality; the input to a single reduce task is
normally the output from all mappers. In the present example, we have a single reduce task that
is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across
the network to the node where the reduce task is running, where they are merged and then
passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS
for reliability. As explained in Chapter 3, for each HDFS block of the reduce output, the first
replica is stored on the local node, with other replicas being stored on off-rack nodes for
reliability. Thus, writing the reduce output does consume network bandwidth, but only as much
as a normal HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in Figure 2-3. The dotted boxes
indicate nodes, the dotted arrows show data transfers on a node, and the solid arrows show data
transfers between nodes.
15

Figure 2-3. MapReduce data flow with a single reduce task
The number of reduce tasks is not governed by the size of the input, but instead is specified
independently. In The Default MapReduce Job, you will see how to choose the number of reduce
tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single partition. The partitioning can be
controlled by a user-defined partitioning function, but normally the default partitioner—which
buckets keys using a hash function—works very well.
The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4. This
diagram makes it clear why the data flow between map and reduce tasks is colloquially known as
“the shuffle,” as each reduce task is fed by many map tasks. The shuffle is more complicated than
this diagram suggests, and tuning it can have a big impact on job execution time, as you will see
in Shuffle and Sort.
16

Figure 2-4. MapReduce data flow with multiple reduce tasks
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need

the shuffle because the processing can be carried out entirely in parallel (a few examples are
discussed in NLineInputFormat). In this case, the only off-node data transfer is when the map
tasks write to HDFS (see Figure 2-5).
COMBINER FUNCTIONS
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify
a combiner function to be run on the map output, and the combiner function’s output forms the
input to the reduce function. Because the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map output record, if at all. In
other words, calling the combiner function zero, one, or many times should produce the same
output from the reducer.
17

Figure 2-5. MapReduce data flow with no reduce tasks
The contract for the combiner function constrains the type of function that may be used. This is
best illustrated with an example. Suppose that for the maximum temperature example, readings
for the year 1950 were processed by two maps (because they were in different splits). Imagine
the first map produced the output:
and the second produced:
18

The reduce function would be called with a list of all the values:
with output:
since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output. The reduce function would
then be called with:
and would produce the same output as before. More succinctly, we may express the function
calls on the temperature values in this case as follows:
Not all functions possess this property.[20] For example, if we were calculating mean
temperatures, we couldn’t use the mean as our combiner function, because:
But:
19

The combiner function doesn’t replace the reduce function. (How could it? The reduce function is

still needed to process records with the same key from different maps.) But it can help cut down
the amount of data shuffled between the mappers and the reducers, and for this reason alone it
is always worth considering whether you can use a combiner function in your MapReduce job.
20

MAP REDUCE PROGRAMMING

Word Count
package solution;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
/*
* MapReduce jobs are typically implemented by using a driver class.
* The purpose of a driver class is to set up the configuration for the
* MapReduce job and to run the job.
* Typical requirements for a driver class include configuring the input
* and output data formats, configuring the map and reduce classes,
* and specifying intermediate data formats.
*
* The following is the code for the driver class:
*/
public class WordCount {
public static void main(String[] args) throws Exception {
/*
* The expected command-line arguments are the paths containing
* input and output data. Terminate the job if the number of
* command-line arguments is not exactly 2.
*/
if (args.length != 2) {
System.out.printf(
"Usage: WordCount <input dir> <output dir>\n");
System.exit(-1);
}
/*
* Instantiate a Job object for your job's configuration.
*/
Job job = new Job();
/*
* Specify the jar file that contains your driver, mapper, and reducer.
* Hadoop will transfer this jar file to nodes in your cluster running
* mapper and reducer tasks.
*/
job.setJarByClass(WordCount.class);
/*
* Specify an easily-decipherable name for the job.
* This job name will appear in reports and logs.
21

*/
job.setJobName("Word Count");
/*
* Specify the paths to the input and output data based on the
* command-line arguments.
*/
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
/*
* Specify the mapper and reducer classes.
*/
job.setMapperClass(WordMapper.class);
job.setReducerClass(SumReducer.class);
job.setCombinerClass(SumReducer.class);
/*
* For the word count application, the input file and output
* files are in text format - the default format.
*
* In text format files, each record is a line delineated by a
* by a line terminator.
*
* When you use other input formats, you must call the
* SetInputFormatClass method. When you use other
* output formats, you must call the setOutputFormatClass method.
*/
/*
* For the word count application, the mapper's output keys and
* values have the same data types as the reducer's output keys
* and values: Text and IntWritable.
*
* When they are not the same data types, you must call the
* setMapOutputKeyClass and setMapOutputValueClass
* methods.
*/
/*
* Specify the job's output key and value classes.
*/
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(2);
/*
* Start the MapReduce job and wait for it to finish.
* If it finishes successfully, return 0. If not, return 1.
*/
boolean success = job.waitForCompletion(true);
System.exit(success ? 0 : 1);
}
22

package solution;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
/*
* To define a map function for your MapReduce job, subclass
* the Mapper class and override the map method.
* The class definition requires four parameters:
* The data type of the input key
* The data type of the input value
* The data type of the output key (which is the input key type
* for the reducer)
* The data type of the output value (which is the input value
* type for the reducer)
*/
public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

/*
* The map method runs once for each line of text in the input file.
* The method receives a key of type LongWritable, a value of type
* Text, and a Context object.
*/
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
/*
* Convert the line, which is received as a Text object,
* to a String object.
*/
String line = value.toString();
/*
* The line.split("\\W+") call uses regular expressions to split the
* line up by non-word characters.
*
* If you are not familiar with the use of regular expressions in
* Java code, search the web for "Java Regex Tutorial."
*/
for (String word : line.split("\\W+")) {
if (word.length() > 0) {
/*
* Call the write method on the Context object to emit a key
* and a value from the map method.
23

*/
ew IntWritable(1));
context.write(new Text(word), n
}
}
package solution;
import java.io.IOException;
import org.apache.hadoop.mapreduce.Reducer;
/*
* To define a reduce function for your MapReduce job, subclass
* the Reducer class and override the reduce method.
* The class definition requires four parameters:
* The data type of the input key (which is the output key type
* from the mapper)
* The data type of the input value (which is the output value
* type from the mapper)
* The data type of the output key
* The data type of the output value
*/
public class SumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
/*
* The reduce method runs once for each key received from
* the shuffle and sort phase of the MapReduce framework.
* The method receives a key of type Text, a set of values of type
* IntWritable, and a Context object.
*/
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int wordCount = 0;
/*
* For each value in the set of values passed to us by the mapper:
*/
for (IntWritable value : values) {
/*
* Add the value to the word count counter for this key.
*/
wordCount += value.get();
}
/*
* Call the write method on the Context object to emit a key
* and a value from the reduce method.
*/
context.write(key, new IntWritable(wordCount));
}
}
24

Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single method call: submit() on a Job object (you can also
call waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits for
it to finish).[51] This method call conceals a great deal of processing behind the scenes. This
section uncovers the steps Hadoop takes to run a job.
The whole process is illustrated in Figure 7-1. At the highest level, there are five independent
entities:[52]
● The client, which submits the MapReduce job.
● The YARN resource manager, which coordinates the allocation of compute resources on
the cluster.
● The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
● The MapReduce application master, which coordinates the tasks running the MapReduce

job. The application master and the MapReduce tasks run in containers that are
scheduled by the resource manager and managed by the node managers.
● The distributed filesystem which is used for sharing job files between the other entities.
25

Fig 7.1
JOB SUBMISSION
The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it (step 1 in Figure 7-1). Having submitted the job, waitForCompletion() polls
the job’s progress once per second and reports the progress to the console if it has changed
since the last report. When the job completes successfully, the job counters are displayed.
Otherwise, the error that caused the job to fail is logged to the console.
The job submission process implemented by JobSubmitter does the following:
● Asks the resource manager for a new application ID, used for the MapReduce job ID (step

2).
● Checks the output specification of the job. For example, if the output directory has not
been specified or it already exists, the job is not submitted and an error is thrown to the
MapReduce program.
26

● Computes the input splits for the job. If the splits cannot be computed (because the input

paths don’t exist, for example), the job is not submitted and an error is thrown to the
MapReduce program.
● Copies the resources needed to run the job, including the job JAR file, the configuration
file, and the computed input splits, to the shared filesystem in a directory named after the
job ID (step 3). The job JAR is copied with a high replication factor (controlled by the
mapreduce.client.submit.file.replication property, which defaults to 10) so that there are
lots of copies across the cluster for the node managers to access when they run tasks for
the job.
● Submits the job by calling submitApplication() on the resource manager (step 4).
JOB INITIALIZATION
When the resource manager receives a call to its submitApplication() method, it hands off the
request to the YARN scheduler. The scheduler allocates a container, and the resource manager
then launches the application master’s process there, under the node manager’s management
(steps 5a and 5b).
The application master for MapReduce jobs is a Java application whose main class is
MRAppMaster. It initializes the job by creating a number of bookkeeping objects to keep track of
the job’s progress, as it will receive progress and completion reports from the tasks (step 6). Next,
it retrieves the input splits computed in the client from the shared filesystem (step 7). It then
creates a map task object for each split, as well as a number of reduce task objects determined
by the mapreduce.job.reduces property (set by the setNumReduceTasks() method on Job). Tasks
are given IDs at this point.
The application master must decide how to run the tasks that make up the MapReduce job. If the

job is small, the application master may choose to run the tasks in the same JVM as itself. This
happens when it judges that the overhead of allocating and running tasks in new containers
outweighs the gain to be had in running them in parallel, compared to running them sequentially
on one node. Such a job is said to be uberized, or run as an uber task.
What qualifies as a small job? By default, a small job is one that has less than 10 mappers, only
one reducer, and an input size that is less than the size of one HDFS block. (Note that these
values may be changed for a job by setting mapreduce.job.ubertask.maxmaps,
mapreduce.job.ubertask.maxreduces, and mapreduce.job.ubertask.maxbytes.) Uber tasks must
27

be enabled explicitly (for an individual job, or across the cluster) by setting
mapreduce.job.ubertask.enable to true.
Finally, before any tasks can be run, the application master calls the setupJob() method on the
OutputCommitter. For FileOutputCommitter, which is the default, it will create the final output
directory for the job and the temporary working space for the task output. The commit protocol is
described in more detail in Output Committers.
TASK ASSIGNMENT
If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager (step 8).
Requests for map tasks are made first and with a higher priority than those for reduce tasks, since
all the map tasks must complete before the sort phase of the reduce can start (see Shuffle and
Sort). Requests for reduce tasks are not made until 5% of map tasks have completed (see Reduce
slow start).
Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality
constraints that the scheduler tries to honor (see Resource Requests). In the optimal case, the
task is data local—that is, running on the same node that the split resides on. Alternatively, the
task may be rack local: on the same rack, but not the same node, as the split. Some tasks are
neither data local nor rack local and retrieve their data from a different rack than the one they are
running on. For a particular job run, you can determine the number of tasks that ran at each
locality level by looking at the job’s counters (see Table 9-6).
Requests also specify memory requirements and CPUs for tasks. By default, each map and
reduce task is allocated 1,024 MB of memory and one virtual core. The values are configurable
on a per-job basis (subject to minimum and maximum values described in Memory settings in
YARN and MapReduce) via the following properties: mapreduce.map.memory.mb,
mapreduce.reduce.memory.mb, mapreduce.map.cpu.vcores and mapreduce.reduce.cpu.vcores.
TASK EXECUTION
Once a task has been assigned resources for a container on a particular node by the resource
manager’s scheduler, the application master starts the container by contacting the node manager
(steps 9a and 9b). The task is executed by a Java application whose main class is YarnChild.
Before it can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache (step 10; see Distributed
Cache). Finally, it runs the map or reduce task (step 11).
28

The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce
functions (or even in YarnChild) don’t affect the node manager—by causing it to crash or hang, for
example.
Each task can perform setup and commit actions, which are run in the same JVM as the task itself

and are determined by the OutputCommitter for the job (see Output Committers). For file-based
jobs, the commit action moves the task output from a temporary location to its final location. The
commit protocol ensures that when speculative execution is enabled (see Speculative Execution),
only one of the duplicate tasks is committed and the other is aborted.
29

Map Reduce 2 or YARN (Yet Another Resource Negotiator):
Resource Manager:
● Master daemon
● Consists of the scheduler and the App manager
● 1 RM per cluster
● Only takes care of CPU/memory management
● App Manager
o Responsible for accepting job submissions
o Negotiates the first container for executing the application specific app master
o Provide the service for restarting the application master on any failures
o Negotiates resource or container. ( Container is CPU,RAM,Cores)
● Scheduler
○ FIFO scheduler - First in first out . Used in older versions
○ FAIR scheduler - Multiple applications running fair share of resources. If there are 2
queues , if one is under utilized the other can take up its resources and operate
○ Capacity scheduler - Each queue is allocated it own resources
● Once container has approved by Node manager

● RM starts the App master in one of the data nodes inside the container.
Node Manager:
● Slave daemon
● 1 NM per Node
● Container management
● Task execution
● Nodemanager is allocated around 60-70% of the resources of the data node. (Ex out of 100GB
RAM and 60 core, it allocates 60GB RAM and 30 cores for the node manager)
● Node manager determines the size of the container int the yarn-site.xml
Application Master
● Not a daemon
● 1 AM per application
● Job execution
● App master will run the processing tasks on the containers. It authenticates and provides rights
to an application to use specific amount of resources.
● If a uber job(<10 tasks) comes in AM runs it in its own JVM.
● If not an uber job the AM negotiates with RM to find where resources are free. It instructs NM
to create a container there to execute the jar
30

● Every task sends ping to to AM and also the JHS. In case of AM failure JHS will have the progress
of every task
YARN separates these two roles into two independent daemons: a resource manager to manage
the use of resources across the cluster, and an application master to manage the lifecycle of
applications running on the cluster. The idea is that an application master negotiates with the
resource manager for cluster resources—described in terms of a number of containers each with
a certain memory limit—then runs application specific processes in those containers. The
containers are overseen by node managers running on cluster nodes, which ensure that the
application does not use more resources than it has been allocated
● The client, which submits the MapReduce job.
● The YARN resource manager, which coordinates the allocation of compute resources on
the cluster.
● The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
● The MapReduce application master, which coordinates the tasks running the MapReduce
job. The application master and the MapReduce tasks run in containers that are
scheduled by the resource manager, and managed by the node managers.
31

YARN in Detail
In a YARN cluster, there are two types of hosts:
● The ResourceManager is the master daemon that communicates with the client, tracks
resources on the cluster, and orchestrates work by assigning tasks to NodeManagers.
● A NodeManager is a worker daemon that launches and tracks processes spawned on
worker hosts.
YARN Configuration File

The YARN configuration file is an XML file that contains properties. This file is placed in a
well-known location on each host in the cluster and is used to configure the ResourceManager
and NodeManager. By default, this file is named yarn-site.xml. The basic properties in this file
used to configure YARN are covered in the later sections.
YARN Requires a Global View

YARN currently defines two resources, vcores and memory. Each NodeManager tracks its own
local resources and communicates its resource configuration to the ResourceManager, which
keeps a running total of the cluster’s available resources. By keeping track of the total, the
ResourceManager knows how to allocate resources as they are requested. (Vcore has a special
32

meaning in YARN. You can think of it simply as a “usage share of a CPU core.” If you expect
your tasks to be less CPU-intensive (sometimes called I/O-intensive), you can set the ratio of
vcores to physical cores higher than 1 to maximize your use of hardware resources.)
Containers
Containers are an important YARN concept. You can think of a container as a request to hold
resources on the YARN cluster. Currently, a container hold request consists of vcore and
memory, as shown in Figure 4 (left).
33

For the next section, two new YARN terms need to be defined:
An application is a YARN client program that is made up of one or more tasks (see Figure 5).
For each running application, a special piece of code called an ApplicationMaster helps
coordinate tasks on the YARN cluster. The ApplicationMaster is the first process run after the
application starts.
An application running tasks on a YARN cluster consists of the following steps:
1. The application starts and talks to the ResourceManager for the cluster:
Figure 5: Application starting up before tasks are assigned to the cluster
34

2. The ResourceManager makes a single container request on behalf of the application:
Figure 6: Application + allocated container on a cluster

3. The ApplicationMaster starts running within that container:
Figure 7: Application + ApplicationMaster running in the container on the cluster
4. The ApplicationMaster requests subsequent containers from the ResourceManager that

are allocated to run tasks for the application. Those tasks do most of the status communication
with the ApplicationMaster allocated in Step 3):
35

Figure 8: Application + ApplicationMaster + task running in multiple containers running on the

cluster
5. Once all tasks are finished, the ApplicationMaster exits. The last container is de-allocated
from the cluster.
6. The application client exits. (The ApplicationMaster launched in a container is more
specifically called a managed AM. Unmanaged ApplicationMasters run outside of YARN’s
control. Llama is an example of an unmanaged AM.)
36

VERIFYING YARN CONFIGURATION IN THE RM UI

As mentioned before, the ResourceManager has a snapshot of the available resources on the
YARN cluster.
Example: Assume you have the following configuration on your 50 Worker nodes:
1. yarn.nodemanager.resource.memory-mb = 90000
2. yarn.nodemanager.resource.vcores = 60
Doing the math, your cluster totals should be:
1. memory: 50*90GB=4500GB=4.5TB
2. vcores: 50*60 vcores= 3000 vcores
On the ResourceManager Web UI page, the cluster metrics table shows the total memory and
total vcores for the cluster, as seen in Figure 2 below.
Figure 2: Verifying YARN Cluster Resources on ResourceManager Web UI
CONTAINER CONFIGURATION
At this point, the YARN Cluster is properly set up in terms of Resources. YARN uses these
resource limits for allocation, and enforces those limits on the cluster.
YARN Container Memory Sizing

Minimum: yarn.scheduler.minimum-allocation-mb
Maximum: yarn.scheduler.maximum-allocation-mb
37

YARN Container VCore Sizing

Minimum: yarn.scheduler.minimum-allocation-vcores
Maximum: yarn.scheduler.maximum-allocation-vcores
YARN Container Allocation Size Increments
Memory Increment: yarn.scheduler.increment-allocation-mb
VCore Increment: yarn.scheduler.increment-allocation-vcores
Restrictions and recommendations for Container values:
Memory properties:
Minimum required value of 0 for yarn.scheduler.minimum-allocation-mb.
Any of the memory sizing properties must be less than or equal to
yarn.nodemanager.resource.memory-mb.
Maximum value must be greater than or equal to the minimum value.
VCore properties:
Minimum required value of 0 for yarn.scheduler.minimum-allocation-vcores.
Any of the vcore sizing properties must be less than or equal to
yarn.nodemanager.resource.vcores.
Maximum value must be greater than or equal to the minimum value.
Recommended value of 1 for yarn.scheduler.increment-allocation-vcores. Higher values will
likely be wasteful.
Note that in YARN it’s possible to do some very easy misconfiguration. If the Container
memory request minimum (yarn.scheduler.minimum-allocation-mb) is larger than the memory
available per node (yarn.nodemanager.resource.memory-mb), then it would be impossible for
YARN to fulfill that request. A similar argument can be made for the Container vcore request
minimum.
MAPREDUCE CONFIGURATION
The Map task memory property is mapreduce.map.memory.mb. The Reduce task memory
property is mapreduce.reduce.memory.mb. Since both types of tasks must fit within a Container,
the value should be less than the Container maximum size. (We won’t get into details about Java
38

and the YARN properties that affect launching the Java Virtual Machine here. They may be
covered in a future installment.)
SPECIAL CASE: APPLICATION MEMORY MEMORY CONFIGURATION

The property yarn.app.mapreduce.am.resource.mb is used to set the memory size for the
ApplicationMaster. Since the ApplicationMaster must fit within a container, the property should
be less than the Container maximum.
Scheduling in YARN
The ResourceManager (RM) tracks resources on a cluster, and assigns them to applications that
need them. The scheduler is that part of the RM that does this matching honoring organizational
policies on sharing resources. Please note that:
YARN uses queues to share resources among multiple tenants. This will be covered in more
detail in “Introducing Queues” below.
The ApplicationMaster (AM) tracks each task’s resource requirements and coordinates container
requests. This approach allows better scaling since the RM/scheduler doesn’t need to track all
containers running on the cluster.
Scheduling in general is a difficult problem and there is no one “best” policy, which is why
YARN provides a choice of schedulers and configurable policies. They are as follows:
● The FIFO scheduler

● The Fair scheduler
● The Capacity scheduler
The FIFO (First In First Out) scheduler

FIFO means First In First Out. As the name indicates, the job submitted first will get priority to
execute; in other words, the job runs in the order of submission. FIFO is a queue-based
scheduler. It is a very simple approach to scheduling and it does not guarantee performance
efficiency, as each job would use a whole cluster for execution. So other jobs may keep waiting
to finish their execution
39

The Capacity Scheduler

The Capacity scheduler is designed to allow applications to share cluster resources in a
predictable and simple fashion. These are commonly known as “job queues”. The main idea
behind capacity scheduling is to allocate available resources to the running applications, based
on individual needs and requirements.
By specifying below property in yarn-site.xml, you can enable the Capacity scheduling policy in
your YARN cluster:
The Fair Scheduler

The fair scheduler is one of the most famous pluggable schedulers for large clusters. It enables
memory-intensive applications to share cluster resources in a very efficient way. Fair scheduling
is a policy that enables the allocation of resources to applications in a way that all applications
get, on average, an equal share of the cluster resources over a given period.
40

Introducing Queues
Queues are the organizing structure for YARN schedulers, allowing multiple tenants to share the

cluster. As applications are submitted to YARN, they are assigned to a queue by the scheduler.
The root queue is the parent of all queues. All other queues are each a child of the root queue or
another queue (also called hierarchical queues). It is common for queues to correspond to users,
departments, or priorities.
Hadoop cluster sizes
Hadoop vendors tend to segregate clusters into three sizes: small, medium, and large. How many
nodes constitute a small, medium, or large is mostly a heuristic exercise. Some say if you can
manage a 50-node cluster then you can manage a 1,000-node cluster. Generally speaking, a
small cluster will tend to be less than 32 nodes, a medium cluster is between 32 and 150 nodes,
and a large cluster is anything over 150. Another common design template is whether your
cluster will fit on a single rack or multiple racks
41

Hadoop commands
‘hadoop fs’ (or) ‘hdfs dfs’ can be provided. ‘hdfs dfs’ does not work on recent hadoop versions
hadoop fs -ls
/dev/data/harish
hadoop fs -mkdir -p /
dev/data/harish
hadoop fs -cp
/dev/data/harish/1.xml /dev/data/sara/
hadoop fs -mv
/dev/data/harish/1.xml /dev/data/sara/
- Absolute path must be
provided
hadoop fs -cat /
dev/data/harish/1.xml
- To view contents of the file
hadoop fs -rm
/dev/data/harish/1.xml
- To remove file
hadoop fs -du
/dev/data/harish/
- To view disk usage of directory
hadoop fs -df -h /
dev/data/harish/
- To view free space
hadoop fs -text
/dev/data/harish/part-000
- To view contents of file that is in
sequential format
To retrieve data from hadoop
hadoop fs -get /
dev/data/harish/1.xml /usr/bin/bash
- Can get multiple files
hadoop fs -copyToLocal /
- Can get only one
file
hadoop fs -moveToLocal /
- Can get only one
file
hadoop fs -getMerge /
dev/data/mapred/part-000* usr/bin/bash
- To retrieve
multiple files into one single file. Usually to read mapreduce outputs
To push data to hadoop
hadoop fs -put 1.xml /

dev/data/harish
- Can copy multiple files
42

hadoop fs -copyFromLocal 1.xml

/dev/data/harish
- Can copy only one file
hadoop fs -moveFromLocal 1.xml - Can move only one file

/dev/data/harish
hadoop fs -touchz
/dev/data/harish/joker.txt
- This HDFS basic creates a file
at the path containing the current time as a timestamp
hadoop fs -test -{option}

/dev/data/harish/jo
- It gives 1 output if a path
exists; it has zero length, or it is a directory or otherwise 0.
-d: if the path given by the user is a directory, then it gives 0 output.
-e: if the path given by the user exists, then it gives 0 output.
-f: if the path given by the user is a file, then it gives 0 output.
-s: if the path given by the user is not empty, then it gives 0 output.
-z: if the file is zero length, then it gives 0 output.
hadoop fs -stat
/dev/data/harish
-This HDFS File system command prints
information about the path.
%b: If the format is a string which accepts file size in blocks.
%n: Filename
%o: Block size
%r: replication
%y, %Y: modification date.
hadoop fs -appendToFile 1
.xml
/dev/data/harish
-This HDFS command append single
sources or multiple sources from local file system to the file system at the
destination
hadoop fs -checksum /
dev/data/harish/1.xml
-This HDFS command returns the
checksum value of the file. A
checksumis the outcome of running an algorithm,
called a cryptographic hash function, on a piece of data, usually a single
file. Comparing the checksum that you generate from your version of the file,
with the one provided by the source of the file, helps ensure that your copy
of the file is genuine and error free.
43

hadoop distcp
/dev/data/harish/1.xml /prd/data/harish/
-To copy files from one
cluster to another cluster
hadoop fs -count -q -h -v/dev/data/harish-Provides the number of files and

sum of size of all files in all sub directories.
hadoop fs -rm -r /
dev/data/harish/ds={2019,2018,2017}
- Deletes sub directories
2017,2018,2019 in a single command
44

Hadoop Fundamentals

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop Fundamentals

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop Distributed File System (HDFS)

● Why Is a Block in HDFS So Large? - HDFS blocks are large compared to disk blocks, and

● To insure against corrupted blocks and disk and machine failure, each block is replicated

Example of checking the whole filesystem for a small cluster

% hadoop fsck /user/tom/part-00007 -files -blocks –racks

/user/tom/part-00007 25582428 bytes, 1 block(s): OK

Namenodes and Datanodes

FSimage and edit log

The fsimage file is a persistent checkpoint of the filesystem metadata. However, it is not updated

As described, the edits file would grow without bound. Though this state of affairs would have no

The solution is to run the secondary namenode, whose purpose is to produce checkpoints of the

A few architectural changes are needed to allow this to happen:

Failover and fencing

What is Quorum journal manager?

What is Journal nodes?

amended information from the JNs, and is regularly monitoring them for changes. As the

Standby Node sees the changes, it then applies them to its own namespace. In case of a

namespace state is fully synched before a failover occurs.

location information and heartbeats to both.

the responsibility of writing to the JournalNodes.

Epoc numbers and it use

MAP REDUCE PARADIGM

MapReduce is a programming model for data processing. The model is simple, yet not too simple

A map reduce example

Analyzing the Data with Hadoop

MAP AND REDUCE

is complete, the map output can be thrown away. So, storing it in HDFS with replication would be

Figure 2-3. MapReduce data flow with a single reduce task

Figure 2-4. MapReduce data flow with multiple reduce tasks

Finally, it’s also possible to have zero reduce tasks. This can be appropriate when you don’t need

Figure 2-5. MapReduce data flow with no reduce tasks

and the second produced:

The combiner function doesn’t replace the reduce function. (How could it? The reduce function is

MAP REDUCE PROGRAMMING

public static void main(String[] args) throws Exception {

public​ ​class​ WordMapper ​extends​ Mapper<LongWritable, Text, Text, IntWritable> {

Anatomy of a MapReduce Job Run

● The client, which submits the MapReduce job.

● The MapReduce application master, which coordinates the tasks running the MapReduce

The job submission process implemented by JobSubmitter does the following:

● Asks the resource manager for a new application ID, used for the MapReduce job ID (step

● Computes the input splits for the job. If the splits cannot be computed (because the input

The application master must decide how to run the tasks that make up the MapReduce job. If the

Each task can perform setup and commit actions, which are run in the same JVM as the task itself

Map Reduce 2 or YARN (Yet Another Resource Negotiator):

● Once container has approved by Node manager

● The client, which submits the MapReduce job.

In a YARN cluster, there are two types of hosts:

YARN Configuration File

YARN Requires a Global View

An application running tasks on a YARN cluster consists of the following steps:

Figure 5: Application starting up before tasks are assigned to the cluster

2. The ResourceManager makes a single container request on behalf of the application:

Figure 6: Application + allocated container on a cluster

Figure 7: Application + ApplicationMaster running in the container on the cluster

4. The ApplicationMaster requests subsequent containers from the ResourceManager that

Figure 8: Application + ApplicationMaster + task running in multiple containers running on the

VERIFYING YARN CONFIGURATION IN THE RM UI

Doing the math, your cluster totals should be:

Figure 2: Verifying YARN Cluster Resources on ResourceManager Web UI

YARN Container Memory Sizing

YARN Container VCore Sizing

public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

hadoop fs -copyFromLocal 1.xml

hadoop fs -moveFromLocal 1.xml - Can move only one file

hadoop fs -test -{option}

hadoop fs -count -q -h -v/dev/data/harish-Provides the number of files and