Sie sind auf Seite 1von 41

BIG DATA HADOOP BANK

FAQs For Data Science


1. What is the biggest data set that you have processed and how did you process it? What was the result?
2. Tell me two success stories about your analytic or computer science projects? How was the lift (or success)
measured?
3. How do you optimize a web crawler to run much faster, extract better information and summarize data to
produce cleaner databases?
4. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? And
which languages would you choose for semi-structured text data reconciliation?
5. State any 3 positive and negative aspects about your favorite statistical software.
6. You are about to send one million email (marketing campaign). How do you optimize delivery and its
response? Can both of these be done separately?
7. How would you turn unstructured data into structured data? Is it really necessary? Is it okay to store data as
flat text files rather than in an SQL-powered RDBMS?
8. In terms of access speed (assuming both fit within RAM) is it better to have 100 small hash tables or one big
hash table in memory? What do you think about in-database analytics?
9. Can you perform logistic regression with Excel? If yes, how can it be done? Would the result be good?
10. Give examples of data that does not have a Gaussian distribution, or log-normal. Also give examples of data
that has a very chaotic distribution?
11. How can you prove that one improvement youve brought to an algorithm is really an improvement over not
doing anything? How familiar are you with A/B testing?
12. What is sensitivity analysis? Is it better to have low sensitivity and low predictive power? How do you perform
good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity
of your models?
13. Compare logistic regression with decision trees and neural networks. How have these technologies improved
over the last 15 years?
14. What is root cause analysis? How to identify a cause Vs a correlation? Give examples.
15. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule
redundancy, rule discovery and the combinatorial nature of the problem? Can an approximate solution to the
rule set problem be okay? How would you find an okay approximate solution? What factors will help you decide
that it is good enough and stop looking for a better one?
16. Which tools do you use for visualization? What do you think of Tableau, R and SAS? (for graphs). How to
efficiently represent 5 dimension in a chart or in a video?
17. Which is better: Too many false positives or too many false negatives?
18. Have you used any of the following: Time series models, Cross-correlations with time lags, Correlograms,
Spectral analysis, Signal processing and filtering techniques? If yes, in which context?
19. What is the computational complexity of a good and fast clustering algorithm? What is a good clustering
algorithm? How do you determine the number of clusters? How would you perform clustering in one million
unique keywords, assuming you have 10 million data points and each one consists of two keywords and a
metric measuring how similar these two keywords are? How would you create this 10 million data points table
in the first place?
20. How can you fit Non-Linear relations between X (say, Age) and Y (say, Income) into a Linear Model?

1|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


21. What is regularization? What is the difference in the outcome (coefficients) between the L1 and L2 norms?
22. What is Box-Cox transformation?
23. What is Multicollinearity ? How can we solve it?
24. Does the Gradient Descent method always converge to the same point?
25. Is it necessary that the Gradient Descent Method will always find the global minima?

FAQs For MongoDB


What are the best features of Mongodb?

Document-oriented

High performance

High availability

Easy scalability

Rich-query language

When using replication, can some members use journaling and others not?
Yes!

Can journaling feature be used to perform safe hot backups?


Yes!

What is 32-bit nuances?


There is an extra memory mapped file activity with journaling. This will further constrain the limited db size of 32-bit
builds. For now, journaling by default is disabled on 32-bit systems.

Will there be journal replay programs in case of incomplete entries (if there is a failure in the
middle of one)?
Each journal (group) write is consistent and wont be replayed during recovery unless it is complete.

What is the role of profiler in MongoDB?


MongoDB includes a database profiler which shows performance characteristics of each operation against the
database. With this profiler you can find queries (and write operations) which are slower than they should be and use
this information for determining when an index is needed.

What is a namespace?
MongoDB stores BSON objects in collections. The concatenation of the database name and the collection name (with
a period in between) is called a namespace.

When an object attribute is removed, is it deleted from the store?


Yes, you can remove the attribute and then re-save() the object.

Are null values allowed?


2|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


Yes, but only for the members of an object. A null cannot be added to the database collection as it isnt an object. But
{}can be added.

Does an update fsync to disk immediately?


No. Writes to disk are lazy by default. A write may only hit the disk a couple of seconds later. For example, if the
database receives thousand increments to an object within one second, it will only be flushed to disk once. (Note: fsync
options are available both at the command line and via getLastError_old.)

How do I do transactions/locking?
MongoDB does not use traditional locking or complex transactions with rollback, as it is designed to be light weight,
fast and predictable in its performance. It can be thought of how analogous is to the MySQLs MyISAM autocommit
model. By keeping transaction support extremely simple, performance is enhanced, especially in a system that may
run across many servers.

Why are data files so large?


MongoDB does aggressive preallocation of reserved space to avoid file system fragmentation.

How long does replica set failover take?


It may take 10-30 seconds for the primary to be declared down by the other members and a new primary to be elected.
During this window of time, the cluster is down for primary operations i.e writes and strong consistent reads. However,
eventually consistent queries may be executed to secondaries at any time (in slaveOk mode), including during this
window.

Whats a Master or Primary?


This is a node/member which is currently the primary and processes all writes for the replica set. During a failover
event in a replica set, a different member can become primary.

Whats a Secondary or Slave?


A secondary is a node/member which applies operations from the current primary. This is done by tailing the replication
oplog (local.oplog.rs). Replication from primary to secondary is asynchronous, however, the secondary will try to stay
as close to current as possible (often this is just a few milliseconds on a LAN).

Is it required to call getLastError to make a write durable?


No. If getLastError (aka Safe Mode) is not called, the server does exactly behave the way as if it has been called.
The getLastError call simply allows one to get a confirmation that the write operation was successfully committed. Of
course, often you will want that confirmation, but the safety of the write and its durability is independent.

Should you start out with Sharded or with a Non-Sharded MongoDB environment?
We suggest starting with Non-Sharded for simplicity and quick startup, unless your initial data set will not fit on single
servers. Upgrading to Sharded from Non-sharded is easy and seamless, so there is not a lot of advantage in setting
up Sharding before your data set is large.

How does Sharding work with replication?


3|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


Each Shard is a logical collection of partitioned data. The shard could consist of a single server or a cluster of replicas.
Using a replica set for each Shard is highly recommended.

When will data be on more than one Shard?


MongoDB Sharding is range-based. So all the objects in a collection lie into a chunk. Only when there is more than 1
chunk there is an option for multiple Shards to get data. Right now, the default chunk size is 64mb, so you need at
least 64mb for migration.

What happens when a document is updated on a chunk that is being migrated?


The update will go through immediately on the old Shard and then the change will be replicated to the new Shard
before ownership transfers.

What happens when a Shard is down or slow when querying?


If a Shard is down, the query will return an error unless the Partial query options is set. If a shard is responding slowly,
Mongos will wait for it.

Can the old files in the moveChunk directory be removed?


Yes, these files are made as backups during normal Shard balancing operations. Once the operations are done then
they can be deleted. The clean-up process is currently manual so this needs to be taken care of to free up space.

How do you see the connections used by Mongos?


The following command needs to be used: db._adminCommand(connPoolStats);

If a moveChunk fails, is it necessary to cleanup the partially moved docs?


No, chunk moves are consistent and deterministic. The move will retry and when completed, the data will be only on
the new Shard.

What are the disadvantages of MongoDB?


1. A 32-bit edition has 2GB data limit. After that it will corrupt the entire DB, including the existing data. A 64-bit

edition wont suffer from this bug/feature.


2. Default installation of MongoDB has asynchronous and batch commits turned on. Meaning, it lies when asked

to store something in DB and commits all changes in a batch at a later time in future. If there is a server crash
or power failure, all those commits buffered in memory will be lost. This functionality can be disabled, but then
it will perform as good as or worse than MySQL.
3. MongoDB is only ideal for implementing things like analytics/caching where impact of small data loss is

negligible.
4. In MongoDB, its difficult to represent relationships between data so you end up doing that manually by creating

another table to represent the relationship between rows in two or more tables.

4|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK

FAQs For Hadoop Administration


Explain check pointing in Hadoop and why is it important?
Check pointing is an essential part of maintaining and persisting filesystem metadata in HDFS. Its crucial for efficient
Namenode recovery and restart and is an important indicator of overall cluster health.
Namenode persists filesystem metadata. At a high level, namenodes primary responsibility is to store the HDFS
namespace. Meaning, things like the directory tree, file permissions and the mapping of files to block IDs. It is essential
that this metadata are safely persisted to stable storage for fault tolerance.
This filesystem metadata is stored in two different parts: the fsimage and the edit log. The fsimage is a file that
represents a point-in-time snapshot of the filesystems metadata. However, while the fsimage file format is very
efficient to read, its unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation
in the edit log for durability. This way, if the NameNode crashes, it can restore its state by first loading the fsimage
then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state
of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the
namesystem modifications made since the creation of the fsimage.

What is default block size in HDFS and what are the benefits of
having smaller block sizes?
Most block-structured file systems use a block size on the order of 4 or 8 KB. By contrast, the default block size in
HDFS is 64MB and larger. This allows HDFS to decrease the amount of metadata storage required per file.
Furthermore, it allows fast streaming reads of data, by keeping large amounts of data sequentially organized on the
disk. As a result, HDFS is expected to have very large files that are read sequentially. Unlike a file system such as
NTFS or EXT which has numerous small files, HDFS stores a modest number of very large files: hundreds of
megabytes, or gigabytes each.

What are two main modules which help you interact with HDFS and
what are they used for?
user@machine:hadoop$ bin/hadoop moduleName-cmdargs
The moduleName tells the program which subset of Hadoop functionality to use. -cmd is the name of a specific
command within this module to execute. Its arguments follow the command name.
The two modules relevant to HDFS are : dfs and dfsadmin.
The dfs module, also known as FsShell, provides basic file manipulation operations and works with objects within
the file system. The dfsadmin module manipulates or queries the file system as a whole.

5|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK

How can I setup Hadoop nodes (data nodes/namenodes) to use


multiple volumes/disks?
Datanodes can store blocks in multiple directories typically located on different local disk drives. In order to setup
multiple directories one needs to specify a comma separated list of pathnames as values under config paramters
dfs.data.dir/dfs.datanode.data.dir. Datanodes will attempt to place equal amount of data in each of the directories.
Namenode also supports multiple directories, which stores the name space image and edit logs. In order to setup
multiple directories one needs to specify a comma separated list of pathnames as values under config paramters
dfs.name.dir/dfs.namenode.data.dir. The namenode directories are used for the namespace data replication so that
image and log could be restored from the remaining disks/volumes if one of the disks fails.

How do you read a file from HDFS?


The following are the steps for doing this:
Step 1: The client uses a Hadoop client program to make the request.
Step 2: Client program reads the cluster config file on the local machine which tells it where the namemode is located.
This has to be configured ahead of time.
Step 3: The client contacts the NameNode and requests the file it would like to read.
Step 4: Client validation is checked by username or by strong authentication mechanism like Kerberos.
Step 5: The clients validated request is checked against the owner and permissions of the file.
Step 6: If the file exists and the user has access to it then the NameNode responds with the first block id and provides
a list of datanodes a copy of the block can be found, sorted by their distance to the client (reader).
Step 7: The client now contacts the most appropriate datanode directly and reads the block data. This process repeats
until all blocks in the file have been read or the client closes the file stream.
If while reading the file the datanode dies, library will automatically attempt to read another replica of the data from
another datanode. If all replicas are unavailable, the read operation fails and the client receives an exception. In case
the information returned by the NameNode about block locations are outdated by the time the client attempts to
contact a datanode, a retry will occur if there are other replicas or the read will fail.

What are schedulers and what are the three types of schedulers that
can be used in Hadoop cluster?
Schedulers are responsible for assigning tasks to open slots on tasktrackers. The scheduler is a plug-in within the
jobtracker. The three types of schedulers are:

6|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK

FIFO (First in First Out) Scheduler

Fair Scheduler

Capacity Scheduler

How do you decide which scheduler to use?


The CS scheduler can be used under the following situations:
1. When you know a lot about your cluster workloads and utilization and simply want to enforce resource

allocation.
2. When you have very little fluctuation within queue utilization. The CSs more rigid resource allocation makes

sense when all queues are at capacity almost all the time.
3. When you have high variance in the memory requirements of jobs and you need the CSs memory-based

scheduling support.
4. When you demand scheduler determinism.

The Fair Scheduler can be used over the Capacity Scheduler under the following conditions:
1. When you have a slow network and data locality makes a significant difference to a job runtime, features like

delay scheduling can make a dramatic difference in the effective locality rate of map tasks.
2. When you have a lot of variability in the utilization between pools, the Fair Schedulers pre-emption model

affects much greater overall cluster utilization by giving away otherwise reserved resources when theyre not
used. 3. When you require jobs within a pool to make equal progress rather than running in FIFO order.

Why are dfs.name.dir and dfs.data.dir parameters used ? Where


are they specified and what happens if you dont specify these
parameters?
DFS.NAME.DIR specifies the path of directory in Namenodes local file system to store HDFSs metadata and
DFS.DATA.DIR specifies the path of directory in Datanodes local file system to store HDFSs file blocks. These
paramters are specified in HDFS-SITE.XML config file of all nodes in the cluster, including master and slave nodes.
If these paramters are not specified, namenodes metadata and Datanodes file blocks related information gets stored
in /tmp under HADOOP-USERNAME directory. This is not a safe place, as when nodes are restarted, data will be lost
and is critical if Namenode is restarted, as formatting information will be lost.

What is file system checking utility FSCK used for? What kind of
information does it show? Can FSCK show information about files
which are open for writing by a client?
FileSystem checking utility FSCK is used to check and display the health of file system, files and blocks in it. When
used with a path ( bin/Hadoop fsck / -files blocks locations -racks) it recursively shows the health of all files under
the path. And when used with / , it checks the entire file system. By Default FSCK ignores files still open for writing
by a client. To list such files, run FSCK with -openforwrite option.

7|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


FSCK checks the file system, prints out a dot for each file found healthy, prints a message of the ones that are less
than healthy, including the ones which have over replicated blocks, under-replicated blocks, mis-replicated blocks,
corrupt blocks and missing replicas.

What are the important configuration files that need to be


updated/edited to setup a fully distributed mode of Hadoop cluster 1.x
( Apache distribution)?
The Configuration files that need to be updated to setup a fully distributed mode of Hadoop are:

Hadoop-env.sh

Core-site.xml

Hdfs-site.xml

Mapred-site.xml

Masters

Slaves

These files can be found in your Hadoop>conf directory. If Hadoop daemons are started individually using
bin/Hadoop-daemon.sh start xxxxxx where xxxx is the name of daemon, then masters and slaves file need not be
updated and can be empty. This way of starting daemons requires command to be issued on appropriate nodes to
start appropriate daemons. If Hadoop daemons are started using bin/start-dfs.sh and bin/start-mapred.sh, then
masters and slaves configurations files on namenode machine need to be updated.
Masters Ip address/hostname of node where secondarynamenode will run.
Slaves Ip address/hostname of nodes where datanodes will be run and eventually task trackers.

FAQs For Hadoop HDFS


What is Big Data?
Big Data is nothing but an assortment of such a huge and complex data that it becomes very tedious to capture, store,
process, retrieve and analyze it with the help of on-hand database management tools or traditional data processing
techniques.
Can you give some examples of Big Data?
There are many real life examples of Big Data Facebook is generating 500+ terabytes of data per day, NYSE (New York
Stock Exchange) generates about 1 terabyte of new trade data per day, a jet airline collects 10 terabytes of censor
data for every 30 minutes of flying time. All these are day to day examples of Big Data! Can you give a detailed
overview about the Big Data being generated by Facebook?

8|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


As of December 31, 2012, there are 1.06 billion monthly active users on facebook and 680 million mobile users. On
an average, 3.2 billion likes and comments are posted every day on Facebook. 72% of web audience is on Facebook.
And why not! There are so many activities going on facebook from wall posts, sharing images, videos, writing
comments and liking posts, etc. In fact, Facebook started using Hadoop in mid-2009 and was one of the initial users
of Hadoop.

What are the four characteristics of Big Data?


According to IBM, the three characteristics of Big Data are: Volume: Facebook generating 500+ terabytes of data per
day. Velocity: Analyzing 2 million records each day to identify the reason for losses. Variety: images, audio, video,
sensor data, log files, etc. Veracity: biases, noise and abnormality in data
How Big is Big Data?
With time, data volume is growing exponentially. Earlier we used to talk about Megabytes or Gigabytes. But time has
arrived when we talk about data volume in terms of terabytes, petabytes and also zettabytes! Global data volume was
around 1.8ZB in 2011 and is expected to be 7.9ZB in 2015. It is also known that the global information doubles in
every two years!

How is analysis of Big Data useful for organizations?


Effective analysis of Big Data provides a lot of business advantage as organizations will learn which areas to focus
on and which areas are less important. Big data analysis provides some early key indicators that can prevent the
company from a huge loss or help in grasping a great opportunity with open hands! A precise analysis of Big Data
helps in decision making! For instance, nowadays people rely so much on Facebook and Twitter before buying any
product or service. All thanks to the Big Data explosion.
Who are Data Scientists?
Data scientists are soon replacing business analysts or data analysts. Data scientists are experts who find solutions
to analyze data. Just as web analysis, we have data scientists who have good business insight as to how to handle a
business challenge. Sharp data scientists are not only involved in dealing business problems, but also choosing the
relevant issues that can bring value-addition to the organization.

What is Hadoop?
Hadoop is a framework that allows for distributed processing of large data sets across clusters of commodity
computers using a simple programming model.

9|Page

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK

Why do we need Hadoop?

10 | P a g e

https://www.facebook.com/chatchindia

BIG DATA HADOOP BANK


Everyday a large amount of unstructured data is getting dumped into our machines. The major challenge is not to
store large data sets in our systems but to retrieve and analyze the big data in the organizations that too data present
in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to
analyze the data present in different machines at different locations very quickly and in a very cost effective way. It
uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel.
This is also known as parallel computing. The following link Why Hadoop gives a detailed explanation about why
Hadoop is gaining so much popularity!
What are some of the characteristics of Hadoop framework?
Hadoop framework is written in Java. It is designed to solve problems that involve analyzing large data (e.g. petabytes).
The programming model is based on Googles MapReduce. The infrastructure is based on Googles Big Data and
Distributed File System. Hadoop handles large files/data throughput and supports data intensive distributed
applications. Hadoop is scalable as more nodes can be easily added to it.
Give a brief overview of Hadoop history.
In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS
papers. In 2006, Doug Cutting developed the open source, Mapreduce and HDFS project. In 2008, Yahoo ran 4,000
node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for
Hadoop.
Give examples of some companies that are using Hadoop structure?
A lot of companies are using the Hadoop structure such as Cloudera, EMC, MapR, Hortonworks, Amazon, Facebook,
eBay, Twitter, Google and so on.
What is the basic difference between traditional RDBMS and Hadoop?
Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach
to store huge amount of data in the distributed file system and process it. RDBMS will be useful when you want to
seek one record from Big data, whereas, Hadoop will be useful when you want Big data in one shot and perform
analysis on that later.
What is structured and unstructured data?
Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of
structured data is a database where specific information is stored in tables, that is, rows and columns. Unstructured
data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email,
logs and random text. It is not in the form of rows and columns. What are the core components of Hadoop?
Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and
MapReduce is used to process such large data sets.

11 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

Now, lets get cracking with the hard Stuff:


What is HDFS?
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on
commodity hardware.
What are the key features of HDFS?
HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to
file system data and can be built out of commodity hardware.
What is Fault Tolerance?
Suppose you have a file stored in a system, and due to some technical problem that file gets destroyed. Then there
is no chance of getting the data back present in that file. To avoid such situations, Hadoop has introduced the feature
of fault tolerance in HDFS. In Hadoop, when we store a file, it automatically gets replicated at two other locations also.
So even if one or two of the systems collapse, the file is still available on the third system.
Replication causes data redundancy, then why is it pursued in HDFS?
HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed
any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places.
Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is
unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance
of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.
Since the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be
replicated on the other two?
Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data.
The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding,
it is assumed to be failed. Only then, the required calculation will be done on the second replica.
What is throughput? How does HDFS get a good throughput?
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an
action, then the work is divided and shared among different systems. So all the systems will be executing the tasks
assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this
way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data
tremendously.
What is streaming access?

12 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


As HDFS works on the principle of Write Once, Read many, the feature of streaming access is extremely important
in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially
while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single
record from the data.
What is a commodity hardware? Does commodity hardware include RAM?
Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be
installed in any average commodity hardware. We dont need super computers or high-end hardware to work on
Hadoop. Yes, Commodity hardware includes RAM because there will be some services which will be running on
RAM.
What is a Namenode?
Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the
blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.
Is Namenode also a commodity?
No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure
in HDFS. Namenode has to be a high-availability machine.
What is a metadata?
Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.
What is a Datanode?
Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible
for serving read and write requests for the clients.
Why do we use HDFS for applications having large data sets and not when there are lot of small files?
HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across
multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy
the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when
there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized
performance, HDFS supports large data sets instead of multiple small files.
What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The
equivalent of Daemon in Windows is services and in Dos is TSR.

13 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


What is a job tracker?
Job tracker is a daemon that runs on a namenode for submitting and tracking MapReduce jobs in Hadoop. It assigns
the tasks to the different task tracker. In a Hadoop cluster, there will be only one job tracker but many task trackers. It
is the single point of failure for Hadoop and MapReduce Service. If the job tracker goes down all the running jobs are
halted. It receives heartbeat from task tracker based on which Job tracker decides whether the assigned task is
completed or not.
What is a task tracker?
Task tracker is also a daemon that runs on datanodes. Task Trackers manage the execution of individual tasks on
slave node. When a client submits a job, the job tracker will initialize the job and divide the work and assign them to
different task trackers to perform MapReduce tasks. While performing this action, the task tracker will be
simultaneously communicating with job tracker by sending heartbeat. If the job tracker does not receive heartbeat
from task tracker within specified time, then it will assume that task tracker has crashed and assign that task to another
task tracker in the cluster.
Is Namenode machine same as datanode machine as in terms of hardware?
It depends upon the cluster you are trying to create. The Hadoop VM can be there on the same machine or on another
machine. For instance, in a single node cluster, there is only one machine, whereas in the development or in a testing
environment, Namenode and datanodes are on different machines.

What is a heartbeat in HDFS?


A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send
its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there
is some problem in datanode or task tracker is unable to perform the assigned task.
Are Namenode and job tracker on the same host?
No, in practical environment, Namenode is on a separate host and job tracker is on a separate host.
What is a block in HDFS?
A block is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as
contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which
are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost
of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64

14 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS
block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient
manner.
What are the benefits of block transfer?
A file can be larger than any single disk in the network. Theres nothing that requires the blocks from a file to be stored
on the same disk, so they can take advantage of any of the disks in the cluster. Making the unit of abstraction a block
rather than a file simplifies the storage subsystem. Blocks provide fault tolerance and availability. To insure against
corrupted blocks and disk and machine failure, each block is replicated to a small number of physically separate
machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is
transparent to the client.
If we want to copy 10 blocks from one machine to another, but another machine can copy only 8.5 blocks,
can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node
will figure out what is the actual amount of space required, how many block are being used, how much space is
available, and it will allocate the blocks accordingly.
How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on
storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.

If a data Node is full how its identified?


When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will
identify if the data node is full.
If datanodes increase, then do we need to upgrade Namenode?
While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we
do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a
requirement rarely arise.
Are job tracker and task trackers present in separate machines?

15 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Yes, job tracker and task tracker are present in different machines. The reason is job tracker is a single point of failure
for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
When we send a data to a node, do we allow settling in time, before sending another data to that node?
Yes, we do.
Does hadoop always require digital data to process?
Yes. Hadoop always require digital data to be processed.
On what basis Namenode will decide which datanode to write on?
As the Namenode has the metadata (information) related to all the data nodes, it knows which datanode is free.
Doesnt Google have its very own version of DFS?
Yes, Google owns a DFS known as Google File System (GFS) developed by Google Inc. for its own use. Who
is a user in HDFS?
A user is like you or me, who has some query or who needs some kind of data.
Is client the end user in HDFS?
No, Client is an application which runs on your machine, which is used to interact with the Namenode (job tracker) or
datanode (task tracker).
What is the communication channel between client and namenode/datanode?
The mode of communication is SSH.
What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different
places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks
in a single location.
On what basis data will be stored on a rack?
When the client is ready to load a file into the cluster, the content of the file will be divided into blocks. Now the client
consults the Namenode and gets 3 datanodes for every block of the file which indicates where the block should be
stored. While placing the datanodes, the key rule followed is for every block of data, two copies will exist in one rack,
third copy in a different rack. This rule is known as Replica Placement Policy.

16 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Do we need to place 2nd and 3rd data in rack 2 only?
Yes, this is to avoid datanode failure.
What if rack 2 and datanode fails?
If both rack2 and datanode present in rack 1 fails then there is no chance of getting data from it. In order to avoid such
situations, we need to replicate that data more number of times instead of replicating only thrice. This can be done by
changing the value in replication factor which is set to 3 by default.
What is a Secondary Namenode? Is it a substitute to the Namenode?
The secondary Namenode constantly reads the data from the RAM of the Namenode and writes it into the hard disk
or the file system. It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes
down.
What is the difference between Gen1 and Gen2 Hadoop with regards to the Namenode?
In Gen 1 Hadoop, Namenode is the single point of failure. In Gen 2 Hadoop, we have what is known as Active and
Passive Namenodes kind of a structure. If the active Namenode fails, passive Namenode takes over the charge.
What is MapReduce?
Map Reduce is the heart of Hadoop that consists of two parts map and reduce. Maps and reduces are programs
for processing data. Map processes the data first to give some intermediate output which is further processed by
Reduce to generate the final output. Thus, MapReduce allows for distributed processing of the map and reduction
operations.
Can you explain how do map and reduce work?
Namenode takes the input and divide it into parts and assign them to data nodes. These datanodes process the tasks
assigned to them and make a key-value pair and returns the intermediate output to the Reducer. The reducer collects
this key value pairs of all the datanodes and combines them and generates the final output.
What is Key value pair in HDFS?
Key value pair is the intermediate data generated by maps and sent to reduces for generating the final output. What
is the difference between MapReduce engine and HDFS cluster?
HDFS cluster is the name given to the whole configuration of master and slaves where data is stored. Map Reduce
Engine is the programming module which is used to retrieve and analyze data.
Is map like a pointer?

17 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


No, Map is not like a pointer.
Do we require two servers for the Namenode and the datanodes?
Yes, we need two different servers for the Namenode and the datanodes. This is because Namenode requires highly
configurable system as it stores information about the location details of all the files stored in different datanodes and
on the other hand, datanodes require low configuration system.
Why are the number of splits equal to the number of maps?
The number of maps is equal to the number of input splits because we want the key and value pairs of all the input
splits.
Is a job split into maps?
No, a job is not split into maps. Spilt is created for the file. The file is placed on datanodes in blocks. For each split, a
map is needed.
Which are the two types of writes in HDFS?
There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about
it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted
Write, we wait for the acknowledgement. It is similar to the todays courier services. Naturally, non-posted write is
more expensive than the posted write. It is much more expensive, though both writes are asynchronous.
Why Reading is done in parallel and Writing is not in HDFS?
Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation
in parallel. The reason is that if we perform the write operation in parallel, then it might result in data inconsistency.
For example, you have a file and two nodes are trying to write data into the file in parallel, then the first node does not
know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and
accessed.
Can Hadoop be compared to NOSQL database like Cassandra?
Though NOSQL is the closet technology that can be compared to Hadoop, it has its own pros and cons. There is no
DFS in NOSQL. Hadoop is not a database. Its a filesystem (HDFS) and distributed programming framework
(MapReduce).

FAQs For Hadoop Cluster


Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:

18 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode

What are the features of Stand alone (local) mode?


In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS and utilizes the local file
system. Stand-alone mode is suitable only for running MapReduce programs during development. It is one of the
most least used environments.

What are the features of Pseudo mode?


Pseudo mode is used both for development and in the QA environment. In the Pseudo mode all the daemons run on
the same machine.

Can we call VMs as pseudos?


No, VMs are not pseudos because VM is something different and pesudo is very specific to Hadoop.

What are the features of Fully Distributed mode?


Fully Distributed mode is used in the production environment, where we have n number of machines forming a
Hadoop cluster. Hadoop daemons run on a cluster of machines. There is one host onto which Namenode is running
and another host on which datanode is running and then there are machines on which task tracker is running. We
have separate masters and separate slaves in this distribution.

Does Hadoop follows the UNIX pattern?


Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the conf directory as in the case of UNIX. In

which directory Hadoop is installed?


Cloudera and Apache has the same directory structure. Hadoop is installed in cd /usr/lib/hadoop-0.20/.

What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is 70, for job tracker is 30 and for task tracker is 60.

What is the Hadoop-core configuration?


Hadoop core is configured by two xml files:
1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.
These files are written in xml format. We have certain properties in these xml files, which consist of name and value.
But these files do not exist now.

What are the Hadoop configuration files at present?


There are 3 configuration files in Hadoop:
1. core-site.xml

19 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


2. hdfs-site.xml
3. mapred-site.xml These files are located in the conf/ subdirectory. How

to exit the Vi editor?

To exit the Vi Editor, press ESC and type :q and then press enter.

What is a spill factor with respect to the RAM?


Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is used for this.

Is fs.mapr.working.dir a single directory?


Yes, fs.mapr.working.dir it is just one directory.

Which are the three main hdfs-site.xml properties?


The three main hdfs-site.xml properties are:
1. dfs.name.dir which gives you the location on which metadata will be stored and where DFS is located on disk or
onto the remote.
2. dfs.data.dir which gives you the location where the data is going to be stored.
3. fs.checkpoint.dir which is for secondary Namenode.

How to come out of the insert mode?


To come out of the insert mode, press ESC, type :q (if you have not written anything) OR type :wq (if you have written
anything in the file) and then press ENTER.

What is Cloudera and why it is used?


Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera belongs to Apache and is used
for data processing.

What happens if you get a connection refused java exception when you type hadoop fsck /?
It could mean that the Namenode is not working on your VM.

We are using Ubuntu operating system with Cloudera, but from where we can download
Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from Edurekas dropbox and
the run it on your systems. You can also proceed with your own configuration but you need a Linux box, be it Ubuntu
or Red hat. There are installation steps present at the Cloudera location or in Edurekas Drop box. You can go either
ways.

What does jps command do?


This command checks whether your Namenode, datanode, task tracker, job tracker, etc are working or not.

20 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

How can I restart Namenode?


1. Click on stop-all.sh and then click on start-all.sh OR
2. Write

sudo

hdfs

(press

enter),

su-hdfs

(press

enter),

/etc/init.d/hadoop0.20-namenode start (press enter). What

/etc/init.d/ha

(press

enter)

and

then

is the full form of fsck?

Full form of fsck is File System Check.

How can we check whether Namenode is working or not?


To check whether Namenode is working or not, use the command /etc/init.d/hadoop-0.20-namenode status or as
simple as jps.

What does the command mapred.job.tracker do?


The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.

What does /etc /init.d do?


/etc /init.d specifies where daemons (services) are placed or to see the status of these daemons. It is very LINUX
specific, and nothing to do with Hadoop.

How can we look for the Namenode in the browser?


If you have to look for Namenode in the browser, you dont have to give localhost:8021, the port number to look for
Namenode in the brower is 50070.

How to change from SU to Cloudera?


To change from SU to Cloudera just type exit.

Which files are used by the startup and shutdown commands?


Slaves and Masters are used by the startup and the shutdown commands. What

do slaves consist of?


Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.

What do masters consist of?


Masters contain a list of hosts, one per line, that are to host secondary namenode servers.

What does hadoop-env.sh do?


hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.

Can we have multiple entries in the master files?


Yes, we can have multiple entries in the Master files.

Where is hadoop-env.sh file present?


hadoop-env.sh file is present in the conf location.

In Hadoop_PID_DIR, what does PID stands for?


PID stands for Process ID.

21 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

What does /var/hadoop/pids do?


It stores the PID.

What does hadoop-metrics.properties file do?


hadoop-metrics.properties is used for Reporting purposes. It controls the reporting for Hadoop. The default status
is not to report.

What are the network requirements for Hadoop?


The Hadoop core uses Shell (SSH) to launch the server processes on the slave nodes. It requires passwordless
SSH connection between the master and all the slaves and the secondary machines.

Why do we need a password-less SSH in Fully Distributed environment?


We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in
Fully
Distributed environment, the communication is too frequent. The job tracker should be able to send a task to task
tracker quickly.

Does this lead to security issues?


No, not at all. Hadoop cluster is an isolated cluster. And generally it has nothing to do with an internet. It has a
different kind of a configuration. We neednt worry about that kind of a security breach, for instance, someone
hacking through the internet, and so on. Hadoop has a very secured way to connect to other machines to fetch and
to process data.

On which port does SSH work?


SSH works on Port No. 22, though it can be configured. 22 is the default Port number.

Can you tell us more about SSH?


SSH is nothing but a secure shell communication, it is a kind of a protocol that works on a Port No. 22, and when you
do an SSH, what you really require is a password.

Why password is needed in SSH localhost?


Password is required in SSH for security and in a situation where password-less communication is not set. Do

we need to give a password, even if the key is added in SSH?


Yes, password is still required even if the key is added in SSH.

What if a Namenode has no data?


If a Namenode has no data it is not a Namenode. Practically, Namenode will have some data.

What happens to job tracker when Namenode is down?


When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

What happens to a Namenode, when job tracker is down?

22 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


When a job tracker is down, it will not be functional but Namenode will be present. So, cluster is accessible if
Namenode is working, even if the job tracker is not working.

Can you give us some more details about SSH communication between Masters and the
Slaves?
SSH is a password-less secure communication where data packets are sent across the slave. It has some format into
which data is sent across. SSH is not only between masters and slaves but also between two hosts.

What is formatting of the DFS?


Just like we do for Windows, DFS is formatted for proper structuring. It is not usually done as it formats the Namenode
too.

Does the HDFS client decide the input split or Namenode?


No, the Client does not decide. It is already specified in one of the configurations through which input split is already
configured.

In Cloudera there is already a cluster, but if I want to form a cluster on Ubuntu can we do it?
Yes, you can go ahead with this! There are installation steps for creating a new cluster. You can uninstall your present
cluster and install the new cluster.

Can we create a Hadoop cluster from scratch?


Yes we can do that also once we are familiar with the Hadoop environment.

Can we use Windows for Hadoop?


Actually, Red Hat Linux or Ubuntu are the best Operating Systems for Hadoop. Windows is not used frequently for
installing Hadoop as there are many support problems attached with Windows. Thus, Windows is not a preferred
environment for Hadoop.

FAQs For Hadoop MapReduce


What is MapReduce?
It is a framework or a programming model that is used for processing large data sets over clusters of computers using
distributed programming.

What are maps and reduces?


Maps and Reduces are two phases of solving a query in HDFS. Map is responsible to read data from input
location, and based on the input type, it will generate a key value pair, that is, an intermediate output in local machine.
Reducer is responsible to process the intermediate output received from the mapper and generate the final output.

What are the four basic parameters of a mapper?

23 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


The four basic parameters of a mapper are LongWritable, text, text and IntWritable. The first two represent input
parameters and the second two represent intermediate output parameters.

What are the four basic

parameters of a reducer?
The four basic parameters of a reducer are text, IntWritable, text, IntWritable. The first two represent intermediate
output parameters and the second two represent final output parameters.

What do the master class and the

output class do?


Master is defined to update the Master or the job tracker and the output class is defined to write data onto the output
location.

What is the input type/format in MapReduce by default?


By default the type input type in MapReduce is text.

Is it mandatory to set input and output type/format in MapReduce?


No, it is not mandatory to set the input and output type/format in MapReduce. By default, the cluster takes the input
and the output type as text.

What does the text input format do?


In text input format, each line will create a line object, that is an hexa-decimal number. Key is considered as a line
object and value is considered as a whole line text. This is how the data gets processed by a mapper. The mapper
will receive the key as a LongWritable parameter and value as a text parameter.

What does job conf class do?


MapReduce needs to logically separate different jobs running on the same cluster. Job conf class helps to do job
level settings such as declaring a job in real environment. It is recommended that Job name should be descriptive
and represent the type of job that is being executed. What

does conf.setMapper Class do?

Conf.setMapper class sets the mapper class and all the stuff related to map job such as reading a data and
generating a key-value pair out of the mapper. What

do sorting and shuffling do?

Sorting and shuffling are responsible for creating a unique key and a list of values. Making similar keys at one
location is known as Sorting. And the process by which the intermediate output of the mapper is sorted and sent
across to the reducers is known as Shuffling. What

does a split do?

Before transferring the data from hard disk location to map method, there is a phase or method called the Split
Method. Split method pulls a block of data from HDFS to the framework. The Split class does not write anything, but
reads data from the block and pass it to the mapper. Be default, Split is taken care by the framework. Split method is
equal to the block size and is used to divide block into bunch of splits.

How can we change the split size if our commodity hardware has less storage space?
If our commodity hardware has less storage space, we can change the split size by writing the custom splitter. There
is a feature of customization in Hadoop which can be called from the main method.

What does a MapReduce partitioner do?

24 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


A MapReduce partitioner makes sure that all the value of a single key goes to the same reducer, thus allows evenly
distribution of the map output over the reducers. It redirects the mapper output to the reducer by determining which
reducer is responsible for a particular key.

How is Hadoop different from other data processing tools?


In Hadoop, based upon your requirements, you can increase or decrease the number of mappers without bothering
about the volume of data to be processed. this is the beauty of parallel processing in contrast to the other data
processing tools available.

Can we rename the output file?


Yes we can rename the output file by implementing multiple format output class.

Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on
the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose
the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again
gets divided into mapper, thus we do not have a track of the previous row value.

What is Streaming?
Streaming is a feature with Hadoop framework that allows us to do programming using MapReduce in any
programming language which can accept standard input and can produce standard output. It could be Perl, Python,
Ruby and not necessarily be Java. However, customization in MapReduce can only be done using Java and not any
other programming language.

What is a Combiner?
A Combiner is a mini reducer that performs the local reduce task. It receives the input from the mapper on a
particular node and sends the output to the reducer. Combiners help in enhancing the efficiency of MapReduce by
reducing the quantum of data that is required to be sent to the reducers.

What is the difference between an HDFS Block and Input Split?


HDFS Block is the physical division of the data and Input Split is the logical division of the data.

What happens in a textinputformat?


In textinputformat, each line in the text file is a record. Key is the byte offset of the line andvalue is the content of the
line. For instance, Key: longWritable, value: text.

What do you know about keyvaluetextinputformat?


In keyvaluetextinputformat, each line in the text file is a record. The first separator character divides each line.
Everything before the separator is the key and everything after the separator is the value. For instance, Key: text,
value: text.

What do you know about Sequencefileinputformat?


Sequencefileinputformat is an input format for reading in sequence files. Key and value are user defined. It is a specific
compressed binary file format which is optimized for passing the data between the output of one MapReduce job to
the input of some other MapReduce job.

25 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

What do you know about Nlineoutputformat?


Nlineoutputformat splits n lines of input as one split.

FAQs For Hadoop PIG


Can you give us some examples how Hadoop is used in real time environment?
Let us assume that the we have an exam consisting of 10 Multiple-choice questions and 20 students appear for that
exam. Every student will attempt each question. For each question and each answer option, a key will be generated.
So we have a set of key-value pairs for all the questions and all the answer options for every student. Based on the
options that the students have selected, you have to analyze and find out how many students have answered correctly.
This isnt an easy task. Here Hadoop comes into picture! Hadoop helps you in solving these problems quickly and
without much effort. You may also take the case of how many students have wrongly attempted a particular question.

What is BloomMapFile used for?


The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile. BloomMapFile uses
dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.

What is PIG?
PIG is a platform for analyzing large data sets that consist of high level language for expressing data analysis
programs, coupled with infrastructure for evaluating these programs. PIGs infrastructure layer consists of a compiler
that produces sequence of MapReduce Programs.

What is the difference between logical and physical plans?


Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic
parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have
to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the
physical operators that are needed to execute the script.

Does ILLUSTRATE run MR job?


No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows
output of each stage and not the final output.

Is the keyword DEFINE like a function name?


Yes, the keyword DEFINE is like a function name. Once you have registered, you have to define it. Whatever logic
you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will
check the function in exported jar. When the function is not present in the library, it looks into your jar.

Is the keyword FUNCTIONAL a User Defined Function (UDF)?

26 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


No, the keyword FUNCTIONAL is not a User Defined Function (UDF). While using UDF, we have to override some functions.
Certainly you have to do your job with the help of these functions only. But the keyword FUNCTIONAL is a built-in function i.e a
pre-defined function, therefore it does not work as a UDF.

Why do we need MapReduce during Pig programming?


Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use
for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution
engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into
MapReduce jobs. Here, MapReduce acts as the execution engine.

Are there any problems which can only be solved by MapReduce and cannot be solved by
PIG? In which kind of scenarios MR jobs will be more useful than PIG?
Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different
cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and
the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population
data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program
whenever you find city with the name Bangalore and city with the name Noida, you create the alias name which will
be the common name for these two cities so that you create a common key for both the cities and it get passed to
the same reducer. For this, we have to write custom partitioner.
In mapreduce when you create a key for city, you have to consider city as the key. So, whenever the framework
comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There
is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida
then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we
cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.

Does Pig give any warning when there is a type mismatch or missing field?
No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a
warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.

What co-group does in Pig?


Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and
then returns a set of records containing two separate bags. The first bag consists of the record of the first data set
with the common data set and the second bag consists of the records of the second data set with the common data
set.

Can we say cogroup is a group of more than 1 data set?


Cogroup is a group of one data set. But in the case of more than one data sets, cogroup will group all the data sets
and join them based on the common field. Hence, we can say that cogroup is a group of more than one data set and
join of that data set as well.

What does FOREACH do?


FOREACH is used to apply transformations to the data and to generate new data items. The name itself is indicating
that for each element of a data bag, the respective action will be performed.

27 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Syntax : FOREACH

bagname

GENERATE

expression1,

expression2,

.. The meaning of this

statement is that the expressions mentioned after GENERATE will be applied to the current record of the data bag.

What is bag?
A bag is one of the data models present in Pig. It is an unordered collection of tuples with possible duplicates. Bags
are used to store collections while grouping. The size of bag is the size of the local disk, this means that the size of
the bag is limited. When the bag is full, then Pig will spill this bag into local disk and keep only some parts of the bag
in memory. There is no necessity that the complete bag should fit into memory. We represent bags with {}.

Real BIG DATA Use Cases


Big Data Exploration
Big Data exploration deals with the challenges like information stored in different systems and access to this data to
complete day-to-day tasks, faced by large organization. Big Data exploration allows you to analyse data and gain
valuable insights from them.
Enhanced 360 Customer Views
Enhancing existing customer views helps to gain complete understanding of customers, addressing questions like
why they buy, how they prefer to shop, why they change, what theyll buy next, and what features make them to
recommend a company to others.
Security/Intelligence Extension
Enhancing cyber security and intelligence analysis platforms with Big Data technologies to process and analyze new
types from social media, emails, sensors and Telco, reduce risks, detect fraud and monitor cyber security in realtime
to significantly improve intelligence, security and law enforcement insights.
Operations Analysis
Operations analysis is about using Big Data technologies to enable a new generation of applications that analyze
large volumes of multi-structured, like machine and operational data to improve business. These data can include
anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across
different types of data sets.
Data Warehouse Modernization
Big Data needs to be integrated with data warehouse capabilities to increase operational efficiency. Getting rid of
rarely accessed or old data from warehouse and application databases can be done using information integration
software and tools.

28 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

Companies and their Big Data Applications:


Guangdong Mobiles:
A popular mobile group in China, Guangdong uses Hadoop to remove data access bottlenecks and uncover
customer usage pattern for precise and targeted market promotions and Hadoop HBase for automatically splitting
data tables across nodes to expand data storage.
Red Sox:
The World Series champs come across huge volumes of structured and unstructured data related to the game like
on the weather, opponent team and pre-game promotions. Big Data allows them to provide forecasts about the
game and how to allocate resources based on expected variations in the oncoming game.
Nokia:
Big Data has helped Nokia make effective use of their data to understand and improve users experience with their
products. The company leverages data processing and complex analyses to build maps with predictive traffic and
layered elevation models. Nokia uses Clouderas Hadoop platform and Hadoop components like HBase, HDFS,
Sqoop and Scribe for the above application.
Huawei:
Huawei OceanStor N8000-Hadoop Big Data solution is developed based on advanced clustered architecture and
enterprise-level storage capability and integrating it with Hadoop computing framework. This innovative combination
helps enterprises get real-time analysis and processing results from exhaustive data computation and analysis,
improves decision-making and efficiency, make management easier and reduce the cost of networking.
SAS:
SAS has combined with Hadoop to help data scientists transform Big Data in to bigger insights. As a result, SAS has
come up with an environment that provides visual and interactive experience, making it easier to gain insights and
explore new trends. The potent analytical algorithms extract valuable insights from the data while the in-memory
technology allows faster access to data.
CERN:
Big Data plays a vital part in CERN, home of the large Hadron Supercollider, as it collects unbelievable amount of
data from its 40 million pictures per second from its 100 megapixel cameras, which gives out 1 petabyte of data per
second. The data from these cameras needs to be analysed. The lab is experimenting with ways to place more data
from its experiments in both relational databases and data stores based on NoSQL technologies, such as Hadoop
and Dynamo in Amazons S3s cloud storage service Buzzdata:

29 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Buzzdata is working on a Big Data project where it needs to combine all the sources and integrate them in a safe
location. This creates a great place for journalists to connect and normalize public data.
Department of Defence:
The Department of Defense (DoD) has invested approximately $250 million for harnessing and utilizing colossal
amount of data to come up with a system that can make control and make autonomous decisions and assist
analysts to provide support to operations. The department has plans to increase their analytical abilities by 100
folds, to extract information from texts in any language and an equivalent increase in the number of objects,
activities, and events that analysts can analyze.
Defence Advanced Research Projects Agency (DARPA):
DARPA intends to invest approximately $25 million to improve computational techniques and software tools for
analyzing large amounts of semi-structured and unstructured data.
National Institutes of Health:
At 200 terabytes of data contained in the 1000 Genomes Project, it is all set to be a prime example of Big Data. The
datasets are so massive that very few researchers have the computational power to analyses the data.

Big Data Application Examples in different Industries:


Retail/Consumer:
1.

Market Basket Analysis and Pricing Optimization

2.

Merchandizing and market analysis

3.

Supply-chain management and analytics

4.

Behavior-based targeting

5.

Market and consumer segmentations Finances & Frauds Services:


Customer

1.

Segmentation

2.

Compliance and regulatory reporting


4.

Risk analysis and management.


Fraud detection and security analytics

5.

Medical insurance fraud

6.
7.

CRM
Credit risk, scoring and analysis

8.

Trade surveillance and abnormal trading pattern analysis Health & Life Sciences:

1.
2.

Clinical trials data analysis


Disease pattern analysis

3.

Patient care quality analysis

3.

30 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


4.

Drug development analysis Telecommunications:

1.

Price optimization

2.

Customer churn prevention

3.

Call detail record (CDR) analysis

4.

Network performance and optimization

5.

Mobile user location analysis Enterprise Data Warehouse:

1. Enhance EDW by offloading processing and storage


2. Pre-processing hub before getting to EDW

Gaming:
1. Behavioral Analytics High

Tech:
1. Optimize Funnel Conversion
2. Predictive Support
3. Predict Security Threats
4. Device Analytics

FACEBOOK
Facebook today is a world-wide phenomenon that has caught up with young and old alike. Launched in 2004 by a
bunch of Harvard University students, it was least expected to be such a rage. In a span of just a decade, how did it
manage this giant leap?
With around 1.23 billion users and counting, Facebook definitely has an upper hand over other social media websites.
What is the reason behind this success? This blog is an attempt to answer some of these queries.
It is quite evident that the existence of a durable storage system and high technological expertise has contributed to
the support of various user data like managing messages, applications, personal information etc, without which all of
it would have come to a staggering halt.So what does a website do when its user count exceeds the number of cars
in the world? How does it manage such a massive data?

Data Centre: The Crux of Facebook


Facebooks data center is spread across an area of 300,000 sq ft in cutting edge servers and huge memory banks; it
has data spread over 23 million ft of fiber optic cables. Their systems are designed to run data at the speed of light
making sure that once a user logs into his profile, everything works faster. With 30 MW of electricity, they have to
make sure that theyre never out of power. The warehouse stores up to 300 PB of Hive data with an incoming daily rate of
600 TB.

31 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Every computer is cooled by heat sync not bigger than a match box, but for Facebook computers, the picture is
evidently bigger. Spread over a huge field, there are cooling systems and fans that help balance the temperature of
these systems. As the count increases, trucks of storage systems keep pouring in on a daily basis and employees are
now losing a count of it.

Hadoop & Cassandra: The Technology Wizards


The use of big data has evolved and for Facebooks existence Big Data is crucial. A platform as big as this, requires
a number of technologies that will enable them to solve problems and store massive data. Hadoop is one of the many
Big Data technologies employed at Facebook, which is insufficient for a company that is growing every minute of the
day. Hadoop is a highly scalable open-source framework that uses clusters of low-cost servers to solve problems.
One of the other technologies used and preferred is Cassandra.
Apache Cassandra was initially developed at Facebook to power their Inbox Search feature by two proficient Indians
Avinash Lakshman and Prashant Malik, the former being an author( Amazon Dynamo) and the latter a techie. It is an
open-source distributed database management system designed to handle large amounts of data across many
commodity servers, providing high availability with no single point of failure.
Cassandra offers robust support for clusters spanning multiple data centers. Hence, Cassandra aims to run on top of
an infrastructure of hundreds of nodes. There are failures at some point of time, but the manner in which Cassandra
manages it, makes it possible for anyone to rely on this service.
Facebook, along with the other social media websites, avoids using MySQL due to the complexity in getting good
results. Cassandra has overpowered the rest and has proved its capability in terms of getting quick results. Facebook
had originally developed Cassandra to solve the problem of engine search and to be fast and reliable in terms of
handling the ability to read and write requests at the same time. Facebook is a platform that instantly helps you connect
to people far and near and for this, it requires a system that performs and matches the brand.

WHAT IS HADOOP
So, What exactly is Hadoop?
It is truly said that Necessity is the mother of all inventions and Hadoop is amongst the finest inventions in the world
of Big Data! Hadoop had to be developed sooner or later as there was an acute need of a framework that can handle
and process Big Data efficiently.
Technically speaking, Hadoop is an open source software framework that supports data-intensive distributed
applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as Apache Hadoop.
Hadoop has been developed, based on a paper originally written by Google on MapReduce system and applies
concepts of functional programming. Hadoop is written in the Java programming language and is the highest-level
Apache project being constructed and used by a global community of contributors. Hadoop was developed by Doug
Cutting and Michael J. Cafarella. And the charming yellow elephant you see is basically named after Dougs sons toy
elephant!

32 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

Hadoop Ecosystem:
Once you are familiar with What is Hadoop, lets probe into its ecosystem. Hadoop Ecosystem is nothing but various
components that make up Hadoop so powerful, among which HDFS and MapReduce are the core components!

1. HDFS:
The Hadoop Distributed File System (HDFS) is a very robust feature of Apache Hadoop. HDFS is designed to amass
gigantic amount of data unfailingly, and to transfer the data at an amazing speed among nodes and facilitates the
system to continue working smoothly even if any of the nodes fail to function. HDFS is very competent in writing
programs, handling their allocation, processing the data and generating the final outcomes. In fact, HDFS manages
around 40 petabytes of data at Yahoo! The key components of HDFS are NameNode, DataNodes and Secondary
NameNode.

2. MapReduce:
It all started with Google applying the concept of functional programming to solve the problem of how to manage large
amounts of data on the internet. Google named it as the MapReduce system and was penned down in a paper
published by Google. With the ever increasing amount of data generated on the web, MapReduce was created in
2004 and Yahoo stepped in to develop Hadoop in order to implement the MapReduce technique in Hadoop. The
function of MapReduce is to help Google in searching and indexing the large quantity of web pages in matter of a few
seconds or even in a fraction of a second. The key components of MapReduce are JobTracker, TaskTrackers and
JobHistoryServer.

3. Apache Pig:
Apache Pig is another component of Hadoop, which is used to evaluate huge data sets made up of high-level
language. In fact, Pig was initiated with the idea of creating and executing commands on Big Data sets. The basic
attribute of Pig programs is parallelization which helps them to manage large data sets. Apache Pig consists of a
compiler that generates a series of MapReduce program and a Pig Latin language layer that facilitates SQL-like
queries to be run on distributed databases in Hadoop.

33 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK

4. Apache Hive:
As the name suggests, Hive is Hadoops data warehouse system that enables quick data summarization for Hadoop,
handle queries and evaluate huge data sets which are located in Hadoops file systems and also maintains full support
for map/reduce. Another striking feature of Apache Hive is to provide indexes such as bitmap indexes in order to
speed up queries. Apache Hive was originally developed by Facebook, but now it is developed and used by other
companies too, including Netflix.

5. Apache HCatalog
Apache HCatalog is another important component of Apache Hadoop which provides a table and storage
management service for data created with the help of Apache Hadoop. HCatalog offers features like a shared schema
and data type mechanism, a table abstraction for users and smooth functioning across other components of Hadoop
such as such as Pig, Map Reduce, Streaming, and Hive.

6. Apache HBase
HBase is acronym for Hadoop DataBase. HBase is a distributed, column oriented database that uses HDFS for
storage purposes. On one hand it manages batch style computations using MapReduce and on the other hand it
handles point queries (random reads). The key components of Apache HBase are HBase Master and the
RegionServer.

7. Apache Zookeeper
Apache ZooKeeper is another significant part of Hadoop ecosystem. Its major funciton is to keep a record of
configuration information, naming, providing distributed synchronization, and providing group services which are
immensely crucial for various distributed systems. Infact, HBase is dependent upon ZooKeeper for its functioning.

WHY HADOOP
Hadoop can be contagious. Its implementation in one organization can lead to another one elsewhere. Thanks to
Hadoop being robust and cost-effective, handling humongous data seems much easier now. The ability to include
HIVE in an EMR workflow is yet another awesome point. Its incredibly easy to boot up a cluster, install HIVE, and be
doing simple SQL analytics in no time. Lets take a look at why Hadoop can be so incredible.

Key features that answer Why Hadoop?


1. Flexible:
As it is a known fact that only 20% of data in organizations is structured, and the rest is all unstructured, it is very
crucial to manage unstructured data which goes unattended. Hadoop manages different types of Big Data, whether
structured or unstructured, encoded or formatted, or any other type of data and makes it useful for decision making
process. Moreover, Hadoop is simple, relevant and schema-less! Though Hadoop generally supports Java
Programming, any programming language can be used in Hadoop with the help of the MapReduce technique.

34 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Though Hadoop works best on Windows and Linux, it can also work on other operating systems like BSD and OS
X.

2. Scalable
Hadoop is a scalable platform, in the sense that new nodes can be easily added in the system as and when required
without altering the data formats, how data is loaded, how programs are written, or even without modifying the existing
applications. Hadoop is an open source platform and runs on industry-standard hardware. Moreover, Hadoop is also
fault tolerant this means, even if a node gets lost or goes out of service, the system automatically reallocates work
to another location of the data and continues processing as if nothing had happened!

3. Building more efficient data economy:


Hadoop has revolutionized the processing and analysis of big data world across. Till now, organizations were worrying
about how to manage the non-stop data overflowing in their systems. Hadoop is more like a Dam, which is harnessing
the flow of unlimited amount of data and generating a lot of power in the form of relevant information.
Hadoop has changed the economics of storing and evaluating data entirely!

4. Robust Ecosystem:
Hadoop has a very robust and a rich ecosystem that is well suited to meet the analytical needs of developers, web startups and other organizations. Hadoop Ecosystem consists of various related projects such as MapReduce, Hive,
HBase, Zookeeper, HCatalog, Apache Pig, which make Hadoop very competent to deliver a broad spectrum of
services.

5. Hadoop is getting more Real-Time!


Did you ever wonder how to stream information into a cluster and analyze it in real time? Hadoop has the answer for
it. Yes, Hadoops competencies are getting more and more real-time. Hadoop also provides a standard approach to
a wide set of APIs for big data analytics comprising MapReduce, query languages and database access, and so on.

6. Cost Effective:
Loaded with such great features, the icing on the cake is that Hadoop generates cost benefits by bringing massively
parallel computing to commodity servers, resulting in a substantial reduction in the cost per terabyte of storage, which
in turn makes it reasonable to model all your data. The basic idea behind Hadoop is to perform cost-effective data
analysis present across world wide web!

7. Upcoming Technologies using Hadoop:


With reinforcing its capabilities, Hadoop is leading to phenomenal technical advancements. For instance, HBase will
soon become a vital Platform for Blob Stores (Binary Large Objects) and for Lightweight OLTP (Online Transaction
Processing). Hadoop has also begun serving as a strong foundation for new-school graph and NoSQL databases,
and better versions of relational databases.

8. Hadoop is getting cloudy!

35 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Hadoop is getting cloudier! In fact, cloud computing and Hadoop are synchronizing in several organizations to manage
Big Data. Hadoop will become one of the most required apps for cloud computing. This is evident from the number of
Hadoop clusters offered by cloud vendors in various businesses. Thus, Hadoop will reside in the cloud soon!
Now you know why Hadoop is gaining so much popularity!
The importance of Hadoop is evident from the fact that there are many global MNCs that are using Hadoop and
consider it as an integral part of their functioning. It is a misconception that social media companies alone uses
Hadoop. In fact, many other industries now use Hadoop to manage BIG DATA!

Use Cases Of Hadoop


It was Yahoo!Inc. that developed the Worlds biggest application of Hadoop on February 19, 2008. In fact, if youve
heard of The Yahoo! Search Webmap, it is a Hadoop application that runs on over 10,000 core Linux cluster and
generates data that is now extensively used in each query of Yahoo! Web search.
Facebook, which has over 1.3 billion active users and it is Hadoop that brings respite to Facebook in storing and
managing data of such magnitude. Hadoop helps Facebook in keeping track of all the profiles stored in it, along with
the related data such as posts, comments, images, videos, and so on.
Linkedin manages over 1 billion personalized recommendations every week. All thanks to Hadoop and its
MapReduce and HDFS features!
Hadoop is at its best when it comes to analyzing Big Data. This is why companies likeRackspace uses Hadoop.

36 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


Hadoop plays an equally competent role in analyzing huge volumes of data generated by scientifically driven
companies like Spadac.com.
Hadoop is a great framework for advertising companies as well. It keeps a good track of the millions of clicks on the
ads and how the users are responding to the ads posted by the big Ad agencies!

HYPE Behind BIG DATA


Big Data!
We come across data in every possible form, whether through social media sites, sensor networks, digital images or
videos, cellphone GPS signals, purchase transaction records, web logs, medical records, archives, military
surveillance, e-commerce, complex scientific research and so onit amounts to around some Quintilian bytes of data!
This data is what we call asBIG DATA!
Big data is nothing but an assortment of such huge and complex data that becomes very tedious to capture, store,
process, retrieve and analyze it. Thanks to on-hand database management tools or traditional data processing
techniques, things have become easier now. In fact, the concept of BIG DATA may vary from company to company
depending upon its size, capacity, competence, human resource, techniques and so on. For some companies it may
be a cumbersome job to manage a few gigabytes and for others it may be some terabytes creating a hassle in the
entire organization.

The Four Vs Of Big Data


1.

Volume: BIG DATA is clearly determined by its volume. It could amount to hundreds of terabytes or even

petabytes of information. For instance, 15 terabytes of Facebook posts or 400 billion annual medical records could
mean Big Data!
2.

Velocity: Velocity means the rate at which data is flowing in the companies. Big data requires fast processing.

Time factor plays a very crucial role in several organizations. For instance, processing 2 million records at share
market or evaluating results of millions of students applied for competitive exams could mean Big Data!
3.

Variety: Big Data may not belong to a specific format. It could be in any form such as structured, unstructured,

text, images, audio, video, log files, emails, simulations, 3D models, etc. New research shows that a substantial
amount of an organizations data is not numeric; however, such data is equally important for decision-making process.
So, organizations need to think beyond stock records, documents, personnel files, finances, etc.
4.

Veracity: Veracity refers to the uncertainty of data available. Data available can sometimes get messy and

maybe difficult to trust. With many forms of big data, quality and accuracy are difficult to control like the Twitter posts
with hash tags, abbreviations, typos and colloquial speech. But big data and analytics technology now permits to work

37 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


with these types of data. The volumes often make up for the lack of quality or accuracy. Due to uncertainty of data, 1
in 3 business leaders dont trust the information they use to make decisions.

Big Data Opportunities


Why is it important to harness Big Data?
Data had never been as crucial before as it is today. In fact, we can see a transition from the old sayingCustomer is
King to Data is king! This is because for an efficient decision making, it is very important to analyze the right amount
and the right type of data! Healthcare, banking, public sector, pharmaceutical, or IT, all need to look beyond the
concrete data stored in their databases and study the intangible data in the form of sensors, images, weblogs, etc. In
fact, what sets smart organizations apart from others is their ability to scan data effectively to allocate resources
properly, increase productivity and inspire innovation! Why Big Data analysis is crucial:
1. Just like labor and capital, data has become one of the factors of production in almost all the industries.
2. Big data can unveil some really useful and crucial information which can change decision making process entirely
to a more fruitful one.
3. Big data makes customer segmentation easier and more visible, enabling the companies to focus on more profitable
and loyal customers.
4. Big data can be an important criterion to decide upon the next line of products and services required by the future
customers. Thus, companies can follow proactive approach at every step.
5. The way in which big data is explored and used can directly impact the growth and development of the organizations
and give a tough competition to others in the row! Data driven strategies are soon becoming the latest trend at the
Management level! How to Harness Big Data?
As the name suggests, it is not an easy task to capture, store, process and do big data analysis. Optimizing big data
is a daunting affair that requires a robust infrastructure and state-of-art technology which should take care of the
privacy, security, intellectual property, and even liability issues related to big data. Big data will help you answer those
questions that were lingering for a long time! It is not the amount of big data that matters the most, it is what you are
able to do with it that draws a line between the achievers and the losers.
Some Recent Technologies:
Companies are relying on the following technologies to do big data analysis:
1. Speedy and efficient processors.
2. Modern storage and processing technologies, especially for unstructured data
3. Robust server processing capacities

38 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


4. Cloud computing
5. Clustering, high connectivity, parallel processing, MPP
6. Apache Hadoop/ Hadoop Big Data

Apache MapReduce & HDFS

Introduction to Apache MapReduce and HDFS


Apache Hadoop has been originated from Googles Whitepapers:
1.

Apache HDFS is derived from GFS (Google File System). 2.

Apache MapReduce is derived from Google MapReduce


3.

Apache HBase is derived from Google BigTable.

Though Google has only provided the Whitepapers, without any implementation, around 90-95% of the architecture
presented in these Whitepapers is applied in these three Java-based Apache projects.
HDFS and MapReduce are the two major components of Hadoop, where HDFS is from the Infrastructural point of
view and MapReduce is from the Programming aspect. Though HDFS is at present a subproject of Apache Hadoop,
it was formally developed as an infrastructure for the Apache Nutch web search engine project.
To understand the magic behind the scalability of Hadoop from one-node cluster to a thousand-nodes cluster

39 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


(Yahoo! has 4,500-node cluster managing 40 petabytes of enterprise data), we need to first understand Hadoops file
system, that is, HDFS (Hadoop Distributed File System).

What is HDFS (Hadoop Distributed File System)?


HDFS is a distributed and scalable file system designed for storing very large files with streaming data access patterns,
running clusters on commodity hardware.
Though it has many similarities with existing traditional distributed file systems, there are noticeable differences
between these. Lets look into some of the assumptions and goals/objectives behind HDFS, which also form some
striking features of this incredible file system!

Assumptions and Goals/Objectives behind HDFS:


1. Large Data Sets:
It is assumed that HDFS always needs to work with large data sets. It will be an underplay if HDFS is deployed to
process several small data sets ranging in some megabytes or even a few gigabytes. The architecture of HDFS is
designed in such a way that it is best fit to store and retrieve huge amount of data. What is required is high cumulative
data bandwidth and the scalability feature to spread out from a single node cluster to a hundred or a thousand-node
cluster. The acid test is that HDFS should be able to manage tens of millions of files in a single occurrence.

2. Write Once, Read Many Model:


HDFS follows the write-once, read-many approach for its files and applications. It assumes that a file in HDFS once
written will not be modified, though it can be access n number of times (though future versions of Hadoop may support
this feature too)! At present, in HDFS strictly has one writer at any time. This assumption enables high throughput
data access and also simplifies data coherency issues. A web crawler or a MapReduce application is best suited for
HDFS.

3. Streaming Data Access:


As HDFS works on the principle of Write Once, Read Many, the feature of streaming data access is extremely
important in HDFS. As HDFS is designed more for batch processing rather than interactive use by users. The
emphasis is on high throughput of data access rather than low latency of data access. HDFS focuses not so much on
storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading
the complete data is more important than the time taken to fetch a single record from the data. HDFS overlooks a few
POSIX requirements in order to implement streaming data access.

4. Commodity Hardware:
HDFS (Hadoop Distributed File System) assumes that the cluster(s) will run on common hardware, that is,
nonexpensive, ordinary machines rather than high-availability systems. A great feature of Hadoop is that it can be
installed in any average commodity hardware. We dont need super computers or high-end hardware to work on
Hadoop. This leads to an overall cost reduction to a great extent.

5. Data Replication and Fault Tolerance:


40 | P a g e

ANTRIXSHGUPTA

BIG DATA HADOOP BANK


HDFS works on the assumption that hardware is bound to fail at some point of time or the other. This disrupts the
smooth and quick processing of large volumes of data. To overcome this obstacle, in HDFS, the files are divided into
large blocks of data and each block is stored on three nodes: two on the same rack and one on a different rack for
fault tolerance. A block is the amount of data stored on every data node. Though the default block size is 64MB and
the replication factor is three, these are configurable per file. This redundancy enables robustness, fault detection,
quick recovery, scalability, eliminating the need of RAID storage on hosts and merits of data locality.

6. High Throughput:
Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the
system and it is usually used to measure performance of the system. In Hadoop HDFS, when we want to perform a
task or an action, then the work is divided and shared among different systems. So, all the systems will be executing
the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time.
In this way, the Apache HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read
data tremendously.

7. Moving Computation is better than Moving Data:


Hadoop HDFS works on the principle that if a computation is done by an application near the data it operates on, it is
much more efficient than done far of, particularly when there are large data sets. The major advantage is reduction in
the network congestion and increased overall throughput of the system. The assumption is that it is often better to
locate the computation closer to where the data is located rather than moving the data to the application space. To
facilitate this, Apache HDFS provides interfaces for applications to relocate themselves nearer to where the data is
located.

8. File System Namespace:


A traditional hierarchical file organization is followed by HDFS, where any user or an application can create directories
and store files inside these directories. Thus, HDFSs file system namespace hierarchy is similar to most of the other
existing file systems, where one can create and delete files or relocate a file from one directory to another, or even
rename a file. In general, HDFS does not support hard links or soft links, though these can be implemented if need
arise.
Thus, HDFS works on these assumptions and goals in order to help the user access or process large data sets within
incredibly short period of time!

41 | P a g e

ANTRIXSHGUPTA

Das könnte Ihnen auch gefallen