BigData Objective

This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on History of
Hadoop.
1. IBM and ________ have announced a major initiative to use Hadoop to support university
courses in distributed computer programming.
a) Google Latitude
b) Android (operating system)
c) Google Variations
d) Google
Answer:d
Explanation:Google and IBM Announce University Initiative to Address Internet-Scale.
2. Point out the correct statement :
a) Hadoop is an ideal environment for extracting and transforming small volumes of data
b) Hadoop stores data in HDFS and supports data compression/decompression
c) The Giraph framework is less useful than a MapReduce job to solve graph and machine
learning
d) None of the mentioned
Answer:b
Explanation:Data compression can be achieved using compression algorithms like bzip2, gzip,
LZO, etc. Different algorithms can be used in different scenarios based on their capabilities.
3. What license is Hadoop distributed under ?
a) Apache License 2.0
b) Mozilla Public License
c) Shareware
d) Commercial
Answer:a
Explanation:Hadoop is Open Source, released under Apache 2 license.
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional
Hadoop cluster using a live CD.
a) OpenOffice.org
b) OpenSolaris
c) GNU
d) Linux
Answer:b
Explanation: The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image.
5. Which of the following genres does Hadoop produce ?
a) Distributed file system
b) JAX-RS
c) Java Message Service
d) Relational Database Management System
Answer:a
Explanation: The Hadoop Distributed File System (HDFS) is designed to store very large data
sets reliably, and to stream those data sets at high bandwidth to user.
6. What was Hadoop written in ?
a) Java (software platform)
b) Perl
c) Java (programming language)
d) Lua (programming language)
Answer:c
Explanation: The Hadoop framework itself is mostly written in the Java programming language,
with some native code in C and command line utilities written as shell-scripts.
7. Which of the following platforms does Hadoop run on ?
a) Bare metal
b) Debian
c) Cross-platform
d) Unix-like
Answer:c
Explanation:Hadoop has support for cross platform operating system.
8. Hadoop achieves reliability by replicating the data across multiple hosts, and hence does not
require ________ storage on hosts.
a) RAID
b) Standard RAID levels
c) ZFS
d) Operating system
Answer:a
Explanation:With the default replication value, 3, data is stored on three nodes: two on the same
rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to
which client applications submit MapReduce jobs.
a) MapReduce
b) Google
c) Functional programming
d) Facebook
Answer:a
Explanation:MapReduce engine uses to distribute work around a cluster.
10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and
matrix operations.
a) Machine learning
b) Pattern recognition
c) Statistical classification
d) Artificial intelligence
Answer:a
Explanation: The Apache Mahout projects goal is to build a scalable machine learning tool.
1. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support
Answer:d
Explanation:Adding security to Hadoop is challenging because all the interactions do not follow
the classic client- server pattern.
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records

Answer:b
Explanation:Hadoop batch processes data distributed over a number of computers ranging in
100s and 1000s.
3. According to analysts, for what can traditional IT systems provide a foundation when theyre
integrated with big data technologies like Hadoop ?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
Answer:a
Explanation:Data warehousing integrated with Hadoop would give better understanding of data.
4. Hadoop is a framework that works with a variety of related tools. Common cohorts include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
Answer:a
Explanation:To use Hive with HBase youll typically want to launch two clusters, one to run
HBase and the other to run Hive.
5. Point out the wrong statement :
a) Hardtops processing capabilities are huge and its real advantage lies in the ability to process
terabytes & petabytes of data
b) Hadoop uses a programming model called MapReduce, all the programs should confirms to
this model in order to work on Hadoop platform
c) The programming model, MapReduce, used by Hadoop is difficult to write and test
d) All of the mentioned
Answer:c
Explanation: The programming model, MapReduce, used by Hadoop is simple to write and test.
6. What was Hadoop named after?
a) Creator Doug Cuttings favorite circus act
b) Cuttings high school rock band
c) The toy elephant of Cuttings son

d) A sound Cuttings laptop made during Hadoops development
Answer:c
Explanation:Doug Cutting, Hadoops creator, named the framework after his childs stuffed toy
elephant.
7. All of the following accurately describe Hadoop, EXCEPT:
a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach
Answer:b
Explanation:Apache Hadoop is an open-source software framework for distributed storage and
distributed processing of Big Data on clusters of commodity hardware.
8. __________ can best be described as a programming model used to develop Hadoop-based
applications that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
Answer:a
Explanation:MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm.
9. __________ has the worlds largest Hadoop cluster.
a) Apple
b) Datamatics
c) Facebook
Answer:c
Explanation:Facebook has many Hadoop clusters, the largest among them is the one that is used
for Data warehousing.
10. Facebook Tackles Big Data With _______ based on Hadoop.
a) Project Prism
b) Prism
c) Project Big
d) Project Data
Answer:a
Explanation:Prism automatically replicates and moves data wherever its needed across a vast
network of computing facilities.
Hadoop Questions and Answers Hadoop Ecosystem
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Hadoop
Ecosystem.
1. ________ is a platform for constructing data flows for extract, transform, and load (ETL)
processing and analysis of large datasets.
a) Pig Latin
b) Oozie
c) Pig
d) Hive
Answer:c
Explanation:Apache Pig is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs.
a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to
querying data
b) Hive is a relational database with SQL support
c) Pig is a relational database with SQL support
Answer:a
Explanation:Hive is a SQL-based data warehouse system for Hadoop that facilitates data
summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible
file systems.
3. _________ hides the limitations of Java behind a powerful and concise Clojure API for
Cascading.
a) Scalding
b) HCatalog
c) Cascalog

Answer:c
Explanation:Cascalog also adds Logic Programming concepts inspired by Datalog. Hence the
name Cascalog is a contraction of Cascading and Datalog.
4. Hive also support custom extensions written in :
a) C#
b) Java
c) C
d) C++
Answer:b
Explanation:Hive also support custom extensions written in Java, including user-defined
functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats.
a) Elastic MapReduce (EMR) is Facebooks packaged Hadoop offering
b) Amazon Web Service Elastic MapReduce (EMR) is Amazons packaged Hadoop offering
c) Scalding is a Scala API on top of Cascading that removes most Java boilerplate
Answer:a
Explanation:Rather than building Hadoop deployments manually on EC2 (Elastic Compute
Cloud) clusters, users can spin up fully configured Hadoop installations using simple invocation
commands, either through the AWS Web Console or through command-line tools.
6. ________ is the most popular high-level Java API in Hadoop Ecosystem
a) Scalding
b) HCatalog
c) Cascalog
d) Cascading
Answer:d
Explanation:Cascading hides many of the complexities of MapReduce programming behind
more intuitive pipes and data flow abstractions.
7. ___________ is general-purpose computing model and runtime system for distributed data
analytics.
a) Mapreduce
b) Drill
c) Oozie
Answer:a
Explanation:Mapreduce provides a flexible and scalable foundation for analytics, from
traditional reporting to leading-edge machine learning algorithms.
8. The Pig Latin scripting language is not only a higher-level data flow language but also has
operators similar to :
a) SQL
b) JSON
c) XML
Answer:a
Explanation:Pig Latin, in essence, is designed to fill the gap between the declarative style of SQL
and the low-level procedural style of MapReduce.
9. _______ jobs are optimized for scalability but not latency.
a) Mapreduce
b) Drill
c) Oozie
d) Hive
Answer:d
Explanation:Hive Queries are translated to MapReduce jobs to exploit the scalability of
MapReduce.
10. ______ is a framework for performing remote procedure calls and data serialization.
a) Drill
b) BigTop
c) Avro
d) Chukwa
Answer:c
Explanation:In the context of Hadoop, Avro can be used to pass data from one program or
language to another.
Hadoop Questions and Answers Introduction to Mapreduce

This set of Multiple Choice Questions & Answers (MCQs) focuses on MapReduce.
1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the
JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Answer:c
Explanation:TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.
a) MapReduce tries to place the data and the compute as close as possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
Answer:a
Explanation:This feature of MapReduce is Data Locality.
3. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.
a) Maptask
b) Mapper
c) Task execution
Answer:a
Explanation:Map Task in MapReduce is performed using the Map() function.
4. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer

Answer:a
Explanation:Reduce function collates the work and resolves the results.
a) A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner
b) The MapReduce framework operates exclusively on pairs
c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
Answer:d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks.
6. Although the Hadoop framework is implemented in Java , MapReduce applications need not
be written in :
a) Java
b) C
c) C#
Answer:a
Explanation:Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
7. ________ is a utility which allows users to create and run jobs with any executables as the
mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
Answer:b
Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.
8. __________ maps input key/value pairs to a set of intermediate key/value pairs.

a) Mapper
b) Reducer
c) Both Mapper and Reducer
Answer:a
Explanation:Maps are the individual tasks that transform input records into intermediate records.
9. The number of maps is usually driven by the total size of :
a) inputs
b) outputs
c) tasks
Answer:a
Explanation:Total size of inputs means total number of blocks of the input files.
10. _________ is the default Partitioner for partitioning key space.
a) HashPar
b) Partitioner
c) HashPartitioner
Answer:c
Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called
getPartition to partition.
Hadoop Questions and Answers Analyzing Data with Hadoop
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Analyzing Data
with Hadoop.
1. Mapper implementations are passed the JobConf for the job via the ________ method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
Answer:b
Explanation:JobConfigurable.configure method is overridden to initialize themselves.
a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by
the InputFormat for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value)
format
Answer: d
Explanation:Reporters can be used to set application-level status messages and update Counters.
3. Input to the _______ is the sorted output of the mappers.
a) Reducer
b) Mapper
c) Shuffle
Answer:a
Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.
4. The right number of reduces seems to be :
a) 0.90
b) 0.80
c) 0.36
d) 0.95
Answer:d
Explanation: The right number of reduces seems to be 0.95 or 1.75.
a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage

Answer:a
Explanation:Reducer has 3 primary phases: shuffle, sort and reduce.
6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
Answer:d
Explanation: The output of the reduce task is typically written to the FileSystem. The output of
the Reducer is not sorted.
7. Which of the following phases occur simultaneously ?
a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
Answer:a
Explanation: The shuffle and sort phases occur simultaneously; while map-outputs are being
fetched they are merged.
8. Mapper and Reducer implementations can use the ________ to report progress or just indicate
that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
9. __________ is a generalization of the facility provided by the MapReduce framework to
collect data output by the Mapper or the Reducer
a) Partitioner
b) OutputCollector
c) Reporter
Answer:b
Explanation:Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop
framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
Answer:b
Explanation:JobConf represents a MapReduce job configuration.
Hadoop Questions and Answers Scaling out in Hadoop
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Scaling out in
Hadoop.
1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in
the nodes.
a) NoSQL
b) NewSQL
c) SQL
Answer:a
Explanation: NoSQL systems make the most sense whenever the application is based on data
with varying data types and the data can be stored in key-value notation.
a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload
b) HDFS runs on a small cluster of commodity-class nodes
c) NEWSQL is frequently the collection point for big data
Answer:a
Explanation:Hadoop together with a relational data warehouse, they can form very effective data
warehouse infrastructure.
3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record
values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
Answer:a
Explanation:Other means of tagging the values also can be used.
4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional
Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
Answer:b
Explanation:enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and
highly efficient storage platform
b) Isilons native HDFS integration means you can avoid the need to invest in a separate Hadoop
infrastructure
c) NoSQL systems do provide high latency access and accommodate less concurrent users
Answer:c
Explanation:NoSQL systems do provide low latency access and accommodate many concurrent
users.
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to :
a) Scale out
b) Scale up
c) Both Scale out and up
Answer:a
Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to
increase performance (scale-out) but even they require node configuration with elements of scale
up.
7. Which is the most popular NoSQL database for scalable big data store with Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
Answer:a
Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables billions of rows multiplied by millions of columns on clusters built
with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in the
map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
Answer:c
Explanation: The child-jvm always has its current working directory added to the
java.library.path and LD_LIBRARY_PATH.
9. HBase provides ___________ like capabilities on top of Hadoop and HDFS.
a) TopTable
b) BigTop
c) Bigtable
Answer:c
Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
System.
10. _______ refers to incremental costs with no major impact on solution design, performance
and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
Answer:c
Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a
cluster does not require additional network switches.
Hadoop Questions and Answers Hadoop Streaming
Streaming.
1. Streaming supports streaming command options as well as _________ command options.
a) generic
b) tool
c) library
d) task
Answer:a
Explanation:Place the generic options before the streaming options, otherwise the command will
fail.
a) You can specify any executable as the mapper and/or the reducer
b) You cannot supply a Java class as the mapper and/or the reducer
c) The class you supply for the output format should return key/value pairs of Text class
Answer:a
Explanation:If you do not specify an input format class, the TextInputFormat is used as the
default.
3. Which of the following Hadoop streaming command option parameter is required ?
a) output directoryname
b) mapper executable
c) input directoryname
Answer:d
Explanation:Required parameters is used for Input and Output location for mapper.
4. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Answer:c
Explanation:Environment Variable is set using cmdenv command.
a) Hadoop has a library package called Aggregate
b) Aggregate allows you to define a mapper plugin class that is expected to generate
aggregatable items for each input key/value pair of the mappers
c) To use Aggregate, simply specify -mapper aggregate
Answer:c
Explanation:To use Aggregate, simply specify -reducer aggregate:
6. The ________ option allows you to copy jars locally to the current working directory of tasks
and automatically unjar the files.
a) archives
b) files
c) task
Answer:a
Explanation:Archives options is also a generic option.
7. ______________ class allows the Map/Reduce framework to partition the map outputs based
on certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
Answer:b
Explanation: The primary key is used for partitioning, and the combination of the primary and
secondary keys is used for sorting.
8. Which of the following class provides a subset of features provided by the Unix/GNU Sort ?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
Answer:c
Explanation:Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
applications.
9. Which of the following class is provided by Aggregate package ?
a) Map
b) Reducer
c) Reduce
Answer:b
Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as sum, max, min and so on over a
sequence of values.
10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that
effectively allows you to process text data like the unix ______ utility.
a) Copy
b) Cut
c) Paste
d) Move
Answer:b
Explanation: The map function defined in the class treats each input key/value pair as a list of
fields.
Hadoop Questions and Answers Introduction to HDFS
This set of Multiple Choice Questions & Answers (MCQs) focuses on Hadoop Filesystem
HDFS.
1. A ________ serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
Answer:b
Explanation:All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
tolerance
Answer:a
Explanation:There can be any number of DataNodes in a Hadoop Cluster.
3. HDFS works in a __________ fashion.
a) master-worker
b) master-slave
c) worker/slave.
Answer:a
Explanation:NameNode servers as the master and each DataNode servers as a worker/slave
4. ________ NameNode is used when the Primary NameNode goes down.
a) Rack
b) Data
c) Secondary
Answer:c
Explanation:Secondary namenode is used for all time availability and reliability.
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Answer:d
Explanation: NameNode is aware of the files to which the blocks stored on it belong to.
6. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
Answer:a
Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows
storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
7. The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
Answer:d
Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance.
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
Answer:a
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has
more than one DataNode, with data replicated across them.
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) HDFS Shell
b) FS Shell
c) DFS Shell
Answer:b
Explanation: The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System (HDFS).
10. HDFS is implemented in _____________ programming language.
a) C++
b) Java
c) Scala
Answer:b
Explanation:HDFS is implemented in Java and any computer which can run Java can host a
NameNode/DataNode on it.
Hadoop Questions and Answers Java Interface
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Java Interface.
1. In order to read any file in HDFS, instance of __________ is required.
a) filesystem
b) datastream
c) outstream
d) inputstream
Answer:a
Explanation:InputDataStream is used to read data from file.
a) The framework groups Reducer inputs by keys
b) The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are
merged
c) Since JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate
keys are grouped, these can be used in conjunction to simulate secondary sort on values
Answer:d
Explanation:If equivalence rules for keys while grouping the intermediates are different from
those for grouping keys before reduction, then one may specify a Comparator.
3. ______________ is method to copy byte from input stream to any other stream in Hadoop.
a) IOUtils
b) Utils
c) IUtils
Answer:a
Explanation:IOUtils class is static method in Java interface.
4. _____________ is used to read data from bytes buffers .
a) write()
b) read()
c) readwrite()
Answer:a
Explanation:readfully method can also be used instead of read method.
a) The framework calls reduce method for each pair in the grouped inputs
b) The output of the Reducer is re-sorted
c) reduce method reduces values for a given key
Answer:b
Explanation: The output of the Reducer is not re-sorted.
6. Interface ____________ reduces a set of intermediate values which share a key to a smaller
set of values.
a) Mapper
b) Reducer
c) Writable
d) Readable
Answer:b
Explanation:Reducer implementations can access the JobConf for the job.
7. Reducer is input the grouped output of a :

a) Mapper
b) Reducer
c) Writable
d) Readable
Answer:a
Explanation:In the phase the framework, for each Reducer, fetches the relevant partition of the
output of all the Mappers, via HTTP.
8. The output of the reduce task is typically written to the FileSystem via :
a) OutputCollector
b) InputCollector
c) OutputCollect
Answer:a
Explanation:In reduce phase the reduce(Object, Iterator, OutputCollector, Reporter) method is
called for each pair in the grouped inputs.
9. Applications can use the _________ provided to report progress or just indicate that they are
alive.
a) Collector
b) Reporter
c) Dashboard
Answer:b
Explanation:In scenarios where the application takes a significant amount of time to process
individual key/value pairs, this is crucial since the framework might assume that the task has
timed-out and kill that task.
10. Which of the following parameter is to collect keys and combined values ?
a) key
b) values
c) reporter
d) output
Answer:d
Explanation:reporter parameter is for facility to report progress.
Hadoop Questions and Answers Data Flow
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Data Flow.
1. ________ is a programming model designed for processing large volumes of data in parallel
by dividing the work into a set of independent tasks.
a) Hive
b) MapReduce
c) Pig
d) Lucene
Answer:b
Explanation:MapReduce is the heart of hadoop.
a) Data locality means movement of algorithm to the data instead of data to algorithm
b) When the processing is done on the data algorithm is moved across the Action Nodes rather
than data to the algorithm
c) Moving Computation is expensive than Moving Data
Answer:a
Explanation:Data flow framework possesses the feature of data locality.
3. The daemons associated with the MapReduce phase are ________ and task-trackers.
a) job-tracker
b) map-tracker
c) reduce-tracker
Answer:a
Explanation:Map-Reduce jobs are submitted on job-tracker.
4. The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep the
work as close to the data as possible
a) DataNodes
b) TaskTracker
c) ActionNodes

Answer:b
Explanation:A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to
check its status whether the node is dead or alive.
a) The map function in Hadoop MapReduce have the following general form:map:(K1, V1)
list(K2, V2)
b) The reduce function in Hadoop MapReduce have the following general form: reduce: (K2,
list(V2)) list(K3, V3)
c) MapReduce has a complex model of data processing: inputs and outputs for the map and
reduce functions are key-value pairs
Answer:c
Explanation:MapReduce is relatively simple model to implement in Hadoop.
6. InputFormat class calls the ________ function and computes splits for each file and then sends
them to the jobtracker.
a) puts
b) gets
c) getSplits
Answer:c
Explanation:InputFormat uses their storage locations to schedule map tasks to process them on
the tasktrackers.
7. On a tasktracker, the map task passes the split to the createRecordReader() method on
InputFormat to obtain a _________ for that split.
a) InputReader
b) RecordReader
c) OutputReader
Answer:b
Explanation: The RecordReader loads data from its source and converts into key-value pairs
suitable for reading by mapper.
8. The default InputFormat is __________ which treats each value of input a new value and the
associated key is byte offset.
a) TextFormat
b) TextInputFormat
c) InputFormat
Answer:b
Explanation:A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs.
9. __________ controls the partitioning of the keys of the intermediate map-outputs.
a) Collector
b) Partitioner
c) InputFormat
Answer:b
Explanation: The output of the mapper is sent to the partitioner.
10. Output of the mapper is first written on the local disk for sorting and _________ process.
a) shuffling
b) secondary sorting
c) forking
d) reducing
Answer:a
Explanation:All values corresponding to the same key will go the same reducer.
Hadoop Questions and Answers Hadoop Archives
Archives.
1. _________ is the name of the archive you would like to create.
a) archive
b) archiveName
c) Name
Answer:b
Explanation: The name should have a *.har extension.
a) A Hadoop archive maps to a file system directory
b) Hadoop archives are special format archives
c) A Hadoop archive always has a *.har extension
Answer:d
Explanation:A Hadoop archive directory contains metadata (in the form of _index and
_masterindex) and data (part-*) files.
3. Using Hadoop Archives in __________ is as easy as specifying a different input filesystem
than the default file system.
a) Hive
b) Pig
c) MapReduce
Answer:c
Explanation:Hadoop Archives is exposed as a file system MapReduce will be able to use all the
logical input files in Hadoop Archives as input.
4. The __________ guarantees that excess resources taken from a queue will be restored to it
within N minutes of its need for them.
a) capacitor
b) scheduler
c) datanode
Answer:b
Explanation:Free resources can be allocated to any queue beyond its guaranteed capacity.
a) The Hadoop archive exposes itself as a file system layer
b) Hadoop archives are immutable
c) Archive renames, deletes and creates return an error
Answer:d
Explanation:All the fs shell commands in the archives work but with a different URI.
6. _________ is a pluggable Map/Reduce scheduler for Hadoop which provides a way to share
large clusters.
a) Flow Scheduler
b) Data Scheduler
c) Capacity Scheduler
Answer:c
Explanation: The Capacity Scheduler supports for multiple queues, where a job is submitted to a
queue.
7. Which of the following parameter describes destination directory which would contain the
archive ?
a) -archiveName
b) <source>
c) <destination>
Answer:c
Explanation: -archiveName is the name of the archive to be created.
8. _________ identifies filesystem pathnames which work as usual with regular expressions.
a) -archiveName <name>
b) <source>
c) <destination>
Answer:d
Explanation: identifies destination directory which would contain the archive.
9. __________ is the parent argument used to specify the relative path to which the files should
be archived to
a) -archiveName <name>
b) -p <parent_path>
c) <destination>
d) <source>
Answer:b
Explanation: The hadoop archive command creates a Hadoop archive, a file that contains other
files.
10. Which of the following is a valid syntax for hadoop archive ?
a)
hadooparchive [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
<source>
<destination>
b)
hadooparch [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
<source>
<destination>
c)
hadoop [ Generic Options ] archive
-archiveName <name>
[-p <parent>]
<source>
<destination>
Answer:c
Explanation: The Hadoop archiving tool can be invoked using the following command format:
hadoop archive -archiveName name -p *
Hadoop Questions and Answers Hadoop I/O

This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Hadoop I/O.
1. Hadoop I/O Hadoop comes with a set of ________ for data I/O.
a) methods
b) commands
c) classes
Answer:d
Explanation:Hadoop I/O consist of primitives for serialization and deserialization.
a) The sequence file also can contain a secondary key-value list that can be used as file
Metadata
b) SequenceFile formats share a header that contains some information which allows the reader
to recognize is format
c) Therere Key and Value Class Names that allow the reader to instantiate those classes, via
reflection, for reading
Answer:d
Explanation:In contrast with other persistent key-value data structures like B-Trees, you cant
seek to a specified key editing, adding or removing it.
3. Apache Hadoops ___________ provides a persistent data structure for binary key-value pairs.
a) GetFile
b) SequenceFile
c) Putfile
Answer:b
Explanation:SequenceFile is append-only.
4. How many formats of SequenceFile are present in Hadoop I/O ?
a) 2
b) 3
c) 4
d) 5
Answer:b
Explanation:SequenceFile has 3 available formats: An Uncompressed format, A Record
Compressed format and a Block-Compressed.
a) The data file contains all the key, value records but key N + 1 must be greater then or equal to
the key N
b) Sequence file is a kind of hadoop file based data structure
c) Map file type is splittable as it contains a sync point after several records
Answer:c
Explanation:Map file is again a kind of hadoop file based data structure and it differs from a
sequence file in a matter of the order.
6. Which of the following format is more compression-aggressive ?
a) Partition Compressed
b) Record Compressed
c) Block-Compressed
d) Uncompressed
Answer:c
Explanation:SequenceFile key-value list can be just a Text/Text pair, and is written to the file
during the initialization that happens in the SequenceFile.
7. The __________ is a directory that contains two SequenceFile.
a) ReduceFile
b) MapperFile
c) MapFile
Answer:c
Explanation:Sequence files are data file (/data) and the index file (/index).
8. The ______ file is populated with the key and a LongWritable that contains the starting byte
position of the record.
a) Array
b) Index
c) Immutable

Answer:b
Explanation:Index doest contains all the keys but just a fraction of the keys.
9. The _________ as just the value field append(value) and the key is a LongWritable that
contains the record number, count + 1.
a) SetFile
b) ArrayFile
c) BloomMapFile
Answer:b
Explanation: The SetFile instead of append(key, value) as just the key field append(key) and the
value is always the NullWritable instance.
10. ____________ data file takes is based on avro serializaton framework which was primarily
created for hadoop.
a) Oozie
b) Avro
c) cTakes
d) Lucene
Answer:b
Explanation:Avro is a splittable data format with a metadata section at the beginning and then a
sequence of avro serialized objects.
Hadoop Questions and Answers Compression
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Compression.
1. The _________ codec from Google provides modest compression ratios.
a) Snapcheck
b) Snappy
c) FileCompress
Answer:b
Explanation:Snappy has fast compression and decompression speeds.

a) Snappy is licensed under the GNU Public License (GPL)
b) BgCIK needs to create an index when it compresses a file
c) The Snappy codec is integrated into Hadoop Common, a set of common utilities that supports
other Hadoop subprojects
Answer:c
Explanation:You can use Snappy as an add-on for more recent versions of Hadoop that do not yet
provide Snappy codec support.
3. Which of the following compression is similar to Snappy compression ?
a) LZO
b) Bzip2
c) Gzip
Answer:a
Explanation:LZO is only really desirable if you need to compress text files.
4. Which of the following supports splittable compression ?
a) LZO
b) Bzip2
c) Gzip
Answer:a
Explanation:LZO enables the parallel processing of compressed text file splits by your
MapReduce jobs.
a) From a usability standpoint, LZO and Gzip are similar.
b) Bzip2 generates a better compression ratio than does Gzip, but its much slower
c) Gzip is a compression utility that was adopted by the GNU project
Answer:a
Explanation:From a usability standpoint, Bzip2 and Gzip are similar.
6. Which of the following is the slowest compression technique ?

a) LZO
b) Bzip2
c) Gzip
Answer:b
Explanation:Of all the available compression codecs in Hadoop, Bzip2 is by far the slowest.
7. Gzip (short for GNU zip) generates compressed files that have a _________ extension.
a) .gzip
b) .gz
c) .gzp
d) .g
Answer:b
Explanation:You can use the gunzip command to decompress files that were created by a number
of compression utilities, including Gzip.
8. Which of the following is based on the DEFLATE algorithm ?
a) LZO
b) Bzip2
c) Gzip
Answer:c
Explanation:gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and
Huffman Coding.
9. __________ typically compresses files to within 10% to 15% of the best available techniques.
a) LZO
b) Bzip2
c) Gzip
Answer:b
Explanation:bzip2 is a freely available, patent free (see below), high-quality data compressor.
10. The LZO compression format is composed of approximately __________ blocks of
compressed data.
a) 128k
b) 256k
c) 24k
d) 36k
Answer:b
Explanation:LZO was designed with speed in mind: it decompresses about twice as fast as gzip,
meaning its fast enough to keep up with hard drive read speeds.
Hadoop Questions and Answers Data Integrity
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Data Integrity.
1. The HDFS client software implements __________ checking on the contents of HDFS files.
a) metastore
b) parity
c) checksum
Answer:c
Explanation:When a client creates an HDFS file, it computes a checksum of each block of the
file and stores these checksums in a separate hidden file in the same HDFS namespace.
a) The HDFS architecture is compatible with data rebalancing schemes
b) Datablocks support storing a copy of data at a particular instant of time.
c) HDFS currently support snapshots.
Answer:a
Explanation:A scheme might automatically move data from one DataNode to another if the free
space on a DataNode falls below a certain threshold.
3. The ___________ machine is a single point of failure for an HDFS cluster.
a) DataNode
b) NameNode
c) ActionNode
Answer:b
Explanation:If the NameNode machine fails, manual intervention is necessary. Currently,
automatic restart and failover of the NameNode software to another machine is not supported.
4. The ____________ and the EditLog are central data structures of HDFS.
a) DsImage
b) FsImage
c) FsImages
Answer:b
Explanation:A corruption of these files can cause the HDFS instance to be non-functional
a) HDFS is designed to support small files only
b) Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get
updated synchronously
c) NameNode can be configured to support maintaining multiple copies of the FsImage and
EditLog
Answer:a
Explanation:HDFS is designed to support very large files.
6. __________ support storing a copy of data at a particular instant of time.
a) Data Image
b) Datanots
c) Snapshots
Answer:c
Explanation:One usage of the snapshot feature may be to roll back a corrupted HDFS instance to
a previously known good point in time.
7. Automatic restart and ____________ of the NameNode software to another machine is not
supported.
a) failover
b) end
c) scalability

Answer:a
Explanation:If the NameNode machine fails, manual intervention is necessary.
8. HDFS, by default, replicates each data block _____ times on different nodes and on at least
____ racks.
a) 3,2
b) 1,2
c) 2,3
Answer:a
Explanation:HDFS has a simple yet robust architecture that was explicitly designed for data
reliability in the face of faults and failures in disks, nodes and networks.
9. _________ stores its metadata on multiple disks that typically include a non-local file server.
a) DataNode
b) NameNode
c) ActionNode
Answer:b
Explanation:HDFS tolerates failures of storage servers (called DataNodes) and its disks
10. The HDFS file system is temporarily unavailable whenever the HDFS ________ is down.
a) DataNode
b) NameNode
c) ActionNode
Answer:b
Explanation:When the HDFS NameNode is restarted it recovers its metadata.
Hadoop Questions and Answers Serialization
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Serialization.
1. Apache _______ is a serialization framework that produces data in a compact binary format.
a) Oozie
b) Impala
c) kafka
d) Avro
Answer:d
Explanation:Apache Avro doesnt require proxy objects or code generation.
a) Apache Avro is a framework that allows you to serialize data in a format that has a schema
built in
b) The serialized data is in a compact binary format that doesnt require proxy objects or code
generation
c) Including schemas with the Avro messages allows any application to deserialize the data
Answer:d
Explanation:Instead of using generated proxy libraries and strong typing, Avro relies heavily on
the schemas that are sent along with the serialized data.
3. Avro schemas describe the format of the message and are defined using :
a) JSON
b) XML
c) JS
Answer:a
Explanation: The JSON schema content is put into a file.
4. The ____________ is an iterator which reads through the file and returns objects using the
next() method.
a) DatReader
b) DatumReader
c) DatumRead
Answer:b
Explanation:DatumReader reads the content through the DataFileReader implementation.
a) Java code is used to deserialize the contents of the file into objects
b) Avro allows you to use complex data structures within Hadoop MapReduce jobs
c) The m2e plug-in automatically downloads the newly added JAR files and their dependencies
Answer:d
Explanation:A unit test is useful because you can make assertions to verify that the values of the
deserialized object are the same as the original values.
6. The ____________ class extends and implements several Hadoop-supplied interfaces.
a) AvroReducer
b) Mapper
c) AvroMapper
Answer:c
Explanation:AvroMapper is used to provide the ability to collect or map data.
7. ____________ class accepts the values that the ModelCountMapper object has collected.
a) AvroReducer
b) Mapper
c) AvroMapper
Answer:a
Explanation:AvroReducer summarizes them by looping through the values.
8. The ________ method in the ModelCountReducer class reduces the values the mapper
collects into a derived value
a) count
b) add
c) reduce
Answer:c
Explanation:In some case, it can be simple sum of the values.
9. Which of the following works well with Avro ?
a) Lucene
b) kafka
c) MapReduce

Answer:c
Explanation:You can use Avro and MapReduce together to process many items serialized with
Avros small binary format.
10. __________ tools is used to generate proxy objects in Java to easily work with the objects.
a) Lucene
b) kafka
c) MapReduce
d) Avro
Answer:d
Explanation:Avro serialization includes the schema with it in JSON format which allows
you to have different versions of objects.
Hadoop Questions and Answers Avro-1
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Avro-1.
1. Avro schemas are defined with _____
a) JSON
b) XML
c) JAVA
Answer:a
Explanation:JSON implementation facilitates implementation in languages that already have
JSON libraries.
a) Avro provides functionality similar to systems such as Thrift
b) When Avro is used in RPC, the client and server exchange data in the connection handshake
c) Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Java
Foundation
Answer:a
Explanation:Avro differs from these systems in the fundamental aspects like untagged data.
3. __________ facilitates construction of generic data-processing systems and languages.

a) Untagged data
b) Dynamic typing
c) No manually-assigned field IDs
Answer:b
Explanation:Avro does not require that code be generated.
4. With ______ we can store data and read it easily with various programming languages
a) Thrift
b) Protocol Buffers
c) Avro
Answer:c
Explanation:Avro is optimized to minimize the disk space needed by our data and it is flexible.
a) Apache Avro is a data serialization system
b) Avro provides simple integration with dynamic languages
c) Avro provides rich data structures
Answer:d
Explanation: Code generation is not required to read or write data files nor to use or implement
RPC protocols in Avro.
6. ________ are a way of encoding structured data in an efficient yet extensible format.
a) Thrift
b) Protocol Buffers
c) Avro
Answer:b
Explanation:Google uses Protocol Buffers for almost all of its internal RPC protocols and file
formats.
7. Thrift resolves possible conflicts through _________ of the field.
a) Name
b) Static number
c) UID
Answer:b
Explanation:Avro resolves possible conflicts through the name of the field.
8. Avro is said to be the future _______ layer of Hadoop.
a) RMC
b) RPC
c) RDC
Answer:b
Explanation:When Avro is used in RPC, the client and server exchange schemas in the
connection handshake.
9. When using reflection to automatically build our schemas without code generation, we need to
configure Avro using :
a) AvroJob.Reflect(jConf);
b) AvroJob.setReflect(jConf);
c) Job.setReflect(jConf);
Answer:c
Explanation:For strongly typed languages like Java, it also provides a generation code layer,
including RPC services code generation.
10. We can declare the schema of our data either in a ______ file.
a) JSON
b) XML
c) SQL
d) R
Answer:c
Explanation:Schema can be declared using an IDL or simply through Java beans by using
reflection-based schema building.
Hadoop Questions and Answers Avro-2

This set of Interview Questions and Answers focuses on Avro.
1. Which of the following is a primitive data type in Avro ?
a) null
b) boolean
c) float
Answer:d
Explanation:Primitive type names are also defined type names.
a) Records use the type name record and support three attributes
b) Enum are represented using JSON arrays
c) Avro data is always serialized with its schema
Answer:a
Explanation:A record is encoded by encoding the values of its fields in the order that they are
declared.
3. Avro supports ______ kinds of complex types.
a) 3
b) 4
c) 6
d) 7
Answer:d
Explanation:Avro supports six kinds of complex types: records, enums, arrays, maps, unions and
fixed.
4.________ are encoded as a series of blocks.
a) Arrays
b) Enum
c) Unions
d) Maps
Answer:a
Explanation:Each block of array consists of a long count value, followed by that many array
items. A block with count zero indicates the end of the array. Each item is encoded per the arrays
item schema.
a) Record, enums and fixed are named types
b) Unions may immediately contain other unions
c) A namespace is a dot-separated sequence of such names
Answer:b
Explanation:Unions may not immediately contain other unions.
6. ________ instances are encoded using the number of bytes declared in the schema.
a) Fixed
b) Enum
c) Unions
d) Maps
Answer:a
Explanation:Except for unions, the JSON encoding is the same as is used to encode field default
values.
7. ________ permits data written by one system to be efficiently sorted by another system.
a) Complex Data type
b) Order
c) Sort Order
Answer:c
Explanation:Avro binary-encoded data can be efficiently ordered without deserializing it to
objects.
8. _____________ are used between blocks to permit efficient splitting of files for MapReduce
processing.
a) Codec
b) Data Marker
c) Syncronization markers

Answer:c
Explanation:Avro includes a simple object container file format.
9. The __________ codec uses Googles Snappy compression library.
a) null
b) snappy
c) deflate
Answer:b
Explanation:Snappy is a compression library developed at Google, and, like many technologies
that come from Google, Snappy was designed to be fast.
10. Avro messages are framed as a list of _________
a) buffers
b) frames
c) rows
Answer:b
Explanation:Framing is a layer between messages and the transport. It exists to optimize certain
operations.
Hadoop Questions and Answers Metrics in Hbase
This set of Interview Questions & Answers focuses on Hbase.
1. _______ can change the maximum number of cells of a column family.
a) set
b) reset
c) alter
d) select
Answer:c
Explanation:Alter is the command used to make changes to an existing table.
a) You can add a column family to a table using the method addColumn()
b) Using alter, you can also create a column family
c) Using disable-all, you can truncate a column family

Answer:a
Explanation:Columns can also be added through HbaseAdmin.
3. Which of the following is not a table scope operator ?
a) MEMSTORE_FLUSH
b) MEMSTORE_FLUSHSIZE
c) MAX_FILESIZE
Answer:a
Explanation:Using alter, you can set and remove table scope operators such as MAX_FILESIZE,
READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.
4. You can delete a column family from a table using the method _________ of HBAseAdmin
class.
a) delColumn()
b) removeColumn()
c) deleteColumn()
Answer:c
Explanation:Alter command also can be used to delete a column family.
a) To read data from an HBase table, use the get() method of the HTable class
b) You can retrieve data from the HBase table using the get() method of the HTable class
c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row ids,
or scan an entire table or a subset of rows
Answer:d
Explanation:You can retrieve an HBase table data using the add method variants in Get class.
6. __________ class adds HBase configuration files to its object.
a) Configuration
b) Collector
c) Component

Answer:a
Explanation:You can create a configuration object using the create() method of the
HbaseConfiguration class.
7. The ________ class provides the getValue() method to read the values from its instance.
a) Get
b) Result
c) Put
d) Value
Answer:b
Explanation:Get the result by passing your Get class instance to the get method of the HTable
class. This method returns the Result class object, which holds the requested result.
8. ________ communicate with the client and handle data-related operations.
a) Master Server
b) Region Server
c) Htable
Answer:b
Explanation:Region Server handle read and write requests for all the regions under it.
9. _________ is the main configuration file of HBase.
a) hbase.xml
b) hbase-site.xml
c) hbase-site-conf.xml
Answer:b
Explanation:Set the data directory to an appropriate location by opening the HBase home folder
in /usr/local/HBase.
10. HBase uses the _______ File System to store its data.
a) Hive
b) Imphala
c) Hadoop
d) Scala
Answer:c
Explanation: The data storage will be in the form of regions (tables). These regions will be split
up and stored in region servers.
adoop Questions and Answers Mapreduce Development-2
This set of Questions & Answers focuses on Hadoop MapReduce.
1. The Mapper implementation processes one line at a time via _________ method.
a) map
b) reduce
c) mapper
d) reducer
Answer:a
Explanation: The Mapper outputs are sorted and then partitioned per Reducer.
a) Mapper maps input key/value pairs to a set of intermediate key/value pairs
b) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
c) Mapper and Reducer interfaces form the core of the job
Answer:d
Explanation: The transformed intermediate records do not need to be of the same type as the
input records.
3. The Hadoop MapReduce framework spawns one map task for each __________ generated by
the InputFormat for the job.
a) OutputSplit
b) InputSplit
c) InputSplitStream
Answer:b
Explanation:Mapper implementations are passed the JobConf for the job via the
JobConfigurable.configure(JobConf) method and override it to initialize themselves.
4. Users can control which keys (and hence records) go to which Reducer by implementing a
custom :
a) Partitioner
b) OutputSplit
c) Reporter
Answer:a
Explanation:Users can control the grouping by specifying a Comparator via
JobConf.setOutputKeyComparatorClass(Class).
a) The Mapper outputs are sorted and then partitioned per Reducer
b) The total number of partitions is the same as the number of reduce tasks for the job
format
Answer:d
Explanation:All intermediate values associated with a given output key are subsequently grouped
by the framework, and passed to the Reducer(s) to determine the final output.
6. Applications can use the ____________ to report progress and set application-level status
messages
a) Partitioner
b) OutputSplit
c) Reporter
Answer:c
Explanation:Reporter is also used to update Counters, or just indicate that they are alive.
7. The right level of parallelism for maps seems to be around _________ maps per-node
a) 1-10
b) 10-100
c) 100-150
d) 150-200
Answer:b
Explanation:Task setup takes a while, so it is best if the maps take at least a minute to execute.
8. The number of reduces for the job is set by the user via :
a) JobConf.setNumTasks(int)
b) JobConf.setNumReduceTasks(int)
c) JobConf.setNumMapTasks(int)
Answer:b
9. The framework groups Reducer inputs by key in _________ stage.
a) sort
b) shuffle
c) reduce
Answer:a
10. The output of the reduce task is typically written to the FileSystem via _____________ .
a) OutputCollector.collect
b) OutputCollector.get
c) OutputCollector.receive
d) OutputCollector.put
Answer:a
Explanation: The output of the Reducer is not sorted.
Hadoop Questions and Answers MapReduce Features-1
This set of Hadoop Questions & Answers for freshers focuses on MapReduce Features.
1. Which of the following is the default Partitioner for Mapreduce ?
a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
Answer:c
Explanation: The total number of partitions is the same as the number of reduce tasks for the job.
a) The right number of reduces seems to be 0.95 or 1.75
b) Increasing the number of reduces increases the framework overhead
c) With 0.95 all of the reduces can launch immediately and start transferring map outputs as the
maps finish
Answer:c
Explanation:With 1.75 the faster nodes will finish their first round of reduces and launch a
second wave of reduces doing a much better job of load balancing.
3. Which of the following partitions the key space ?
a) Partitioner
b) Compactor
c) Collector
Answer:a
Explanation:Partitioner controls the partitioning of the keys of the intermediate map-outputs.
4. ____________ is a generalization of the facility provided by the MapReduce framework to
a) OutputCompactor
b) OutputCollector
c) InputCollector
Answer:b
a) It is legal to set the number of reduce-tasks to zero if no reduction is desired
b) The outputs of the map-tasks go directly to the FileSystem
c) The Mapreduce framework does not sort the map-outputs before writing them out to the
FileSystem

Answer:d
Explanation:Outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path).
6. __________ is the primary interface for a user to describe a MapReduce job to the Hadoop
a) JobConfig
b) JobConf
c) JobConfiguration
Answer:b
Explanation:JobConf is typically used to specify the Mapper, combiner (if any), Partitioner,
Reducer, InputFormat, OutputFormat and OutputCommitter implementations.
7. The ___________ executes the Mapper/ Reducer task as a child process in a separate jvm.
a) JobTracker
b) TaskTracker
c) TaskScheduler
Answer:a
Explanation: The child-task inherits the environment of the parent TaskTracker.
8. Maximum virtual memory of the launched child-task is specified using :
a) mapv
b) mapred
c) mapvim
Answer:b
Explanation:Admins can also specify the maximum virtual memory of the launched child-task,
and any sub-process it launches recursively, using mapred.
9. Which of the following parameter is the threshold for the accounting and serialization
buffers ?
a) io.sort.spill.percent
b) io.sort.record.percent
c) io.sort.mb
Answer:a
Explanation:When percentage of either buffer has filled, their contents will be spilled to disk in
the background.
10. ______________ is percentage of memory relative to the maximum heapsize in which map
outputs may be retained during the reduce.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
Answer:b
Explanation:When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
Hadoop Questions and Answers MapReduce Features-2
This set of Interview Questions & Answers focuses on MapReduce.
1. ____________ specifies the number of segments on disk to be merged at the same time.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
Answer:d
Explanation:io.sort.factor limits the number of open files and compression codecs during the
merge.
a) The number of sorted map outputs fetched into memory before being merged to disk
b) The memory threshold for fetched map outputs before an in-memory merge is finished
c) The percentage of memory relative to the maximum heapsize in which map outputs may not
be retained during the reduce
Answer:a
Explanation:When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
3. Map output larger than ___ percent of the memory allocated to copying map outputs.
a) 10
b) 15
c) 25
d) 35
Answer:c
Explanation:Map output will be written directly to disk without first staging through memory.
4. Jobs can enable task JVMs to be reused by specifying the job configuration :
a) mapred.job.recycle.jvm.num.tasks
b) mapissue.job.reuse.jvm.num.tasks
c) mapred.job.reuse.jvm.num.tasks
Answer:b
Explanation:Many of my tasks had performance improved over 50% using
mapissue.job.reuse.jvm.num.tasks.
a) The task tracker has local directory to create localized cache and localized job
b) The task tracker can define multiple local directories
c) The Job tracker cannot define multiple local directories
Answer:d
Explanation:When the job starts, task tracker creates a localized job directory relative to the local
directory specified in the configuration.
6. During the execution of a streaming job, the names of the _______ parameters are
transformed.
a) vmap
b) mapvim
c) mapreduce
d) mapred
Answer:d
Explanation:To get the values in a streaming jobs mapper/reducer use the parameter names with
the underscores.
7. The standard output (stdout) and error (stderr) streams of the task are read by the TaskTracker
and logged to :
a) ${HADOOP_LOG_DIR}/user
b) ${HADOOP_LOG_DIR}/userlogs
c) ${HADOOP_LOG_DIR}/logs
Answer:b
8. ____________ is the primary interface by which user-job interacts with the JobTracker.
a) JobConf
b) JobClient
c) JobServer
Answer:b
Explanation:JobClient provides facilities to submit jobs, track their progress, access componenttasks reports and logs, get the MapReduce clusters status information and so on.
9. The _____________ can also be used to distribute both jars and native libraries for use in the
a) DistributedLog
b) DistributedCache
c) DistributedJars
Answer:b
Explanation:Cached libraries can be loaded via System.loadLibrary or System.load.
10. __________ is used to filter log files from the output directory listing.
a) OutputLog
b) OutputLogFilter
c) DistributedLog
d) DistributedJars

Answer:b
Explanation:User can view the history logs summary in specified directory using the following
command $ bin/hadoop job -history output-dir.
Hadoop Questions and Answers Hadoop Configuration
Configuration.
1. Which of the following class provides access to configuration parameters ?
a) Config
b) Configuration
c) OutputConfig
Answer:b
Explanation:Configurations are specified by resources.
a) Configuration parameters may be declared static
b) Unless explicitly turned off, Hadoop by default specifies two resources
c) Configuration class provides access to configuration parameters
Answer:a
Explanation:Once a resource declares a value final, no subsequently-loaded resource can alter
that value.
3. ___________ gives site-specific configuration for a given hadoop installation.
a) core-default.xml
b) core-site.xml
c) coredefault.xml
Answer:b
Explanation:core-default.xml is read-only defaults for hadoop.
4. Administrators typically define parameters as final in __________ for values that user
applications may not alter.
a) core-default.xml
b) core-site.xml
c) coredefault.xml
Answer:b
Explanation:Value strings are first processed for variable expansion.
a) addDeprecations adds a set of deprecated keys to the global deprecations
b) Configuration parameters cannot be declared final
c) addDeprecations method is lockless
Answer:b
Explanation:Configuration parameters may be declared final.
6. _________ method clears all keys from the configuration.
a) clear
b) addResource
c) getClass
Answer:a
Explanation:getClass is used to get the value of the name property as a Class.
7. ________ method adds the deprecated key to the global deprecation map.
a) addDeprecits
b) addDeprecation
c) keyDeprecation
Answer:b
Explanation:addDeprecation does not override any existing entries in the deprecation map.
8. ________ checks whether the given key is deprecated.
a) isDeprecated
b) setDeprecated
c) isDeprecatedif

Answer:a
Explanation:Method returns true if the key is deprecated and false otherwise.
9. _________ is useful for iterating the properties when all deprecated properties for currently set
properties need to be present.
a) addResource
b) setDeprecatedProperties
c) addDefaultResource
Answer:b
Explanation:setDeprecatedProperties sets all deprecated properties that are not currently set but
have a corresponding new property that is set.
10. Which of the following adds a configuration resource ?
a) addResource
b) setDeprecatedProperties
c) addDefaultResource
d) addResource
Answer:d
Explanation: The properties of this resource will override properties of previously added
resources, unless they were marked final.
Hadoop Questions and Answers Security
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Security.
1. For running hadoop service daemons in Hadoop in secure mode, ___________ principals are
required.
a) SSL
b) Kerberos
c) SSH
Answer:b
Explanation:Each service reads auhenticate information saved in keytab file with appropriate
permission.

a) Hadoop does have the definition of group by itself
b) MapReduce JobHistory server run as same user such as mapred
c) SSO environment is managed using Kerberos with LDAP for Hadoop in secure mode
Answer:c
Explanation:You can change a way of mapping by specifying the name of mapping provider as a
value of hadoop.security.group.mapping.
3. The simplest way to do authentication is using _________ command of Kerberos.
a) auth
b) kinit
c) authorize
Answer:b
Explanation:HTTP web-consoles should be served by principal different from RPCs one.
4. Data transfer between Web-console and clients are protected by using :
a) SSL
b) Kerberos
c) SSH
Answer:a
Explanation:AES offers the greatest cryptographic strength and the best performance
a) Data transfer protocol of DataNode does not use the RPC framework of Hadoop
b) Apache Oozie which access the services of Hadoop on behalf of end users need to be able to
impersonate end users
c) DataNode must authenticate itself by using privileged ports which are specified by
dfs.datanode.address and dfs.datanode.http.address
Answer:d
Explanation:Authentication is based on the assumption that the attacker wont be able to get root
privileges.
6. In order to turn on RPC authentication in hadoop, set the value of

hadoop.security.authentication property to :
a) zero
b) kerberos
c) false
Answer:b
Explanation:Security settings need to be modified properly for robustness.
7. The __________ provides a proxy between the web applications exported by an application
and an end user.
a) ProxyServer
b) WebAppProxy
c) WebProxy
Answer:b
Explanation:If security is enabled it will warn users before accessing a potentially unsafe web
application. Authentication and authorization using the proxy is handled just like any other
privileged web application.
8. ___________ used by YARN framework which define how any container launched and
controlled.
a) Container
b) ContainerExecutor
c) Executor
Answer:b
Explanation: The container process has the same Unix user as the NodeManager.
9. The ____________ requires that paths including and leading up to the directories specified in
yarn.nodemanager.local-dirs
a) TaskController
b) LinuxTaskController
c) LinuxController
Answer:b
Explanation:LinuxTaskController keeps track of all paths and directories on datanode.
10. The configuration file must be owned by the user running :
a) DataManager
b) NodeManager
c) ValidationManager
Answer:b
Explanation:To re-cap,local file-sysytem permissions need to be modified
Hadoop Questions and Answers MapReduce Job-1
This set of Hadoop Interview Questions & Answers for freshers focuses on MapReduce Job.
1. __________ storage is a solution to decouple growing storage capacity from compute
capacity.
a) DataNode
b) Archival
c) Policy
Answer:b
Explanation:Nodes with higher density and less expensive storage with low compute power are
becoming available.
a) When there is enough space, block replicas are stored according to the storage type list
b) One_SSD is used for storing all replicas in SSD
c) Hot policy is useful only for single replica blocks
Answer:a
Explanation: The first phase of Heterogeneous Storage changed datanode storage model from a
single storage.
3. ___________ is added for supporting writing single replica files in memory.
a) ROM_DISK
b) ARCHIVE
c) RAM_DISK

Answer:c
Explanation:DISK is the default storage type.
4. Which of the following has high storage density ?
a) ROM_DISK
b) ARCHIVE
c) RAM_DISK
Answer:b
Explanation:Little compute power is added for supporting archival storage.
a) A Storage policy consists of the Policy ID
b) The storage policy can be specified using the dfsadmin -setStoragePolicy command
c) dfs.storage.policy.enabled is used for enabling/disabling the storage policy feature
Answer:d
Explanation: The effective storage policy can be retrieved by the dfsadmin -getStoragePolicy
command.
6. Which of the following storage policy is used for both storage and compute ?
a) Hot
b) Cold
c) Warm
d) All_SSD
Answer:a
Explanation:When a block is hot, all replicas are stored in DISK.
7. Which of the following is only for storage with limited compute ?
a) Hot
b) Cold
c) Warm
d) All_SSD
Answer:b
Explanation:When a block is cold, all replicas are stored in ARCHIVE.
8. When a block is warm, some of its replicas are stored in DISK and the remaining replicas are
stored in :
a) ROM_DISK
b) ARCHIVE
c) RAM_DISK
Answer:b
Explanation:Warm storage policy is partially hot and partially cold.
9. ____________ is used for storing one of the replicas in SSD.
a) Hot
b) Lazy_Persist
c) One_SSD
d) All_SSD
Answer:c
Explanation: The remaining replicas are stored in DISK.
10. ___________ is used for writing blocks with single replica in memory.
a) Hot
b) Lazy_Persist
c) One_SSD
d) All_SSD
Answer:b
Explanation: The replica is first written in RAM_DISK and then it is lazily persisted in DISK.
Hadoop Questions and Answers MapReduce Job-2
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on MapReduce Job2.
1. _________ is a data migration tool added for archiving data.
a) Mover
b) Hiver
c) Serde

Answer:a
Explanation:Mover periodically scans the files in HDFS to check if the block placement satisfies
the storage policy.
a) Mover is not similar to Balancer
b) hdfs dfsadmin -setStoragePolicy puts a storage policy to a file or a directory.
c) addCacheArchive add archives to be localized
Answer:c
Explanation:addArchiveToClassPath(Path archive) adds an archive path to the current set of
classpath entries.
3. Which of the following is used to list out the storage policies ?
a) hdfs storagepolicies
b) hdfs storage
c) hd storagepolicies
Answer:a
Explanation:Arguments are none for the hdfs storagepolicies command.
4. Which of the following statement can be used get the storage policy of a file or a directory ?
a) hdfs dfsadmin -getStoragePolicy path
b) hdfs dfsadmin -setStoragePolicy path policyName
c) hdfs dfsadmin -listStoragePolicy path policyName
Answer:a
Explanation: refers to the path referring to either a directory or a file.
Hadoop Questions and Answers Task Execution
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Task
Execution.
1. Which of the following node is responsible for executing a Task assigned to it by the
JobTracker ?
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Answer:c
Explanation:TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.
a) MapReduce tries to place the data and the compute as close as possible
b) Map Task in MapReduce is performed using the Mapper() function.
c) Reduce Task in MapReduce is performed using the Map() function.
Answer:a
Explanation:This feature of MapReduce is Data Locality.
3. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.
a) Maptask
b) Mapper
c) Task execution
Answer:a
Explanation:Map Task in MapReduce is performed using the Map() function.
4. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
Answer:a
Explanation:Reduce function collates the work and resolves the results.
a) A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner

b) The MapReduce framework operates exclusively on pairs
c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
Answer:d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and reexecutes the failed tasks.
6. Although the Hadoop framework is implemented in Java , MapReduce applications need not
be written in :
a) Java
b) C
c) C#
Answer:a
Explanation:Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
7. ________ is a utility which allows users to create and run jobs with any executable as the
mapper and/or the reducer.
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
Answer:b
Explanation:Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.
8. __________ maps input key/value pairs to a set of intermediate key/value pairs.
a) Mapper
b) Reducer
c) Both Mapper and Reducer
Answer:a
Explanation:Maps are the individual tasks that transform input records into intermediate records.
9. The number of maps is usually driven by the total size of :
a) inputs
b) outputs
c) tasks
Answer:a
Explanation:Total size of inputs means total number of blocks of the input files.
10. Running a ___________ program involves running mapping tasks on many or all of the
nodes in our cluster.
a) MapReduce
b) Map
c) Reducer
Answer:a
Explanation: In some applications, component tasks need to create and/or write to side-files,
which differ from the actual job-output files.
Hadoop Questions and Answers YARN-1
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on YARN-1.
1. ________ is the architectural center of Hadoop that allows multiple data processing engines.
a) YARN
b) Hive
c) Incubator
d) Chuckwa
Answer:a
Explanation:YARN is the prerequisite for Enterprise Hadoop, providing resource management
and a central platform to deliver consistent operations, security, and data governance tools across
Hadoop clusters.
a) YARN also extends the power of Hadoop to incumbent and new technologies found within the
data center
b) YARN is the central point of investment for Hortonworks within the Apache community
c) YARN enhances a Hadoop compute cluster in many ways
Answer:d
Explanation:YARN provides ISVs and developers a consistent framework for writing data access
applications that run IN Hadoop.
3. YARNs dynamic allocation of cluster resources improves utilization over more static _______
rules used in early versions of Hadoop.
a) Hive
b) MapReduce
c) Imphala
Answer:b
Explanation:Multi-tenant data processing improves an enterprises return on its Hadoop
investments.
4. The __________ is a framework-specific entity that negotiates resources from the
ResourceManager
a) NodeManager
b) ResourceManager
c) ApplicationMaster
Answer:c
Explanation:Each ApplicationMaster has responsibility for negotiating appropriate resource
containers from the schedule.
a) From the system perspective, the ApplicationMaster runs as a normal container.
b) The ResourceManager is the per-machine slave, which is responsible for launching the
applications containers
c) The NodeManager is the per-machine slave, which is responsible for launching the
applications containers, monitoring their resource usage
Answer:b
Explanation:ResourceManager has a scheduler, which is responsible for allocating resources to
the various applications running in the cluster, according to constraints such as queue capacities
and user limits.
6. Apache Hadoop YARN stands for :
a) Yet Another Reserve Negotiator
b) Yet Another Resource Network
c) Yet Another Resource Negotiator
Answer:c
Explanation:YARN is a cluster management technology.
7. MapReduce has undergone a complete overhaul in hadoop :
a) 0.21
b) 0.23
c) 0.24
d) 0.26
Answer:b
Explanation: The fundamental idea of MRv2 is to split up the two major functionalities of the
JobTracker.
8. The ____________ is the ultimate authority that arbitrates resources among all the
applications in the system.
a) NodeManager
b) ResourceManager
c) ApplicationMaster
Answer:b
Explanation: The ResourceManager and per-node slave, the NodeManager (NM), form the datacomputation framework.
9. The __________ is responsible for allocating resources to the various running applications
subject to familiar constraints of capacities, queues etc.
a) Manager
b) Master
c) Scheduler

Answer:c
Explanation: The Scheduler is pure scheduler in the sense that it performs no monitoring or
tracking of status for the application.
10. The CapacityScheduler supports _____________ queues to allow for more predictable
sharing of cluster resources.
a) Networked
b) Hierarchial
c) Partition
Answer:b
Explanation: The Scheduler has a pluggable policy plug-in, which is responsible for partitioning
the cluster resources among the various queues, applications etc.
Hadoop Questions and Answers YARN-2
This set of Hadoop Question Bank focuses on YARN.
1. Yarn commands are invoked by the ________ script.
a) hive
b) bin
c) hadoop
d) home
Answer:b
Explanation:Running the yarn script without any arguments prints the description for all
commands.
a) Each queue has strict ACLs which controls which users can submit applications to individual
queues
b) Hierarchy of queues is supported to ensure resources are shared among the sub-queues of an
organization
c) Queues are allocated a fraction of the capacity of the grid in the sense that a certain capacity of
resources will be at their disposal
Answer:d
Explanation:All applications submitted to a queue will have access to the capacity allocated to
the queue.
3. The queue definitions and properties such as ________, ACLs can be changed, at runtime.
a) tolerant
b) capacity
c) speed
Answer:b
Explanation:Administrators can add additional queues at runtime, but queues cannot be deleted
at runtime.
4. The CapacityScheduler has a pre-defined queue called :
a) domain
b) root
c) rear
Answer:b
Explanation:All queueus in the system are children of the root queue.
a) The multiple of the queue capacity which can be configured to allow a single user to acquire
more resources
b) Changing queue properties and adding new queues is very simple
c) Queues cannot be deleted, only addition of new queues is supported
Answer:d
Explanation:You need to edit conf/capacity-scheduler.xml and run yarn rmadmin -refreshQueues
for changing queue properties.
6. The updated queue configuration should be a valid one i.e. queue-capacity at each level should
be equal to :
a) 50%
b) 75%
c) 100%
d) 0%
Answer:c
Explanation:Queues cannot be deleted, only addition of new queues is supported.
7. Users can bundle their Yarn code in a _________ file and execute it using jar command.
a) java
b) jar
c) C code
d) xml
Answer:b
Explanation:Usage: yarn jar [mainClass] args
8. Which of the following command is used to dump the log container ?
a) logs
b) log
c) dump
Answer:a
Explanation:Usage: yarn logs -applicationId .
9. __________ will clear the RMStateStore and is useful if past applications are no longer
needed.
a) -format-state
b) -form-state-store
c) -format-state-store
Answer:c
Explanation:-format-state-store formats the RMStateStore.
10. Which of the following command runs ResourceManager admin client ?
a) proxyserver
b) run
c) admin
d) rmadmin
Answer:d
Explanation:proxyserver command starts the web proxy server.
Hadoop Questions and Answers Mapreduce Types
This set of Hadoop Questions & Answers for experienced focuses on MapReduce Types.
1. ___________ generates keys of type LongWritable and values of type Text.
a) TextOutputFormat
b) TextInputFormat
c) OutputInputFormat
Answer:b
Explanation:If K2 and K3 are the same, you dont need to call setMapOutputKeyClass().
a) The reduce input must have the same types as the map output, although the reduce output
types may be different again
b) The map input key and value types (K1 and V1) are different from the map output types
c) The partition function operates on the intermediate key
Answer:d
Explanation:In practice, the partition is determined solely by the key (the value is ignored).
3. In _____________, the default job is similar, but not identical, to the Java equivalent.
a) Mapreduce
b) Streaming
c) Orchestration
Answer:b
Explanation:MapReduce Types and Formats MapReduce has a simple model of data processing.
4. An input _________ is a chunk of the input that is processed by a single map.
a) textformat
b) split
c) datanode
Answer:b
Explanation:Each split is divided into records, and the map processes each recorda key-value
pairin turn.
a) If V2 and V3 are the same, you only need to use setOutputValueClass()
b) The overall effect of Streaming job is to perform a sort of the input
c) A Streaming application can control the separator that is used when a key-value pair is turned
into a series of bytes and sent to the map or reduce process over standard input
Answer:d
Explanation:If a combine function is used then it is the same form as the reduce function, except
its output types are the intermediate key and value types (K2 and V2), so they can feed the
reduce function.
6. An ___________ is responsible for creating the input splits, and dividing them into records.
a) TextOutputFormat
b) TextInputFormat
c) OutputInputFormat
d) InputFormat
Answer:d
Explanation:As a MapReduce application writer, you dont need to deal with InputSplits directly,
as they are created by an InputFormat.
7. ______________ is another implementation of the MapRunnable interface that runs mappers
concurrently in a configurable number of threads.
a) MultithreadedRunner
b) MultithreadedMap
c) MultithreadedMapRunner
d) SinglethreadedMapRunner
Answer:c
Explanation:A RecordReader is little more than an iterator over records, and the map task uses
one to generate record key-value pairs, which it passes to the map function.
8. Which of the following is the only way of running mappers ?
a) MapReducer
b) MapRunner
c) MapRed
Answer:b
Explanation:Having calculated the splits, the client sends them to the jobtracker.
9. _________ is the base class for all implementations of InputFormat that use files as their data
source .
a) FileTextFormat
b) FileInputFormat
c) FileOutputFormat
Answer:b
Explanation:FileInputFormat provides implementation for generating splits for the input files.
10. Which of the following method add a path or paths to the list of inputs ?
a) setInputPaths()
b) addInputPath()
c) setInput()
Answer:b
Explanation:FileInputFormat offers four static convenience methods for setting a JobConfs
input paths.
Hadoop Questions and Answers Mapreduce Formats-1
This set of Hadoop Interview Questions & Answers for experienced focuses on MapReduce
Formats.
1. The split size is normally the size of an ________ block, which is appropriate for most
applications.
a) generic
b) task
c) library
d) HDFS
Answer:d
Explanation:FileInputFormat splits only large files(Here large means larger than an HDFS
block).
a) The minimum split size is usually 1 byte, although some formats have a lower bound on the
split size
b) Applications may impose a minimum split size.
c) The maximum split size defaults to the maximum value that can be represented by a Java long
type
Answer:a
Explanation: The maximum split size has an effect only when it is less than the block size,
forcing splits to be smaller than a block.
3. Which of the following Hadoop streaming command option parameter is required ?
a) output directoryname
b) mapper executable
c) input directoryname
Answer:d
Explanation:Required parameters is used for Input and Output location for mapper.
4. To set an environment variable in a streaming command use:
a) -cmden EXAMPLE_DIR=/home/example/dictionaries/
b) -cmdev EXAMPLE_DIR=/home/example/dictionaries/
c) -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
d) -cmenv EXAMPLE_DIR=/home/example/dictionaries/
Answer:c
Explanation:Environment Variable is set using cmdenv command.
a) Hadoop works better with a small number of large files than a large number of small files
b) CombineFileInputFormat is designed to work well with small files
c) CombineFileInputFormat does not compromise the speed at which it can process the input in a
typical MapReduce job

Answer:c
Explanation:If the file is very small (small means significantly smaller than an HDFS block)
and there are a lot of them, then each map task will process very little input, and there will be a
lot of them (one per file), each of which imposes extra bookkeeping overhead.
6. The ________ option allows you to copy jars locally to the current working directory of tasks
and automatically unjar the files.
a) archives
b) files
c) task
Answer:a
Explanation:Archives options is also a generic option.
7. ______________ class allows the Map/Reduce framework to partition the map outputs based
on certain key fields, not the whole keys.
a) KeyFieldPartitioner
b) KeyFieldBasedPartitioner
c) KeyFieldBased
Answer:b
Explanation: The primary key is used for partitioning, and the combination of the primary and
secondary keys is used for sorting.
8. Which of the following class provides a subset of features provided by the Unix/GNU Sort ?
a) KeyFieldBased
b) KeyFieldComparator
c) KeyFieldBasedComparator
Answer:c
Explanation:Hadoop has a library class, KeyFieldBasedComparator, that is useful for many
applications.
9. Which of the following class is provided by Aggregate package ?
a) Map
b) Reducer
c) Reduce
Answer:b
Explanation:Aggregate provides a special reducer class and a special combiner class, and a list of
simple aggregators that perform aggregations such as sum, max, min and so on over a
sequence of values.
10.Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that
effectively allows you to process text data like the unix ______ utility.
a) Copy
b) Cut
c) Paste
d) Move
Answer:b
Explanation: The map function defined in the class treats each input key/value pair as a list of
fields.
Hadoop Questions and Answers Mapreduce Formats-2
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Mapreduce
Formats-2.
1. ___________ takes node and rack locality into account when deciding which blocks to place
in the same split
a) CombineFileOutputFormat
b) CombineFileInputFormat
c) TextFileInputFormat
Answer:b
Explanation:CombineFileInputFormat does not compromise the speed at which it can process the
input in a typical MapReduce job.
a) With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable
number of lines of input
b) StreamXmlRecordReader, the page elements can be interpreted as records for processing by a
mapper
c) The number depends on the size of the split and the length of the lines.
Answer:d
Explanation:Large XML documents that are composed of a series of records can be broken
into these records using simple string or regular-expression matching to find start and end tags of
records.
3. The key, a ____________, is the byte offset within the file of the beginning of the line.
a) LongReadable
b) LongWritable
c) LongWritable
Answer:b
Explanation: The value is the contents of the line, excluding any line terminators (newline,
carriage return), and is packaged as a Text object.
4. _________ is the output produced by TextOutputFor mat, Hadoops default OutputFormat.
a) KeyValueTextInputFormat
b) KeyValueTextOutputFormat
c) FileValueTextInputFormat
Answer:b
Explanation:To interpret such files correctly, KeyValueTextInputFormat is appropriate.
a) Hadoops sequence file format stores sequences of binary key-value pairs
b) SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the
sequence files keys and values as opaque binary objects
c) SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that retrieves the
sequence files keys and values as opaque binary objects.
Answer:c
Explanation:SequenceFileAsBinaryInputFormat is used for reading keys, values from
SequenceFiles in binary (raw) format.
6. __________ is a variant of SequenceFileInputFormat that converts the sequence files keys

and values to Text objects
a) SequenceFile
b) SequenceFileAsTextInputFormat
c) SequenceAsTextInputFormat
Answer:b
Explanation:With multiple reducers, records will be allocated evenly across reduce tasks, with all
records that share the same key being processed by the same reduce task.
7. __________ class allows you to specify the InputFormat and Mapper to use on a per-path
basis.
a) MultipleOutputs
b) MultipleInputs
c) SingleInputs
Answer:b
Explanation:One might be tab-separated plain text, the other a binary sequence file. Even if they
are in the same format, they may have different representations, and therefore need to be parsed
differently.
8. ___________ is an input format for reading data from a relational database, using JDBC.
a) DBInput
b) DBInputFormat
c) DBInpFormat
Answer:b
Explanation:DBInputFormat is the most frequently used format for reading data.
9. Which of the following is the default output format ?
a) TextFormat
b) TextOutput
c) TextOutputFormat
Answer:c
Explanation:TextOutputFormat keys and values may be of any type.
10. Which of the following writes MapFiles as output ?
a) DBInpFormat
b) MapFileOutputFormat
c) SequenceFileAsBinaryOutputFormat
Answer:c
Explanation:SequenceFileAsBinaryOutputFormat writes keys and values in raw binary format
into a SequenceFile container.
Hadoop Questions and Answers Hadoop Cluster-1
This set of Questions and Answers focuses on Hadoop Cluster
1. Mapper implementations are passed the JobConf for the job via the ________ method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
Answer:b
Explanation:JobConfigurable.configure method is overrided to initialize themselves.
a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by
the InputFormat for the job
format
Answer: d
Explanation:Reporters can be used to set application-level status messages and update Counters.
3. Input to the _______ is the sorted output of the mappers.
a) Reducer
b) Mapper
c) Shuffle

Answer:a
Explanation:In Shuffle phase the framework fetches the relevant partition of the output of all the
mappers, via HTTP.
4. The right number of reduces seems to be :
a) 0.90
b) 0.80
c) 0.36
d) 0.95
Answer:d
Explanation: The right number of reduces seems to be 0.95 or 1.75.
a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load
balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the
same key) in sort stage
Answer:a
6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
Answer:d
Explanation: The output of the reduce task is typically written to the FileSystem. The output of
the Reducer is not sorted.
7. Which of the following phases occur simultaneously ?
a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map

Answer:a
8. Mapper and Reducer implementations can use the ________ to report progress or just indicate
that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
Answer:c
Explanation:Reporter is a facility for MapReduce applications to report progress, set applicationlevel status messages and update Counters.
9. __________ is a generalization of the facility provided by the MapReduce framework to
a) Partitioner
b) OutputCollector
c) Reporter
Answer:b
10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop
a) Map Parameters
b) JobConf
c) MemoryConf
Answer:b
Explanation:JobConf represents a MapReduce job configuration.
Hadoop Questions and Answers Hadoop Cluster-2
This set of Hadoop assessment questions focuses on Hadoop Cluster.

1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in
the nodes.
a) NoSQL
b) NewSQL
c) SQL
Answer:a
Explanation: NoSQL systems make the most sense whenever the application is based on data
with varying data types and the data can be stored in key-value notation.
a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload
b) HDFS runs on a small cluster of commodity-class nodes
c) NEWSQL is frequently the collection point for big data
Answer:a
Explanation:Hadoop together with a relational data warehouse, they can form very effective data
warehouse infrastructure.
3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record
values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
Answer:a
Explanation:Other means of tagging the values also can be used.
4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional
Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
Answer:b
Explanation:enterprise data protection and security options including file system auditing and
data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and
highly efficient storage platform
b) Isilons native HDFS integration means you can avoid the need to invest in a separate Hadoop
infrastructure
c) NoSQL systems do provide high latency access and accommodate less concurrent users
Answer:c
Explanation:NoSQL systems do provide low latency access and accommodate many concurrent
users.
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to :
a) Scale out
b) Scale up
c) Both Scale out and up
Answer:a
Explanation:HDFS and NoSQL file systems focus almost exclusively on adding nodes to
increase performance (scale-out) but even they require node configuration with elements of scale
up.
7. Which is the most popular NoSQL database for scalable big data store with Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
Answer:a
Explanation:HBase is the Hadoop database: a distributed, scalable Big Data store that lets you
host very large tables billions of rows multiplied by millions of columns on clusters built
with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in the
a) DataCache
b) DistributedData
c) DistributedCache
Answer:c
9. HBase provides ___________ like capabilities on top of Hadoop and HDFS.
a) TopTable
b) BigTop
c) Bigtable
Answer:c
Explanation: Google Bigtable leverages the distributed data storage provided by the Google File
System.
10. _______ refers to incremental costs with no major impact on solution design, performance
and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
Answer:c
Explanation:dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a
cluster does not require additional network switches.
Hadoop Questions and Answers HDFS Maintenance
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on HDFS
Maintenance.
1. Which of the following is a common hadoop maintenance issue ?
a) Lack of tools
b) Lack of configuration management
c) Lack of web interface
Answer:b
Explanation:Without a centralized configuration management framework, you end up with a
number of issues that can cascade just as your usage picks up.
a) RAID is turned off by default
b) Hadoop is designed to be a highly redundant distributed system
c) Hadoop has a networked configuration system
Answer:b
Explanation:Hadoop deployment is sometimes difficult to implement.
3. ___________ mode allows you to suppress alerts for a host, service, role, or even the entire
cluster.
a) Safe
b) Maintenance
c) Secure
Answer:b
Explanation:Maintenance mode can be useful when you need to take actions in your cluster and
do not want to see the alerts that will be generated due to those actions.
4. Which of the following is a configuration management system ?
a) Alex
b) Puppet
c) Acem
Answer:b
Explanation:Administrators may use configuration management systems such as Puppet and
Chef to manage processes.
a) If you set the HBase service into maintenance mode, then its roles (HBase Master and all
Region Servers) are put into effective maintenance mode
b) If you set a host into maintenance mode, then any roles running on that host are put into
effective maintenance mode
c) Putting a component into maintenance mode prevent events from being logged

Answer:c
Explanation:Maintenance mode only suppresses the alerts that those events would otherwise
generate.
6. Which of the following is a common reason to restart hadoop process ?
a) Upgrade Hadoop
b) React to incidents
c) Remove worker nodes
Answer:d
Explanation: The most common reason administrators restart Hadoop processes is to enact
configuration changes.
7. __________ Managers Service feature monitors dozens of service health and performance
metrics about the services and role instances running on your cluster.
a) Microsoft
b) Cloudera
c) Amazon
Answer:b
Explanation:Managers Service feature presents health and performance data in a variety of
formats.
8. Which of the tab shows all the role instances that have been instantiated for this service ?
a) Service
b) Status
c) Instance
Answer:c
Explanation: The Instances page displays the results of the configuration validation checks it
performs for all the role instances for this service.
9. __________ is a standard Java API for monitoring and managing applications.
a) JVX
b) JVM
c) JMX
Answer:c
Explanation:Hadoop includes several managed beans (MBeans), which expose Hadoop metrics
to JMX-aware applications.
10. NameNode is monitored and upgraded in a __________ transition.
a) safemode
b) securemode
c) servicemode
Answer:b
Explanation: The HDFS service has some unique functions that may result in additional
information on its Status and Instances pages.
Hadoop Questions and Answers Monitoring HDFS
This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on Monitoring
HDFS.
1. For YARN, the ___________ Manager UI provides host and port information.
a) Data Node
b) NameNode
c) Resource
d) Replication
Answer:c
Explanation:All the metadata related to HDFS including the information about data nodes, files
stored on HDFS, and Replication, etc. are stored and maintained on the NameNode.
a) The Hadoop framework publishes the job flow status to an internally running web server on
the master nodes of the Hadoop cluster
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
tolerance
Answer:a
Explanation: The web interface for the Hadoop Distributed File System (HDFS) shows
information about the NameNode itself.
3. For ________, the HBase Master UI provides information about the HBase Master uptime.
a) HBase
b) Oozie
c) Kafka
Answer:a
Explanation:HBase Master UI provides information about the number of live, dead and
transitional servers, logs, ZooKeeper information, debug dumps, and thread stacks.
4. ________ NameNode is used when the Primary NameNode goes down.
a) Rack
b) Data
c) Secondary
Answer:c
Explanation:Secondary namenode is used for all time availability and reliability.
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
Answer:d
Explanation:NameNode is aware of the files to which the blocks stored on it belong to.
6. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
Answer:a
Explanation:HDFS can be used for storing archive data since it is cheaper as HDFS allows
storing the data on low cost commodity hardware while ensuring a high degree of faulttolerance.
7. The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
Answer:d
Explanation:Data is replicated across different DataNodes to ensure a high degree of faulttolerance.
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
Answer:a
Explanation: A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has
more than one DataNode, with data replicated across them.
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) HDFS Shell
b) FS Shell
c) DFS Shell
Answer:b
Explanation: The File System (FS) shell includes various shell-like commands that directly
interact with the Hadoop Distributed File System.
10. During start up, the ___________ loads the file system state from the fsimage and the edits
log file.
a) DataNode
b) NameNode
c) ActionNode

Answer:b
Explanation:HDFS is implemented on any computer which can run Java can host a
NameNode/DataNode on it.

BigData Objective

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BigData Objective

Hochgeladen von

Copyright:

Verfügbare Formate

This set of Hadoop Multiple Choice Questions & Answers (MCQs) focuses on History of

d) None of the mentioned

c) The toy elephant of Cuttings son

d) All of the mentioned

Hadoop Questions and Answers Introduction to Mapreduce

d) All of the mentioned

8. __________ maps input key/value pairs to a set of intermediate key/value pairs.

same key) in sort stage

7. Reducer is input the grouped output of a :

d) All of the mentioned

Hadoop Questions and Answers Hadoop I/O

d) All of the mentioned

2. Point out the correct statement :

6. Which of the following is the slowest compression technique ?

d) All of the mentioned

d) None of the mentioned

3. __________ facilitates construction of generic data-processing systems and languages.

Hadoop Questions and Answers Avro-2

d) All of the mentioned

c) Using disable-all, you can truncate a column family

d) None of the mentioned

d) None of the mentioned

d) None of the mentioned

d) All of the mentioned

2. Point out the correct statement :

6. In order to turn on RPC authentication in hadoop, set the value of

d) All of the mentioned

d) None of the mentioned

processed by the map tasks in a completely parallel manner

d) None of the mentioned

d) None of the mentioned

6. __________ is a variant of SequenceFileInputFormat that converts the sequence files keys

d) All of the mentioned

d) All of the mentioned

This set of Hadoop assessment questions focuses on Hadoop Cluster.

d) None of the mentioned

d) None of the mentioned

Das könnte Ihnen auch gefallen