Beruflich Dokumente
Kultur Dokumente
OBJECTIVES: 3003
To understand the competitive advantages of big data analytics
To understand the big data frameworks
To learn data analysis methods
To learn stream computing
To gain knowledge on Hadoop related tools such as HBase, Cassandra, Pig, and Hive for big
data analytics.
Unit III
Part A
1,What is classification?
Classification is a general process related to categorization, the process in which ideas
and objects are recognized, differentiated, and understood.
A classification system is an approach to accomplishing classification.
2, What is clustering?
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled
data.A loose definition of clustering could be “the process of organizing objects into groups whose
members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are
“dissimilar” to the objects belonging to other clusters.
3, What are the different types of regression medels?
Linear Regression. It is one of the most widely known modeling technique
Logistic Regression
Polynomial Regression
Stepwise Regression
Ridge Regression
Lasso Regression.
Elastic Net Regression
4, What is the difference between correlation and regression?
Correlation and linear regression are not the same. Correlation quantifies the degree to
which two variables are related. Correlation does not fit a line through the data points. You
simply are computing a correlation coefficient (r) that tells you how much one variable tends to
change when the other one does.
5, What is rule mining?
Association rule mining is a procedure which is meant to find frequent patterns,
correlations, associations, or causal structures from data sets found in various kinds of databases
such as relational databases, transactional databases, and other forms of data repositories.
8.What is regression?
• Predicts the quantity or probability of an outcome
• What is the likelihood of heart attack, given age, weight, …?
• What is the expected profit a customer will generate?
• What is the forecasted price of a stock?
• Algorithms: Logistic, Linear, Polynomial, Transform
9, What is Real Time Analytics Platform (RTAP)?
Real Time Analytics Platform (RTAP) analyzes data, correlates and predicts outcomes on
a real time basis. The platform enables enterprises to track things in real time on a worldwide
basis and helps in timely decision making. This platform provides us to build a range of
powerful analytic applications.
Unit V
Part A
1, What is NoSQL?
A NoSQL (originally referring to "non SQL" or "non relational") database provides a
mechanism for storage and retrieval of data that is modeled in means other than the tabular
relations used in relational databases. ... NoSQL databases are increasingly used in big data and
real-time web applications.
2, Why do we need NoSQL?
A relational database may require vertical and, sometimes horizontal expansion of servers,
to expand as data or processing requirements grow. An alternative, more cloud-friendly approach
is to employ NoSQL. ... NoSQL is a whole new way of thinking about a database. NoSQL is not
a relational database
3.What is HBase?
HBase is an open-source, non-relational, distributed database modeled after Google's
Bigtable and is written in Java. It is developed as part of Apache Software Foundation's Apache
Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing Bigtable-
like capabilities for Hadoop
4, What is the difference between HBase and Hive?
Despite providing SQL functionality, Hive does not provide interactive querying yet - it
only runs batch processes on Hadoop. Apache HBase is a NoSQL key/value store which runs on
top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than
MapReduce jobs.
5, What is the difference between Pig and Hive?
Differences between Pig and Hive- Depending on the purpose and type of data you can
either choose to use Hive Hadoop component or Pig Hadoop Component based on the
below differences : 1) Hive Hadoop Component is used mainly by data analysts
whereas Pig Hadoop Component is generally used by Researchers and Programmers
6, What is Pig in hadoop ?
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data
workers to write complex data transformations without knowing Java. Pig's simple SQL-like
scripting language is called Pig Latin, and appeals to developers already familiar with scripting
languages and SQL.
7,What is Apache Pig ?
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop.
Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which
makes MapReduce programming high level, similar to that of SQL for relational database
management systems.
8,What is Pig,Hive,HBase?
PIG is used for data transformation tasks. You have a file, want to extract a useful
information from it or join two files or any other transformation then use PIG. HIVE is used to
query these files by defining a "virtual" table and running SQL like queries on those
tables. HBase is a full fledged NoSQL database .
9, What is Cassandra Client?
cassandra-client is a Node.js CQL 2 driver for Apache Cassandra 0.8 and later. CQL is a
query language for Apache Cassandra. You use it in much the same way you would use SQL
for a relational database. The Cassandra documentation can help you learn the syntax.
10,List out the types of builtin operator in HIVE?
Unit 1
Part B
1,Explain about structure of big data.
As you read about big data, you will come across a lot of discussion on the concept of data
being structured, unstructured, semi-structured, or even multi-structured. Big data is often
described as unstructured and traditional data as structured. The lines aren’t as clean as such labels
suggest, however. Let’s explore these three types of data structure from a layman’s perspective.
Highly technical details are out of scope for this book. Most traditional data sources are fully in
the structured realm. This means traditional data sources come in a clear, predefined format that is
specified in detail. There is no variation from the defined formats on a day-to-day or update-to-
update basis. For a stock trade, the first field received might be a date in a MM/DD/YYYY format.
Next might be an account number in a 12-digit numeric format. Next might be a stock symbol that
is a three- to five-digit character field. And so on. Every piece of information included is known
ahead of time, comes in a specified format, and occurs in a specified order. This makes it easy to
work with.
Unstructured data sources are those that you have little or no control over. You are going
to get what you get. Text data, video data, and audio data all fall into this classification. A picture
has a format of individual pixels set up in rows, but how those pixels fit together to create the
picture seen by an observer is going to vary substantially in each case. There are sources of big
data that are truly unstructured such as those preceding. However, most data is at least semi-
structured. Semi-structured data has a logical flow and format to it that can be understood, but the
format is not user-friendly. Sometimes semi structured data is referred to as multi-structured data.
There can be a lot of noise or unnecessary data intermixed with the nuggets of high value in
such a feed. Reading semi-structured data to analyze it isn’t as simple as specifying a fixed file
format. To read semi-structured data, it is necessary to employ complex rules that dynamically
determine how to proceed after reading each piece of information. Web logs are a perfect example
of semi-structured data. Web logs are pretty ugly when you look at them; however, each piece of
information does, in fact, serve a purpose of some sort. Whether any given piece of a web log
serves your purposes is another question. See Figure 1.1 for anexample of a raw web log.
If n = 100, we do not want to use a DFS or Map Reduce for this calculation. But this sort of
calculation is at the heart of the ranking of Web pages that goes on at search engines, and there, n
is in the tens of billions.3 Let us first assume that n is large, but not so large that vector v cannot
fit in main memory and thus be available to every Map task. The matrix M and the vector v each
will be stored in a file of the DFS. We assume that the row-column coordinates of each matrix
element will be discoverable, either from its position in the file, or because it is stored with explicit
coordinates, as a triple (i, j,mij). We also assume the position of element vj in the vector v will be
discoverable in the analogous way.
The Map Function: The Map function is written to apply to one element of M. However, if v is
not already read into main memory at the compute node executing a Map task, then v is first read,
in its entirety, and subsequently will be available to all applications of the Map function performed
at this Map task. Each Map task will operate on a chunk of the matrix M. From each matrix element
mij it produces the key-value pair (i,mijvj). Thus, all terms of the sum that make up the component
xi of the matrix-vector product will get the same key, i.
The Reduce Function: The Reduce function simply sums all the values associated with a given
key i. The result will be a pair (i, xi)
2.3.2 If the Vector v Cannot Fit in Main Memory
However, it is possible that the vector v is so large that it will not fit in its entirety in main
memory. It is not required that v fit in main memory at a compute node, but if it does not then
there will be a very large number ofdisk accesses as we move pieces of the vector into main memory
to multiply
components by elements of the matrix. Thus, as an alternative, we can divide
the matrix into vertical stripes of equal width and divide the vector into an equal
number of horizontal stripes, of the same height. Our goal is to use enough
stripes so that the portion of the vector in one stripe can fit conveniently into
main memory at a compute node. Figure 2.4 suggests what the partition looks
like if the matrix and vector are each divided into five stripes.