Beruflich Dokumente
Kultur Dokumente
Section 1: Understanding Microsoft big data solutions, provides an overview of the principles
and benefits of big data solutions, and the differences between these and the more traditional
database systems. It includes general guidance for planning and designing big data solutions by
exploring in more depth topics such as defining the goals, locating data sources, and more. It
will help you decide where, when, and how you might benefit from adopting a big data solution.
This section also discusses Azure HDInsight, and its place within the comprehensive Microsoft
data platform.
Section 2, Designing big data solutions using HDInsight, contains guidance for designing
solutions to meet the typical batch processing use cases inherent in big data processing. Even if
you choose not to use HDInsight as the platform for your own solution, you will find the
information in this section useful.
Section 3, Implementing big data solutions using HDInsight, explores a range of topics such as
the options and techniques for loading data into an HDInsight cluster, the tools you can use in
HDInsight to process data in a cluster, and the ways you can transfer the results from HDInsight
into analytical and visualization tools to generate reports and charts, or export the results into
existing data stores such as databases, data warehouses, and enterprise BI systems. This section
also contains useful information to help you automate all or part of the process, and to manage
and monitor your solutions.
The guide concentrates on the Azure HDInsight service, but much of the information is equally
applicable to big data solutions built on any platform, and with any Hadoop-based framework.
This guide is based on the version 3.0 (March 2014) release of HDInsight on Azure, but also includes
some of the preview features that are available in later versions. Earlier and later releases of HDInsight
may differ from the version described in this guide. For more information, see What's new in the
Hadoop cluster versions provided by HDInsight? To sign up for the Azure service, go to HDInsight
service home page.
Executives, information officers, and technology managers. The discussion of the principles
and benefits of Hadoop-based big data solutions, defining the goals for solutions, and
identifying analysis requirements in section 1, Understanding Microsoft big data solutions, of
this guide demonstrates where, when, and how a big data solution would benefit the
organization.
Architects and system designers. The exploration of the typical use cases and scenarios for big
data batch processing solutions in section 2, Designing big data solutions using HDInsight, of
this guide provides valuable assistance in designing systems that will produce the desired
results.
Developers and database administrators. The explanation of topics such as loading, querying
and manipulating data; transferring the results into analytical and visualization tools; exporting
the results into existing data stores and enterprise BI systems; and automating solutions in
section 3, Implementing big data solutions using HDInsight, of this guide will help developers
and DBAs to get started implementing and working with big data solutions.
other sources and combined with your own data to help you better understand your customers, your
users, and your business; and to help you plan for the future.
Table of Contents 5
Table of Contents
Understanding Microsoft big data solutions ................................................................................................ 6
What is big data? .................................................................................................................................... 11
What is Microsoft HDInsight? ................................................................................................................. 27
Planning a big data solution .................................................................................................................... 30
Designing big data solutions using HDInsight ............................................................................................. 47
Use case 1: Iterative exploration ............................................................................................................ 52
Use case 2: Data warehouse on demand................................................................................................ 56
Use case 3: ETL automation .................................................................................................................... 61
Use case 4: BI integration ....................................................................................................................... 65
Scenario 1: Iterative exploration............................................................................................................. 74
Scenario 2: Data warehouse on demand ................................................................................................ 96
Scenario 3: ETL automation .................................................................................................................. 108
Scenario 4: BI integration...................................................................................................................... 126
Implementing big data solutions using HDInsight .................................................................................... 163
Collecting and loading data into HDInsight .......................................................................................... 164
Processing, querying, and transforming data using HDInsight ............................................................. 195
Consuming and visualizing data from HDInsight .................................................................................. 244
Building end-to-end solutions using HDInsight .................................................................................... 295
Appendix A - Tools and technologies reference ....................................................................................... 317
Big data is not a stand-alone technology, or just new type of data querying mechanism. It is a significant
part of the Microsoft Business Intelligence (BI) and Analytics product range, and a vital component of
the Microsoft data platform. Figure 1 shows an overview of the Microsoft data platform and enterprise
BI product range, and the roles big data and HDInsight play within this.
Figure 1 - The role of big data and HDInsight within the Microsoft Data Platform
The figure does not include all of Microsofts data-related products, and it doesnt attempt to show
physical data flows. For example, data can be ingested into HDInsight without going through an
integration process, and a data store could be the data source for another process. Instead, the figure
illustrates as layers the applications, services, tools, and frameworks that work together allow you to
capture data, store it, process it, and visualize the information it contains. Notice that the big data
technologies span both the Integration and Data stores layers.
Microsoft implements Hadoop-based big data solutions using the Hortonworks Data Platform (HDP),
which is built on open source components in conjunction with Hortonworks. The HDP is 100%
compatible with Apache Hadoop, and is compatible with open source community distributions. All of
the components are tested in typical scenarios to ensure that they work together correctly, and that
there are no versioning or compatibility issues. Developments are fed back into community through
Hortonworks to maintain compatibility and to support the open source effort.
Microsoft and Hortonworks offer three distinct solutions based on HDP:
HDInsight. This is a cloud-hosted service available to Azure subscribers that uses Azure clusters
to run HDP, and integrates with Azure storage. For more information about HDInsight see What
is Microsoft HDInsight? and the HDInsight page on the Azure website.
Hortonworks Data Platform (HDP) for Windows. This is a complete package that you can install
on Windows Server to build your own fully-configurable big data clusters based on Hadoop. It
can be installed on physical on-premises hardware, or in virtual machines in the cloud. For more
information see Microsoft Server and Cloud Platform on the Microsoft website and Hortonworks
Data Platform.
Microsoft Analytics Platform System. This is a combination of the massively parallel processing
(MPP) engine in Microsoft Parallel Data Warehouse (PDW) with Hadoop-based big data
technologies. It uses the HDP to provide an on-premises solution that contains a region for
Hadoop-based processing, together with PolyBasea connectivity mechanism that integrates
the MPP engine with HDP, Cloudera, and remote Hadoop-based services such as HDInsight. It
allows data in Hadoop to be queried and combined with on-premises relational data, and data
to be moved into and out of Hadoop. For more information see Microsoft Analytics Platform
System.
Simple iterative querying and visualization. You may simply want to load some unstructured
data into HDInsight, combine it with data from external sources such as Azure Marketplace, and
then analyze and visualize the results using Microsoft Excel and Power View. In this case, data
from the data source will flow into HDInsight where queries and transformations generate the
required result. This result flows through an ODBC connector or directly from Azure blob
storage into a visualization tool such as Excel, where it is combined with data loaded directly by
Excel from Azure Marketplace.
Handling streaming data and exposing it through SharePoint. In this case streaming data
collected from device sensors is fed through Microsoft StreamInsight or Azure Intelligent
Systems Service for categorization and filtering, and can be used to display real-time values on a
dashboard or to trigger changes in a process. The data is then transferred into an Azure
HDInsight cluster for use in historical analysis. The output from queries that are run as periodic
batch jobs in HDInsight is integrated at the corporate data model level with a data warehouse,
and ultimately delivered to users through SharePoint libraries and web partsmaking it
available for use in reports, and in data analysis and visualization tools such as Excel.
Exposing data as a business data source for an existing data warehouse system. This might be
to produce a specific set of management reports on a regular basis. Semi-structured or
unstructured data is loaded into HDInsight, queried and transformed within HDInsight,
validated and cleansed using Data Quality Services, and stored in your data warehouse tables
ready for use in reports. You may also use Master Data Services to ensure consistency between
data representations of business elements across your organization.
These are just three examples of the countless permutations and capabilities of the Microsoft data
platform and HDInsight. Your own requirements will differ, but the combination of services and tools
makes it possible to implement almost any kind of big data solution using the elements of the platform.
You will see many examples of the way that these applications, tools, and services work together in this
guide.
The topic How do big data solutions work? explores the mechanisms that Hadoop-based solutions can
use to analyze data.
Big data solutions can help you to discover information that you didnt know existed, complement your
existing knowledge about your business and your customers, and boost competitiveness. By using the
cloud as the data store and HDInsight as the query mechanism you benefit from very affordable storage
costs (at the time of writing, 1TB of Azure storage costs less than $40 per month), and the flexibility and
elasticity of the pay-as-you-go model where you only pay for the resources you use.
More information
For an overview and description of Microsoft big data see Microsoft Server and Cloud Platform.
For more information about HDInsight see the HDInsight page on the Azure website.
Documentation for HDInsight is available on the Tutorials and Guides page.
To sign up for Azure services go to the HDInsight Service page.
The page Get started using Azure HDInsight will help you begin working with HDInsight.
The official site for the Apache Hadoop framework and tools is the Apache Hadoop website.
You can download the free eBook Introducing Microsoft Azure HDInsight from the Microsoft Press
Blog.
There are also many popular blogs that cover big data and HDInsight topics:
Hortonworks: http://hortonworks.com/blog/
Volume: Big data solutions typically store and query hundreds of terabytes of data, and the
total volume is probably growing by ten times every five years. Storage must be able to manage
this volume, be easily expandable, and work efficiently across distributed systems. Processing
systems must be scalable to handle increasing volumes of data, typically by scaling out across
multiple machines.
Variety: It's not uncommon for new data to not match any existing data schema. It may also be
semi-structured or unstructured data. This means that applying schemas to the data before or
during storage is no longer a practical proposition.
Velocity: Data is being collected at an increasing rate from many new types of devices, from a
fast-growing number of users, and from an increasing number of devices and applications per
user. The design and implementation of storage must be able to manage this efficiently, and
processing systems must be able to return results within an acceptable timeframe.
The quintessential aspect of big data is not the data itself; its the ability to discover useful information
hidden in the data. Big data is not just Hadoopsolutions may use traditional data management
systems such as relational databases and other types of data store. Its really all about the analytics
that a big data solution can empower.
This section of the guide explores some of the basic features of big data solutions. If you are not familiar
with the concepts of big data, when it is useful, and how it works, you will find the following topics
helpful:
Finding hidden insights in large stores of data. For example, organizations want to know how
their products and services are perceived in the market, what customers think of the
organization, whether advertising campaigns are working, and which facets of the organization
are (or are not) achieving their aims. Organizations typically collect data that is useful for
generating business intelligence (BI) reports, and to provide input for management decisions.
However, they are increasingly implementing mechanisms that collect other types of data such
as sentiment data (emails, comments from web site feedback mechanisms, and tweets that
are related to the organization's products and services), click-through data, information from
sensors in users' devices (such as location data), and website log files.
Extracting vital management information. The vast repositories of data often contain useful,
and even vital information that can be used for product and service planning, coordinating
advertising campaigns, improving customer service, or as an input to reporting systems. This
information is also very useful for predictive analysis such as estimating future profitability in a
financial scenario, or for an insurance company to predict the possibility of accidents and
claims. Big data solutions allow you to store and extract all this information, even if you dont
know when or how you will use the data at the time you are collecting it.
Successful organizations typically measure performance by discovering the customer value that each
part of their operation generates. Big data solutions provide a way to help you discover value, which
often cannot be measured just through traditional business methods such as cost and revenue
analysis.
Volume: Big data solutions are designed and built to store and process hundreds of terabytes,
or even petabytes of data in a way that can dramatically reduce storage cost, while still being
able to generate BI and comprehensive reports.
Variety: Organizations often collect unstructured data, which is not in a format that suits
relational database systems. Some data, such as web server logs and responses to
questionnaires may be preprocessed into the traditional row and column format. However,
data such as emails, tweets, and web site comments or feedback, are semi-structured or even
unstructured data. Deciding how to store this data using traditional database systems is
problematic, and may result in loss of useful information if the data must be constricted to a
specific schema when it is stored.
Big data solutions typically target scenarios where there is a huge volume of unstructured or
semi-structured data that must be stored and queried to extract business intelligence.
Typically, the majority of data currently stored in big data solutions is unstructured or semistructured.
Velocity: The rate at which data arrives may make storage in an enterprise data warehouse
problematic, especially where formal preparation processes such as examining, conforming,
cleansing, and transforming the data must be accomplished before it is loaded into the data
warehouse tables.
The combination of all these factors means that, in some circumstances, a big data batch processing
solution may be a more practical proposition than a traditional relational database system. However, as
big data solutions have continued to evolve it has become clear that they can also be used in a
fundamentally different context: to quickly get insights into data, and to provide a platform for further
investigation in a way that just isnt possible with traditional data storage, management, and querying
tools.
Figure 1 demonstrates how you might go from a semi-intuitive guess at the kind of information that
might be hidden in the data, to a process that incorporates that information into your business domain.
As an example, you may want to explore the postings by users of a social website to discover what they
are saying about your company and its products or services. Using a traditional BI system would mean
waiting for the database architect and administrator to update the schemas and models, cleanse and
import the data, and design suitable reports. But its unlikely that youll know beforehand if the data is
actually capable of providing any useful information, or how you might go about discovering it. By using
a big data solution you can investigate the data by asking any questions that may seem relevant. If you
find one or more that provide the information you need you can refine the queries, automate the
process, and incorporate it into your existing BI systems.
Big data solutions arent all about business topics such as customer sentiment or web server log file
analysis. They have many diverse uses around the world and across all types of applications. Police
forces are using big data techniques to predict crime patterns, researchers are using them to explore
the human genome, particle physicists are using them to search for information about the structure of
matter, and astronomers are using them to plot the entire universe. Perhaps the last of these really is
a big big data solution!
Figure 2 - Some differences between relational databases and big data batch processing solutions
Modern data warehouse systems typically use high speed fiber networks, in-memory caching, and
indexes to minimize data transfer delays. However, in a big data solution only the results of the
distributed query processing are passed across the cluster network to the node that will assemble them
into a final results set. Under ideal conditions, performance during the initial stages of the query is
limited only by the speed and capacity of connectivity to the co-located disk subsystem, and this initial
processing occurs in parallel across all of the cluster nodes.
The servers in a cluster are typically co-located in the same datacenter and connected over a lowlatency, high-bandwidth network. However, big data solutions can work well even without a high
capacity network, and the servers can be more widely distributed, because the volume of data moved
over the network is much less than in a traditional relational database cluster.
The ability to work with highly distributed data and simple file formats also opens up opportunities for
more efficient and more comprehensive data collection. For example, services and applications can
store data in any of the predefined distributed locations without needing to preprocess it or execute
queries that can absorb processing capacity. Data is simply appended to the files in the data store. Any
processing required on the data is done when it is queried, without affecting the original data and
risking losing valuable information.
Queries to extract information in a big data solution are typically batch operations that, depending on
the data volume and query complexity, may take some time to return a final result. However, when you
consider the volumes of data that big data solutions can handle, the fact that queries run as multiple
tasks on distributed servers does offer a level of performance that may not be achievable by other
methods. While it is possible to perform real-time queries, typically you will run the query and store the
results for use within your existing BI tools and analytics systems. This means that, unlike most SQL
queries used with relational databases, big data queries are typically not executed repeatedly as part of
an applications executionand so batch operation is not a major disadvantage.
Big data systems are also designed to be highly resilient against failure of storage, networks, and
processing. The distributed processing and replicated storage model is fault-tolerant, and allows easy reexecution of individual stages of the process. The capability for easy scaling of resources also helps to
resolve operational and performance issues.
The following table summarizes the major differences between a big data solution and existing
relational database systems.
Feature
Structured
Data integrity
Hightransactional updates
Schema
Staticrequired on write
Storage volume
Gigabytes to terabytes
Scalability
Limited or none
Economics
The Hadoop kernel, or core package, containing the Hadoop distributed file system (HDFS)), the
map/reduce framework, and common routines and utilities.
A runtime resource manager that allocates tasks, and executes queries (such as map/reduce
jobs) and other applications. This is usually implemented through the YARN framework,
although other resource managers such as Mesos are available.
Other resources, tools, and utilities that run under the control of the resource manager to
support tasks such as managing data and running queries or other jobs on the data.
Notice that map/reduce is just one application that you can run on a Hadoop cluster. Several query,
management, and other types of applications are available or under development. Examples are:
Lasr: An in-memory analytics processor for tasks that are not well suited to map/reduce
processing.
Reef: A query mechanism designed to implement iterative algorithms for graph analytics and
machine learning.
Storm: A distributed real-time computation system for processing fast, large streams of data.
In addition there are many other open source components and tools that can be used with Hadoop. The
Apache Hadoop website lists the following:
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast, general-use compute engine with a simple and expressive programming model.
Tez: A generalized data-flow programming framework for executing both batch and interactive
tasks.
A list of commonly used tools and frameworks for big data projects based on Hadoop can be found in
Appendix A - Tools and technologies reference. Some of these tools are not supported on HDInsight
for more details see What is Microsoft HDInsight?
Figure 1 shows an overview of a typical Hadoop-based big data mechanism.
The cluster
In Hadoop, a cluster of servers stores the data using HDFS, and processes it. Each member server in the
cluster is called a data node, and contains an HDFS data store and a query execution engine. The cluster
is managed by a server called the name node that has knowledge of all the cluster servers and the parts
of the data files stored on each one. The name node server does not store any of the data to be
processed, but is responsible for storing vital metadata about the cluster and the location of each block
of the source data, directing clients to the other cluster members, and keeping track of the state of each
one by communicating with a software agent running on each server.
To store incoming data, the name node server directs the client to the appropriate data node server.
The name node also manages replication of data files across all the other cluster members that
communicate with each other to replicate the data. The data is divided into blocks and three copies of
each data file are stored across the cluster servers in order to provide resilience against failure and data
loss (the block size and the number of replicated copies are configurable for the cluster).
Key/value stores. These are data stores that hold data as a series of key/value pairs. The value
may be a single data item or a complex data structure. There is no fixed schema for the data,
and so these types of data store are ideal for unstructured data. An example of a key/value
store is Azure table storage, where each row has a key and a property bag containing one or
more values. Key/value stores can be persistent or volatile.
Document stores. These are data stores optimized to hold structured, semi-structured, and
unstructured data items such as JSON objects, XML documents, and binary data. They are
usually indexed stores.
Block stores. These are typically non-indexed stores that hold binary data, which can be a
representation of data in any format. For example, the data could represent JSON objects or it
could just be a binary data stream. An example of a block store is Azure blob storage, where
each item is identified by a blob name within a virtual container structure.
Wide column or column family data stores. These are data stores that do use a schema, but the
schema can contain families of columns rather than just single columns. They are ideally suited
to storing semi-structured data, where some columns can be predefined but others are capable
of storing differing elements of unstructured data. HBase running on HDFS is an example. HBase
is discussed in more detail in the topic Specifying the infrastructure in this guide.
Graph data stores. These are data stores that hold the relationships between objects. They are
less common than the other types of data store, many still being experimental, and they tend to
have specialist uses.
NoSQL storage is typically much cheaper than relational storage, and usually supports a write once
capability that allows only for data to be appended. To update data in these stores you must drop and
recreate the relevant file, or maintain delta files and implement mechanisms to conflate the data. This
limitation maximizes throughput; storage implementations are usually measured by throughput rather
than capacity because this is usually the most significant factor for both storage and query efficiency.
Modern data management techniques such as the Event Sourcing, Command Query Responsibility
Separation (CQRS), and other patterns do not encourage updates to data. Instead, new data is added
and milestone records are used to fix the current state of the data at intervals. This approach provides
better performance and maintains the history of changes to the data. For more information about
CQRS and Event Sourcing see the patterns & practices guide CQRS Journey.
order record, the name of the item ordered, and the quantity ordered. If this data was stored in a
relational database, a query of the following form could be used to generate a summary of the total
number of each item sold:
SQL
SELECT ProductName, SUM(Quantity) FROM OrderDetails GROUP BY ProductName
The equivalent using a big data solution requires a Map and a Reduce component. The Map component
running on each node operates on a subset, or chunk, of the data. It transforms each order line into a
name/value pair where the name is the product name, and the value is the quantity from that order
line. Note that in this example the Map component does not sum the quantity for each product, it
simply transforms the data into a list.
Next, the framework shuffles and sorts all of the lists generated by the Map component instances into a
single list, and executes the Reduce component with this list as the input. The Reduce component sums
the totals for each product, and outputs the results. Figure 2 shows a schematic overview of the process.
Figure 2 - A high level view of the map/reduce process for storing data and extracting information
Depending on the configuration of the query job, there may be more than one Reduce component
instance running. The output from each Map component instance is stored in a buffer on disk, and the
component exits. The content of the buffer is then sorted, and passed to one or more Reduce
component instances. Intermediate results are stored in the buffer until the final Reduce component
instance combines them all.
In some cases the process might include an additional component called a Combiner that runs on each
data node as part of the Map process, and performs a reduce type of operation on this part of the
data each time the map process runs. It may also run as part of the reduce phase, and again when large
datasets are being merged.
In the example shown here, a Combiner could sum the values for each product so that the output is
smaller, which can reduce network load and memory requirementswith a subsequent increase in
overall query efficiency. Often, as in this example, the Combiner and the Reduce components would be
identical.
Performing a map/reduce operation involves several stages such as partitioning the input data, reading
and writing data, and shuffling and sorting the intermediate results. Some of these operations are
quite complex. However, they are typically the same every timeirrespective of the actual data and
the query. The great thing with a map/reduce framework such as Hadoop is that you usually need to
create only the Map and Reduce components. The framework does the rest.
Although the core Hadoop engine requires the Map and Reduce components it executes to be written in
Java, you can use other techniques to create them in the background without writing Java code. For
example you can use tools named Hive and Pig that are included in most big data frameworks to write
queries in a SQL-like or a high-level language. You can also use the Hadoop streaming API to execute
components written in other languagessee Hadoop Streaming on the Apache website for more
details.
More information
The official site for Apache big data solutions and tools is the Apache Hadoop website.
For a detailed description of the MapReduce framework and programming model, see MapReduce.org.
Data storage
Big data solutions typically store data as a series of files located within a folder structure on disk.
However, in HDInsight these files are stored in Azure blob storage. HDInsight supports the standard
Hadoop file system commands and processes by using a fully HDFS-compliant layer over Azure blob
storage. As far as Hadoop is concerned, storage operates in exactly the same way as when using a
physical HDFS implementation. The advantages are that you can access storage using standard Azure
blob storage techniques as well as through the HDFS layer, and the data can be persisted when the
cluster is decommissioned.
HDInsight also offers the option to create a cluster that hosts the HBase open source data management
system. HBase is a NoSQL wide-column data store implemented as distributed system that provides data
processing and storage over multiple nodes in a Hadoop cluster. It provides a random, real-time,
read/write data store designed to host tables that can contain billions of rows and millions of columns.
For more information about how HDInsight uses blob storage, and the optional use of HBase, see
Data storage in the topic Specifying the infrastructure.
Data processing
HDInsight supports many of the Hadoop query, transformation, and analysis tools, and you can install
some additional tools and utilities on an HDInsight cluster if required. Examples of the tools and utilities
commonly used with Hadoop-based solutions such as HDInsight are:
Hive, which allows you to overlay a schema onto the data when you need to run a query, and
use a SQL-like language called HiveQL for these queries. For example, you can use the CREATE
TABLE command to build a table by splitting the text strings in the data using delimiters or at
specific character locations, and then execute SELECT statements to extract the required data.
Pig, which allows you to create schemas and execute queries by writing scripts in a high level
language called Pig Latin. Pig Latin is a procedural language that processes relations by
performing multiple interrelated data transformations that are explicitly encoded as data flow
sequences.
Map/reduce using components written in Java, and executed directly by the Hadoop
framework. As an alternative you can use the Hadoop streaming interface to execute map and
reduce components written in other languages such as C# and F#.
Mahout is a machine learning library, which allows you to perform data mining queries that
examine data files to extract specific types of information. For example, it supports
recommendation mining (finding users preferences from their behavior), clustering (grouping
documents with similar topic content), and classification (assigning new documents to a
category based on existing categorization).
Storm is a distributed real-time computation system for processing fast, large streams of data. It
allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data
items using a user-defined number of parallel tasks. It can be used for real-time analytics, online
machine learning, continuous computation, distributed RPC, ETL, and more.
At the time of writing, Mahout and Storm were not supported on HDInsight. For more information
about the query and analysis tools in HDInsight see Processing, querying, and transforming data using
HDInsight.
An ODBC driver that can be used to connect any ODBC-enabled consumer (such as a database,
or visualization tools such as Excel) with the data in Hive tables.
A Linq To Hive implementation that allows LINQ queries to be executed over the data in
HDInsight.
HCatalog, which is used in conjunction with queries, such as those that use Hive and Pig, to
abstract the physical paths to storage and make it easier to manage data and queries as a
solution evolves.
Sqoop, which can be used to import and export relational data to and from HDInsight.
Oozie, which provides a mechanism for automating workflows and operations. It supports
sequential and parallel workflow processes, and is extremely flexible.
More information about these and other tools and utilities is available in subsequent sections of this
guide, and in Appendix A - Tools and technologies reference.
Cluster management. This includes tasks such as creating and deleting clusters, and obtaining
runtime monitoring information.
Job execution. This includes uploading data and jobs, executing jobs, and downloading or
accessing the results.
Cluster management makes use of Apache Zookeeper (which is used internally to manage some aspects
of HDInsight) and the some features of the Ambari cluster monitoring framework.
The PowerShell cmdlets for Azure can be used to access blob storage to upload data to an HDInsight
cluster, as well as performing administrative tasks related to managing your subscription and services.
The PowerShell cmdlets for HDInsight allow full access to and management of almost all features of
HDInsight.
SDKs are available for use in creating applications that perform management and job submission for
HDInsight. The SDKs contain APIs that include classes for accessing storage, using HCatalog, automating
tasks with Oozie, and accessing monitoring information through Ambari. The .NET SDK also contains a
map/reduce implementation that uses the streaming interface to allow you to write queries in .NET
languages,
In addition, there is a cross-platform command-line interface available that allows you to access
HDInsight from different client platforms, and a management pack for Microsoft System Center.
For more information about administration tools and techniques for HDInsight see Building end-to-end
solutions using HDInsight and Appendix A - Tools and technologies reference.
More information
For an overview and description of HDInsight see Microsoft Big Data.
To sign up for the Azure HDInsight service, go to Azure HDInsight Service page.
For more information about using HDInsight, a good place to start is the TechNet library. You can see a
list of articles related to HDInsight by searching the library using this URL:
http://social.technet.microsoft.com/Search/en-US?query=hadoop.
The TechNet library contains articles related to HDInsight. Search for these using the URL
http://social.technet.microsoft.com/Search/en-US?query=hadoop.
The official support forum for HDInsight is at http://social.msdn.microsoft.com/Forums/enUS/hdinsight/threads.
Decide if big data is the appropriate solution. There are some tasks and scenarios for which big
data batch-processing solutions based on Hadoop are ideally suited, while other scenarios may
be better accomplished using a more traditional data management mechanism such as a
relational database. For more details, see Is big data the right solution?
Determine the analytical goals and source data. Before you start any data analysis project, it is
useful to be clear about what you hope to achieve from it. You may have a specific question
that you need to answer in order to make a critical business decision; in which case you must
identify data that may help you determine the answer, where it can be obtained from, and if
there are any costs associated with procuring it. Alternatively, you may already have some data
that you want to explore to try to discern useful trends and patterns. Either way, understanding
your goals will help you design and implement a solution that best supports those goals. For
more details, see Determining analytical goals and Identifying source data.
Design the architecture. While every data analysis scenario is different, and your requirements
will vary, there are some basic use cases and models that are best suited to specific scenarios.
For example, your requirements may involve a data analysis process followed by data cleansing
and validation, perhaps as a workflow of tasks, before transferring the results to another
system. This may form the basis for a mechanism that, for example, changes the behavior of an
application based on user preferences and patterns of behavior collected as they use the
application. For more details of the core use cases and models, see Designing big data solutions
using HDInsight.
Specify the infrastructure and cluster configuration. This involves choosing the appropriate big
data software, or subscribing to an online service such as HDInsight. You will also need to
determine the appropriate cluster size, storage requirements, consider if you will need to delete
and recreate the cluster as part of your management process, and ensure that your chosen
solution will meet SLAs and business operational requirements. For more details, see Specifying
the infrastructure.
Obtain the data and submit it to the cluster. During this stage you decide how you will collect
the data you have identified as the source, and how you will load it into your big data solution
for processing. Often you will store the data in its raw format to avoid losing any useful
contextual information it contains, though you may choose to do some pre-processing before
storing it to remove duplication or to simplify it in some other way. For more details, see
Collecting and loading data into HDInsight.
Process the data. After you have started to collect and store the data, the next stage is to
develop the processing solutions you will use to extract the information you need. You can
usually use Hive and Pig queries, or other processing tools, for even quite complex data
extraction. In a few rare circumstances you may need to create custom map/reduce
components to perform more complex queries against the data. For more details, see
Processing, querying, and transforming data using HDInsight.
Evaluate the results. Probably the most important step of all is to ensure that you are getting
the results you expected, and that these results make sense. Complex queries can be hard to
write, and difficult to get right the first time. Its easy to make assumptions or miss edge cases
that can skew the results quite considerably. Of course, it may be that you dont know what the
expected result actually is (after all, the whole point of big data is to discover hidden
information from the data) but you should make every effort to validate the results before
making business decisions based on them. In many cases, a business user who is familiar
enough with the business context can perform the role of a data steward and review the results
to verify that they are meaningful, accurate, and useful.
Tune the solution. At this stage, if the solution you have created is working correctly and the
results are valuable, you should decide whether you will repeat it in the future; perhaps with
new data you collect over time. If so, you should tune the solution by reviewing the log files it
creates, the processing techniques you use, and the implementation of the queries to ensure
that they are executing in the most efficient way. Its possible to fine tune big data solutions to
improve performance, reduce network load, and minimize the processing time by adjusting
some parameters of the query and the execution platform, or by compressing the data that is
transferred over the network.
Visualize and analyze the results. Once you are satisfied that the solution is working correctly
and efficiently, you can plan and implement the analysis and visualization approach you require.
This may be loading the data directly into an application such as Microsoft Excel, or exporting it
into a database or enterprise BI system for further analysis, reporting, charting, and more. For
more details, see Consuming and visualizing data from HDInsight.
Automate and manage the solution. At this point it will be clear if the solution should become
part of your organizations business management infrastructure, complementing the other
sources of information that you use to plan and monitor business performance and strategy. If
this is the case, you should consider how you might automate and manage some or all of the
solution to provide predictable behavior, and perhaps so that it is executed on a schedule. For
more details, see Building end-to-end solutions using HDInsight.
Note that, in many ways, data analysis is an iterative process; and you should take this approach when
building a big data batch processing solution. In particular, given the large volumes of data and
correspondingly long processing times typically involved in big data analysis, it can be useful to start by
implementing a proof of concept iteration in which a small subset of the source data is used to validate
the processing steps and results before proceeding with a full analysis. This enables you to test your big
data processing design on a small cluster, or even on a single-node on-premises cluster, before scaling
out to accommodate production level data volumes.
Its easy to run queries that extract data, but its vitally important that you make every effort to validate
the results before using them as the basis for business decisions. If possible you should try to cross
reference the results with other sources of similar information.
Where will the source data come from? Perhaps you already have the data that contains the
information you need, but you cant analyze it with your existing tools. Or is there a source of
data you think will be useful, but you dont yet know how to collect it, store it, and analyze it?
What is the format of the data? Is it highly structured, in which case you may be able to load it
into your existing database or data warehouse and process it there? Or is it semi-structured or
unstructured, in which case a Hadoop-based mechanism such as HDInsight that is optimized for
textual discovery, categorization, and predictive analysis will be more suitable?
What are the delivery and quality characteristics of the data? Is there a huge volume? Does it
arrive as a stream or in batches? Is it of high quality, or will you need to perform some type of
data cleansing and validation of the content?
Do you want to combine the results with data from other sources? If so, do you know where
this data will come from, how much it will cost if you have to purchase it, and how reliable this
data is?
Do you want to integrate with an existing BI system? Will you need to load the data into an
existing database or data warehouse, or will you just analyze it and visualize the results
separately?
The answers to these questions will help you decide whether a Hadoop-based big data solution such as
HDInsight is appropriate, but keep in mind that modern data management systems such as Microsoft
SQL Server and the Microsoft Analytics Platform System (APS) are designed to offer high performance
for huge volumes of datayour decision should not focus solely on data volume.
As you saw earlier in this guide, Hadoop-based solutions are primarily suited to situations where:
You have very large volumes of data to store and process, and these volumes are beyond the
capabilities of traditional relational database systems.
The data is in a semi-structured or unstructured format, often as text files or binary files.
The data is not well categorized; for example, similar items are described using different
terminology such as a variation in city, country, or region names, and there is no obvious key
value.
The data arrives rapidly as a stream, or in large batches that cannot be processed in real time,
and so must be stored efficiently for processing later as a batch operation.
The data cannot easily be processed into a format that suits existing database schemas without
risking loss of information.
You need to execute complex batch jobs on a very large scale, so that running the queries in
parallel is necessary.
You want to be able to easily scale the system up or down on demand, or have it running only
when required for specific processing tasks and close it down altogether at other times.
You dont actually know how the data might be useful, but you suspect that it will beeither
now or in the future.
In general you should consider adopting a Hadoop-based solution such as HDInsight only when your
requirements match several of the points listed above, and not just one or two. Existing database
systems can achieve many of the tasks in the list, but a batch processing solution based on Hadoop may
be a better choice when several of the factors are relevant to your own requirements.
Historical analysis and reporting, which is concerned with summarizing data to make sense of
what happened in the past. For example, a business might summarize sales transactions by
fiscal quarter and sales region, and use the results to create a report for shareholders.
Additionally, business analysts within the organization might explore the aggregated data by
drilling down into individual months to determine periods of high and low sales revenue, or
drilling down into cities to find out if there are marked differences in sales volumes across
geographic locations. The results of this analysis can help to inform business decisions, such as
when to conduct sales promotions or where to open a new store.
Predictive analysis and reporting, which is concerned with detecting data patterns and trends
to determine whats likely to happen in the future. For example, a business might use statistics
from historical sales data and apply it to known customer profile information to predict which
customers are most likely to respond to a direct-mail campaign, or which products a particular
customer is likely to want to purchase. This analysis can help improve the cost-effectiveness of a
direct-mail campaign, or increase sales while building closer customer relationships through
relevant targeted recommendations.
Both kinds of analysis and reporting involve taking source data, applying an analytical model to that
data, and using the output to inform business decision making. In the case of historical analysis and
reporting, the model is usually designed to summarize and aggregate a large volume of data to
determine meaningful business measuresfor example, the total sales revenue aggregated by various
aspects of the business, such as fiscal period and sales region.
For predictive analysis the model is usually based on a statistical algorithm that categorizes clusters of
similar data, or that correlates data attributes (which may influence one another) to the related cause
trendsfor example, classifying customers based on demographic attributes, or identifying a
relationship between customer age and the purchase of specific products.
Databases are the core of most organizations data processing, and in most cases the purpose is simply
to run the operation by, for example, storing and manipulating data to manage stock and create
invoices. However, analytics and reporting is one of the fastest growing sectors in business IT as
managers strive to learn more about their organization.
Analytical goals
Although every project has its own specific requirements, big data projects generally fall into one of the
following categories:
One-time analysis for a specific business decision. For example, a company planning to expand
by opening a new physical store might use big data techniques to analyze demographic data for
a shortlist of proposed store sites in order to determine the location that is likely to result in the
highest revenue for the store. Alternatively, a charity planning to build water supply
infrastructure in a drought-stricken area might use a combination of geographic, geological,
health, and demographic statistics to identify the best locations.
Open blue sky exploration of interesting data. Sometimes the goal of big data analysis is
simply to find out what you dont already know from the available data. For example, a business
might be aware that customers are using Twitter to discuss its products and services, and want
to explore the tweets to determine if any patterns or trends can be found that relate to brand
visibility or customer sentiment. There may be no specific business decision that needs to be
made based on the data, but gaining a better understanding of how customers perceive the
business might inform decision-making in the future.
Ongoing reporting and BI. In some cases a big data solution will be used to support ongoing
reporting and analytics, either in isolation or integrated with an existing enterprise BI solution.
For example, a real estate business that already has a BI solution, which enables analysis and
reporting of its own property transactions across time periods, property types, and locations,
might extend it to include demographic and population statistics data from external sources.
In many respects, data analysis is an iterative process. It is not uncommon for an initial project based on
open exploration of data to uncover trends or patterns that form the basis for a new project to support
a specific business decision, or to extend an existing BI solution.
The results of the analysis are typically consumed and visualized in the following ways:
Custom application interfaces. For example, a custom application might display the data as a
chart, or generate a set of product recommendations for a customer.
Business performance dashboards. For example, you could use the PerformancePoint Services
component of SharePoint Server to display key performance indicators (KPIs) as scorecards, and
display summarized business metrics in a SharePoint Server site.
Reporting solutions such as SQL Server Reporting Services. For example, business reports can
be generated in a variety of formats and distributed automatically by email, or viewed on
demand through a web browser.
Analytical tools such as Excel. Information workers can explore analytical data models through
PivotTables and charts. Business analysts can use advanced Excel capabilities such as Power
Query, Power Pivot, Power View, and Power Map to create their own personal data models and
visualizations, or use add-ins to apply predictive models to data and view the results in Excel.
Internal business data from existing applications or BI solutions. Often this data is historic in
nature or includes demographic profile information that the business gathered from its
customers. For example, you might use historic sales records to correlate customer attributes
with purchasing patterns, and then use this information to support targeted advertising or
predictive modeling of future product plans.
Log files. Applications or infrastructure services often generate log data that can be useful for
analysis and decision making with regard to managing IT reliability and scalability. Additionally,
in some cases, combining log data with business data can reveal useful insights into how IT
services support the business. For example, you might use log files generated by Internet
Information Services (IIS) to assess network bandwidth utilization, or to correlate web site
traffic with sales transactions in an ecommerce application.
Sensors. Increased automation in almost every aspect of life has led to a growth in the amount
of data recorded by electronic sensors (often referred to as the Internet of Things). For
example, RFID tags in smart cards are now routinely used to track passenger progress through
mass transit infrastructure, sensors in plant machinery generate huge quantities of data in
production lines, and smart metering provides detailed views of energy usage. This type of data
is often well suited to highly dynamic analysis and real-time reporting.
Social media. The massive popularity of social media services such as Facebook, Twitter, and
others is a major factor in the growth of data volumes on the Internet. Many social media
services provide application programming interfaces (APIs) that you can use to query the data
shared by users of these services, and consume this data for analysis. For example, a business
might use Twitters query API to find tweets that mention the name of the company or its
products, and analyze the data to determine how customers feel about the companys brand.
Data feeds. Many web sites and services provide data as a feed that can be consumed by client
applications and analytical solutions. Common feed formats include RSS, ATOM, and industry
defined XML formats; and the data sources themselves include blogs, news services, weather
forecasts, and financial markets data.
Governments and special interest groups. Many government organizations and special interest
groups publish data that can be used for analysis. For example, the UK government publishes
over 9000 downloadable datasets including statistics on population, crime, government
spending, health, and more, in a variety of formats. Similarly, the US government provides
census data and other statistics as downloadable datasets or in dBASE format on CD-ROM.
Additionally, many international organizations provide data free of charge. For example, the
United Nations makes statistical data available through its own website and in Azure
Marketplace.
Commercial data providers. There are many organizations that sell data commercially,
including geographical data, historical weather data, economic indicators, and others. Azure
Marketplace provides a central service through which you can locate and purchase
subscriptions to many of these data sources.
Just because data is available doesnt mean it is useful, or that the effort of using it will be viable. Think
about the value the analysis can add to your business before you devote inordinate time and effort to
collecting and analyzing data.
When planning data sources to use in your big data solution, consider the following factors:
Availability. How easy is it to find and obtain the data? You may have a specific analytical goal
in mind, but if the data required to support the analysis is difficult (or impossible) to find you
may waste valuable time trying to obtain it. When planning a big data project it can be useful to
define a schedule that allows sufficient time to research what data is available. If the data
cannot be found after an agreed deadline you may need to revise the analytical goals.
Format. In what format is the data available, and how can it be consumed? Some data is
available in standard formats and can be downloaded over a network or Internet API. In other
cases the data may be available only as a real-time stream that you must capture and structure
for analysis. Later in the process you will consider tools and techniques for consuming the data
from its source and ingesting it into your cluster, but even during this early stage you should
identify the format and connectivity options for the data sources you want to use.
Relevance. Is the data relevant to the analytical goals? You may have identified a potential data
source and already be planning how you will consume it and ingest it into the analytical process.
However, you should first examine the data source carefully to ensure the data it contains is
relevant to the analysis you intend to perform.
Cost. You may determine the availability of a relevant dataset, only to discover that the cost of
obtaining the data outweighs the potential business benefit of using it. This can be particularly
true if the analytical goal is to augment an enterprise BI solution with external data on an
ongoing basis, and the external data is only available through a commercial data provider.
Data storage
A managed service such as HDInsight running on Azure is a good choice over a self-installed
framework when:
You want a solution that is easy to initialize and configure, and where you do not need to
install any software yourself.
You want to get started quickly by avoiding the time it takes to set up the servers and
deploy the framework components to each one.
You want to be able to quickly and easily decommission a cluster and then initialize it again
without paying for the intermediate time when you dont need to use it.
You require the solution to be running for only a specific period of time.
You require the solution to be available for ongoing analysis, but the workload will vary
sometimes requiring a cluster with many nodes and sometimes not requiring any service at
all.
You want to avoid the cost in terms of capital expenditure, skills development, and the time
it takes to provision, configure, and manage an on-premises solution.
The majority of the data is stored or generated within your on-premises network.
You require ongoing services with a predictable and constant level of scalability.
You have the necessary technical capability and budget to provision, configure, and manage
your own cluster.
The data you plan to analyze must remain on your own servers for compliance or
confidentiality reasons.
A pre-configured hardware appliance that supports big data connectivity, such as Microsoft
Analytics Platform System with PolyBase, is a good choice when:
You want a solution that provides predictable scalability, easy implementation, technical
support, and that can be deployed on-premises without requiring deep knowledge of big
data systems in order to set it up.
You want existing database administrators and developers to be able to seamlessly work
with big data without needing to learn new languages and techniques.
You want to be able to grow into affordable data storage space, and provide opportunities
for bursting by expanding into the cloud when required, while still maintaining corporate
services on a traditional relational system.
Choosing a big data platform that is hosted in the cloud allows you to change the number of servers in
the cluster (effectively scaling out or scaling in your solution) without incurring the cost of new
hardware or having existing hardware underused.
Data storage
When you create an HDInsight cluster, you have the option to create one of two types:
Hadoop cluster. This type of cluster combines an HDFS-compatible storage mechanism with the
Hadoop core engine and a range of additional tools and utilities. It is designed for performing
the usual Hadoop operations such as executing queries and transformations on data. This is the
type of cluster that you will see in use throughout this guide.
HBase cluster. This type of cluster, which was in preview at the time this guide was written,
contains a fully configured installation of the HBase database system. It is designed for use as
either a standalone cloud-hosted NoSQL database or, more typically, for use in conjunction with
a Hadoop cluster.
The primary data store used by HDInsight for both types of cluster is Azure blob storage, which provides
scalable, durable, and highly available storage (for more information see Introduction to Microsoft Azure
Storage). Using Azure blob storage means that both types of cluster can offer high scalability for storing
vast amounts of data, and high performance for reading and writing dataincluding the capture of
streaming data. For more details see Azure Storage Scalability and Performance Targets.
HBase provides close integration with Hadoop through base classes for connecting Hadoop map/reduce
jobs with data in HBase tables; an easy to use Java API for client access; adapters for popular
frameworks such as map/reduce, Hive, and Pig; access through a REST interface; and integration with
the Hadoop metrics subsystem.
HBase can be accessed directly by client programs and utilities to upload and access data. It can also be
accessed using storage drivers, or in discrete code, from within the queries and transformations you
execute on a Hadoop cluster. There is also a Thrift API available that provides a lightweight REST
interface for HBase.
HBase is resource-intensive and will attempt to use as much memory as is available on the cluster. You
should not use an HBase cluster for processing data and running queries, with the possible exception of
minor tasks where low latency is not a requirement. However, it is typically installed on a separate
cluster, and queried from the cluster containing your Hadoop-based big data solution.
For more information about HBase see the official Apache HBase project website and HBase Bigtablelike structured storage for Hadoop HDFS on the Hadoop wiki site.
Why Azure blob storage?
HDInsight is designed to transfer data very quickly between blob storage and the cluster, for both
Hadoop and HBase clusters. Azure datacenters provide extremely fast, high bandwidth connectivity
between storage and the virtual machines that make up an HDInsight cluster.
Using Azure blob storage provides several advantages:
Running costs are minimized because you can decommission a Hadoop cluster when not
performing queriesdata in Azure blob storage is persisted when the cluster is deleted and you
can build a new cluster on the existing source data in blob storage. You do not have to upload
the data again over the Internet when you recreate a cluster that uses the same data. However,
although it is possible, deleting and recreating an HBase cluster is not typically a recommended
strategy.
Data storage costs can be minimized because Azure blob storage is considerably cheaper than
many other types of data store (1 TB of locally-redundant storage currently costs around $25
per month). Blob storage can be used to store large volumes of data (up to 500 TB at the time
this guide was written) without being concerned about scaling out storage in a cluster, or
changing the scaling in response to changes in storage requirements.
Data in Azure blob storage is replicated across three locations in the datacenter, so it provides a
similar level of redundancy to protect against data loss as an HDFS cluster. Storage can be
locally-redundant (replicas are in the same datacenter), globally-redundant (replicated locally
and in a different region), or read-only globally-redundant. See Introduction to Microsoft Azure
Storage for more details.
Data stored in Azure blob storage can be accessed by and shared with other applications and
services, whereas data stored in HDFS can only be accessed by HDFS-aware applications that
have access to the cluster storage. Azure storage offers import/export features that are useful
for quickly and easily transferring data in and out of Azure blob storage.
The high speed flat network in the datacenter provides fast access between the virtual machines
in the cluster and blob storage, so data movement is very efficient. Tests carried out by the
Azure team indicate that blob storage provides near identical performance to HDFS when
reading data, and equal or better write performance.
Azure blob storage may throttle data transfers if the workload reaches the bandwidth limits of the
storage service or exceeds the scalability targets. One solution is to use additional storage accounts. For
more information, see the blog post Maximizing HDInsight throughput to Azure Blob Storage on MSDN.
For more information about the use of blob storage instead of HDFS for data storage see Use Azure Blob
storage with HDInsight.
Combining Hadoop and HBase clusters
For most of your solutions you will use an HDInsight Hadoop-based cluster. However, there are
circumstances where you might combine both a Hadoop and an HBase cluster in the same solution, or
use an HBase cluster on its own. Some of the common configurations are:
Use just a Hadoop cluster. Source data can be loaded directly into Azure blob storage or stored
using the HDFS-compatible storage drivers in Hadoop. Data processing, such as queries and
transformations, execute on this cluster and access the data in Azure blob storage using the
HDFS-compatible storage drivers.
Use a combination of a Hadoop and an HBase cluster (or more than one HBase cluster if
required). Data is stored using HBase, and optionally through the HDFS driver in the Hadoop
cluster as well. Source data, especially high volumes of streaming data such as that from sensors
or devices, can be loaded directly into HBase. Data processing takes place on the Hadoop
cluster, but the processes can access the data stored in the HBase cluster and store results
there.
Use just an HBase cluster. This is typically the choice if you require only a high capacity, high
performance storage and retrieval mechanism that will be accessed directly from client
applications, and you do not require Hadoop-based processing to take place.
cluster is running and when it is not running or has been deleted. However the HBase cluster
must be running.
Hadoop automatically partitions the data and allocates the jobs to the data nodes in the cluster.
Some queries may not take advantage of all the nodes in the cluster. This may be case with
smaller volumes of data, or where the data format prevents partitioning (as is the case for some
types of compressed data).
Operations such as Hive queries that must sort the results may limit the number of nodes that
Hadoop uses for the reduce phase, meaning that adding more nodes will not reduce query
execution time.
If the volume of data you will process is increasing, ensure that the cluster size you choose can
cope with this. Alternatively, plan to increase the cluster size at specific intervals to manage the
growth. Typically you will need to delete and recreate the cluster to change the number of
nodes, but you can do this for a Hadoop cluster without the need to upload the data again
because it is held in Azure blob storage.
Use the performance data exposed by the cluster to determine if increasing the size is likely to
improve query execution speed. Use historical data on performance for similar types of jobs to
estimate the required cluster size for new jobs. For more information about monitoring jobs,
see Building end-to-end solutions using HDInsight.
Storage requirements
By default HDInsight creates a new storage container in Azure blob storage when you create a new
cluster. However, its possible to use a combination of different storage accounts with an HDInsight
cluster. You might want to use more than one storage account in the following circumstances:
When the amount of data is likely to exceed the storage capacity of a single blob storage
container.
When the rate of access to the blob container might exceed the threshold where throttling will
occur.
When you want to make data you have already uploaded to a blob container available to the
cluster.
When you want to isolate different parts of the storage for reasons of security, or to simplify
administration.
For details of the different approaches for using storage accounts with HDInsight see Cluster and storage
initialization in the section Collecting and loading data into HDInsight of this guide. For details of storage
capacity and bandwidth limits see Azure Storage Scalability and Performance Targets.
Maintaining cluster data
There may be cases where you want to be able to decommission and delete a cluster, and then recreate
it later with exactly the same configuration and data. HDInsight stores the cluster data in blob storage,
and you can create a new cluster over existing blob containers so the data they contain is available to
the cluster. Typically you will delete and recreate a cluster in the following circumstances:
You want to minimize runtime costs by deploying a cluster only when it is required. This may be
because the jobs it executes run only at specific scheduled times, or run on demand when
specific requirements arise.
You want to change the size of the cluster, but retain the data and metadata so that it is
available in the new cluster.
This applies only with a Hadoop-based HDInsight cluster. See Using Hadoop and HBase clusters for
information about using an HBase cluster. For details of how you can maintain the data when recreating
a cluster in HDInsight see Cluster and storage initialization in the section Collecting and loading data into
HDInsight of this guide.
Investigate the SLAs offered by your big data solution provider because these will ultimately
limit the level of availability and reliability you can offer.
Consider if a cluster should be used for a single process, for one or a subset of customers, or for
a specific limited workload in order to maintain performance and availability. Sharing a cluster
across multiple different workloads can make it more difficult to predict and control demand,
and may affect your ability to meet SLAs.
Consider how you will manage backing up the data and the cluster information to protect
against loss in the event of a failure.
Choose an operating location, cluster size, and other aspects of the cluster so that sufficient
infrastructure and network resources are available to meet requirements.
Implement robust management and monitoring strategies to ensure you maintain the required
SLAs and meet business requirements.
Hadoop-based big data solutions open up new opportunities for converting data into information. They
can also be used to extend existing information systems to provide additional insights through analytics
and data visualization. Every organization is different, and so there is no definitive list of the ways you
can use these types of solution as part of your own business processes.
However, there are four general use cases and corresponding models, described below, that are
appropriate for the typical batch processing workloads on an HDInsight cluster. Understanding these use
cases will help you to start making decisions on how best to integrate HDInsight with your organization,
and with your existing BI systems and tools.
By incorporating additional applications that run under the YARN resource manager, HDInsight can be
used to perform real-time processing of streaming data. However, this topic is outside the scope of the
guide.
Data sources
Output targets
Considerations
Handling data that you cannot process using existing systems, perhaps by performing complex
calculations and transformations that are beyond the capabilities of existing systems to
complete in a reasonable time.
Collecting feedback from customers through email, web pages, or external sources such as
social media sites, then analyzing it to get a picture of customer sentiment for your products.
Combining information with other data, such as demographic data that indicates population
density and characteristics in each city where your products are sold.
Dumping data from your existing information systems into HDInsight so that you can work with
it without interrupting other business processes or risking corruption of the original data.
Trying out new ideas and validating processes before implementing them within the live
system.
Combining your data with datasets available from Azure Marketplace or other commercial data sources
can reveal useful information that might otherwise remain hidden in your data.
Data sources
The input data for this model typically includes the following:
Social data, log files, sensors, and applications that generate data files.
Datasets obtained from Azure Marketplace and other commercial data providers.
Internal data extracted from databases or data warehouses for experimentation and one-off
analysis.
Streaming data that is captured, filtered, and pre-processed through a suitable tool or
framework (see Collecting and loading data into HDInsight).
Notice that, as well as externally obtained data, you might process data from within your organizations
existing database or data warehouse. HDInsight is an ideal solution when you want to perform offline
exploration of existing data in a sandbox. For example, you may join several datasets from your data
warehouse to create large datasets that act as the source for some experimental investigation, or to test
new analysis techniques. This avoids the risk of interrupting existing systems, affecting performance of
your data warehouse system, or accidently corrupting the core data.
The capability to store schema-less data, and apply a schema only when processing the data, may also
simplify the task of combining information from different systems because you do not need to apply a
schema beforehand, as you would in a traditional data warehouse.
Often you need to perform more than one query on the data to get the results into the form you need.
Its not unusual to base queries on the results of a preceding query; for example, using one query to
select and transform the required data and remove redundancy, a second query to summarize the data
returned from the first query, and a third query to format the output as required. This iterative
approach enables you to start with a large volume of complex and difficult to analyze data, and get it
into a structure that you can consume directly from an analytical tool such as Excel, or use as input to a
managed BI solution.
Output targets
The results from your exploration processes can be visualized using any of the wide range of tools that
are available for analyzing data, combining it with other datasets, and generating reports. Typical
examples for the iterative exploration model are:
Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map.
You will see more details of these tools in Consuming and visualizing data from HDInsight.
Considerations
There are some important points to consider when choosing the iterative exploration model:
Combine the output with other data to generate comparisons or to augment the
information.
You will usually choose this model when you do not want to persist the results of the query
after analysis, or after the required reports have been generated. It is typically used for one-off
analysis tasks where the results are discarded after use; and so differs from the other models
described in this guide in which the results are stored and reused.
Very large datasets are likely to preclude the use of an interactive approach due to the time
taken for the queries to run. However, after the queries are complete you can connect to the
cluster and work interactively with the data to perform different types of analysis or
visualization.
Data arriving as a stream, such as the output from sensors on an automated production line or
the data generated by GPS sensors in mobile devices, requires additional considerations. A
typical technique is to capture the data using a stream processing technology such as Storm or
StreamInsight and persist it, then process it in batches or at regular intervals. The stream
capture technology may perform some pre-processing, and might also power a real time
visualization or rudimentary analysis tool, as well as feeding it into an HDInsight cluster. A
common technique is micro-batch processing, where incoming data is persisted in small
increments, allowing near real-time processing by the big data solution.
You are not limited to running a single query on the source data. You can follow an iterative
pattern in which the data is passed through the cluster multiple times, each pass refining the
data until it is suitably prepared for use in your analytical tool. For example, a large
unstructured file might be processed using a Pig script to generate a smaller, more structured
output file. This output could then be used as the input for a Hive query that returns aggregated
data in tabular form.
Data sources
Output targets
Considerations
If you need to store vast amounts of data, irrespective of the format of that data, an on-premises
Hadoop-based solution can reduce administration overhead and save money by minimizing the need for
the high performance database servers and storage clusters used by traditional relational database
systems. Alternatively, you may choose to use a cloud hosted Hadoop-based solution such as HDInsight
in order to reduce the administration overhead and running costs compared to on-premises
deployment.
This model is also suitable for use as a data store where you do not need to implement the typical data
warehouse capabilities. For example, you may just want to minimize storage cost when saving large
tabular format data files for use in the future, large text files such as email archives or data that you
must keep for legal or regulatory reasons but you do not need to process, or for storing large quantities
of binary data such as images or documents. In this case you simply load the data into the storage
associated with cluster, without creating Hive tables for it.
You might, as an alternative, choose to use just an HBase cluster in this model. HBase can be accessed
directly from client applications through the Java APIs and the REST interface. You can load data directly
into HBase and query it using the built-in mechanisms. For information about HBase see Data storage
in the topic Specifying the infrastructure.
An example of applying this use case and model can be found in Scenario 2: Data warehouse on
demand.
Storing data in a way that allows you to minimize storage cost by taking advantage of cloudbased storage systems, and minimizing runtime cost by initiating a cluster to perform
processing only when required.
Exposing both the source data in raw form, and the results of queries executed over this data in
the familiar row and column format, to a wide range of data analysis tools. The processed
results can use a range of data types that includes both primitive types (including timestamps)
and complex types such as arrays, maps, and structures.
Storing schemas (or, to be precise, metadata) for tables that are populated by the queries you
execute, and partitioning the data in tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.
Creating views based on tables, and creating functions for use in both tables and queries.
Creating a robust data repository for very large quantities of data that is relatively low cost to
maintain compared to traditional relational database systems and appliances, where you do not
need the additional capabilities of these types of systems.
Consuming the results directly in business applications through interactive analytical tools such
as Excel, or in corporate reporting platforms such as SQL Server Reporting Services.
Data sources
Data sources for this model are typically data collected from internal and external business processes.
However, it may also include reference data and datasets obtained from other sources that can be
matched on a key to existing data in your data store so that it can be used to augment the results of
analysis and reporting processes. Some examples are:
Datasets obtained from Azure Marketplace and other commercial data providers.
If you adopt this model simply as a commodity data store rather than a data warehouse, you might also
load data from other sources such as social media data, log files, and sensors; or streaming data that is
captured, filtered, and processed through a suitable tool or framework (see Collecting and loading data
into HDInsight).
Output targets
The main intention of this model is to provide the equivalent to a data warehouse system based on the
traditional relational database model, and expose it as Hive tables. You can use these tables in a variety
of ways, such as:
Combining the datasets for analysis, and using the result to generate reports and business
information.
Generating ancillary information such as related items or recommendation lists for use in
applications and websites.
Providing external access to the results through web applications, web services, and other
services.
Powering information systems such as SharePoint server through web parts and the Business
Data Connector (BDC).
If you adopt this model simply as a commodity data store rather than a data warehouse, you might use
the data you store as an input for any of the models described in this section of the guide.
The data in an HDInsight data warehouse can be analyzed and visualized directly using any tools that can
consume Hive tables. Typical examples are:
Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map
You can find more details of these tools in the topic Consuming and visualizing data from HDInsight. For
a discussion of using SQL Server Analysis Services see Corporate Data Model Level Integration in the
topic Use case 4: BI integration. You can also download a case study that describes using SQL Server
Analysis Services with Hive.
Considerations
There are some important points to consider when choosing the data warehouse on demand model:
Create a central point for analysis and reporting by multiple users and tools.
Host your data in the cloud to benefit from reliability and elasticity, to minimize cost, and to
reduce administration overhead.
Store both externally collected data and data generated by internal tools and processes.
Define tables that have the familiar row and column format, with a range of data types for
the columns that includes both primitive types (including timestamps) and complex types
such as arrays, maps, and structures.
Load data from storage into tables, save data to storage from tables, and populate tables
from the results of running a query.
Create indexes for tables, and partition tables based on a clustered index so that each has a
separate metadata definition and can be handled separately.
Rename, alter and drop tables, and modify columns in a table as required.
Create views based on tables, and create functions for use in both tables and queries.
The main limitation of Hive tables is that you cannot create constraints such as foreign key
relationships that are automatically managed. For more details of how to work with Hive
tables, see Hive Data Definition Language on the Apache Hive website.
You can store the Hive queries and views within HDInsight so that they can be used to extract
data on demand in much the same way as the stored procedures in a relational database.
However, to minimize response times you will probably need to pre-process the data where
possible using queries within your solution, and store these intermediate results in order to
reduce the time-consuming overhead of complex queries. Incoming data may be processed by
any type of query, not just Hive, to cleanse and validate the data before converting it to table
format.
You can use the Hive ODBC connector in SQL Server with HDInsight to create linked servers. This
allows you to write Transact-SQL queries that join tables in a SQL Server database to tables
stored in an HDInsight data warehouse.
If you want to be able to delete and restore the cluster, as is typically the case for this model,
there are additional considerations when creating a cluster. See Cluster and storage
initialization for more information.
Data sources
Output targets
Considerations
There is often some confusion between the terms ETL and ELT. ETL, as used here, is generally the more
well-known, and describes performing a transformation on incoming data before loading it into a data
warehouse. ELT is the process of loading it into the data warehouse in raw form and then transforming it
afterwards. Because the Azure blob storage used by HDInsight can store schema-less data, storing the
raw data is not an issue (it might be when the target is a relational data store). The data is then
extracted from blob storage, transformed, and the results are loaded back into blob storage. See ETL or
ELT or both? on the Microsoft OLAP blog for a more complete discussion of this topic.
Extracting and transforming data before you load it into your existing databases or analytical
tools.
Performing categorization and restructuring of data, and for extracting summary results to
remove duplication and redundancy.
Preparing data so that it is in the appropriate format and has appropriate content to power
other applications or services.
Data sources
Data sources for this model are typically external data that can be matched on a key to existing data in
your data store so that it can be used to augment the results of analysis and reporting processes. Some
examples are:
Social media data, log files, sensors, and applications that generate data files.
Datasets obtained from Azure Marketplace and other commercial data providers.
Streaming data captured, filtered, and processed through a suitable tool or framework (see
Collecting and loading data into HDInsight).
Output targets
This model is designed to generate output that is in the appropriate format for the target data store.
Common types of data store are:
Applications or services that require data to be processed into specific formats, or as files that
contain specific types of information structure.
You may decide to use this model even when you dont actually want to keep the results of the big data
query. You can load it into your database, generate the reports and analyses you require, and then
delete the data from the database. You may need to do this every time if the source data changes
between each reporting cycle in a way that means just adding new data is not appropriate.
Considerations
There are some important points to consider when choosing the ETL automation model:
Load stream data or large volumes of semi-structured or unstructured data from external
sources into an existing database or information system.
Cleanse, transform, and validate the data before loading it; perhaps by using more than one
transformation pass through the cluster.
Power other applications that require specific types of data, such as using an analysis of
previous behavioral information to apply personalization to an application or service.
When the output is in tabular format, such as that generated by Hive, the data import process
can use the Hive ODBC driver or Linq To Hive. Alternatively, you can use Sqoop (which is
included in in the Hadoop distribution installed by HDInsight) to connect a relational database
such as SQL Server or Azure SQL Database to your HDInsight data store and export the results of
a query into your database. If you are using Microsoft Analytical Platform System (APS) you can
access the data in HDInsight using PolyBase, which acts as a bridge between APS and HDInsight
so that it becomes just another data source available for use in queries and processes in APS.
Some other connectors for accessing Hive data are available from Couchbase, Jaspersoft, and
Tableau Software.
If the target for the data is not a database, you can generate a file in the appropriate format
within the query. This might be tab delimited format, fixed width columns, some other format
for loading into Excel or a third-party application, or even for loading into Azure storage through
a custom data access layer that you create. Azure table storage can be used to store table
formatted data using a key to identify each row. Azure blob storage is more suitable for storing
compressed or binary data generated from the HDInsight query if you want to store it for reuse.
If the intention is to regularly update the target table or data store as the source data changes
you will probably choose to use an automated mechanism to execute the query and data
import processes. However, if it is a one-off operation you may decide to execute it interactively
only when required.
If you need to execute several operations on the data as part of the ETL process you should
consider how you manage these. If they are controlled by an external program, rather than as a
workflow within the solution, you will need to decide whether some can be executed in parallel,
and you must be able to detect when each job has completed. Using a workflow mechanism
such as Oozie within Hadoop may be easier than trying to orchestrate several operations using
external scripts or custom programs. See Workflow and job orchestration for more information
about Oozie.
Data sources
Output targets
Considerations
The information in this section will help you to understand how you can integrate HDInsight with an
enterprise BI system. However, a complete discussion of enterprise BI systems is beyond the scope of
this guide.
Figure 2 - Three levels of integration for big data with an enterprise BI system
The integration levels shown in Figure 2 are:
Report level integration. Data from HDInsight is used in reporting and analytical tools to
augment data from corporate BI sources, enabling the creation of reports that include data
from corporate BI sources as well as from HDInsight, and also enabling individual users to
combine data from both solutions into consolidated analyses. This level of integration is
typically used for creating mashups, exploring datasets to discover possible queries that can find
hidden information, and for generating one off reports and visualizations.
Corporate data model level integration. HDInsight is used to process data that is not present in
the corporate data warehouse, and the results of this processing are then added to corporate
data models where they can be combined with data from the data warehouse and used in
multiple corporate reports and analysis tools. This level of integration is typically used for
exposing the data in specific formats to information systems, and for use in reporting and
visualization tools.
Data warehouse level integration. HDInsight is used to prepare data for inclusion in the
corporate data warehouse. The data that has been loaded is then available throughout the
entire enterprise BI solution. This level of integration is typically used to create standalone
tables on the same database hardware as the enterprise data warehouse, which provides a
single source of enterprise data for analysis, or to incorporate the data into a dimensional
schema and populate dimension and fact tables for full integration into the BI solution.
An example of applying this use case and model can be found in Scenario 4: BI integration.
The following sections describe the three integration levels in more detail to help you understand the
implications of your choice. They also contain guidelines for implementing each one. However, keep in
mind that you dont have to use the same integration level for all of your processes. You can use a
different approach for each dataset that you extract from HDInsight, depending on the scenario and the
requirements for that dataset.
Use the Power Query add-in to download the output files generated in the cluster and open
them in Excel, or import them into a database for reporting.
Create Hive tables in the cluster and consume them directly from Excel (including using Power
Query, Power Pivot, Power View, and Power Map) or from SQL Server Reporting Services (SSRS)
by using the Hive ODBC driver.
Download the required data as a delimited file from the clusters Azure blob storage container,
perhaps by using PowerShell, and open it in Excel or another data analysis and visualization
tool.
By integrating data from your big data solution into corporate data models you can accomplish both of
these aims, and use the data as the basis for enterprise reporting and analytics. Integrating the output
from HDInsight with your corporate data models allows you to use tools such as SQL Server Analysis
Services (SSAS) to analyze the data and present it in a format that is easy to use in reports, or for
performing deeper analysis.
You can use the following techniques to integrate the results into a corporate data model:
Create Hive tables in the cluster and consume them directly from a SSAS tabular model by using
the Hive ODBC driver. SSAS in tabular mode supports the creation of data models from multiple
data sources and includes an OLE DB provider for ODBC, which can be used as a wrapper
around the Hive ODBC driver.
Create Hive tables in the cluster and then create a linked server in the instance of the SQL
Server database source used by an SSAS multidimensional data model so that the Hive tables
can be queried through the linked server and imported into the data model. SSAS in
multidimensional mode can only use a single OLE DB data source, and the OLE DB provider for
ODBC is not supported.
Use Sqoop or SQL Server Integration Services (SSIS) to copy the data from the cluster to a SQL
Server database engine instance that can then be used as a source for an SSAS tabular or
multidimensional data model.
Note that you must choose between the multidimensional and the tabular data mode when you install
SQL Server Analysis Services, though you can install two instances if you need both modes.
When installed in tabular mode, SSAS supports the creation of data models that include data from
multiple diverse sources, including ODBC-based data sources such as Hive tables.
When installed in multidimensional mode, SSAS data models cannot be based on ODBC sources due to
some restrictions in the designers for multidimensional database objects. To use Hive tables as a source
for a multidimensional SSAS model, you must either extract the data from Hive into a suitable source for
the multidimensional model (such as a SQL Server database), or use the Hive ODBC driver to define a
linked server in a SQL Server instance that provides pass through access to the Hive tables, and then
use the SQL Server instance as the data source for the multidimensional model. You can download a
case study that describes how to create a multidimensional SSAS model that uses a linked server in a
SQL Server instance to access data in Hive tables.
queried through HDInsight just like any other business data source, and consolidating the data from all
sources into an enterprise dimensional model.
In addition, as with other data sources, its likely that the data import process from the cluster into the
database tables will occur on a schedule so that the data is as up to date as possible. The schedule will
depend on the time taken to execute the queries and perform ETL tasks prior to loading the results into
the database tables.
You can use the following techniques to integrate data from HDInsight into an enterprise data
warehouse:
Use Sqoop to copy data directly into database tables. These might be tables in a SQL Server data
warehouse that you do not want to integrate into the dimensional model of the data
warehouse. Alternatively, they may be tables in a staging database where the data can be
validated, cleansed, and conformed to the dimensional model of the data warehouse before
being loaded into the fact and dimension tables. Any firewalls located between the cluster and
the target database must be configured to allow the database protocols that Sqoop uses.
Use PolyBase for SQL Server to copy data directly into database tables. PolyBase is a component
of Microsoft Analytical Platform System (APS) and is available only on APS appliances (see
PolyBase on the SQL Server website). The tables might be in a SQL Server data warehouse and
you do not want to integrate them into the dimensional model of the data warehouse.
Alternatively, they may be tables in a staging database where the data can be validated,
cleansed, and conformed to the dimensional model of the data warehouse before being loaded
into the fact and dimension tables. Any firewalls located between the cluster and the target
database must be configured to allow the database protocols that PolyBase uses.
Create an SSIS package that reads the output file from the cluster, or uses the Hive ODBC driver
to extract the data, and then validates, cleanses, and transforms it before loading it into the fact
and dimension tables in the data warehouse.
Create a Linked Server in SQL Server that links to Hive tables in HDInsight through the Hive
ODBC Driver. You can then execute SQL queries that extract the data from HDInsight. However,
you must be aware of some issues such as compatible data types and some language syntax
limitations. For more information see How to create a SQL Server Linked Server to HDInsight
HIVE using Microsoft Hive ODBC Driver.
When reading data from HDInsight you must open port 1000 on the clusteryou can do this using the
management portal. For more information see Configure the Windows Firewall to Allow SQL Server
Access.
You have an existing enterprise data warehouse and BI system that you want to augment with
data from outside your organization.
You want to explore new ways to combine data in order to provide better insight into history
and to predict future trends.
You want to give users more opportunities for self-service reporting and analysis that combines
managed business data and big data from other sources.
Data sources
The input data can be almost anything, but for the BI integration model it typically includes the
following:
Social media data, log files, sensor data, and the output from applications that generate data
files.
Datasets obtained from Azure Marketplace and other commercial data providers.
Streaming data captured, filtered, and processed through a suitable tool or framework (see
Collecting and loading data into HDInsight).
Output targets
The results from your HDInsight queries can be visualized using any of the wide range of tools that are
available for analyzing data, combining it with other datasets, and generating reports. Typical examples
for the BI integration model are:
Interactive analytical tools such as Excel, Power Query, Power Pivot, Power View, and Power
Map.
You will see more details of these tools in Consuming and visualizing data from HDInsight.
Considerations
There are some important points to consider when choosing the BI integration model:
ETL processes in a data warehouse usually execute on a scheduled basis to add new data to the
warehouse. If you intend to integrate the results from HDInsight into your data warehouse so
that the information stored there is updated, you must consider how you will automate and
schedule the tasks of executing the query and importing the results.
You must ensure that data imported from your HDInsight solution contains valid values,
especially where there are typically multiple common possibilities (such as in street addresses
and city names). You may need to use a data cleansing mechanism such as Data Quality Services
to force such values to the correct leading value.
Most data warehouse implementations use slowly changing dimensions to manage the history
of values that change over time. Different versions of the same dimension member have the
same alternate key but unique surrogate keys, and so you must ensure that data imported into
the data warehouse tables uses the correct surrogate key value. This means that you must
either:
Use some complex logic to match the business key in the source data (which will typically
be the alternate key) with the correct surrogate key in the data model when you join the
tables. If you simply join on the alternate key, some loss of data accuracy may occur
because the alternate key is not guaranteed to be unique.
Load the data into the data warehouse and conform it to the dimensional data model,
including setting the correct surrogate key values.
One of the difficult tasks in full data warehouse integration is matching rows imported from a big data
solution to the correct dimension members in the data warehouse dimension tables. You must use a
combination of the alternate key and the point in time to which the imported row relates in order to
look up the correct surrogate key. This key can differ based on the date when changes were made to the
original entity. For example, a product may have more than one surrogate key over its lifetime, and the
imported data must match the correct version of this key.
Integration Level
Typical Scenarios
Considerations
None
Report Level
Although there is no specific business decision under consideration, the customer services managers
believe that some analysis of the tweets sent by customers may reveal important information about
how they perceive the airline and the issues that matter to customers. The kinds of question the team
expects to answer are:
Of these topics, if any, is it possible to get a realistic view of which are the most important?
Does the process provide valid and useful information? If not, can it be refined to produce
more accurate and useful results?
If the results are valid and useful, can the process be made repeatable?
The source data and results will be retained in Azure blob storage for visualization and further
exploration in Excel after the investigation is complete.
If you are just experimenting with data so see if it useful, you probably wont want to spend inordinate
amounts of time and resources building a complex or automated data ingestion mechanism. Often it
easier and quicker to just use a simple PowerShell script. For details of other options for ingesting data
see Collecting and loading data into HDInsight.
Explore the analysts explore the data to determine what potentially useful information it
contains.
Refine when some potentially useful data is found, the data processing steps used to query
the data are refined to maximize the analytical value of the results.
Stabilize when a data processing solution that produces useful analytical results has been
identified, it is stabilized to make it robust and repeatable.
Although many big data solutions will be developed using the stages described here, its not
mandatory. You may know exactly what information you want from the data, and how to extract it.
Alternatively, if you dont intend to repeat the process, theres no point in refining or stabilizing it.
An external table was used so that the table can be dropped without deleting the data, and recreated as
the analysis continues.
The analysts hypothesized that the use of Twitter to communicate with the company is significant, and
that the volume of tweets that mention the company is growing. They therefore used the following
query to determine the daily volume and trend of tweets.
HiveQL
SELECT PubDate, COUNT(*) TweetCount FROM Tweets GROUP BY PubDate SORT BY PubDate;
TweetCount
4/16/2013
1964
4/17/2013
2009
4/18/2013
2058
4/19/2013
2107
4/20/2013
2160
4/21/2013
2215
4/22/2013
2274
These results seem to validate the hypothesis that the volume of tweets is growing. It may be worth
refining this query to include a larger set of source data that spans a longer time period, and potentially
include other aggregations in the results such as the number of distinct authors that tweeted each day.
However, while this analytical approach might reveal some information about the importance of Twitter
as a channel for customer communication, it doesnt provide any information about the specific topics
that concern customers. To determine whats important to the airlines customers, the analysts must
look more closely at the actual contents of the tweets.
}
}
}
public static class IntSumReducer extends
Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) { sum += val.get(); }
result.set(sum);
context.write(key, result);
}
}
...
}
For information about writing Java map/reduce code for HDInsight see Develop Java MapReduce
programs for HDInsight. For more details of the Java classes used when creating map/reduce functions
see Understanding MapReduce.
The Java code, compiled to a .jar file, can be executed using the following PowerShell script.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$jarFile =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/example/jars/hadoop
-mapreduce-examples.jar"
$input =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/tweets"
$output =
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/twitterdata/words"
$jobDef = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile
-ClassName "wordcount" -Arguments $input , $output
$wordCountJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Write-Host "Map/Reduce job submitted..."
Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId StandardError
For more information about running map/reduce jobs in HDInsight see Building custom clients in the
topic Processing, querying, and transforming data using HDInsight.
The job generates a file named part-r-00000 containing the total number of instances of each word in
the source data. An extract from the results is shown here.
Partial output from map/reduce job
http://twitter.com/<user_name>/statuses/12347297
http://twitter.com/<user_name>/statuses/12347149
in
1408
in-flight
1057
incredible
541
is
704
it
352
it!
352
job
1056
journey 1057
just
352
later
352
lost
704
lots
352
lousy
515
love
1408
lugage? 352
luggage 352
made
1056
...
1
1
Unfortunately, these results are not particularly useful in trying to identify the most common topics
discussed in the tweets because the words are not ordered by frequency, and the list includes words
derived from Twitter names and other fields that are not actually a part of the tweeted messages.
in descending order of occurrences and stores the first 100 results in the /twitterdata/wordcounts
folder.
Pig Latin (WordCount.pig)
-- load tweets.
Tweets = LOAD '/twitterdata/tweets' AS (date, id, author, tweet);
-- split tweet into words.
TweetWords = FOREACH Tweets GENERATE FLATTEN(TOKENIZE(tweet)) AS word;
-- clean words by removing punctuation.
CleanWords = FOREACH TweetWords GENERATE LOWER(REGEX_EXTRACT(word, '[a-zA-Z]*', 0))
as word;
-- filter text to eliminate empty strings.
FilteredWords = FILTER CleanWords BY word != '';
-- group by word.
GroupedWords = GROUP FilteredWords BY (word);
-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group, COUNT(FilteredWords) as count;
-- sort by count.
SortedWords = ORDER CountedWords BY count DESC;
-- limit results to the top 100.
Top100Words = LIMIT SortedWords 100;
-- store the results as a file.
STORE Top100Words INTO '/twitterdata/wordcounts';
This script is saved as WordCount.pig, uploaded to Azure storage, and executed in HDInsight using the
following Windows PowerShell script.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
$localfolder = "D:\Data\Scripts"
$destfolder = "twitterdata/scripts"
$scriptFile = "WordCount.pig"
$outputFolder = "twitterdata/wordcounts"
$outputFile = "part-r-00000"
# Upload Pig Latin script to Azure.
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
When the script has completed successfully, the results are stored in a file named part-r-00000 in the
/twitterdata/wordcounts folder. This can be downloaded and viewed using the Hadoop cat command.
The following is an extract of the results.
Extract from /twitterdata/wordcounts/part-r-00000
my
delayed
flight
to
entertainment
the
a
delay
of
bags
3437
2749
2749
2407
2064
2063
2061
1720
1719
1718
These results show that the word count approach has the potential to reveal some insights. For
example, the high number of occurrences of delayed and delay are likely to be relevant in determining
common customer concerns. However, the solution needs to be modified to restrict the output to
include only significant words, which will improve its usefulness. To accomplish this the analysts decided
to refine it to produce accurate and meaningful insights into the most common words used by
customers when communicating with the airline by Twitter. This is described in Phase 2: Refining the
solution.
You can obtain lists of noise words from various sources such as TextFixer and Armand Brahajs Blog. If
you have installed SQL Server you can start with the list of noise words that are included in the
Resource database. For more information, see Configure and Manage Stopwords and Stoplists for FullText Search. In addition, you may find the N-gram datasets available from Google are useful. These
contain lists of words and phrases with their observed frequency counts.
With the noise words file in place the analysts modified the WordCount.pig script to use a LEFT OUTER
JOIN matching the words in the tweets with the words in the noise list, and to store the result in the file
named noisewordcounts. Only words with no matching entry in the noise words file are now included in
the aggregated results. The modified section of the script is shown below.
Pig Latin (FilterNoiseWords.pig)
...
-- load the noise words file.
NoiseWords = LOAD '/twitterdata/noisewords.txt' AS noiseword:chararray;
-- join the noise words file using a left outer join.
JoinedWords = JOIN FilteredWords BY word LEFT OUTER, NoiseWords BY noiseword USING
'replicated';
-- filter the joined words so that only words with
-- no matching noise word remain.
UsefulWords = FILTER JoinedWords BY noiseword IS NULL;
...
Partial results from this version of the script are shown below.
Extract from /twitterdata/noisewordcounts/part-r-00000
delayed
entertainment
delay
bags
service
time
vacation
food
wifi
connection
seattle
bag
2749
2064
1720
1718
1718
1375
1031
1030
1030
688
688
687
These results are more useful than the previous output, which included the noise words. However, the
analysts have noticed that semantically equivalent words are counted separately. For example, in the
results shown above, delayed and delay both indicate that customers are concerned about delays, while
bags and bag both indicate concerns about baggage.
return returnCommons;
}
return null;
}
}
For information about creating UDFs for use in HDInsight scripts see User-defined functions.
This function was compiled, packaged as WordDistanceUDF.jar, and saved on the HDInsight cluster.
Next, the analysts modified the Pig Latin script that generates a list of all non-noise word combinations
in the tweet source data to use the function to calculate the Jaro distance between each combination of
words generated by the script. This modified section of the script is shown here.
Pig Latin (MatchWords.pig)
-- register custom jar.
REGISTER WordDistanceUDF.jar;
...
...
-- sort by count.
SortedWords = ORDER WordList BY word;
-- create a duplicate set.
SortedWords2 = FOREACH SortedWords GENERATE word AS word:chararray;
-- cross join to create every combination of pairs.
CrossWords = CROSS SortedWords, SortedWords2;
-- find the Jaro distance.
WordDistances = FOREACH CrossWords GENERATE
SortedWords::word as word1:chararray,
SortedWords2::word as word2:chararray,
WordDistanceUDF.WordDistance(SortedWords::word, SortedWords2::word) AS
jarodistance:double;
-- filter out word pairs with jaro distance less than 0.9.
MatchedWords = FILTER WordDistances BY jarodistance >= 0.9;
-- store the results as a file.
STORE MatchedWords INTO '/twitterdata/matchedwords';
Notice that the script filters the results to include only word combinations with a Jaro distance value of
0.9 or higher.
The results include a row for every word matched to itself with a Jaro score of 1.0, and two rows for
each combination of words with a score of 0.9 or above (one row for each possible word order). Some of
the results in the output file are shown below.
bag
bags
baggage
bagage
bag
bags
delay
delayed
delays
delay
delayed
delay
delays
seat
seats
seated
seat
seat
1.0
0.9166666666666665
1.0
1.0
0.9166666666666665
1.0
1.0
0.9047619047619047
0.9444444444444444
0.9047619047619047
1.0
0.9444444444444444
1.0
1.0
0.9333333333333333
1.0
0.9333333333333333
1.0
Close examination of these results reveals that, while the code has successfully matched some words
appropriately (for example, bag/bags, delay/delays, delay/delayed, and seat/seats), it has failed to
match some others (for example, delays/delayed, bag/baggage, bagage/baggage, and seat/seated).
The analysts experimented with the Jaro value used to filter the results, lowering it to achieve more
matches. However, in doing so they found that the number of false positives increased. For example,
lowering the filter score to 0.85 matched delays to delayed and seats to seated, but also matched
seated to seattle.
The results obtained, and the attempts to improve them by adjusting the matching algorithm, reveal
just how difficult it is to infer semantics and sentiment from free-form text. In the end the analysts
realized that it would require some type of human intervention, in the form of a manually maintained
synonyms list.
bag
bags
bagage
bag
bag
bag
delay
delay
delay
drink
drink
drink
drink
drink
...
baggage
luggage
lugage
delay
delayed
delays
drink
drinks
drinking
beverage
beverages
The first column in this file contains the list of leading values that should be used to aggregate the
results. The second column contains synonyms that should be converted to the leading values for
aggregation.
With this synonyms file in place, the Pig Latin script used to count the words in the tweet contents was
modified to use a LEFT OUTER JOIN between the words in the source tweets (after filtering out the
noise words) and the words in the synonyms file to find the leading values for each matched word. A
UNION clause is then used to combine the matched words with words that are not present in the
synonyms file, and the results are saved into a file named synonymcounts. The modified section of the
Pig Latin script is shown here.
Pig Latin (CountSynonymns.pig)
...
-- Match synonyms.
Synonyms = LOAD '/twitterdata/synonyms.txt' AS (leadingvalue:chararray,
synonym:chararray);
WordsAndSynonyms = JOIN UsefulWords BY word LEFT OUTER, Synonyms BY synonym USING
'replicated';
UnmatchedWords = FILTER WordsAndSynonyms BY synonym IS NULL;
UnmatchedWordList = FOREACH UnmatchedWords GENERATE word;
MatchedWords = FILTER WordsAndSynonyms BY synonym IS NOT NULL;
MatchedWordList = FOREACH MatchedWords GENERATE leadingvalue as word;
AllWords = UNION MatchedWordList, UnmatchedWordList;
-- group by word.
GroupedWords = GROUP AllWords BY (word);
-- count mentions per group.
CountedWords = FOREACH GroupedWords GENERATE group as word, COUNT(AllWords) as count;
...
4812
bag
seat
service
movie
vacation
time
entertainment
food
wifi
connection
seattle
drink
3777
2404
1718
1376
1375
1375
1032
1030
1030
688
688
687
In these results, semantically equivalent words are combined into a single leading value for aggregation.
For example the counts for seat, seats, and seated are combined as a single count for seat. This makes
the results more useful in terms of identifying topics that are important to customers. For example, it is
apparent from these results that the top three subjects that customers have tweeted about are delays,
bags, and seats.
include the dependencies on hard-coded file paths, the data schemas in the Pig Latin scripts, and the
data ingestion process.
With the current solution, any changes to the location or format of the tweets.txt, noisewords.txt, or
synonyms.txt files would break the current scripts. Such changes are particularly likely to occur if the
solution gains acceptance among users and Hive tables are created on top of the files to provide a more
convenient query interface.
To execute this script with HCatalog, the following command was executed in the Hadoop command
window on the cluster.
Command Line
%HCATALOG_HOME%\bin\hcat.py f C:\Scripts\CreateTables.hcatalog
After the script had successfully created the new tables, the noisewords.txt and synonyms.txt files were
moved into the folders used by the corresponding tables.
This dependency on hard-coded paths makes the data processing solution vulnerable to changes in the
way that data is stored. One of the major advantages of using HCatalog is that you can relocate and
redefine data as required without breaking all the scripts and code that accesses the files. For example,
if at a later date an administrator modifies the Tweets table to partition the data, the code to load the
source data would no longer work. Additionally, if the source data was modified to use a different
format or schema, the script would need to be modified accordingly.
To eliminate the dependency, the analysts modified the WordCount.pig script to use HCatalog classes to
load and save data in Hive tables instead of accessing the source files directly. The modified sections of
the script are shown below.
Pig Latin (GetTopWords.pig)
-- load tweets using HCatalog.
Tweets = LOAD 'Tweets' USING org.apache.hcatalog.pig.HCatLoader();
...
-- load the noise words file using HCatalog.
NoiseWords = LOAD 'NoiseWords' USING org.apache.hcatalog.pig.HCatLoader();
...
-- Match synonyms using data loaded through HCatalog
Synonyms = LOAD 'Synonyms' USING org.apache.hcatalog.pig.HCatLoader();
...
-- store the results as a file using HCatalog
STORE Top100Words INTO 'TopWords' USING org.apache.hcatalog.pig.HCatStorer();
The script no longer includes any hard-coded paths to data files or schemas for the data as it is loaded or
stored, and instead uses HCatalog to reference the Hive tables created previously. The results of the
script are stored in the TopWords table, and can be viewed by executing a HiveQL query such as the
following example.
HiveQL
SELECT * FROM TopWords;
adding a Hive staging table and preprocessing the data. However, the team needs to consider that the
data in other columns may be useful in the future.
To complete the examination of the data the team next explored how the results would be used, as
described in the next section, "Consuming the results."
Figure 1 - Using the Data Connection Wizard to access a Hive table from Excel
For details of how to consume the output from HDInsight jobs in Excel, see Built-in data connectivity in
the topic Consuming and visualizing data from HDInsight.
After the data has been imported into a worksheet, the analysts can use the full capabilities of Excel to
explore and visualize it, as shown in Figure 2.
Introduction to A. Datum
Introduction to A. Datum
This scenario is based on a fictional company named A. Datum, which conducts research into tornadoes
and other weather-related phenomena in the Unites States. In the scenario, data analysts at A. Datum
want to use HDInsight as a central repository for historical tornado data in order to analyze and visualize
previous tornados, and to try to identify trends in terms of geographical locations and times.
How you can define and create a data warehouse containing a database and Hive tables in
HDInsight.
How you can automate the loading of data into the tables in the data warehouse.
How you can define queries to extract the data from the data warehouse.
How you can view and analyze the data, and generate compelling visualizations using a range of
tools
Figure 1 - Using HDInsight as a data warehouse for analysis, reporting, and as a business data source
Unlike a traditional relational database, HDInsight allows you to manage the lifetime and storage of
tables and indexes (metadata) separately from the data that populates the tables. A Hive table is simply
a definition that is applied over a folder containing data, and this separation of schema and data is what
enables one of the primary differences between Hadoop-based big data batch processing solutions and
relational database: you apply a schema when the data is read, rather than when it is written.
In this scenario youll see how the capability to use a schema on read approach provides an advantage
for organizations that need a data warehousing capability where data can be continuously collected, but
analysis and reporting is carried out only occasionally.
Creating a database
When planning the HDInsight data warehouse, the data analysts at A. Datum needed to consider ways
to ensure that the data and Hive tables can be easily recreated in the event of the HDInsight cluster
being released and re-provisioned. This might happen for a number of reasons, including temporarily
decommissioning the cluster to save costs during periods of non-use, and releasing the cluster in order
to create a new one with more nodes in order to scale out the data warehouse.
A new cluster can be created over one or more existing Azure blob storage containers that hold the
data, but the Hive (and other) metadata is stored separately in an Azure SQL Database instance. To be
able to recreate this metadata, the analysts identified two possible approaches:
Save a HiveQL script that can be used to recreate EXTERNAL tables based on the data persisted
in Azure blob storage.
Specify an existing Azure SQL Database instance to host the Hive metadata store when the
cluster is created.
Using a HiveQL script to recreate tables after releasing and re-provisioning a cluster is an effective
approach when the data warehouse will contain only a few tables and other objects. The script can be
executed to recreate the tables over the existing data when the cluster is re-provisioned. However,
selecting an existing SQL Database instance (which you maintain separately from the cluster) to be used
as the Hive metadata store is also very easy, and removes the need to rerun the scripts. You can back up
this database using the built-in tools, or export the data so that you can recreate the database if
required.
See Cluster and storage initialization for more details of using existing storage accounts and a separate
Azure SQL Database instance to restore a cluster.
Creating a logical database in HDInsight is a useful way to provide separation between the contents of
the database and other items located in the same cluster; for example, to ensure a logical separation
from Hive tables used for other analytical processes. To do this the data analysts created a dedicated
database for the data warehouse by using the following HiveQL statement.
HiveQL
CREATE DATABASE DW LOCATION '/DW/database';
This statement creates a folder named /DW/database as the default folder for all objects created in the
DW database.
Creating tables
The tornado data includes the code for the state where the tornado occurred, as well as the date and
time of the tornado. The data analysts want to be able to display the full state name in reports, and so
created a table for state names using the following HiveQL statement.
HiveQL
CREATE EXTERNAL TABLE DW.States (StateCode STRING, StateName STRING)
STORED AS SEQUENCEFILE;
Notice that the table is stored in the default location. For the DW database this is the /DW/database
folder, and so this is where a new folder named States is created. The table is formatted as a Sequence
File. Tables in this format typically provide faster performance than tables in which data is stored as text.
You must use EXTERNAL tables if you want the data to be persisted when you delete a table definition
or when you recreate a cluster. Storing the data in SEQUENCEFILE format is also a good idea as it can
improve performance. You might also consider using the ORC file format, which provides a highly
efficient way to store Hive data and can improve performance when reading, writing, and processing
data. See ORC File Format for more information.
The data analysts also want to be able to aggregate data by temporal hierarchies (year, month, and day)
and create reports that show month and day names. While many client applications that are used to
analyze and report data support this kind of functionality, the analysts want to be able to generate
reports without relying on specific client application capabilities.
To support date-based hierarchies and reporting, the analysts created a table containing various date
attributes that can be used as a lookup table for date codes in the tornado data. The creation of a date
table like this is a common pattern in relational data warehouses.
HiveQL
CREATE EXTERNAL TABLE DW.Dates
(DateCode STRING, CalendarDate STRING, DayOfMonth INT, MonthOfYear INT,
Year INT, DayOfWeek INT, WeekDay STRING, Month STRING)
STORED AS SEQUENCEFILE;
Finally, the data analysts created a table for the tornado data itself. Since this table is likely to be large,
and many queries will filter by year, they decided to partition the table on a Year column, as shown in
the following HiveQL statement.
HiveQL
CREATE EXTERNAL TABLE DW.Tornadoes
(DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
LengthMiles DOUBLE, WidthYards DOUBLE)
PARTITIONED BY (Year INT) STORED AS SEQUENCEFILE;
Next, the analysts needed to upload the data for the data warehouse. This is described in the next
section, "Loading data into the data warehouse."
01/01/1932
1932
Friday
January
1932-01-02
01/02/1932
1932
Saturday
January
...
...
...
...
...
...
...
...
State table
AL
Alabama
AK
Alaska
AZ
Arizona
...
...
Tornadoes
table
1934-01-18
OK
01/18/1934
02:20
4000000
35.4
96.67
35.45
-96.6
5.1
30
1934-01-18
AR
01/18/1934
08:50
1600000
35.2
93.18
0.1
10
1934-01-18
MO
01/18/1934
13:55
4000000
36.68
90.83
36.72
90.77
4.3
100
table. For example, the following script is used to drop and recreate a staging table named
StagedTornadoes for the tornadoes data.
HiveQL (CreateStagedTornadoes.q)
DROP TABLE DW.StagedTornadoes;
CREATE TABLE StagedTornadoes
(DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
LengthMiles DOUBLE, WidthYards DOUBLE)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/staging/tornadoes';
Notice that the staging table is an INTERNAL table. When it is dropped, any staged data left over from a
previous load operation is deleted and a new, empty /staging/tornadoes folder is created ready for new
data fileswhich can simply be copied into the folder. Similar scripts were created for the StagedDates
and StagedStates tables.
In addition to the scripts used to create the staging tables, the data load process requires scripts to
insert the staged data into the data warehouse tables. For example, the following script is used to load
the staged tornadoes data.
HiveQL (StagingScripts\LoadStagedTornadoes.q)
SET mapreduce.map.output.compress=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM DW.StagedTornadoes s
INSERT INTO TABLE DW.Tornadoes PARTITION (Year)
SELECT s.DateCode, s.StateCode, s.EventTime, s.Category, s.Injuries, s.Fatalities,
s.PropertyLoss, s.CropLoss, s.StartLatitude, s.StartLongitude, s.EndLatitude,
s.EndLongitude, s.LengthMiles, s.WidthYards, SUBSTR(s.DateCode, 1, 4) Year;
Notice that the script includes some configuration settings to enable compression of the query output
(which will be inserted into the data warehouse table). Additionally, the script for the tornadoes data
includes an option to enable dynamic partitions and a function to generate the appropriate partitioning
key value for Year. Similar scripts, without the partitioning functionality, were created for the states and
dates data.
The scripts to create staging tables and load staged data were then uploaded to the /staging/scripts
folder so that they can be used whenever new data is available for loading into the data warehouse.
Loading data
With the Hive table definition scripts in place, the data analysts could now implement a solution to
automate the data load process. It is possible to create a custom application to load the data using the
.NET SDK for HDInsight, but a simple approach using Windows PowerShell scripts was chosen for this
scenario. A PowerShell script was created for each staging table, including the following script that is
used to stage and load tornadoes data.
After setting some initial variables to identify the cluster, storage account, blob container, and the local
folder where the source data is stored, the script performs the following three tasks:
1. Runs the HiveQL script to drop and recreate the staging table.
2. Uploads the source data file to the staging table folder.
3. Runs the HiveQL script to load the data from the staging table into the data warehouse table.
Two similar scripts, LoadDates.ps1 and LoadStates.ps1, are run to load the dates and states into the data
warehouse. Whenever new data is available for any of the data warehouse tables, the data analysts can
run the appropriate PowerShell script to automate the data load process for that table.
Now that the data warehouse is complete, the analysts can explore how to analyze the data. This is
discussed in the next section, "Analyzing data from the data warehouse."
Layer 1: Accumulating property and crop damage by state costs shown as a stacked column
chart.
Layer 2: Average tornado category by latitude and longitude shown as a heat map.
Use the .NET Library for Avro to serialize data for processing in HDInsight.
Use the classes in the .NET API for Hadoop WebClient package to upload files to Azure storage.
Use an Oozie workflow to define an ETL process that includes Pig, Hive, and Sqoop tasks.
Use the classes in the .NET API for Hadoop WebClient package to automate execution of an
Oozie workflow.
The geographical position of the car (its latitude and longitude coordinates).
The sensors used in this scenario are deliberately simplistic. Real racecars include hundreds of sensors
emitting thousands of telemetry readings at sub-second intervals.
For the purpose of the example, to make it repeatable if you want to experiment with the code
yourself, the source data is provided in a file named Lap.csv. The example console application reads
this file to generate the source data for analysis.
The application captures the sensor readings as objects based on the following classes. Note that the
Position property of the GpsReading class is based on the Location struct.
C# (Program.cs in RaceTracker project)
[DataContract]
internal struct Location
{
[DataMember]
public double lat { get; set; }
[DataMember]
public double lon { get; set; }
}
[DataContract(Name = "GpsReading", Namespace = "CarSensors")]
internal class GpsReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }
[DataMember(Name = "Position")]
public Location Position { get; set; }
[DataMember(Name = "Speed")]
public double Speed { get; set; }
}
[DataContract(Name = "EngineReading", Namespace = "CarSensors")]
internal class EngineReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }
[DataMember(Name = "Revs")]
public double Revs { get; set; }
[DataMember(Name="OilTemp")]
public double OilTemp { get; set; }
}
[DataContract(Name = "BrakeReading", Namespace = "CarSensors")]
internal class BrakeReading
{
[DataMember(Name = "Time")]
public string Time { get; set; }
[DataMember(Name = "BrakeTemp")]
public double BrakeTemp { get; set; }
}
As the application captures the telemetry data, each sensor reading object is added to a List as defined
in the following code.
C# (Program.cs in RaceTracker project)
static List<GpsReading> GpsReadings = new List<GpsReading>();
static List<EngineReading> EngineReadings = new List<EngineReading>();
static List<BrakeReading> BrakeReadings = new List<BrakeReading>();
As part of the ETL processing workflow in HDInsight, the captured readings must be filtered to remove
any null values caused by sensor transmission problems. At the end of the processing the data must be
restructured to a tabular format that matches the following Azure SQL Database table definition.
Transact-SQL (Create LapData Table.sql)
CREATE TABLE [LapData]
(
[LapTime] [varchar](25) NOT NULL PRIMARY KEY CLUSTERED,
[Lat] [float] NOT NULL,
[Lon] [float] NOT NULL,
[Speed] [float] NOT NULL,
[Revs] [float] NOT NULL,
[OilTemp] [float] NOT NULL,
[BrakeTemp] [float] NOT NULL,
);
The workflow and its individual components are described in the next section, "The ETL workflow."
Serializing the sensor reading objects as files, and uploading them to Azure storage.
Filtering the data to remove readings that contain null values, and restructuring it into tabular
format.
Loading the combined sensor readings data into the table in Windows Azure SQL Database.
Figure 1 - The ETL workflow required to load racecar telemetry data into Azure SQL Database
The team wants to integrate these tasks into the existing console application so that, after a test lap, the
telemetry data is loaded into the database for later analysis.
Similar code is used to serialize the engine and brake sensor data into files in the bin/debug folder of the
solution.
Uploading the files to Azure storage
After the data for each sensor has been serialized to a file, the program must upload the files to the
Azure blob storage container used by the HDInsight cluster. To accomplish this the developer imported
the Microsoft .NET API for Hadoop WebClient package and added using statements that reference the
Microsoft.Hadoop.WebHDFS and Microsoft.Hadoop.WebHDFS.Adapters namespaces. The developer
can then use the WebHDFSClient class to connect to Azure storage and upload the files. The following
code shows how this technique is used to upload the file containing the GPS sensor readings.
C# (Program.cs in RaceTracker project)
// Get Azure storage settings from App.Config.
var hdInsightUser = ConfigurationManager.AppSettings["HDInsightUser"];
var storageKey = ConfigurationManager.AppSettings["StorageKey"];
var storageName = ConfigurationManager.AppSettings["StorageName"];
var containerName = ConfigurationManager.AppSettings["ContainerName"];
var destFolder = ConfigurationManager.AppSettings["InputDir"];
// Upload GPS data.
var hdfsClient = new WebHDFSClient(
hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
Console.WriteLine("Uploading GPS data...");
await hdfsClient.CreateFile(gpsFile, destFolder + "gps.avro");
Notice that the settings used by the WebHDFSClient object are retrieved from the App.Config file. These
settings include the credentials required to connect to the Azure storage account used by HDInsight and
the path for the folder to which the files should be uploaded. In this scenario the InputDir configuration
settings has the value /racecar/source/, so the GPS data file will be saved as /racecar/source/gps.avro.
Note that the Pig Latin script uses the AvroStorage load function to load the data file. This load function
enables Pig to read the schema and data from the Avro file, with the result that the script can use the
properties of the serialized objects to refer to the data structures in the file. For example, the script
filters the data based on the Position and Time properties of the objects that were serialized. The script
then uses the FLATTEN function to extract the Lat and Lon values from the Position property, and stores
the resulting data (which now consists of regular rows and columns) in the /racecar/gps folder using the
default tab-delimited text file format.
Similar Pig Latin scripts named engine.pig and brake.pig were created to process the engine and brake
data files.
Combining the readings into a single table
The Pig scripts that process the three Avro-format source files restructure the data for each sensor and
store it in tab-delimited files. To combine the data in these files the developers decided to use Hive
because of the simplicity it provides when querying tabular data structures. The first stage in this
process was to create a script that builds Hive tables over the output files generated by Pig. For
example, the following HiveQL code defines a table over the filtered GPS data.
HiveQL (createtables.hql)
CREATE TABLE gps (laptime STRING, lat DOUBLE, lon DOUBLE, speed FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/racecar/gps';
Similar code was used for the tables that will hold the engine and brake data.
The script also defines the schema for a table named lap that will store the combined data. This script
contains the following HiveQL code, which references a currently empty folder.
HiveQL (createtables.hql)
CREATE TABLE lap
(laptime STRING, lat DOUBLE, lon DOUBLE, speed FLOAT, revs FLOAT, oiltemp FLOAT,
braketemp FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/racecar/lap';
To combine the data from the three sensors and load it into the lap table, the developers used the
following HiveQL statement.
HiveQL (loadlaptable.hql)
FROM gps LEFT OUTER JOIN engine
ON (gps.laptime = engine.laptime) LEFT OUTER JOIN brake
ON (gps.laptime = brake.laptime)
INSERT INTO TABLE lap
SELECT gps.*, engine.revs, engine.oiltemp, brake.braketemp;
This code joins the data in the three tables based on a common time value (so that each row contains all
of the readings for a specific time), and inserts all fields from the gps table, the revs and oiltemp fields
from the engine table, and the braketemp field from the brake table, into the lap table.
Now that each of the tasks for the workflow have been defined, they can be combined into a workflow
definition. This is described in Encapsulating the ETL tasks in an Oozie workflow.
application, and reuse them regularly. The workflow must execute these tasks in the correct order and,
where appropriate, wait until each one completes before starting the next one.
A workflow defined in Oozie can fulfil these requirements, and enable automation of the entire process.
Figure 1 shows an overview of the Oozie workflow that the developers implemented.
Note that, in addition to the tasks described earlier, a new task has been added that drops any existing
Hive tables before processing the data. Because the Hive tables are INTERNAL, dropping them cleans up
any data left by previous uploads. This task uses the following HiveQL code.
HiveQL (droptables.hql)
DROP
DROP
DROP
DROP
TABLE
TABLE
TABLE
TABLE
gps;
engine;
brake;
lap;
The workflow includes a fork, enabling the three Pig tasks that filter the individual data files to be
executed in parallel. A join is then used to ensure that the next phase of the workflow doesnt start until
all three Pig jobs have finished.
If any of the tasks should fail, the workflow executes the kill task. This generates a message containing
details of the error, abandons any subsequent tasks, and halts the workflow. As long as there are no
errors, the workflow ends after the Sqoop task that loads the data into Azure SQL Database has
completed.
When executed, the workflow currently exits with an error. This is due to a fault in Oozie and is not an
error in the scripts. For more information see A CoordActionUpdateXCommand gets queued for all
workflows even if they were not launched by a coordinator.
<fork name="CleanseData">
<path start="FilterGps" />
<path start="FilterEngine" />
<path start="FilterBrake" />
</fork>
<action name="FilterGps">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<action name="FilterEngine">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<action name="FilterBrake">
...
<ok to="CombineData"/>
<error to="fail"/>
</action>
<join name="CombineData" to="CreateTables" />
<action name="CreateTables">
...
<ok to="LoadLapTable"/>
<error to="fail"/>
</action>
<action name="LoadLapTable">
...
<ok to="TransferData"/>
<error to="fail"/>
</action>
<action name="TransferData">
...
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name="end"/>
</workflow-app>
Each action in the workflow is of a particular type, indicated by the first child element of the <action>
element. For example, the following code shows the DropTables action, which uses Hive.
hPDL (workflow.xml)
...
<action name="DropTables">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>droptables.hql</script>
</hive>
<ok to="CleanseData"/>
<error to="fail"/>
</action>
...
The DropTables action references the script droptables.hql, which contains the HiveQL code to drop any
existing Hive tables. All the script files are stored in the same folder as the workflow.xml file. This folder
also contains files used by the workflow to determine configuration settings for specific execution
environments; for example, the hive-default.xml file referenced by all Hive actions contains the
environment settings for Hive.
The FilterGps action, shown in the following code, is a Pig action that references the gps.pig script. This
script contains the Pig Latin code to process the GPS data.
hPDL (workflow.xml)
...
<action name="FilterGps">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>gps.pig</script>
</pig>
<ok to="CombineData"/>
<error to="fail"/>
</action>
...
The FilterEngine and FilterBrake actions are similar to the FilterGps action, but specify the appropriate
value for the <script> element.
After the three filter actions have completed, following the <join> element in the workflow file, the
CreateTables action generates the new internal Hive tables over the data, and the LoadLapTable action
combines the data into the lap table. These are both Hive actions, defined as shown in the following
code.
hPDL (workflow.xml)
...
<action name="CreateTables">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>createtables.hql</script>
</hive>
<ok to="LoadLapTable"/>
<error to="fail"/>
</action>
<action name="LoadLapTable">
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>loadlaptable.hql</script>
</hive>
<ok to="TransferData"/>
<error to="fail"/>
</action>
...
The final action is the TransferData action. This is a Sqoop action, defined as shown in the following
code.
hPDL (workflow.xml)
...
<action name="TransferData">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<arg>export</arg>
<arg>--connect</arg>
<arg>${connectionString}</arg>
<arg>--table</arg>
<arg>${targetSqlTable}</arg>
<arg>--export-dir</arg>
<arg>${outputDir}</arg>
<arg>--input-fields-terminated-by</arg>
<arg>\t</arg>
<arg>--input-null-non-string</arg>
<arg>\\N</arg>
</sqoop>
<ok to="end"/>
<error to="fail"/>
</action>
...
Several of the values used by the actions in this workflow are parameters that are set in the job
configuration, and are populated when the workflow executes. The syntax ${...} denotes a parameter
that is populated at runtime. For example, the TransferData action includes an argument for the
connection string to be used when connecting to Azure SQL database. The value for this argument is
passed to the workflow as a parameter named connectionString. When running the Oozie workflow
from a command line, the parameter values can be specified in a job.properties file as shown in the
following example.
job.properties file
nameNode=wasbs://container-name@mystore.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/racecar/oozieworkflow/
outputDir=/racecar/lap/
connectionString=jdbc:sqlserver://server-name.database.windows.net:1433;
database=database-name;user=user-name@server-name;password=password;encrypt=true;
trustServerCertificate=true;loginTimeout=30;
targetSqlTable=LapData
The ability to abstract settings in a separate file makes the ETL workflow more flexible. It can be easily
adapted to handle future changes in the environment, such as a requirement to use alternative folder
locations or a different Azure SQL Database instance.
The job.properties file may contain sensitive information such as database connection strings and
credentials (as in the example above). This file is uploaded to the cluster and so cannot easily be
encrypted. Ensure you properly protect this file when it is stored outside of the cluster, such as on
client machines that will initiate the workflow, by applying appropriate file permissions and computer
security practices.
When the Oozie job starts, the command line interface displays the unique ID assigned to the job. The
administrators can then view the progress of the job by using the browser on the HDInsight cluster to
display job status at http://localhost:11000/oozie/v0/job/the_unique_job_id?show=log.
With the Oozie workflow definition complete, the next stage it to automate its execution. This is
described in the next section, "Automating the ETL workflow."
await hdfsClient.DeleteDirectory(workflowDir);
foreach (var file in workflowLocalDir.GetFiles())
{
await hdfsClient.CreateFile(file.FullName, workflowDir + file.Name);
}
Notice that the code begins by deleting the workflow directory if it already exists in Azure blob storage,
and then uploads each file from the local OozieWorkflow folder.
string id = json.id;
await client.StartJob(id);
Console.WriteLine("Oozie job started");
Console.WriteLine("View workflow progress at " + clusterAddress + "/oozie/v0/job/" +
id + "?show=log");
This code retrieves the parameters for the Oozie job from the App.Config file for the application, and
initiates the job on the HDInsight cluster. When the job is submitted, its ID is retrieved and the
application displays a message such as:
View workflow progress at https://mycluster.azurehdinsight.net/oozie/v0/job/job_id?show=log
Users can then browse to the URL indicated by the application to view the progress of the Oozie job as it
performs the ETL workflow tasks.
The final stage is to explore how the data in SQL Database can be used. An example is shown in the next
section, "Analyzing the loaded data."
Scenario 4: BI integration
This scenario explores ways in which big data batch processing with HDInsight can be integrated into a
business intelligence (BI) solution in a corporate environment. The emphasis in this scenario is on the
challenges and techniques associated with integrating data from HDInsight into a BI ecosystem based on
Microsoft SQL Server and Office technologies. This includes integration at the report, corporate data
model, and data warehouse levels of an enterprise BI solution, as well as how insights from big data
analysis in HDInsight can be shared in a self-service BI solution built on Office 365 and Power BI.
The scenario includes and demonstrates:
Collaborative self-service BI
The data ingestion and processing elements of the example used in this scenario have been deliberately
kept simple in order to focus on the integration techniques. In a real-world solution the challenge of
obtaining the source data, loading it to the HDInsight cluster, and using map/reduce code, Pig, or Hive to
process it before consuming it in a BI infrastructure are likely to be more complex than described in this
scenario.
customer profile data. The high-level architecture of the Adventure Works BI solution is shown in Figure
1.
The ability to analyze the log data and summarize website activity over time would help the business to
measure the amount of data transferred during web requests, and potentially correlate web activity
with sales transactions to better understand trends and patterns in e-commerce sales. However, the
large volume of log data that must be processed in order to extract these insights has prevented the
company from attempting to include the log data in the enterprise data warehouse.
The company has recently decided to use HDInsight to process and summarize the log data so that it can
be reduced to a more manageable volume, and integrated into the enterprise BI ecosystem. The
developers will integrate the results of the processing at all three levels of their existing BI system, as
shown in Figure 6, and also enable self-service BI through Power BI for Office 365.
Figure 6 - The three levels for integration of the results into the existing BI system.
This Hive table defines a schema for the log file, making it possible to use a query that filters the rows in
order to load the required data into a permanent table for analysis. Notice that the staging table is
based on the /data folder but it is not defined as EXTERNAL, so dropping the staging table after the
required rows have been loaded into the permanent table will delete the source files that are no longer
required.
When designing the permanent table for analytical queries, the BI developer has decided to partition
the data by year and month to improve query performance when extracting data. To achieve this, a
second Hive statement is used to define a partitioned tablenotice the PARTITIONED BY clause near
the end of the following script. This instructs Hive to add two columns named year and month to the
table, and to partition the data loaded into the table based on the values inserted into these columns.
HiveQL
DROP TABLE iis_log;
CREATE TABLE iis_log
(logdate STRING, logtime STRING, c_ip STRING, cs_username STRING, s_ip STRING,
s_port STRING, cs_method STRING, cs_uri_stem STRING, cs_uri_query STRING,
sc_status STRING, sc_bytes INT, cs_bytes INT, time_taken INT,
cs_User_Agent STRING, cs_Referrer STRING)
PARTITIONED BY (year INT, month INT)
STORED AS SEQUENCEFILE;
The Hive scripts to create the tables are saved as text files in a local folder named scripts.
Storing the data in SEQUENCEFILE format can improve performance. You might also consider using the
ORC file format, which provides a highly efficient way to store Hive data and can improve performance
when reading, writing, and processing data. See ORC File Format for more information.
Next, the following Hive script is created to load data from the log_staging table into the iis_log table.
This script takes the values from the columns in the log_staging Hive table, calculates the values for the
year and month of each row, and inserts these rows into the partitioned iis_log Hive table.
HiveQL
SET mapred.output.compression.type=BLOCK;
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM log_staging s
INSERT INTO TABLE iis_log PARTITION (year, month)
SELECT s.logdate, s.logtime, s.c_ip, s.cs_username, s.s_ip, s.s_port, s.cs_method,
s.cs_uri_stem, s.cs_uri_query, s.sc_status, s.sc_bytes, s.cs_bytes,
s.time_taken, s.cs_User_Agent, s.cs_Referrer,
SUBSTR(s.logdate, 1, 4) year, SUBSTR(s.logdate, 6, 2) month
WHERE SUBSTR(s.logdate, 1, 1) <> '#';
The source log data includes a number of header rows that are prefixed with the # character, which
could cause errors or add unnecessary complexity to summarizing the data. To resolve this the HiveQL
statement shown above includes a WHERE clause that ignores rows starting with # so that they are
not loaded into the permanent table.
To maximize performance the script includes statements that specify the output from the query should
be compressed. The code also sets the dynamic partition mode to nonstrict, enabling rows to be
dynamically inserted into the appropriate partitions based on the values of the partition columns.
When data is added to the iis_log table the index can be updated using the following HiveQL statement.
HiveQL
ALTER INDEX idx_logdate ON iis_log REBUILD;
The scripts to load the iis_log table and build the index are also saved in the local scripts folder.
Tests revealed that indexing the tables provided only a small improvement in performance of queries,
and that building the index took longer than the time saved when running the query. However, the
results depend on factors such as the volume of source data, and so you should experiment to see if
indexing is a useful optimization technique in your scenario.
$destfolder = "scripts"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
# Run scripts to create Hive tables.
write-host "Creating Hive tables..."
$jobDef = New-AzureHDInsightHiveJobDefinition
File"wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/CreateT
ables.txt"
$hiveJob = Start-AzureHDInsightJob Cluster $clusterName JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId StandardError
# Upload data to staging table.
$localfolder = "$thisfolder\iislogs"
$destfolder = "data"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
# Run scripts to load Hive tables.
write-host "Loading Hive table..."
$jobDef = New-AzureHDInsightHiveJobDefinition
-File
"wasbs://$containerName@$storageAccountName.blob.core.windows.net/scripts/LoadTables.
txt"
It uploads the contents of the local scripts folder to the /scripts folder in HDInsight.
It runs the CreatedTables.txt Hive script to drop and recreate the log_staging and iis_log tables
(any previously uploaded data will be deleted because both are internal tables).
It uploads the contents of the local iislogs folder to the /data folder in HDInsight (thereby
loading the source data into the staging table).
It runs the LoadTables.txt Hive script to load the data from the log_staging table into the iis_log
table and create an index (note that the text data in the staging table is implicitly converted to
SEQUENCEFILE format as it is inserted into the iis_log table).
Where a more restricted dataset is required the BI developer or business user can use a script that
selects on the year and month columns, and transforms the data as required. For example, the following
script extracts just the data for the first quarter of 2012, aggregates the number of hits for each day (the
logdate column contains the date in the form yyyy-mm-dd), and returns a dataset with two columns: the
date and the total number of page hits.
HiveQL
SELECT logdate, COUNT(*) pagehits FROM iis_log
WHERE year = 2012 AND month < 4
GROUP BY logdate
server log data in HDInsight) and the number of items sold (available from the enterprise data
warehouse). Since only the business analysts require this combined data, there is no need at this stage
to integrate the web log data from HDInsight into the entire enterprise BI solution. Instead, a business
analyst can use PowerPivot to create a personal data model in Excel specifically for this mashup analysis.
The ODBC connection to the HDInsight cluster is typically defined in a data source name (DSN) on the
local computer, which makes it easy to define a connection for programs that will access data in the
cluster. The DSN encapsulates a connection string such as this:
Connection string
DRIVER={Microsoft Hive ODBC Driver};Host=<cluster_name>.azurehdinsight.net;Port=443;
Schema=default;
RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;UID=UserName;PWD=Password;Defau
ltStringColumnLength=4000
After the connection has been defined and tested, the business analyst uses the following HiveQL query
to create a new table named Page Hits that contains aggregated log data from HDInsight.
HiveQL
SELECT logdate, COUNT(*) hits FROM iis_log GROUP BY logdate
This query returns a single row for each distinct date that has log entries, along with a count of the
number of page hits that were recorded on that date. The logdate values in the underlying Hive table
are defined as text, but the yyyy-mm-dd format of the text values means that the business analyst can
simply change the data type for the column in the PowerPivot table to Date; making it possible to create
a relationship that joins the logdate column in the Page Hits table to the Date column in the Date table,
as shown in Figure 2.
Figure 3 - Using Power View in Excel to analyze data from HDInsight and the data warehouse
By integrating IIS log data from HDInsight with enterprise BI data at the report level, business analysts
can create mashup reports and analyses without impacting the BI infrastructure used for corporate
reporting. However, after using this report-level integration to explore the possibilities of using IIS log
data to increase understanding of the business, it has become apparent that the log data could be useful
to a wider audience of users than just business analysts, and for a wider range of business processes.
This can be achieved through corporate data model integration, discussed in the next section.
Scorecards and dashboards for Adventure Works are currently based on the SSAS corporate data model,
which is also used to support formal reports and analytical business processes. The corporate data
model is implemented as an SSAS database in tabular mode, so the process to add a table for the IIS log
data is similar to the one used to import the results of a HiveQL query into a PowerPivot model. A BI
developer uses SQL Server Data Tools to add an ODBC connection to the HDInsight cluster and create a
new table named Page Hits based on the following query.
HiveQL
SELECT logdate, SUM(sc_bytes) sc_bytes, SUM(cs_bytes) cs_bytes, COUNT(*) pagehits
FROM iis_log GROUP BY logdate
Notice that this query includes more columns than the one previously used in the personal data model,
making it useful for more kinds of analysis by a wider audience.
The fact that Adventure Works is using SSAS in tabular mode makes it possible to connect to an ODBC
source such as Hive. If SSAS had been installed in multidimensional mode, the developer would have
had to either extract the data from HDInsight into an OLE DB compliant data source, or base the data
model on a linked SQL Server database that has a remote server connection over ODBC to the Hive
tables.
After the Page Hits table has been created and the data imported into the model, the data type of the
logdate column is changed to Date and a relationship is created with the Date table in the same way as
in the PowerPivot data model discussed in Report level integration. However, one significant difference
between PowerPivot models and SSAS tabular models is that SSAS does not create implicit aggregated
measures from numeric columns in the same way as PowerPivot does.
The Page Hits table contains a row for each date, with the total bytes sent and received, and the total
number of page hits for that date. The BI developer created explicit measures to aggregate the sc_bytes,
cs_bytes, and pagehits values across multiple dates based on data analysis expression (DAX) formulae,
as shown in Figure 1.
The requirement to match the product code to the surrogate key for the appropriate version of the
product makes integration at the report or corporate data model levels problematic. It is possible to
perform complex lookups to find the appropriate surrogate key for an alternate key at any level at the
time a specific page hit occurred (assuming both the surrogate and alternate keys for the products are
included in the data model or report dataset). However, it is more practical to integrate the IIS log data
into the dimensional model of the data warehouse so that the relationship with the product dimension
(and the date dimension) is present throughout the entire enterprise BI stack.
The most problematic task for achieving integration with BI systems at data warehouse level is
typically matching the keys in the source data with the correct surrogate key in the data warehouse
tables where changes to the existing data over time prompt the use of an alternate key.
The table will be loaded with new log data on a regular schedule as part of the ETL process for the data
warehouse. In common with most data warehouse ETL processes, the solution at Adventure Works
makes use of staging tables as an interim store for new data, making it easier to coordinate data loads
into multiple tables and perform lookups for surrogate key values. A staging table is created in a
separate staging schema using the following Transact-SQL statement.
Transact-SQL
CREATE TABLE staging.IISLog([LogDate] nvarchar(50) NOT NULL,
[ProductID] nvarchar(50) NOT NULL, [BytesSent] decimal NULL,
[BytesReceived] decimal NULL, [PageHits] int NULL);
Good practice when regularly loading data into a data warehouse is to minimize the amount of data
extracted from each data source so that only data that has been inserted or modified since the last
refresh cycle is included. This minimizes extraction and load times, and reduces the impact of the ETL
process on network bandwidth and storage utilization. There are many common techniques you can use
to restrict extractions to only modified data, and some data sources support change tracking or change
data capture (CDC) capabilities to simplify this.
In the absence of support in Hive tables for restricting extractions to only modified data, the BI
developers at Adventure Works have decided to use a common pattern that is often referred to as a
high water mark technique. In this pattern the highest log date value that has been loaded into the data
warehouse is recorded, and used as a filter boundary for the next extraction. To facilitate this, the
following Transact-SQL statement is used to create an extraction log table and initialize it with a default
value.
Transact-SQL
CREATE TABLE staging.highwater([ExtractDate] datetime DEFAULT GETDATE(),
[HighValue] nvarchar(200));
INSERT INTO staging.highwater (HighValue) VALUES ('0000-00-00');
Using Sqoop to export the data from HDInsight and push it to SQL Server.
Using PolyBase to combine the data in an HDInsight cluster with a Microsoft Analytics Platform
System (APS) database (PolyBase is available only in APS appliances).
Using SSIS to extract the data from HDInsight and load it into SQL Server.
In the case of the Adventure Works scenario, the data warehouse is hosted on an on-premises server
that cannot be accessed from outside the corporate firewall. Since the HDInsight cluster is hosted
externally in Azure, the use of Sqoop to push the data to SQL Server is not a viable option. In addition,
the SQL Server instance used to host the data warehouse is running SQL Server 2012 Enterprise Edition,
not APS, and PolyBase cannot be used in this scenario.
The most appropriate option, therefore, is to use SSIS to implement a package that transfers data from
HDInsight to the staging table. Since SSIS is already used as the ETL platform for loading data from other
business sources to the data warehouse, this option also reduces the development and management
challenges for implementing an ETL solution to extract data from HDInsight.
The document Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS) contains a
wealth of useful information about using SSIS with HDInsight.
The control flow for an SSIS package to extract the IIS log data from HDInsight is shown in Figure 1.
This expression consists of a HiveQL query to extract the required data, combined with the value of the
HighWaterMark SSIS variable to filter the data being extracted so that only rows with a logdate value
greater than the highest ones already in the data warehouse are included.
Based on the log files for the Adventure Works e-commerce site, the cs_uri_query values in the web
server log file contain either the value - (for requests with no query string) or productid=productcode (where product-code is the product code for the requested product). The HiveQL query includes a
regular expression that parses the cs_uri_query value and removes the text productid=. The Hive
query therefore generates a results set that includes a productid column, which contains either the
product code value or -.
The query string example in this scenario is deliberately simplistic in order to reduce complexity. In a
real-world solution, parsing query strings in a web server log may require a significantly more complex
expression, and may even require a user-defined Java function.
After the data has been extracted it flows to the Data Type Conversion transformation, which converts
the logdate and productid values to 50-character Unicode strings. The rows then flow to the Staging
Table destination, which loads them into the staging.IISLog table.
The code inserts rows from the staging.IISLog table into the dbo.FactIISLog table, looking up the
appropriate dimension keys for the date and product dimensions. The surrogate key for the date
dimension is an integer value derived from the year, month, and day. The LogDateKey value extracted
from HDInsight is a string in the format YYYY-MM-DD. SQL Server can implicitly convert values in this
format to the Date data type, so a join can be made to the FullDateAlternateKey column in the
DimDate table to find the appropriate surrogate DateKey value. The DimProduct dimension table
includes a row for None (with the ProductAlternateKey value -), and one or more rows for each
product.
Each product row has a unique ProductKey value (the surrogate key) and an alternate key that matches
the product code extracted by the Hive query. However, because Product is a slowly changing
dimension there may be multiple rows with the same ProductAlternateKey value, each representing the
same product at a different point in time. When loading the product data, the appropriate surrogate key
for the version of the product that was current when the web page was requested must be looked up
based on the alternate key and the start and end date values associated with the dimension record, so
the join for the DimProduct table in the Transact-SQL code includes a clause to check for a StartDate
value that is before the log date, and an EndDate value that is either after the log date or null (for the
record representing the current version of the product member).
has been conformed to the dimensional model of the data warehouse, and this deep integration enables
business users to intuitively aggregate IIS activity across dates and products. For example, Figure 4
shows how a user can create a Power View visualization in Excel that includes sales and page view
information for product categories and individual products.
Figure 4 - A Power View report showing data that is integrated in the data warehouse
The inclusion of IIS server log data from HDInsight in the data warehouse enables it to be used easily
throughout the entire BI ecosystem, in managed corporate reports and self-service BI scenarios. For
example, a business user can use Report Builder to create a report that includes web site activity as well
as sales revenue from a single dataset, as shown in Figure 5.
Figure 6 - A self-service report based on a data warehouse that contains data from HDInsight
Collaborative self-service BI
In addition to the enterprise BI solution at Adventure Works, described in Scenario 4: BI integration,
business analysts use Excel and SharePoint Server to create and share their own analytical models. This
self-service BI approach has become increasingly useful at Adventure Works because it makes it easier
for business analysts to rapidly develop custom reports that combine internal and external data, without
over-burdening the IT department with requests for changes to the data warehouse. The company has
therefore added the Power BI service to its corporate Office 365 subscription, and encourages business
analysts to use it to share insights gained from their analysis.
means that the business analysts can store analytical datasets as files in Azure blob storage without
relying on the HDInsight cluster remaining available to service Hive queries.
For example, a senior business analyst can use the following Pig script to generate a result set that is
saved as a file in Azure blob storage.
Pig Latin
Logs = LOAD '/data' USING PigStorage(' ')
AS (log_date, log_time, c_ip, cs_username, s_ip, s_port, cs_method, cs_uri_stem,
cs_uri_query,
sc_status, sc_bytes:int, cs_bytes:int, time_taken:int, cs_user_agent,
cs_referrer);
CleanLogs = FILTER Logs BY SUBSTRING(log_date, 0, 1) != '#';
GroupedLogs = GROUP CleanLogs BY log_date;
GroupedTotals = FOREACH GroupedLogs GENERATE group, COUNT(CleanLogs) AS page_hits,
SUM(CleanLogs.sc_bytes) AS bytes_received, SUM(CleanLogs.cs_bytes) AS bytes_sent;
DailyTotals = FOREACH GroupedTotals GENERATE FLATTEN(group) as log_date, page_hits,
bytes_received, bytes_sent;
SortedDailyTotals = ORDER DailyTotals BY log_date ASC;
STORE SortedDailyTotals INTO '/webtraffic';
Running this Pig script produces a file named part-r-00000 in the /webtraffic folder in the Azure blob
storage container used by the HDInsight cluster. The file contains the date, total page hits, and total
bytes received and sent, for each day. This file will be persisted even if the HDInsight cluster is
deactivated.
Figure 4 - A Power View report based on data imported from a shared query
To share the insights gained from the data, business users can publish Excel workbooks that contain
PowerPivot data models and Power View visualizations as reports in a Power BI site, as shown in Figure
5.
This section is divided into convenient areas that make it easier to understand the challenges, options,
solutions, and considerations for each stage. It describes and demonstrates the individual tasks that are
part of typical end-to-end big data solutions.
The following sections demonstrate the three main stages of the process, followed by an exploration of
how you can combine and automate them to build a comprehensive managed solution. The sections
are:
Obtaining the data and submitting it to the cluster. During this stage you decide how you
will collect the data you have identified as the source, and how you will get it into your big
data solution for processing. Often you will store the data in its raw format to avoid losing
any useful contextual information it contains, though you may choose to do some preprocessing before storing it to remove duplication or to simplify it in some other way. You
must also make several decisions about how and when you will initialize a cluster and the
associated storage. For more details, see Collecting and loading data into HDInsight.
Processing the data. After you have started to collect and store the data, the next stage is
to develop the processing solutions you will use to extract the information you need. While
you can usually use Hive and Pig queries for even quite complex data extraction, you will
occasionally need to create map/reduce components to perform more complex queries
against the data. For more details, see Processing, querying, and transforming data using
HDInsight.
Visualizing and analyzing the results. Once you are satisfied that the solution is working
correctly and efficiently, you can plan and implement the analysis and visualization
approach you require. This may be loading the data directly into an application such as
Microsoft Excel, or exporting it into a database or enterprise BI system for further analysis,
reporting, charting, and more. For more details, see Consuming and visualizing data from
HDInsight.
Building an automated end-to-end solution. At this point it will become clear whether the
solution should become part of your organizations business management infrastructure,
complementing the other sources of information that you use to plan and monitor business
performance and strategy. If this is the case you should consider how you might automate
and manage some or all of the solution to provide predictable behavior, and perhaps so
that it is executed on a schedule. For more details, see Building end-to-end solutions using
HDInsight.
Security is also a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include securing
data while in transit over the network, securing data in storage, and authenticating and authorizing
users who have access to the tools and utilities you use as part of your process. For more details of how
you can maximize security of your HDInsight solutions see the topic Security in the section Building endto-end solutions using HDInsight.
Considerations
When planning how you will obtain the source data for your big data solution, consider the following:
You may need to load data from a range of different data sources such as websites, RSS feeds,
clickstreams, custom applications and APIs, relational databases, and more. Its vital to ensure
that you can submit this data efficiently and accurately to cluster storage, including performing
any preprocessing that may be required to capture the data and convert it into a suitable form.
In some cases, such as when the data source is an internal business application or database,
extracting the data into a file in a form that can be consumed by your solution is relatively
straightforward. In the case of external data obtained from sources such as governments and
commercial data providers, the data is often available for download in a suitable format.
However, in other cases you may need to extract data through a web service or other API,
perhaps by making a REST call or using code.
You may need to stage data before submitting it to a big data cluster for processing. For
example, you may want to persist streaming data so that it can be processed in batches, or
collect data from more than one data source and combine the datasets before loading this into
the cluster. Staging is also useful when combining data from multiple sources that have
different formats and velocity (rate of arrival).
Dedicated tools are available for handling specific types of data such as relational or server log
data. See Choosing tools and technologies for more information.
More information
For more information about HDInsight, see the Microsoft Azure HDInsight web page.
For a guide to uploading data to HDInsight, and some of the tools available to help, see Upload data to
HDInsight on the HDInsight website.
For more details of how HDInsight uses Azure blob storage, see Use Microsoft Azure Blob storage with
HDInsight on the HDInsight website.
For information about creating a cluster using scripts or code see Custom cluster management clients.
example, you can store parts of your data in separate storage accounts to help protect and isolate
sensitive information, or use different storage accounts to stage data as part of your ingestion process.
You can also reduce runtime costs by creating the storage account and loading the data before you
create the cluster. Additionally, using non-linked storage accounts can help to maximize security by
isolating data for different users or tenants and allowing each one to manage their own storage account
and upload the data to it themselves, before you process the data in your HDInsight cluster.
Considerations
Keep in mind the following when deciding how and when you will create storage accounts for a cluster:
The main advantage of allowing HDInsight to create one or more storage accounts that are
automatically linked to the cluster during the creation process is that you do not need to specify
the storage account credentials, such as the storage account name and key, when you access
the data in a query or transformation process running on your HDInsight cluster. HDInsight
automatically stores the required credentials within its configuration. However, you will need to
obtain the storage key when you want to upload data to the storage account and access the
results.
The main advantage of using non-linked storage accounts and containers is the flexibility this
provides in choosing the storage account to use with each job. However, you must specify the
target storage account name and key within your query or transformation when you access
data stored in accounts that are not linked to the cluster.
You can specify the storage accounts that are linked to the cluster only when you create the
cluster. You cannot add or remove linked accounts after a cluster has been created. If you need
more than one storage account to be linked to your cluster, you must specify them all as part of
the cluster creation operation.
You can create the storage accounts before or after you create the cluster. Typically you will use
this capability to minimize cluster runtime cost by creating the storage accounts (or using
existing storage accounts) and loading the data before you create the cluster.
If you store parts of your data in different storage accounts, perhaps to separate sensitive data
such as personally identifiable information (PII) and account information from non-sensitive
data, you can create a cluster that uses just a subset of these as the linked accounts. This allows
you to isolate and protect parts of the data while avoiding the need to specify storage account
credentials in queries and transformations. Be aware, however, that code running in HDInsight
will have full access to all of the data in a linked account because the account name and key are
stored in the cluster configuration.
If you do not specify the storage account and path to the data when you submit a job, HDInsight
will use the default container. If you intend to use accounts and containers other than the
default, or delete and then recreate a cluster over the same data, specify the full path of the
account and container in all queries and transformation processes that you will execute on your
HDInsight cluster. This ensures that each job accesses the correct container, and prevents errors
if you subsequently delete and recreate the cluster with different default containers. The full
path and name of a container is in the form wasbs://[container-name]@[storage-accountname].blob.core.windows.net.
Any storage accounts associated with an HDInsight cluster should be in the same data center as
the cluster, and must not be in an affinity group. Using a container in a storage account in a
different datacenter will result in delays as data is transmitted between datacenters, and you
will be billed for these data transfers.
For more information see Use Azure Blob storage with HDInsight, Provision Hadoop clusters in
HDInsight, and Using an HDInsight Cluster with Alternate Storage Accounts and Metastores.
If you want to retain the schema definitions of Hive tables and the HCatalog metadata, you
must specify an existing SQL Database instance when you create the cluster for the first time. If
you allow HDInsight to create the database, it will be deleted when you delete the cluster.
The data for Hive tables you create in the cluster is retained only if you specify the EXTERNAL
option when you create the tables.
You can back up and restore a SQL Database instance, and export or import the data, using the
tools provided by the Azure management portal or through scripting using the REST interface
for SQL Database.
Ensure you set the required configuration properties for a cluster when you create it. You can
change some properties at runtime for individual jobs (see Configuring and debugging solutions
for details), but you cannot change the properties of an existing cluster. See Custom cluster
management clients for information about automating the creation of clusters and setting
cluster properties.
Consider if you can avoid the need to upload large volumes of data as a discrete operation
before you can begin processing it. For example, you might be able to append data to existing
files in the cluster, or upload it in small batches on a schedule.
If possible, choose or create a utility that can upload data in parallel using multiple threads to
reduce upload time, and that can resume uploads that are interrupted by temporary network
connectivity. Some utilities may be able to split the data into small blocks and upload multiple
blocks or small files in parallel; and then combine them into larger files after they have been
uploaded.
Bottlenecks when loading data are often caused by lack of network bandwidth. Adding more
threads may not improve throughput, and can cause additional latency due to the opening and
closing of the connection for each item. In many cases, reusing the connection (which avoids
the TCP ramp-up) is more important.
You can often reduce upload time considerably by compressing the data. If the data at the
destination should be uncompressed, consider compressing it before uploading it and then
decompressing it within the datacenter. Alternatively, you can use one of the HDInsightcompatible compression codecs so that the data in compressed form can be read directly by
HDInsight. This can also improve the efficiency and reduce the running time of jobs that use
large volumes of data. Compression may be done as a discrete operation before you upload the
data, or within a custom utility as part of the upload process. For more details see Preprocessing and serializing the data.
Consider if you can reduce the volume of data to upload by pre-processing it. For example, you
may be able to remove null values or empty rows, consolidate some parts of the data, or strip
out unnecessary columns and values. This should be done in staging, and you should ensure
that you keep a copy of the original data in case the information it contains is required
elsewhere or later. What may seem unnecessary today may turn out to be useful tomorrow.
Choose efficient transfer protocols for uploading data, and ensure that the process can resume
an interrupted upload from the point where it failed. For example, some tools such as Aspera,
Signiant, and File Catalyst use UDP for the data transfer and TCP working in parallel to validate
the uploaded data packages by ensuring each one is complete and has not been corrupted
during the process.
If one instance of the uploader tool does not meet the performance criteria, consider using
multiple instances to scale out and increase the upload velocity if the tool can support this.
Tools such as Flume, Storm, Kafka, and Samza can scale to multiple instances. SSIS can also be
scaled out, as described in the presentation Scaling Out SSIS with Parallelism (note that you will
require additional licenses for this). Each instance of the uploader you choose might create a
separate file or set of files that can be processed as a batch, or could be combined into fewer
files or a single file by a process running on the cluster servers.
Ensure that you measure the performance of upload processes to ensure that the steps you
take to maximize performance are appropriate to different types and varying volumes of data.
What works for one type of upload process may not provide optimum performance for other
upload processes with different types of data. This is particularly the case when using
serialization or compression. Balance the effects of the processes you use to maximize upload
performance with the impact these have on subsequent query and transformation processing
jobs within the cluster.
Reliability
Data uploads must be reliable to ensure that the data is accurately represented in the cluster. For
example, you might need to validate the uploaded data before processing it. Transient failures or errors
that might occur during the upload process must be prevented from corrupting the data.
However, keep in mind that validation extends beyond just comparing the uploaded data with the
original files. For example, you may extend data validation to ensure that the original source data does
not contain values that are obviously inaccurate or invalid, and that there are no temporal or logical
gaps for which data that should be included.
To ensure reliability, and to be able to track and cure faults, you will also need to monitor the process.
Using logs to record upload success and failure, and capturing any available error messages, provides a
way to ensure the process is working as expected and to locate issues that may affect reliability.
Considerations for reliability
Consider the following reliability factors when designing your data ingestion processes:
Choose a technology or create an upload tool that can handle transient connectivity and
transmission failures, and can properly resume the process when the problem clears. Many of
the APIs exposed by Azure and HDInsight, and SDKs such as the Azure Storage client libraries,
include transient fault handling management. If you are building custom tools that do not use
these libraries or APIs, you can include this capability using a framework such as the Transient
Fault Handling Application Block.
Monitor upload processes so that failures are detected early and can be fixed before they have
an impact on the reliability of the solution and the accuracy or timeliness of the results. Also
ensure you log all upload operations, including both successes and failures, and any error
information that is available. This is invaluable when trying to trace problems. Some tools, such
as Flume and CloudBerry, can generate log files. AzCopy provides a command line option to log
the upload or download status. You can also enable the built-in monitoring and logging for
many Azure features such as storage, and use the APIs they expose to generate logs. If you are
building a custom data upload utility, you should ensure it can be configured to log all
operations.
Implement linear tracking where possible by recording each stage involved in a process so that
the root cause of failures can be identified by tracing the issue back to its original source.
Consider validating the data after it has been uploaded to ensure consistency, integrity, and
accuracy of the results and to detect any loss or corruption that may have occurred during the
transmission to cluster storage. You might also consider validating the data before you upload
it, although a large volume of data arriving at high velocity may make this impossible. Common
types of validation include counting the number of rows or records, checking for values that
exceed specific minimum or maximum values, and comparing the overall totals for numeric
fields. You may also apply more in-depth approaches such as using a data dictionary to ensure
relevant values meet business rules and constraints, or cross-referencing fields to ensure that
matching values are present in the corresponding reference tables.
Pre-processing data
You may want to perform some pre-processing on the source data before you load it into the cluster.
For example, you may decide to pre-process the data in order to simplify queries or transformations, to
improve performance, or to ensure accuracy of the results. Pre-processing might also be required to
cleanse and validate the data before uploading it, to serialize or compress the data, or to improve
upload efficiency by removing irrelevant or unnecessary rows or values.
Considerations for pre-processing data
Consider the following pre-processing factors when designing your data ingestion processes:
Before you implement a mechanism to pre-process the source data, consider if this processing
could be better handled within your cluster as part of a query, transformation, or workflow.
Many of the data preparation tasks may not be practical, or even possible, when you have very
large volumes of data. They are more likely to be possible when you stream data from your data
sources, or extract it in small blocks on a regular basis. Where you have large volumes of data to
process you will probably perform these preprocessing tasks within your big data solution as the
initial steps in a series of transformations and queries.
You may need to handle data that arrives as a stream. You may choose to convert and buffer
the incoming data so that it can be processed in batches, or consider a real-time stream
processing technology such as Storm (see the section Overview of Storm in the topic Data
processing tools and techniques) or StreamInsight.
You may need to format individual parts of the data by, for example, combining fields in an
address, removing duplicates, converting numeric values to their text representation, or
changing date strings to standard numerical date values.
You may want to perform some automated data validation and cleansing by using a technology
such as SQL Server Data Quality Services before submitting the data to cluster storage. For
example, you might need to convert different versions of the same value into a single leading
value (such as changing NY and Big Apple into New York).
If reference data you need to combine with the source data is not already available as an
appropriately formatted file, you can prepare it for upload and processing using a tool such as
Excel to extract a relatively small volume of tabular data from a data source, reformat it as
required, and save it as a delimited text file. Excel supports a range of data sources, including
relational databases, XML documents, OData feeds, and the Azure Data Market. You can also
use Excel to import a table of data from any website, including an RSS feed. In addition to the
standard Excel data import capabilities, you can use add-ins such as Power Query to import and
transform data from a wide range of sources.
Be careful when removing information from the data; if possible keep a copy of the original
files. You may subsequently find the fields you removed are useful as you refine queries, or if
you use the data for a different analytical task.
Optimized Row Columnar (ORC). This provides a highly efficient way to store Hive data in a way
that was designed to overcome limitations of the other Hive file formats. The ORC format can
improve performance when Hive is reading, writing, and processing data. See ORC File Format
for more information.
Compression can improve the performance of data processing on the cluster by reducing I/O and
network usage for each node in the cluster as it loads the data from storage into memory. However,
compression does increase the processing overhead for each node, and so it cannot be guaranteed to
reduce execution time. Compression is typically carried out using one of the standard algorithms for
which a compression codec is installed by default in Hadoop.
You can combine serialization and compression to achieve optimum performance when you use Avro
because, in addition to serializing the data, you can specify a codec that will compress it.
Tools for Avro serialization and compression
An SDK is available from NuGet that contains classes to help you work with Avro from programs and
tools you create using .NET languages. For more information see Serialize data with the Avro Library on
the Azure website and Apache Avro Documentation on the Apache website. A simple example of using
the Microsoft library for Avro is included in this guidesee Serializing data with the Microsoft .NET
Library for Avro.
To compress the source data if you are not using Avro or another utility that supports compression, you
can usually use the tools provided by the codec supplier. For example, the downloadable libraries for
both GZip and BZip2 include tools that can help you apply compression. For more details see the
distribution sources for GZip and BZip2 on Source Forge.
You can also use the classes in the .NET Framework to perform GZip and DEFLATE compression on your
source files, perhaps by writing command line utilities that are executed as part of an automated upload
and processing sequence. For more details see the GZipStream Class and DeflateStream Class reference
sections on MSDN.
Another alternative is to create a query job that is configured to write output in compressed form using
one of the built-in codecs, and then execute the job against existing uncompressed data in storage so
that it selects all or some part of the source data and writes it back to storage in compressed form. For
an example of using Hive to do this see the Microsoft White Paper Compression in Hadoop.
Compression libraries available in HDInsight
The following table shows the class name of the codecs provided with HDInsight when this guide was
written. The table shows the standard file extension for files compressed with the codec, and whether
the codec supports split file compression and decompression.
Format
Codec
Extension
Splittable
DEFLATE
org.apache.hadoop.io.compress.DefaultCodec
.deflate
No
GZip
org.apache.hadoop.io.compress.GzipCodec
.gz
No
BZip2
org.apache.hadoop.io.compress.BZip2Codec
.bz2
Yes
.snappy
Yes
org.apache.hadoop.io.compress.SnappyCodec
A codec that supports splittable compression and decompression allows HDInsight to decompress the
data in parallel across multiple mapper and node instances, which typically provides better
performance. However, splittable codecs are less efficient at runtime, so there is a trade off in efficiency
between each type.
There is also a difference in the size reduction (compression rate) that each codec can achieve. For the
same data, BZip2 tends to produce a smaller file than GZip but takes longer to perform the
decompression. The Snappy codec works best with container data formats such as Sequence Files or
Avro Data Files. It is fast and typically provides a good compression ratio.
Considerations for serialization and compression
Consider the following points when you are deciding whether to compress the source data:
Compression may not produce any improvement in performance with small files. However, with
very large files (for example, files over 100 GB) compression is likely to provide dramatic
improvement. The gains in performance also depend on the contents of the file and the level of
compression that was achieved.
When optimizing a job, enable compression within the process using the configuration settings
to compress the output of the mappers and the reducers before experimenting with
compression of the source data. Compression within the job stages often provides a more
substantial gain in performance compared to compressing the source data.
Consider using a splittable algorithm for very large files so that they can be decompressed in
parallel by multiple tasks.
Ensure that the format you choose is compatible with the processing tools you intend to use.
For example, ensure the format is compatible with Hive and Pig if you intend to use these to
query your data.
Use the default file extension for the files if possible. This allows HDInsight to detect the file
type and automatically apply the correct decompression algorithm. If you use a different file
extension you must set the io.compression.codec property for the job to indicate the codec
used.
If you are serializing the source data using Avro, you can apply a codec to the process so that
the serialized data is also compressed.
For more information about using the compression codecs in Hadoop see the documentation for the
CompressionCodecFactory and CompressionCodec classes on the Apache website.
A UI-based tool such as CloudBerry Explorer, Microsoft Azure Storage Explorer, or Server
Explorer in Visual Studio. For a useful list of third party tools for uploading data to HDInsight
interactively see Upload data for Hadoop jobs in HDInsight on the Azure website.
PowerShell commands that take advantage of the PowerShell cmdlets for Azure. This capability
is useful if you are just experimenting or working on a proof of concept.
The hadoop dfs -copyFromLocal [source] [destination] command at the Hadoop command line
using a remote desktop connection.
A command line tool such as AzCopy if you need to upload large files.
Consider how you will handle very large volumes of data. While small volumes of data can be copied
into storage interactively, you will need to choose or build a more robust mechanism capable of
handling large files when you move beyond the experimentation stage.
Microsoft StreamInsight. This is a complex event processing (CEP) engine with a framework API
for building applications that consume and process event streams. It can be run on-premises or
in a virtual machine. For more information about developing StreamInsight applications, see
Microsoft StreamInsight on MSDN.
Apache Storm. This is an open-source framework that can run on a Hadoop cluster to capture
streaming data. It uses other Hadoop-related technologies such as Zookeeper to manage the
data ingestion process. See the section Overview of Storm in the topic Data processing tools
and techniques and Apache Storm on the Hortonworks website for more information.
Other open source frameworks such as Kafka, and Samza. These frameworks provide
capabilities to capture streaming data and process it in real time, including persisting the data
or messages as files for batch processing when required.
A custom event or stream capture solution that feeds the data into the cluster data store in real
time or in batches. The interval should be based on the frequency that related query jobs will be
instantiated. You could use the Reactive Extensions (Rx) library to implement a real-time stream
capture utility.
As an alternative to using Flume, you can use SSIS to implement an automated batch upload solution.
For more details of using SQL Server Integration Services (SSIS) see Scenario 4: BI integration and
Appendix A - Tools and technologies reference.
You can also use the AZCopy utility in scripts to automate uploading data to HDInsight. For more details
see AzCopy Uploading/Downloading files for Windows Azure Blobs on the Azure storage team blog. In
addition, a library called Casablanca can be used to access Azure storage from native C++ code. For more
details see Announcing Casablanca, a Native Library to Access the Cloud From C++.
Considerations
Consider the following factors when designing your automated data ingestion processes:
Consider how much effort is required to create an automated upload solution, and balance this
with the advantages it provides. If you are simply experimenting with data in an iterative
scenario, you may not need an automated solution. Creating automated processes to upload
data is probably worthwhile only when you will repeat the operation on a regular basis, or when
you need to integrate big data processing into a business application.
When creating custom tools or scripts to upload data to a cluster, consider including the ability
to accept command-line parameters so that the tools can be used in a range of automation
processes.
Consider how you will protect the data, the cluster, and the solution as a whole from
inappropriate use of custom upload tools and applications. It may be possible to set permissions
on tools, files, folders, and other resources to restrict access to only authorized users.
PowerShell is a good solution for uploading data files in scenarios where users are exploring
data iteratively and need a simple, repeatable way to upload source data for processing. You
can also use PowerShell as part of an automated processing solution in which data is uploaded
automatically by a scheduled operating system task or SQL Server Integration Services package.
.NET Framework code that uses the .NET SDK for HDInsight can be used to upload data for
processing by HDInsight jobs. This may be a better choice than using PowerShell for large
volumes of data.
In addition to the HDInsight-specific APIs for uploading data to the cluster, the more general
Azure Storage API offers greater flexibility by allowing you to upload data directly to Azure blob
storage as files, or write data directly to blobs in an Azure blob storage container. This enables
you to build client applications that capture real-time data and write it directly to a blob for
processing in HDInsight without first storing the data in local files.
Other tools and frameworks are available that can help you to build data ingestion mechanisms.
For example, Falcon provides an automatable system for data replication, data lifecycle
management (such as data eviction), data lineage and tracing, and process coordination and
scheduling based on a declarative programming model.
More information
For information about creating end-to-end automated solutions that include automated upload stages,
see Building end-to-end solutions using HDInsight.
For more details of the tools and technologies available for automating upload processes see Appendix
A - Tools and technologies reference.
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the Codeplex website.
Uploading data with Windows PowerShell
The Azure module for Windows PowerShell includes a range of cmdlets that you can use to work with
Azure services programmatically, including Azure storage. You can run PowerShell scripts interactively in
a Windows command line window or in a PowerShell-specific command line console. Additionally, you
can edit and run PowerShell scripts in the Windows PowerShell Interactive Scripting Environment (ISE),
which provides IntelliSense and other user interface enhancements that make it easier to write
PowerShell code. You can schedule the execution of PowerShell scripts using Windows Scheduler, SQL
Server Agent, or other tools as described in Building end-to-end solutions using HDInsight.
Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
To upload data files to the Azure blob store, you can use the Set-AzureStorageBlobContent cmdlet, as
shown in the following code example.
Windows PowerShell
# Azure subscription-specific variables.
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the local folder where this PowerShell script is stored.
$currentLocation = Get-location
$thisfolder = Split parent $currentLocation
# Upload files in data subfolder to Azure.
$localfolder = "$thisfolder\data"
$destfolder = "data"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName
$storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
$files = Get-ChildItem $localFolder foreach($file in $files)
{
$fileName = "$localFolder\$file"
$blobName = "$destfolder/$file"
write-host "copying $fileName to $blobName"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob
$blobName -Context $blobContext -Force
}
write-host "All files in $localFolder uploaded to $containerName!"
Note that the code uses the New-AzureStorageContext cmdlet to create a context for the Azure storage
account where the files are to be uploaded. This context requires the access key for the storage account,
which is obtained using the Get-AzureStorageKey cmdlet. Authentication to obtain the key is based on
the credentials or certificate used to connect the local PowerShell environment with the Azure
subscription.
The code shown above also iterates over all of the files to be uploaded and uses the SetAzureStorageBlobContent cmdlet to upload each one in turn. It does this in order to store each one in a
specific path that includes the destination folder name. If all of the files you need to upload are in a
folder structure that is the same as the required target paths, you could use the following code to
upload all of the files in one operation instead of iterating over them in your PowerShell script.
Windows PowerShell
cd [root-data-folder]
ls Recurse Path $localFolder | Set-AzureStorageBlobContent Container
$containerName Context $blobContext
System;
System.Collections.Generic;
System.Text;
System.Threading.Tasks;
System.IO;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
namespace DataUploader
{
class Program
{
static void Main(string[] args)
{
UploadFiles().Wait();
Console.WriteLine("Upload complete!");
Console.WriteLine("Press a key to end");
Console.Read();
}
private static async Task UploadFiles()
{
var localDir = new DirectoryInfo(@".\data");
var hdInsightUser = "user-name";
var storageName = "storage-account-name";
var storageKey = "storage-account-key";
var containerName = "container-name";
var blobDir = "/data/";
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
await hdfsClient.DeleteDirectory(blobDir);
foreach (var file in localDir.GetFiles())
{
Console.WriteLine("Uploading " + file.Name + " to " + blobDir + file.Name + "
...");
await hdfsClient.CreateFile(file.FullName, blobDir + file.Name);
}
}
}
}
Note that the code uses the DeleteDirectory method to delete all existing blobs in the specified path,
and then uses the CreateFile method to upload each file in the local data folder. All of the methods
provided by the WebHDFSClient class are asynchronous, enabling you to upload large volumes of data
to Azure without blocking the client application.
Uploading data with the Azure Storage SDK
The .NET Azure Storage Client, part of the Azure Storage library available from NuGet, offers a flexible
mechanism for uploading data to the Azure blob store as files or writing data directly to blobs in an
Azure blob storage container, including writing streams of data directly to Azure storage without first
storing the data in local files.
The following code example shows how you can use the .NET Azure Storage Client to write data in a
stream directly to a blob in Azure storage. The example is deliberately kept simple by including the
credentials in the code so that you can copy and paste it while you are experimenting with HDInsight. In
a production system you must protect credentials, as described in Securing credentials in scripts and
applications in the Security section of this guide.
C#
using
using
using
using
using
System;
System.Text;
System.Threading.Tasks;
System.IO;
System.Collections.Generic;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Blob;
namespace AzureBlobClient
{
class Program
{
const string AZURE_STORE_CONN_STR = "DefaultEndpointsProtocol=https;"
+ "AccountName=storage-account-name;AccountKey=storage-account-key";
static void Main(string[] args)
{
Stream Observations = GetData();
CloudStorageAccount storageAccount =
CloudStorageAccount.Parse(AZURE_STORE_CONN_STR);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("containername");
var blob = container.GetBlockBlobReference("data/weather.txt");
blob.UploadFromStreamAsync(Observations).Wait();
Console.WriteLine("Data Uploaded!");
Console.WriteLine("Press a key to end");
Console.Read();
}
static Stream GetData()
{
// code to retrieve data as a stream
}
}
}
For information about the features of the Azure Storage Client libraries, see Whats new for Microsoft
Azure Storage at TechEd 2014.
Serializing data with the Microsoft .NET Library for Avro
The .NET Library for Avro is a component of the .NET SDK for HDInsight that you can use to serialize and
deserialize data using the Avro serialization format. Avro enables you to include schema metadata in a
data file, and is widely used in Hadoop (including in HDInsight) as a language-neutral means of
exchanging complex data structures between operations.
For example, consider a weather monitoring application that records meteorological observations. In
the application, each observation can be represented as an object with properties that contain the
specific data values for the observation. These properties might be simple values such as the date, the
time, the wind speed, and the temperature. However, some values might be complex structures such as
the geo-coded location of the monitoring station, which contains longitude and latitude coordinates.
The following code example shows how a list of weather observations in this complex data structure can
be serialized in Avro format and uploaded to Azure storage. The example is deliberately kept simple by
including the credentials in the code so that you can copy and paste it while you are experimenting with
HDInsight. In a production system you must protect credentials, as described in Securing credentials in
scripts and applications in the Security section of this guide.
C#
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.Text;
System.Threading.Tasks;
System.IO;
System.Runtime.Serialization;
System.Configuration;
using Microsoft.Hadoop.Avro.Container;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
namespace AvroClient
{
// Class representing a weather observation.
[DataContract(Name = "Observation", Namespace = "WeatherData")]
internal class Observation
{
[DataMember(Name = "obs_date")]
public DateTime Date { get; set; }
[DataMember(Name = "obs_time")]
The class Observation used to represent a weather observation, and the struct GeoLocation used to
represent a geographical location, include metadata to describe the schema. This schema information is
included in the serialized file that is uploaded to Azure storage, enabling an HDInsight process such as a
Pig job to deserialize the data into an appropriate data structure. Notice also that the data is
compressed using the Deflate codec as it is serialized, reducing the size of the file to be uploaded.
Using the Azure management portal to create and delete the cluster interactively. For more
information see Manage Hadoop clusters in HDInsight using the Azure Management Portal.
Using Windows PowerShell scripts to automate provisioning and deletion of clusters. For more
information see Automating cluster management with PowerShell.
Using the SDK for HDInsight to integrate cluster management into a .NET Framework
application. For more information see Automating cluster management in a .NET application.
The correct approach to cluster provisioning depends on the specific business requirements and
constraints, but the following table describes typical approaches in relation to the common big data use
cases and models discussed in this guide.
Use case
Considerations
Iterative data
exploration
Creating and deleting the cluster manually when required through the Azure
management portal may be acceptable for data exploration scenarios where data
processing and analysis is performed interactively on an occasional basis by a
dedicated team of data analysts. However, if the analysis is more frequent the analysts
might benefit from creating a simple script or command line utility to automate the
process of creating and deleting the cluster.
Data warehouse on
demand
Data warehouses built on HDInsight are usually based on Hive tables, and the cluster
must be running to service Hive queries. If the data warehouse is queried directly by
users and applications, you may need to keep the cluster running continually.
However, if the data warehouse is used only as a data source for analytical data
models (for example, in SQL Server Analysis Services or PowerPivot workbooks) or
for cached reports you can create the cluster on demand to enable new data to be
processed, refresh the dependent data models and reports, and then delete the
cluster.
ETL automation
When HDInsight is used to filter and shape data in an ETL process, the destination of
the transformed data is usually another data store such as a SQL Server database.
Depending on the frequency of the ETL cycle, you may choose to include provisioning
and deletion of the cluster in the ETL process itself. In this case, cluster creation and
deletion are likely to be automated along with data ingestion, job execution, and the
data transfer tasks of the ETL workflow.
BI integration
Considerations
When planning how you will create a cluster for your solution, also consider the following points:
As part of the cluster provision process you may also need to create or manage storage
accounts. Often you will do this only once, and use the storage account each time you run your
automated solution. For more information see Cluster and storage initialization.
You should set all the properties for your cluster when you create it, using the techniques
described in this section of the guide. This ensures that the configuration is fixed in the cluster
definition, and will be reapplied to any virtual servers that make up the cluster if they are
automatically restarted after a failure or an upgrade. Virtual server management within the
datacenter may occur at any time, and you cannot control this. If you edit the configuration files
directly, any changes will be lost when a server restarts. However, you can change some cluster
properties for individual jobssee Configuring and debugging solutions for details
Be careful how and when you delete a cluster as part of an automated solution. You may need
to implement a task that backs up the data and/or the metadata first. Ensure tools that allow
users to delete clusters perform user authentication and authorization to protect against
accidental and malicious use.
More information
For information about creating end-to-end automated solutions that include automated cluster
management stages, see Building end-to-end solutions using HDInsight.
For more details of the tools and technologies available for automating cluster management see
Appendix A - Tools and technologies reference.
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.
The topic Provision HDInsight clusters on the Azure website shows several ways that you can provision a
cluster.
Automating cluster management with PowerShell
You can use Windows PowerShell to create an HDInsight cluster by executing PowerShell commands
interactively, or by creating a PowerShell script that can be executed when required.
Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
Creating a cluster with the default configuration
When using PowerShell to create an HDInsight cluster, you use the New-AzureHDInsightCluster cmdlet
and specify the following configuration settings to create a cluster with the default settings for Hadoop
services:
If you do not intend to use an existing Azure storage account, you can create a new one with a globally
unique name using the New-AzureStorageAccount cmdlet, and then create a new blob container with
the New-AzureStorageContainer cmdlet. Many Azure services require a globally unique name. You can
determine if a specific name is already in use by an Azure service by using the Test-AzureName cmdlet.
The following code example creates an Azure storage account and an HDInsight cluster in the Southeast
Asia region (note that each command should be on a single, unbroken line). The example is deliberately
kept simple by including the credentials in the script so that you can copy and paste the code while you
are experimenting with HDInsight. In a production system you must protect credentials, as described in
Securing credentials in scripts and applications in the Security section of this guide.
Windows PowerShell
$storageAccountName = "unique-storage-account-name"
$containerName = "container-name"
$clusterName = "unique-cluster-name"
$userName = "user-name"
$password = ConvertTo-SecureString "password" -AsPlainText -Force
$location = "Southeast Asia"
$clusterNodes = 4
# Create a storage account.
Write-Host "Creating storage account..."
New-AzureStorageAccount -StorageAccountName $storageAccountName -Location $location
$storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary }
$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName StorageAccountKey $storageAccountKey
# Create a Blob storage container.
Write-Host "Creating container..."
New-AzureStorageContainer -Name $containerName -Context $destContext
# Create a cluster.
Write-Host "Creating HDInsight cluster..."
$credential = New-Object System.Management.Automation.PSCredential ($userName,
$password)
New-AzureHDInsightCluster -Name $clusterName -Location $location DefaultStorageAccountName "$storageAccountName.blob.core.windows.net"
-DefaultStorageAccountKey $storageAccountKey -DefaultStorageContainerName
$containerName
-ClusterSizeInNodes $clusterNodes -Credential $credential -Version 3.0
Write-Host "Finished!"
Notice that this script uses the Convert-To-SecureString function to encrypt the password in memory.
The password and the user name are passed to the New-Object cmdlet to create a PSCredential object
for the cluster credentials. Notice also that the access key for the storage account is obtained using the
Get-AzureStorageKey cmdlet.
Add-AzureHDInsightStorage: Specify an additional storage account that the cluster can use.
Add-AzureHDInsightMetastore: Specify a custom Azure SQL Database instance to host Hive and
Oozie metadata.
After you have added the required configuration settings, you can pass the cluster configuration variable
returned by New-AzureHDInsightClusterConfig to the New-AzureHDInsightCluster cmdlet to create the
cluster.
You can also specify a folder to store shared libraries and upload these so that they are available for use
in HDInsight jobs. Examples include UDFs for Hive and Pig, or custom SerDe components for use in Avro.
For more information see the section Create cluster with custom Hadoop configuration values and
shared libraries in the topic Microsoft .NET SDK For Hadoop on the CodePlex website.
For more information about using PowerShell to manage an HDInsight cluster see the HDInsight
PowerShell Cmdlets Reference Documentation.
Deleting a cluster
When you have finished with the cluster you can use the Remove-AzureHDInsightCluster cmdlet to
delete it. If you are also finished with the storage account, you can delete it after the cluster has been
deleted by using the Remove-AzureStorageAccount cmdlet.
The following code example shows a PowerShell script to delete an HDInsight cluster and the storage
account it was using.
C#
$storageAccountName = "storage-account-name"
$clusterName = "cluster-name"
# Delete HDInsight cluster.
Write-Host "Deleting $clusterName HDInsight cluster..."
Remove-AzureHDInsightCluster -Name $clusterName
# Delete storage account.
Write-Host "Deleting $storageAccountName storage account..."
Use the makecert command in a Visual Studio command line to create a certificate and upload
it to your subscription in the Azure management portal as described in Create and Upload a
Management Certificate for Azure.
After you have created and installed your certificate, it will be stored in the Personal certificate store on
your computer. You can view the details by using the certmgr.msc console.
To create a cluster programmatically, you must create an instance of the ClusterCreateParameters class,
specifying the following information:
After you have created the initial ClusterCreateParameters class, you can optionally customize the
default HDInsight configuration settings by using the following properties:
HiveMetastore: Specify a custom Azure SQL Database instance in which to store Hive metadata.
OozieMetastore: Specify a custom Azure SQL Database instance in which to store Oozie
metadata.
When you are ready to create the cluster, you must use a locally stored Azure management certificate
to create an HDInsightCertificateCredential object and then use this object with the HDInsightClient
static class to connect to Azure and create a client object based on the IHDInsightClient interface. The
IHDInsightClient interface provides the CreateCluster method that you can use to create an HDInsight
cluster synchronously, and a CreateClusterAsync method you can use to create the cluster
asynchronously.
The following code example shows a simple console application that creates an HDInsight cluster using
an existing Azure storage account and container. The example is deliberately kept simple by including
the credentials in the code so that you can copy and paste it while you are experimenting with
HDInsight. In a production system you must protect credentials, as described in Securing credentials in
scripts and applications in the Security section of this guide.
C#
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.WindowsAzure.Management.HDInsight.ClusterProvisioning;
namespace ClusterMgmt
{
class Program
{
static void Main(string[] args)
{
string subscriptionId = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "unique-cluster-name";
string storageAccountName = "storage-account-name";
string storageAccountKey = "storage-account-key";
string containerName = "container-name";
string userName = "user-name";
string password = "password";
string location = "Southeast Asia";
int clusterSize = 4;
// Get the certificate object from certificate store
// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
// Create an HDInsightClient object.
HDInsightCertificateCredential creds = new HDInsightCertificateCredential(new
Guid(subscriptionId), cert);
var client = HDInsightClient.Connect(creds);
// Supply cluster information.
ClusterCreateParameters clusterInfo = new ClusterCreateParameters()
{
Name = clusterName,
Location = location,
DefaultStorageAccountName = storageAccountName + ".blob.core.windows.net",
DefaultStorageAccountKey = storageAccountKey,
DefaultStorageContainer = containerName,
UserName = userName,
Password = password,
ClusterSizeInNodes = clusterSize,
Version = "3.0"
};
// Create the cluster.
Console.WriteLine("Creating the HDInsight cluster ...");
ClusterDetails cluster = client.CreateCluster(clusterInfo);
Note that this example uses a pre-existing Azure storage account and container, which must be hosted
in the same geographical region as the cluster (in this case, Southeast Asia).
To delete a cluster you can use the DeleteCluster method of the HDInsightClient class.
For more information about using the .NET SDK for HDInsight to provision and delete HDInsight clusters
see HDInsight SDK Reference Documentation.
Storm is a real-time data processing application that is designed to handle streaming data.
These applications can be used for a wide variety of tasks, and many of them can be easily combined
into multi-step workflows by using Oozie.
This section of the guide contains the following topics:
HBase is a database management system that can provide scalability for storing vast amounts of data,
support for real-time querying, consistent reads and writes, automatic and configurable sharding of
tables, and high reliability with automatic failover. For more information see Data storage in the topic
Specifying the infrastructure.
Meaningful. The values in the results, when combined and analyzed, relate to one another in a
meaningful way.
Accurate. The results appear to be correct, or are within the expected range.
Useful. The results are applicable to the business decision they will support, and provide
relevant metrics that help inform the decision making process.
You will often need to employ the services of a business user who intimately understands the business
context for the data to perform the role of a data steward and sanity check the results to determine
whether or not they fall within expected parameters. It may not be possible to validate all of the source
data for a query, especially if it is collected from external sources such as social media sites. However,
depending on the complexity of the processing, you might decide to select a number of data inputs for
spot-checking and trace them through the process to ensure that they produce the expected outcome.
When you are planning to use HDInsight to perform predictive analysis, it can be useful to evaluate the
process against known values. For example, if your goal is to use demographic and historical sales data
to determine the likely revenue for a proposed retail store, you can validate the processing model by
using appropriate source data to predict revenue for an existing store and compare the resulting
prediction to the actual revenue value. If the results of the data processing you have implemented vary
significantly from the actual revenue, then it seems unlikely that the results for the proposed store will
be reliable.
Considerations
Consider the following points when designing and developing data processing solutions:
Big data frameworks offer a huge range of tools that you can use with the Hadoop core engine,
and choosing the most appropriate can be difficult. Azure HDInsight simplifies the process
because all of the tools it includes are guaranteed to be compatible and work correctly
together. This doesnt mean you cant incorporate other tools and frameworks in your solution.
Of the query and transformation applications, Hive is the most popular. However, many
HDInsight processing solutions are actually incremental in naturethey consist of multiple
queries, each operating on the output of the previous one. These queries may use different
query applications. For example, you might first use a custom map/reduce job to summarize a
large volume of unstructured data, and then create a Pig script to restructure and group the
data values produced by the initial map/reduce job. Finally, you might create Hive tables based
on the output of the Pig script so that client applications such as Excel can easily consume the
results.
If you decide to use a resource-intensive application such as HBase or Storm, you should
consider running it on a separate cluster from your Hadoop-based big data batch processing
solution to avoid contention and consequent loss of performance for the application and your
solution as a whole.
The challenges dont end with simply writing and running a job. As in any data processing
scenario, its vitally important to check that the results generated by queries are realistic, valid,
and useful before you invest a lot of time and effort (and cost) in developing and extending your
solution. A common use of HDInsight is simply to experiment with data to see if it can offer
insights into previously undiscovered information. As with any investigational or experimental
process, you need to be convinced that each stage is producing results that are both valid
(otherwise you gain nothing from the answers) and useful (in order to justify the cost and
effort).
Unless you are simply experimenting with data to find the appropriate questions to ask, you will
want to automate some or all of the tasks and be able to run the solution from a remote
computer. For more information see Building custom clients and Building end-to-end solutions
using HDInsight.
Security is a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include
securing data while in transit over the network, securing data in storage, and authenticating and
authorizing users who have access to the tools and utilities you use as part of your process. For
more details of how you can maximize security of your HDInsight solutions see the topic
Security in the section Building end-to-end solutions using HDInsight.
More information
For more information about HDInsight, see the Microsoft Azure HDInsight web page.
A central point for TechNet articles about HDInsight is HDInsight Services For Windows.
For examples of how you can use HDInsight, see the following tutorials on the HDInsight website:
Overview of Hive
Overview of Pig
User-defined functions
Overview of HCatalog
Overview of Mahout
Overview of Storm
Overview of Hive
Hive is an abstraction layer over the Hadoop query engine that provides a query language called HiveQL,
which is syntactically very similar to SQL and supports the ability to create tables of data that can be
accessed remotely through an ODBC connection.
In effect, Hive enables you to create an interface to your data that can be used in a similar way to a
traditional relational database. Business users can use familiar tools such as Excel and SQL Server
Reporting Services to consume data from HDInsight in a similar way as they would from a database
system such as SQL Server. Installing the ODBC driver for Hive on a client computer enables users to
connect to an HDInsight cluster and submit HiveQL queries that return data to an Excel worksheet, or to
any other client that can consume results through ODBC. HiveQL also allows you to plug in custom
mappers and reducers to perform more sophisticated processing.
Hive is a good choice for data processing when:
You want to process large volumes of immutable data to perform summarization, ad hoc
queries, and analysis.
The source data has some identifiable structure, and can easily be mapped to a tabular schema.
You want to create a layer of tables through which business users can easily query source data,
and data generated by previously executed map/reduce jobs or Pig scripts.
You want to experiment with different schemas for the table format of the output.
The processing you need to perform can be expressed effectively as HiveQL queries.
The latest versions of HDInsight incorporate a technology called Tez, part of the Stinger initiative for
Hadoop, that vastly increases the performance of Hive. For more details see Stinger: Interactive Query
for Hive on Hortonworks website.
If you are not familiar with Hive, a basic introduction to using HiveQL can be found in the topic
Processing data with Hive. You can also experiment with Hive by executing HiveQL statements in the
Hive Editor page of the HDInsight management portal. See Monitoring and logging for more details.
Overview of Pig
Pig is a query interface that provides a workflow semantic for processing data in HDInsight. Pig enables
you to perform complex processing of your source data to generate output that is useful for analysis and
reporting.
Pig statements are expressed in a language named Pig Latin, and generally involve defining relations
that contain data, either loaded from a source file or as the result of a Pig Latin expression on an existing
relation. Relations can be thought of as result sets, and can be based on a schema (which you define in
the Pig Latin statement used to create the relation) or can be completely unstructured.
Pig is a good choice when you need to:
Restructure source data by defining columns, grouping values, or converting columns to rows.
Perform data transformations such as merging and filtering data sets, and applying functions to
all or subsets of records.
If you are not familiar with Pig, a basic introduction to using Pig Latin can be found in the topic
Processing data with Pig.
You want to process data that is completely unstructured by parsing it and using custom logic in
order to obtain structured information from it.
You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive
without resorting to creating a UDF. For example, you might need to use an external geocoding
service to convert latitude and longitude coordinates or IP addresses in the source data to
geographical location names.
You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components.
You can do this using the Hadoop streaming interface.
If you are not familiar with writing map/reduce components, a basic introduction and information about
using Hadoop streaming can be found in the topic Writing map/reduce code.
User-defined functions
Developers often find that they reuse the same code in several locations, and the typical way to
optimize this is to create a user-defined function (UDF) that can be imported into other projects when
required. Often a series of UDFs that accomplish related functions are packaged together in a library so
that the library can be imported into a project. Hive and Pig can take advantage of any of the UDFs it
contains. For more information see User-defined functions.
Overview of HCatalog
Technologies such as Hive, Pig, and custom map/reduce code can be used to process data in an
HDInsight cluster. In each case you use code to project a schema onto data that is stored in a particular
location, and then apply the required logic to filter, transform, summarize, or otherwise process the
data to generate the required results.
The code must load the source data from wherever it is stored, and convert it from its current format to
the required schema. This means that each script must include assumptions about the location and
format of the source data. These assumptions create dependencies that can cause your scripts to break
if an administrator chooses to change the location, format, or schema of the source data.
Additionally, each processing interface (Hive, Pig, or custom map/reduce) requires its own definition of
the source data, and so complex data processes that involve multiple steps in different interfaces
require consistent definitions of the data to be maintained across all of the scripts.
HCatalog provides a tabular abstraction layer that helps unify the way that data is interpreted across
processing interfaces, and provides a consistent way for data to be loaded and storedregardless of the
specific processing interface being used. This abstraction exposes a relational view over the data,
including support for partitions.
The following factors will help you decide whether to incorporate HCatalog in your HDInsight solution:
It makes it easy to abstract the data storage location, format, and schema from the code used
to process it.
It minimizes fragile dependencies between scripts in complex data processing solutions where
the same data is processed by multiple tasks.
It enables notification of data availability, making it easier to write applications that perform
multiple jobs.
It is easy to incorporate into solutions that include Hive and Pig scripts, requiring very little extra
code. However, if you use only Hive scripts and queries, or you are creating a one-shot solution
for experimentation purposes and do not intend to use it again, HCatalog is unlikely to provide
any benefit.
Files in JSON, SequenceFile, CSV, and RC format can be read and written by default, and a
custom serializer/deserializer component (SerDe) can be used to read and write files in other
formats (see SerDe on the Apache wiki for more details).
Additional effort is required to use HCatalog in custom map/reduce components because you
must create your own custom load and store functions.
For more information, see Unifying and stabilizing jobs with HCatalog.
Overview of Mahout
Mahout is a data mining query library that you can use to examine data files in order to extract specific
types of information. It provides an implementation of several machine learning algorithms, and is
typically used with source data files containing relationships between the items of interest in a data
processing solution. For example, it can use a data file containing the similarities between different
movies and TV shows to create a list of recommendations for customers based on items they have
already viewed or purchased. The source data could be obtained from a third party, or generated and
updated by your application based on purchases made by other customers.
Mahout queries are typically executed as a separate process, perhaps based on a schedule, to update
the results. These results are usually stored as a file within the cluster storage, though they may be
exported to a database or to visualization tools. Mahout can also be executed as part of a workflow.
However, it is a batch-based process that may take some time to execute with large source datasets.
Mahout is a good choice when you need to:
Apply clustering algorithms to group documents or data items that contain similar content.
Apply recommendation mining algorithms to discover users preferences from their behavior.
Apply classification algorithms to assign new documents or data items to a category based on
the existing categorizations.
Perform frequent data mining operations based on the most recent data.
Handle huge volumes of data or messages that arrive at a very high rate.
Filter and sort incoming stream data for storing in separate files, repositories, or database
tables.
Examine the data stream in real time, perhaps to raise alerts for out-of-band values or specific
combinations of events, before analyzing it later using one of the batch-oriented query
mechanisms such as Hive or Pig.
For more information, see the Tutorial on the Storm documentation website.
In addition to its more usual use as a querying mechanism, Hive can be used to create a simple data
warehouse containing table definitions applied to data that you have already processed into the
appropriate format. Azure storage is relatively inexpensive, and so this is a good way to create a
commodity storage system when you have huge volumes of data. An example of this can be found in
Scenario 2: Data warehouse on demand.
Creating tables with Hive
You create tables by using the HiveQL CREATE TABLE statement, which in its simplest form looks similar
to the equivalent statement in Transact-SQL. You specify the schema in the form of a series of column
names and types, and the type of delimiter that Hive will use to delineate each column value as it parses
the data. You can also specify the format for the files in which the table data will be stored if you do not
want to use the default format (where data files are delimited by an ASCII code 1 (Octal \001) character,
equivalent to Ctrl + A). For example, the following code creates a table named mytable and specifies
that the data files for the table should be tab-delimited.
HiveQL
CREATE TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
You can also create a table and populate it as one operation by using a CREATE TABLE statement that
includes a SELECT statement to query an existing table, as described later in this topic.
Hive supports a sufficiently wide range of data types to suit almost any requirement. The primitive data
types you can use for columns in a Hive table are TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT,
DOUBLE, STRING, BINARY, DATE, TIMESTAMP, CHAR, VARCHAR, DECIMAL (though the last five of these
are not available in older versions of Hive). In addition to these primitive types you can define columns
as ARRAY, MAP, STRUCT, and UNIONTYPE. For more information see Hive Data Types in the Apache Hive
language manual.
Managing Hive table data location and lifetime
Hive tables are simply metadata definitions imposed on data in underlying files. By default, Hive stores
table data in the user/hive/warehouse/table_name path in storage (the default path is defined in the
configuration property hive.metastore.warehouse.dir), so the previous code sample will create the
table metadata definition and an empty folder at user/hive/warehouse/mytable. When you delete the
table by executing the DROP TABLE statement, Hive will delete the metadata definition from the Hive
database and it will also remove the user/hive/warehouse/mytable folder and its contents.
Table and column names are case-sensitive so, for example, the table named MyTable is not the same
as the table mytable.
However, you can specify an alternative path for a table by including the LOCATION clause in the
CREATE TABLE statement. The ability to specify a non-default location for the table data is useful when
you want to enable other applications or users to access the files outside of Hive. This allows data to be
loaded into a Hive table simply by copying data files of the appropriate format into the folder, or
downloaded directly from storage. When the table is queried using Hive, the schema defined in its
metadata is automatically applied to the data in the files.
An additional benefit of specifying the location is that this makes it easy to create a table for data that
already exists in that location (perhaps the output from a previously executed map/reduce job or Pig
script). After creating the table, the existing data in the folder can be retrieved immediately with a
HiveQL query.
However, one consideration for using managed tables is that, when the table is deleted, the folder it
references will also be deletedeven if it already contained other data files when the table was created.
If you want to manage the lifetime of the folder containing the data files separately from the lifetime of
the table, you must use the EXTERNAL keyword in the CREATE TABLE statement to indicate that the
folder will be managed externally from Hive, as shown in the following code sample.
HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';
In HDInsight the location shown in this example corresponds to wasbs://[container-name]@[storageaccount-name].blob.core.windows.net/mydata/mytable in Azure storage.
This ability to manage the lifetime of the table data separately from the metadata definition of the table
means that you can create several tables and views over the same data, but each can have a different
schema. For example, you may want to include fewer columns in one table definition to reduce the
network load when you transfer the data to a specific analysis tool, but have all of the columns available
for another tool.
As a general guide you should:
Use INTERNAL tables (the default, commonly referred to as managed tables) when you want
Hive to manage the lifetime of the table or when the data in the table is temporary; for
example, when you are running experimental or one-off queries over the source data.
Use INTERNAL tables and also specify the LOCATION for the data files when you want to access
the data files from outside of Hive; for example, if you want to upload the data for the table
directly into the Azure storage location.
Use EXTERNAL tables when you want to manage the lifetime of the data, when data is used by
processes other than Hive, or if the data files must be preserved when the table is dropped.
However, notice that cannot use EXTERNAL tables when you implicitly create the table by
executing a SELECT query against an existing table.
Use the LOAD statement when you need to create a table from the results of a map/reduce job
or a Pig script. These scripts generate log and status files as well as the output file when they
execute, and using the LOAD method enables you to easily add the output data to a table
without having to deal with the additional files that you do not want to include in the table.
Alternatively you can move the output file to a different location before you create a Hive table
over it.
Use the INSERT statement when you want to load data from an existing table into a different
table. A common use of this approach is to upload source data into a staging table that matches
the format of the source data (for example, tab-delimited text). Then, after verifying the staged
data, compress and load it into a table for analysis, which may be in a different format such as a
SEQUENCE FILE.
Use a SELECT query in a CREATE TABLE statement to generate the table dynamically when you
just want simplicity and flexibility. You do not need to know the column details to create the
table, and you do not need to change the statement when the source data or the SELECT
statement changes. You cannot, however, create an EXTERNAL or partitioned table this way
and so you cannot control the data lifetime of the new table separately from the metadata
definition.
To compress data as you insert it from one table to another, you must set some Hive parameters to
specify that the results of the query should be compressed, and specify the compression algorithm to be
used. The raw data for the table is in TextFile format, which is the default storage. However,
compression may mean that Hadoop will not be able to split the file into chunks/blocks and run multiple
map tasks in parallelwhich can result in under-utilization of the cluster resources by preventing
multiple map tasks from running concurrently. The recommended practice is to insert data into another
table, which is stored in SequenceFile format. Hadoop can split data in SequenceFile format and
distribute it across multiple map jobs.
For example, the following HiveQL statements load compressed data from a staging table into another
table.
HiveQL
CREATE TABLE raw (line STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';
CREATE TABLE raw_sequence (line STRING)
STORED AS SEQUENCEFILE;
LOAD DATA LOCAL INPATH '/path/file.gz' INTO TABLE raw;
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK; -- NONE/RECORD/BLOCK (see below)
INSERT OVERWRITE TABLE raw_sequence SELECT * FROM raw;
The value for io.seqfile.compression.type determines how the compression is performed. The options
are NONE, RECORD, and BLOCK. RECORD compresses each value individually, while BLOCK buffers up
1MB (by default) before beginning compression.
For more information about creating tables with Hive, see Hive Data Definition Language on the Apache
Hive site. For a more detailed description of using Hive see Hive Tutorial.
Partitioning the data
Advanced options when creating a table include the ability to partition, skew, and cluster the data
across multiple files and folders:
You can use the PARTITIONED BY clause to create a subfolder for each distinct value in a
specified column (for example, to store a file of daily data for each date in a separate folder).
You can use the SKEWED BY clause to create separate files for rows where a specified column
value is in a list of specified values. Rows with values not listed are stored together in a separate
single file.
You can use the CLUSTERED BY clause to distribute data across a specified number of subfolders
(described as buckets) based on hashes of the values of specified columns.
When you partition a table, the partitioning columns are not included in the main table schema section
of the CREATE TABLE statement. Instead, they must be included in a separate PARTITIONED BY clause.
The partitioning columns can, however, still be referenced in SELECT queries. For example, the following
HiveQL statement creates a table in which the data is partitioned by a string value named partcol1.
HiveQL
CREATE EXTERNAL TABLE mytable (col1 STRING, col2 INT)
PARTITIONED BY (partcol1 STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION '/mydata/mytable';
When data is loaded into the table, subfolders are created for each partition column value. For example,
you could load the following data into the table.
col1
col2
partcol1
ValueA1
ValueA2
ValueB1
ValueB2
After this data has been loaded into the table, the /mydata/mytable folder will contain a subfolder
named partcol1=A and a subfolder named partcol1=B, and each subfolder will contain the data files for
the values in the corresponding partitions.
When you need to load data into a partitioned table you must include the partitioning column values. If
you are loading a single partition at a time, and you know the partitioning value, you can specify explicit
partitioning values as shown in the following HiveQL INSERT statement.
HiveQL
FROM staging_table s
INSERT INTO mytable PARTITION(partcol1='A')
SELECT s.col1, s.col2
WHERE s.col3 = 'A';
Alternatively, you can use dynamic partition allocation so that Hive creates new partitions as required by
the values being inserted. To use this approach you must enable the non-strict option for the dynamic
partition mode, as shown in the following code sample.
HiveQL
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
FROM staging_table s
INSERT INTO mytable PARTITION(partVal)
SELECT s.col1, s.col2, s.col3 partVal
When designing an overall data processing solution with HDInsight, you may choose to perform complex
processing logic in custom map/reduce components or Pig scripts and then create a layer of Hive tables
over the results of the earlier processing, which can be queried by business users who are familiar with
basic SQL syntax. However, you can use Hive for all processing, in which case some queries may require
logic that is not possible to define in standard HiveQL functions.
In addition to common SQL semantics, HiveQL supports the use of:
Custom map/reduce scripts embedded in a query through the MAP and REDUCE clauses.
Custom user-defined functions (UDFs) that are implemented in Java, or that call Java functions
available in the existing installed libraries. UDFs are discussed in more detail in the topic Userdefined functions.
XPath functions for parsing XML data using XPath. See Hive and XML File Processing for more
information.
This extensibility enables you to use HiveQL to perform complex transformations on data as it is queried.
To help you decide on the right approach, consider the following guidelines:
If the source data must be extensively transformed using complex logic before being consumed
by business users, consider using custom map/reduce components or Pig scripts to perform
most of the processing, and create a layer of Hive tables over the results to make them easily
accessible from client applications.
If the source data is already in an appropriate structure for querying and only a few specific but
complex transforms are required, consider using map/reduce scripts embedded in HiveQL
queries to generate the required results.
If queries will be created mostly by business users, but some complex logic is still regularly
required to generate specific values or aggregations, consider encapsulating that logic in custom
UDFs because these will be simpler for business users to include in their HiveQL queries than a
custom map/reduce script.
For more information about selecting data from Hive tables, see Language Manual Select on the Apache
Hive website. For some useful tips on using the SET command to configure headers and directory
recursion in Hive see Useful Hive settings.
Data
Value1
Value2
Value3
Value1
Value3
Value1
Value2
Value2
1
3
2
4
6
2
8
5
You could process the data in the source file with the following simple Pig Latin script.
Pig Latin
A = LOAD '/mydata/sourcedata.txt' USING PigStorage('\t') AS (col1, col2:long);
B = GROUP A BY col1;
C = FOREACH B GENERATE group, SUM(A.col2) as total;
D = ORDER D BY total;
STORE D INTO '/mydata/results';
This script loads the tab-delimited data into a relation named A imposing a schema that consists of two
columns: col1, which uses the default byte array data type, and col2, which is a long integer. The script
then creates a relation named B in which the rows in A are grouped by col1, and then creates a relation
named C in which the col2 value is aggregated for each group in B.
After the data has been aggregated, the script creates a relation named D in which the data is sorted
based on the total that has been generated. The relation D is then stored as a file in the /mydata/results
folder, which contains the following text.
Data
Value1 7
Value3 8
Value2 16
For more information about Pig Latin syntax, see Pig Latin Reference Manual 2 on the Apache Pig
website. For a more detailed description of using Pig see Pig Tutorial.
The map function splits the contents of the text input into an array of strings using anything that is not
an alphabetic character as a word delimiter. Each string in the array is then used as the key of a new
key/value pair with the value set to 1.
Each key/value pair generated by the map function is passed to the reduce function, which sums the
values in key/value pairs that have the same key. Working together, the map and reduce functions
determine the total number of times each unique word appeared in the source data, as shown here.
Data
Aardvark
About
Above
Action
...
2
7
12
3
For more information about writing map/reduce code, see MapReduce Tutorial on the Apache Hadoop
website and Develop Java MapReduce programs for HDInsight on the Azure website.
Using Hadoop streaming
The Hadoop core within HDInsight supports a technology called Hadoop Streaming that allows you to
interact with the map/reduce process and run your own code outside of the Hadoop core as a separate
executable process. Figure 1 shows a high-level overview of the way that streaming works.
Using the streaming interface does have a minor impact on performance. The additional movement of
the data over the streaming interface can marginally increase query execution time. Streaming tends to
be used mostly to enable the creation of map and reduce components in languages other than Java. It is
quite popular when using Python, and also enables the use of .NET languages such as C# and F# with
HDInsight.
The Azure SDK contains a series of classes that make it easier to use the streaming interface from .NET
code. For more information see Microsoft .NET SDK For Hadoop on CodePlex.
For more details see Hadoop Streaming on the Apache website. For information about writing HDInsight
map/reduce jobs in languages other than Java, see Develop C# Hadoop streaming programs for
HDInsight and Hadoop Streaming Alternatives.
User-defined functions
You can create user-defined functions (UDFs) and libraries of UDFs for use with HDInsight queries and
transformations. Typically, the UDFs are written in Java and they can be referenced and used in a Hive or
Pig script, or (less common) in custom map/reduce code. You can write UDFs in Python for use with Pig,
but the techniques are different from those described in this topic.
UDFs can be used not only to centralize code for reuse, but also to perform tasks that are difficult (or
even impossible) in the Hive and Pig scripting languages. For example, a UDF could perform complex
validation of values, concatenation of column values based on complex conditions or formats,
aggregation of rows, replacement of specific values with nulls to prevent errors when processing bad
records, and much more.
The topics covered here are:
import org.apache.hadoop.hive.ql.exec.UDF;
public final class your-udf-name extends UDF {
public Text evaluate(final Text s) {
// Implementation here.
// Return the result.
}
}
The body of the UDF, the evaluate function, accepts one or more string values. Hive passes the values of
columns in the dataset to these parameters at runtime, and the UDF generates a result. This might be a
text string that is returned within the dataset, or a Boolean value if the UDF performs a comparison test
against the values in the parameters. The arguments must be types that Hive can serialize.
After you create and compile the UDF, you upload it to HDInsight at the start of the Hive session using a
script with the following command.
Hadoop command
add jar /path/your-udf-name.jar;
Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as
described in Automating cluster management with PowerShell. Then you must register it using a
command such as the following.
Hive script
CREATE TEMPORARY FUNCTION local-function-name
AS 'package-name.function-name';
You can then use the UDF in your Hive query or transformation. For example, if the UDF returns a text
string value you can use it as shown in the following code to replace the value in the specified column of
the dataset with the value generated by the UDF.
Hive script
SELECT your-udf-name(column-name) FROM your-data
If the UDF performs a simple task such as reversing the characters in a string, the result would be a
dataset where the value in the specified column of every row would have its contents reversed.
However, the registration only makes the UDF available for the current session, and you will need to reregister it each time you connect to Hive.
For more information about creating and using standard UDFs in Hive, see HivePlugins on the Apache
website. For more information about creating different types of Hive UDF, see User Defined Functions in
Hive, Three Little Hive UDFs: Part 1, Three Little Hive UDFs: Part 2, and Three Little Hive UDFs: Part 3 on
the Oracle website.
After you create and compile the UDF, you upload it to HDInsight at the start of the Pig session with the
following command.
Hadoop command
add jar /path/your-udf-name.jar;
Alternatively, you can upload the UDF to a shared library folder when you create the cluster, as
described in Automating cluster management with PowerShell. The REGISTER command at the start of a
Pig script will then make the UDF available and you can use it in your Pig queries and transformations.
For example, if the UDF returns the lower-cased equivalent of the input string, you can use it as shown
in this query to generate a list of lower-cased equivalents of the text strings in the first column of the
input data.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
B = FOREACH A GENERATE your-udf-name.function-name(column1);
DUMP B;
A second type of UDF in Pig is a filter function that you can use to filter data. A filter function must
extend the class FilterFunc, accept one or more values as Tuple instances, and return a Boolean value.
The UDF can then be used to filter rows based on values in a specified column of the dataset. For
example, if a UDF named IsShortString returns true for any input value less than five characters in
length, you could use the following script to remove any rows where the first column has a value less
than five characters.
Pig script
REGISTER 'your-udf-name.jar';
A = LOAD 'your-data' AS (column1: chararray, column2: int);
For more information about creating and using UDFs in Pig, see the Pig UDF Manual on the Apache Pig
website.
In the example, Hive scripts create the metadata definition for two tables that have different schemas.
The table named mydata is created over some source data uploaded as a file to HDInsight, and this
defines a Hive location for the table data (step 1 in Figure 2). Next, a Pig script reads the data defined in
the mydata table, summarizes it, and stores it back in the second table named mysummary (steps 2 and
3). However, in reality, the Pig script does not access the Hive tables (which are just metadata
definitions). It must access the source data file in storage, and write the summarized result back into
storage, as shown by the dotted arrow in Figure 2.
In Hive, the path or location of these two files is (by default) denoted by the name used when the tables
were created. For example, the following HiveQL shows the definition of the two Hive tables in this
scenario, and the code that loads a data file into the mydata table.
HiveQL
CREATE TABLE mydata (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA INPATH '/mydata/data.txt' INTO TABLE mydata;
CREATE TABLE mysummary (col1 STRING, col2 BIGINT);
Hive scripts can now access the data simply by using the table names mydata and mysummary. Users
can use HiveQL to query the Hive table without needing to know anything about the underlying file
location or the data format of that file.
However, the Pig script that will group and aggregate the data, and store the results in the mysummary
table, must know both the location and the data format of the files. Without HCatalog, the script must
specify the full path to the mydata table source file, and be aware of the source schema in order to
apply an appropriate schema (which must be defined in the script). In addition, after the processing is
complete, the Pig script must specify the location associated with the mysummary table when storing
the result back in storage, as shown in the following code sample.
Pig Latin
A = LOAD '/mydata/data.txt'
USING PigStorage('\t') AS (col1, col2:long);
...
...
...
STORE X INTO '/mysummary/data.txt';
The file locations, the source format, and the schema are hard-coded in the script, creating some
potentially problematic dependencies. For example, if an administrator moves the data files, or changes
the format by adding a column, the Pig script will fail.
Using HCatalog removes these dependencies by enabling Pig to use the Hive metadata that defines the
tables. To use HCatalog with Pig you must specify the -useHCatalog parameter, and the path to the
HCatalog installation files must be registered as an environment variable named HCAT_HOME. For
example, you could use the following Hadoop command line statements to launch the Grunt interface
with HCatalog enabled.
Command Line to start Pig
SET HCAT_HOME = C:\apps\dist\hcatalog-0.4.1
Pig -useHCatalog
With the HCatalog support loaded you can now use the HCatLoader and HCatStorer objects in the Pig
script, enabling you to access the data through the Hive metadata instead of requiring direct access to
the data file storage.
Pig Latin
A = LOAD 'mydata'
USING org.apache.hcatalog.pig.HCatLoader();
...
...
...
STORE X INTO 'mysummary'
USING org.apache.hcatalog.pig.HCatStorer();
The script stores the summarized data in the location denoted for the mysummary table defined in
Hive, and so it can be queried using HiveQL as shown in the following example.
HiveQL
SELECT * FROM mysummary;
HCatalog also exposes notification events that you can use by other tools such as Oozie to detect when
certain storage events occur.
For more information see HCatalog in the Apache Hive Confluence Spaces.
SET mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;
SET mapreduce.output.compression.type=BLOCK;
SET hive.exec.compress.intermediate=true;
Alternatively, when executing a Hadoop command, you can use the -D parameter to set property values,
as shown here.
Command Line
hadoop [COMMAND] D property=value
The space between the -D and the property name can be omitted. For more details of the options you
can use in a job command, see the Apache Hadoop Commands Manual.
For information about how you can set the configuration of a cluster for all jobs, rather than configuring
each job at runtime, see Custom cluster management clients.
Debugging and testing
Debugging and testing a Hadoop-based solution is more difficult than a typical local application.
Executable applications that run on the local development machine, and web applications that run on a
local development web server such as the one built into Visual Studio, can easily be run in debug mode
within the integrated development environment. Developers use this technique to step through the
code as it executes, view variable values and call stacks, monitor procedure calls, and much more.
None of these functions are available when running code in a remote cluster. However, there are some
debugging techniques you can apply. This section contains information that will help you to understand
how to go about debugging and testing your solutions:
Writing out significant values or messages during execution. You can add extra statements or
instructions to your scripts or components to display the values of variables, export datasets,
write messages, or increment counters at significant points during the execution.
Obtaining debugging information from log files. You can monitor log files and standard error
files for evidence that will help you locate failures or problems.
Using a single-node local cluster for testing and debugging. Running the solution in a local or
remote single-node cluster can help to isolate issues with parallel execution of mappers and
reducers.
Hadoop jobs may fail for reasons other than an error in the scripts or code. The two primary reasons are
timeouts and unhandled errors due to bad input data. By default, Hadoop will abandon a job if it does
not report its status or perform I/O activity every ten minutes. Typically most jobs will do this
automatically, but some processor-intensive tasks may take longer.
If a job fails due to bad input data that the map and reduce components cannot handle, you can instruct
Hadoop to skip bad records. While this may affect the validity of the output, skipping small volumes of
the input data may be acceptable. For more information, see the section Skipping Bad Records in the
MapReduce Tutorial on the Apache website.
If you are using Hive you might be able to split a complex script into separate simpler scripts,
and display the intermediate datasets to help locate the source of the error.
Dump messages and/or intermediate datasets generated by the script to disk before
executing the next command. This can indicate where the error occurs, and provide a
sample of the data for you to examine.
Call methods of the EvalFunc class that is the base class for most evaluation functions in
Pig. You can use this approach to generate heartbeat and progress messages that prevent
timeouts during execution, and to write to the standard log file. See Class EvalFunc<T> on
the Pig website for more information.
If you are using custom map and reduce components you can write debugging messages to the
standard output file from the mapper and then aggregate them in the reducer, or generate
status messages from within the components. See the Reporter class on the Apache website for
more information.
If the problem arises only occasionally, or on only one node, it may be due to bad input data. If
skipping bad records is not appropriate, add code to your mapper class that validates the input
data and reports any errors encountered when attempting to parse or manipulate it. Write the
details and an extract of the data that caused the problem to the standard error file.
You can kill a task while it is operating and view the call stack and other useful information such
as deadlocks. To kill a job, execute the command kill -QUIT [job_id]. The job ID can be found in
the Hadoop YARN Status portal. The debugging information is written to the standard output
(stdout) file.
For information about the Hadoop YARN Status portal, see Monitoring and logging.
Obtaining debugging information from log files
The core Hadoop engine in HDInsight generates a range of information in log files, counters, and status
messages that is useful for debugging and testing the performance of your solutions. Much of this
information is accessible through the Hadoop YARN Status portal.
The following list contains some suggestions to help you obtain debugging information from HDInsight:
Use the Applications section of the Hadoop YARN Status portal to view the status of jobs. Select
FAILED in the menu to see failed jobs. Select the History link for a job to see more details. In the
details page are menu links to show the job counters, and details of each map and reduce task.
The Task Details page shows the errors, and provides links to the log files and the values of
custom and built-in counters.
View the history, job configuration, syslog, and other log files. The Tools section of the Hadoop
YARN Status portal contains a link that opens the log files folder where you can view the logs,
and also a link to view the current configuration of the cluster. In addition, see Monitoring and
logging in the section Building end-to-end solutions using HDInsight.
Run a debug information script automatically to analyze the contents of the standard error,
standard output, job configuration, and syslog files. For more information about running debug
scripts, see How to Debug Map/Reduce Programs and Debugging in the MapReduce Tutorial on
the Apace website.
entries are uploaded each day, and must be parsed to load the data they contain into a Hive table. A
workflow to accomplish this might consist of the following steps:
1. Insert data from the files located in the /uploads folder into the Hive table.
2. Delete the source files, which are no longer required.
This workflow is relatively simple, but could become more complex when other required tasks are
added. For example:
1. If there are no files in the /uploads folder, go to step 5.
2. Insert data from the files into the Hive table.
3. Delete the source files, which are no longer required.
4. Send an email message to an operator indicating success, and stop.
5. Send an email message to an operator indicating failure.
Implementing these kinds of workflows is possible in a range of ways. For example, you could:
Use the Oozie framework that is installed with HDInsight, and PowerShell or the Oozie client in
the HDInsight .NET SDK to execute it. This is a good option when:
You want to execute workflows from within a program running on a client computer.
You are familiar with .NET and prepared to write programs that use the .NET Framework.
For more information see Use Oozie with HDInsight. If you are not familiar with Oozie, see the
next section, "Creating workflows with Oozie" for an overview of how it can be used. A
demonstration of using an Oozie workflow can also be found in the topic Scenario 3: ETL
automation. Use SQL Server Integration Services (SSIS) or a similar integration framework. This
is a good option when:
You have SQL Server installed and are experienced with writing SSIS workflows.
Use the Cascading abstraction layer software. This is a good choice when:
You want to execute complex data processing workflows written in any language that runs
on the Java virtual machine.
You have complex multi-level workflows that you need to combine into a single task.
You want to control the execution of the map and reduce phases of jobs directly in code.
Create a custom application or script that executes the tasks as a workflow. This is a good
option when:
You need a fairly simple workflow that can be expressed using your chosen programming or
scripting language.
You want to run scripts on a schedule, perhaps driven by Windows Scheduled Tasks.
You are prepared to use a Remote Desktop connection to communicate with the cluster to
administer the processes.
Third party workflow frameworks such as Hamake or Azkaban are also available and are a good option
when you are familiar with these tools, or if they offer a capability you need that is not available in other
tools. However, they are not currently supported on HDInsight.
More information
Oozie workflows can be executed using the Oozie time-based coordinator, or by using the classes in the
HDInsight SDK. The topic Use Oozie with HDInsight on the Azure website describes how you can use
Oozie, and the topic Use time-based Oozie Coordinator with HDInsight extends this to show time-based
coordination of a workflow.
For information about automating an entire solution see Building end-to-end solutions using HDInsight.
</configuration>
<script>script.q</script>
<param>INPUT_TABLE=HiveSampleTable</param>
<param>OUTPUT=/results/sampledata</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
<kill name="fail">
<message>Hive failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
The action itself is a Hive job defined in a HiveQL script file named script.q, with two parameters named
INPUT_TABLE and OUTPUT. The code in script.q is shown in the following example.
HiveQL
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM ${INPUT_TABLE}
The script file is stored in the same folder as the workflow.xml hPDL file, along with a standard
configuration file for Hive jobs named hive-default.xml.
A configuration file named job.properties is stored on the local file system of the computer on which
the Oozie client tools are installed. This file, shown in the following example, contains the settings that
will be used to execute the job.
job.properties
nameNode=wasb://my_container@my_asv_account.blob.core.windows.net
jobTracker=jobtrackerhost:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=/workflowfiles/
To initiate the workflow, the following command is executed on the computer where the Oozie client
tools are installed.
Command Line
oozie job -oozie http://localhost:11000/oozie/ -config c:\scripts\job.properties run
When Oozie starts the workflow it returns a job ID in the format 0000001-123456789123456-oozie-hdpW. You can check the status of a job by opening a Remote Desktop connection to the cluster and using a
web browser to navigate to http://localhost:11000/oozie/v0/job/job-id?show=log.
You can also initiate an Oozie job by using Windows PowerShell or the .NET SDK for HDInsight. For more
details see Initiating an Oozie workflow with PowerShell and Initiating an Oozie workflow from a .NET
application.
For more information about Oozie see Apache Oozie Workflow Scheduler for Hadoop. An example of
using Oozie can be found in Scenario 3: ETL automation.
Advantages
Considerations
HCatalog
Typically, you will use the simplest of these approaches that can provide the results you require. For
example, it may be that you can achieve these results by using just Hive, but for more complex scenarios
you may need to use Pig or even write your own map and reduce components. You may also decide,
after experimenting with Hive or Pig, that custom map and reduce components can provide better
performance by allowing you to fine tune and optimize the processing.
The following table shows some of the more general suggestions that will help you make the
appropriate choice of query technology depending on the requirements of your task.
Requirement
Appropriate technologies
Hive
Hadoop Streaming
HCatalog
Storm
Mahout
For more information about these tools and technologies see Data processing tools and techniques.
Before you use PowerShell to work with HDInsight you must configure the PowerShell environment to
connect to your Azure subscription. To do this you must first download and install the Azure PowerShell
module, which is available through the Microsoft Web Platform Installer. For more details see How to
install and configure Azure PowerShell.
The following examples demonstrate some common scenarios for submitting and running jobs using
PowerShell:
Use the makecert command in a Visual Studio command line to create a certificate and upload
it to your subscription in the Azure management portal as described in Create and Upload a
Management Certificate for Azure.
After you have created and installed your certificate, it will be stored in the Personal certificate store on
your computer. You can view the details in the certmgr.msc console.
The following examples demonstrate some common scenarios for submitting and running jobs using
.NET Framework code:
Due to page width limitations we have broken some of the commands in the code above across several
lines for clarity. In your code each command must be on a single, unbroken line.
The Azure PowerShell module also provides the NewAzureHDInsightStreamingMapReduceJobDefinition cmdlet, which you can use to execute map/reduce
jobs that are implemented in .NET assemblies and that use the Hadoop Streaming API. This cmdlet
enables you to specify discrete .NET executables for the mapper and reducer to be used in the job.
After defining the job you can initiate it with the Start-HDInsightJob cmdlet, wait for it to complete with
the Wait-AzureHDInsightJob cmdlet, and retrieve the completion status with the GetAzureHDInsightJobOutput cmdlet.
The following code example shows a PowerShell script that executes a Hive job based on hard-coded
HiveQL code in the PowerShell script. A Query parameter is used to specify the HiveQL code to be
executed, and in this example some of the code is generated dynamically based on a PowerShell
variable.
Windows PowerShell
$clusterName = "cluster-name"
$tableFolder = "/data/mytable"
$hiveQL
$hiveQL
$hiveQL
$hiveQL
As an alternative to hard-coding HiveQL or Pig Latin code in a PowerShell script, you can use the File
parameter to reference a file in Azure storage that contains the Pig Latin or HiveQL code to be executed.
In the following code example a PowerShell script uploads a Pig Latin code file that is stored locally in
the same folder as the PowerShell script, and then uses it to execute a Pig job.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the folder where this PowerShell script is saved
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$destfolder = "data/scripts"
$scriptFile = "ProcessData.pig"
# Upload Pig Latin script to Azure Storage
<property>
<name>TableName</name>
<value>$tableName</value>
</property>
<property>
<name>TableFolder</name>
<value>$tableFolder</value>
</property>
<property>
<name>user.name</name>
<value>$clusterUser</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>$ooziePath</value>
</property>
</configuration>
"@
# Create Oozie job.
$clusterUriCreateJob = "https://$clusterName.azurehdinsight.net:443/oozie/v2/jobs"
$response = Invoke-RestMethod -Method Post
-Uri $clusterUriCreateJob
-Credential $creds
-Body $OoziePayload
-ContentType "application/xml" -OutVariable $OozieJobName
$jsonResponse = ConvertFrom-Json(ConvertTo-Json -InputObject $response)
$oozieJobId = $jsonResponse[0].("id")
Write-Host "Oozie job id is $oozieJobId..."
# Start Oozie job.
Write-Host "Starting the Oozie job $oozieJobId..."
$clusterUriStartJob = "https://$clusterName.azurehdinsight.net:443/oozie/v2/job/" +
$oozieJobId + "?action=start"
$response = Invoke-RestMethod -Method Put -Uri $clusterUriStartJob -Credential $creds
| Format-Table -HideTableHeaders
# Get job status.
Write-Host "Waiting until the job metadata is populated in the Oozie metastore..."
Start-Sleep -Seconds 10
Write-Host "Getting job status and waiting for the job to complete..."
$clusterUriGetJobStatus = "https://$clusterName.azurehdinsight.net:443/oozie/v2/job/"
+ $oozieJobId + "?show=info"
$response = Invoke-RestMethod -Method Get -Uri $clusterUriGetJobStatus -Credential
$creds
$jsonResponse = ConvertFrom-Json (ConvertTo-Json -InputObject $response)
$JobStatus = $jsonResponse[0].("status")
while($JobStatus -notmatch "SUCCEEDED|KILLED")
{
Due to page width limitations we have broken some of the commands in the code above across several
lines for clarity. In your code each command must be on a single, unbroken line.
The request to submit the Oozie job consists of an XML configuration document that contains the
properties to be used by the workflow. These properties include configuration settings for Oozie, and
any parameters that are required by actions defined in the Oozie job. In this example the properties
include parameters named TableName and TableFolder that are used by the following action in the
workflow.xml file uploaded in the oozieworkflow folder.
Partial Oozie Workflow XML
<action name='CreateTable'>
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>CreateTable.q</script>
<param>TABLE_NAME=${TableName}</param>
<param>LOCATION=${TableFolder}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
This action passes the parameter values to the CreateTable.q file, also in the oozieworkflow folder,
which is shown in the following code example.
HiveQL
DROP TABLE IF EXISTS ${TABLE_NAME};
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;
namespace MRClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
// Define the MapReduce job.
Notice the variables required to configure the Hadoop client. These include the unique ID of the
subscription in which the cluster is defined (which you can view in the Azure management portal), the
friendly name of the Azure management certificate to be loaded (which you can view in certmgr.msc),
and the name of your HDInsight cluster.
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;
namespace HiveClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
string hiveQL = @"CREATE TABLE mytable (id INT, val STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/data/mytable';";
// Define the Hive job.
HiveJobCreateParameters hiveJobDefinition = new HiveJobCreateParameters()
{
JobName = "Create Table",
StatusFolder = "/CreateTableStatus",
Query = hiveQL
};
// Get the certificate object from certificate store
// using the friendly name to identify it.
X509Store store = new X509Store();
store.Open(OpenFlags.ReadOnly);
X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
.First(item => item.FriendlyName == certFriendlyName);
JobSubmissionCertificateCredential creds = new
JobSubmissionCertificateCredential(
new Guid(subscriptionID), cert, clusterName);
// Create a hadoop client to connect to HDInsight.
var jobClient = JobSubmissionClientFactory.Connect(creds);
// Run the Hive job.
JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);
// Wait for the job to complete.
Console.Write("Job running...");
JobDetails jobInProgress = jobClient.GetJob(jobResults.JobId);
while (jobInProgress.StatusCode != JobStatusCode.Completed
&& jobInProgress.StatusCode != JobStatusCode.Failed)
{
Console.Write(".");
jobInProgress = jobClient.GetJob(jobInProgress.JobId);
Thread.Sleep(TimeSpan.FromSeconds(10));
}
// Job is complete
Console.WriteLine("!");
Console.WriteLine("Job complete!");
Console.WriteLine("Press a key to end.");
Console.Read();
}
}
}
Notice the variables required to configure the Hadoop client. These include the unique ID of the
subscription in which the cluster is defined (which you can view in the Azure management portal), the
friendly name of the Azure management certificate to be loaded (which you can view in certmgr.msc),
and the name of your HDInsight cluster.
In previous example, the HiveQL command to be executed was specified as the Query parameter of the
HiveJobCreateParameters object. A similar approach is used to specify the Pig Latin statements to be
executed when using the PigJobCreateParameters class. Alternatively, you can use the File property to
specify a file in Azure storage that contains the HiveQL or Pig Latin code to be executed. The following
code example shows how to submit a Pig job that executes the Pig Latin code in a file that already exists
in Azure storage.
C#
using
using
using
using
using
using
using
using
using
using
using
using
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
System.Threading;
System.IO;
System.Security.Cryptography.X509Certificates;
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Blob;
Microsoft.WindowsAzure.Management.HDInsight;
Microsoft.Hadoop.Client;
namespace PigClient
{
class Program
{
static void Main(string[] args)
{
// Azure variables.
string subscriptionID = "subscription-id";
string certFriendlyName = "certificate-friendly-name";
string clusterName = "cluster-name";
// Define the Pig job.
PigJobCreateParameters pigJobDefinition = new PigJobCreateParameters()
{
StatusFolder = "/PigJobStatus",
File = "/weather/scripts/SummarizeWeather.pig"
};
You can combine this approach with any of the data upload techniques described in Uploading data with
the Microsoft .NET Framework to build a client application that uploads source data and the Pig Latin or
HiveQL code files required to process it, and then submits a job to initiate processing.
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
using
using
using
using
using
using
using
System.IO;
Microsoft.Hadoop.WebHDFS;
Microsoft.Hadoop.WebHDFS.Adapters;
Microsoft.Hadoop.WebClient;
Microsoft.Hadoop.WebClient.OozieClient;
Microsoft.Hadoop.WebClient.OozieClient.Contracts;
Newtonsoft.Json;
namespace OozieClient
{
class Program
{
const string hdInsightUser = "user-name";
const string hdInsightPassword = "password";
const string hdInsightCluster = "cluster-name";
const string azureStore = "storage-account-name";
const string azureStoreKey = "storage-account-key";
const string azureStoreContainer = "container-name";
const string workflowDir = "/data/oozieworkflow/";
const string inputPath = "/data/source/";
const string outputPath = "/data/output/";
static void Main(string[] args)
{
try
{
UploadWorkflowFiles().Wait();
CreateAndExecuteOozieJob().Wait();
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
}
finally
{
Console.WriteLine("Press a key to end");
Console.Read();
}
}
private static async Task UploadWorkflowFiles()
{
try
{
var workflowLocalDir = new DirectoryInfo(@".\oozieworkflow");
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(azureStore, azureStoreKey, azureStoreContainer,
false));
Console.WriteLine("Uploading workflow files...");
await hdfsClient.DeleteDirectory(workflowDir);
Notice that an OozieJobProperties object contains the properties to be used by the workflow. These
properties include configuration settings for Oozie as well as any parameters that are required by
actions defined in the Oozie job. In this example the properties include parameters named TableName
and TableFolder, which are used by the following action in the workflow.xml file that is uploaded in the
oozieworkflow folder.
Partial Oozie Workflow XML
<action name='CreateTable'>
<hive xmlns="uri:oozie:hive-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>default</value>
</property>
<property>
<name>oozie.hive.defaults</name>
<value>hive-default.xml</value>
</property>
</configuration>
<script>CreateTable.q</script>
<param>TABLE_NAME=${TableName}</param>
<param>LOCATION=${TableFolder}</param>
</hive>
<ok to="end"/>
<error to="fail"/>
</action>
This action passes the parameter values to the CreateTable.q file. This file is also in the oozieworkflow
folder, and is shown in the following code example.
HiveQL
DROP TABLE IF EXISTS ${TABLE_NAME};
CREATE EXTERNAL TABLE ${TABLE_NAME} (id INT, val STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE LOCATION '${LOCATION}';
Figure 1 represents a map that shows the various routes through which data can be delivered from an
HDInsight cluster to client applications. The client application destinations shown on the map include
Excel and SQL Server Reporting Services (SSRS). You can also create custom applications, or use common
enterprise BI data visualization technologies such as PerformancePoint Services in SharePoint Server to
consume data from HDInsightdirectly or indirectlyusing the same interfaces.
The topics and technologies discussed in this section of the guide are:
Microsoft Excel
Security is also a fundamental concern in all computing scenarios, and big data processing is no
exception. Security considerations apply during all stages of a big data process, and include securing
data while in transit over the network, securing data in storage, and authenticating and authorizing
users who have access to the tools and utilities you use as part of your process. For more details of how
you can maximize security of your HDInsight solutions see the topic Security in the section Building endto-end solutions using HDInsight.
More information
For more information about HDInsight, see Microsoft Azure HDInsight.
For more information about the tools and add-ins for Excel see Power BI for Office 365.
For more information about the HDInsight .NET SDKs see HDInsight SDK Reference Documentation on
MSDN.
Microsoft Excel
Excel is one of the most widely used data manipulation and visualization applications in the world, and is
commonly used as a tool for interactive data analysis and reporting. It supports comprehensive data
import and connectivity options that include built-in data connectivity to a wide range of data sources,
and the availability of add-ins such as Power Query, Power View, PowerPivot, and Power Map.
Additionally, Power BI for Office 365 provides a cloud-based platform for sharing data and reports in
Excel workbooks across the enterprise.
Excel includes a range of analytical tools and visualizations that you can apply to tables of data in one or
more worksheets within a workbook. All Excel 2013 workbooks encapsulate a data model in which you
can define tables of data and relationships between them. These data models make it easier to slice
and dice data in PivotTables and PivotCharts, and to create Power View visualizations.
Office 2013 and Office 365 ProPlus are available in 32-bit and 64-bit versions. If you plan to use Excel to
build data models and perform analysis of big data processing results, the 64-bit version of Office is
recommended because of its ability to handle larger volumes of data.
Excel is especially useful when you want to add value and insight by augmenting the results of your data
analysis with external data. For example, you may perform an analysis of social media sentiment data by
geographical region in HDInsight, and consume the results in Excel. This geographically oriented data
can be enhanced by subscribing to a demographic dataset in the Azure Marketplace. The socioeconomic and population data may provide an insight into why your organization is more popular in
some locations than in others.
As well as datasets that you can download and use to augment your results, Azure Marketplace includes
a number of data services that you can use for data validation (for example, verifying that telephone
numbers and postal codes are valid) and for data transformation (for example, looking up the country or
region, state, and city for a particular IP address or longitude/latitude value).
The following topics describe the tools and techniques you can use to import and visualize data using
Excel:
Power Query
PowerPivot
you install the 64-bit version it also installs the 32-bit version so you will be able to use it to connect to
Hive from both 64-bit and 32-bit applications.
You can simplify the process of connecting to HDInsight by using the Data Sources (ODBC)
administrative tool to create a data source name (DSN) that encapsulates the ODBC connection
information, as shown in Figure 1. Creating a DSN makes it easier for business users with limited
experience of configuring data connections to import data from Hive tables that are defined in
HDInsight. If you set up both 32-bit and 64-bit DSNs using the same name, client applications will
automatically use the appropriate one.
Use case
Considerations
Iterative data
exploration
Built-in data connectivity in Excel is a suitable choice when the results of the data
processing can be encapsulated in a Hive table, or a query with simple joins can be
encapsulated in a Hive view, and the volume of data is sufficiently small to support
interactive connectivity with tolerable response times.
Data warehouse on
demand
When HDInsight is used to create a basic data warehouse containing Hive tables,
business users can use the built-in data connectivity in Excel to consume data from those
tables for analysis and reporting. However, for complex data models that require multiple
related tables and queries with complex joins, PowerPivot may be a better choice.
ETL automation
Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. While
Excel may be used to consume the data from the relational data source after it has been
transferred from HDInsight, it is unlikely that an Excel workbook would be the direct target
for the ETL process.
BI integration
Importing data from a Hive table and combining it with data from a BI data source (such
as a relational data warehouse or corporate data model) is an effective way to accomplish
report-level integration with an enterprise BI solution. However, in self-service analysis
scenarios, advanced users such as business analysts may require a more comprehensive
data modeling solution such as that offered by PowerPivot, and can benefit from the ability
to share queries, data models, and reports with Power BI for Office 365.
Install both 32-bit and 64-bit Hive ODBC Drivers and create 32-bit and 64-bit ODBC DSNs with
the same name. This enables 32-bit and 64-bit clients to use the same connection string when
connecting to Hive.
Importing data into a table in a worksheet makes it possible to filter the data, use data bars and
conditional formatting, and create charts. Tables in worksheets are automatically included in
the workbook data model. However, if you need to define relationships between multiple
tables, or create custom columns and aggregations, it may be more efficient to import data
directly into a PowerPivot data model.
Imported data can be refreshed from the original data source. When importing data from Hive
tables you will be able to refresh the tables only while the HDInsight cluster is running.
Power Query
The Power Query add-in enhances Excel by providing a comprehensive interface for querying a wide
range of data sources. It can also be used to perform data enhancements such as cleansing data by
replacing values, and combining data sets from different sources. Power Query includes a data source
provider for HDInsight, which enables users to browse the folders in Azure blob storage that are
associated with an HDInsight cluster. You can download the Power Query add-in from the Office
website.
By connecting directly to blob storage, users can import data from files such as those generated by
map/reduce jobs and Pig scripts, and import the underlying data files associated with Hive tableseven
if the cluster is not running or has been deleted. This enables organizations to consume the results of
HDInsight processing, while significantly reducing costs if no further HDInsight processing is required.
Keeping an HDInsight cluster running when it is not executing queries just so that you can access the
data incurs charges to your Azure account. If you are not using the cluster, you can close it down but still
be able to access the data at any time using Power Query in Excel, or any other tool that can access
Azure blob storage.
With the Power Query add-in installed, you can use the From Other Sources option on the Power Query
tab on the Excel ribbon to import data from HDInsight. You must specify the account name and key of
the Azure blob store, not the HDInsight cluster itself. After connecting to the Azure blob store you can
select a data file, convert its contents to a table by specifying the appropriate delimiter, and modify the
data types of the columns before importing it into a worksheet, as shown in Figure 1.
The imported data can be added to the workbook data model, or analyzed directly in the worksheet.
The following table describes specific considerations for using Power Query in the HDInsight use cases
and models described in this guide.
Use case
Considerations
Iterative data
exploration
Power Query is a good choice when HDInsight data processing techniques such as
map/reduce jobs or Pig scripts generate files that contain the results to be analyzed or
reported. The HDInsight cluster can be deleted after the processing is complete, leaving
the results in Azure blob storage ready to be consumed by business users in Excel. With
the addition of a Power BI for Office 365 subscription, queries that return data from files in
Azure blob storage can be sharedmaking big data processing results discoverable by
other Excel users in the enterprise through the Online Search feature.
Data warehouse on
demand
ETL automation
Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. It is
unlikely that Power Query would be used to consume data files from the blob storage
associated with the HDInsight cluster, though it may be used to consume data from the
relational data store loaded by the ETL process.
BI integration
Importing data from a file in Azure blob storage and combining it with data from a BI data
source (such as a relational data warehouse or corporate data model) is an effective way
to accomplish report-level integration with an enterprise BI solution. Additionally, users
can import the datasets retrieved by Power Query into a PowerPivot data model, and
publish workbooks containing data models and Power View visualizations to Power BI for
Office 365 in a self-service BI scenario.
Ensure that the big data processing jobs you use to generate data for analysis store their output
in appropriately named folders. This makes it easier for Power Query users to find output files
with generic names such as part-r-00000.
You can apply filters and sophisticated transformations to data in Power Query queries while
importing the output file from a big data processing job. However, you should generally try to
perform as much as possible of the required filtering and shaping within the big data processing
job itself in order to simplify the query that Excel users need to create.
Ensure that Power Query users are familiar with the schema of output files generated by big
data processing jobs. Output files generally do not include column headers.
When a big data processing job generates multiple output files you can use multiple Power
Query queries to combine the data.
PowerPivot
The growing awareness of the value of decisions based on proven data, combined with advances in data
analysis tools and techniques, has resulted in an increased demand for versatile analytical data models
that support ad-hoc analysis (the self-service approach).
PowerPivot is an Excel-based technology in Office 2013 Professional Plus and Office 365 ProPlus that
enables advanced users to create complex data models that include hierarchies for drill-up/drill-down
aggregations, custom data access expression (DAX) calculated measures, key performance indicators
(KPIs), and other features not available in basic data models. PowerPivot is also available as an add-in for
previous releases of Excel. PowerPivot uses xVelocity compression technology to support in-memory
data models that enable high-performance analysis, even with extremely large volumes of data.
You can create and edit PowerPivot data models by using the PowerPivot for Excel interface, which is
accessed from the PowerPivot tab of the Excel ribbon. Through this interface you can enhance tables
that have been added to the workbook data model by other users or processes. You can also import
multiple tables from one or more data sources into the data model and define relationships between
them. Figure 1 shows a data model in PowerPivot for Excel.
PowerPivot brings many of the capabilities of enterprise BI to Excel, enabling business analysts to create
personal data models for sophisticated self-service data analysis. Users can share PowerPivot workbooks
through SharePoint Server, where they can be viewed interactively in a browser, enabling teams of
analysts to collaborate on data analysis and reporting.
The following table describes specific considerations for using PowerPivot in the HDInsight use cases and
models described in this guide.
Use case
Considerations
Iterative data
exploration
Data warehouse on
demand
ETL automation
Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In this
scenario, PowerPivot may be used to consume data from the relational data store loaded
by the ETL process.
BI integration
When importing data into tables in the PowerPivot data model, minimize the size of the
workbook document by using filters to remove rows and columns that are not required.
If the PowerPivot workbook includes data from Hive tables, and you plan to share it in
SharePoint Server or in a Power BI site, use an explicit ODBC connection string instead of a DSN.
This will enable the PowerPivot data model to be refreshed when stored on a server where the
DSN is not available.
Hide any numeric columns for which PowerPivot automatically generates aggregated measures.
This ensures that they do not appear as dimension attributes in the PivotTable Fields and
Power View Fields panes.
Power View
Power View is a data visualization technology that enables interactive, graphical exploration of data in a
data model. Power View is available as a component of SQL Server 2012 Reporting Services when
integrated with SharePoint Server, but is also available in Excel 2013 Professional Plus and Office 365
ProPlus. Using Power View you can create interactive visualizations that make it easy to explore
relationships and trends in the data. Figure 2 shows how Power View can be used to visualize the
weather data in the data model described in the topic PowerPivot.
Use Power View when you need to explore data using a range of data visualizations. Power
View is particularly effective when you want to explore relationships between data in multiple
tables in a PowerPivot data model, but can also be used to visualize data in a single worksheet.
Use Power Map when you want to show changes in geographically-related data values over
time. Your data must include at least one geographic field, and must also include a temporal
field if you want to visualize changes to data over time.
Use native PivotCharts and conditional formatting when you want to create data visualizations
in workbooks that will be opened in versions of Excel that do not support Power View or Power
Map.
Figure 4 - Using Power BI Q&A to query a data model using natural language
Power BI for Office 365 is a great choice when you want to empower business users to create and share
their own queries, data models, and reports. Users with the necessary skills can use HDInsight to process
data (for example, by using Pig or Hive scripts), and then import the data directly into Excel data models
using the Hive ODBC Driver or Power Query. The reports generated from these data models can then be
published in a Power BI site where other business users can view them.
Alternatively, you can publish the results of big data processing to the general business user population
through shared queries that you have created with Power Query. Business users can then engage in selfservice data modeling and analysis simply by discovering and consuming the big data processing results
you have sharedwithout requiring any knowledge of how the results were generated, or even where
the results are stored.
The following table describes specific considerations for using Power BI in the HDInsight use cases and
models described in this guide.
Use case
Considerations
Iterative data
exploration
In an iterative data exploration scenario, users can use Power Query or the Hive ODBC
Driver to consume the results of each data processing iteration in Excel, and then use
native Excel charting, Power View, or Power Map to visualize the data. Power BI makes it
easier for multiple analysts to collaborate by sharing queries and reports in a Power BI
site.
Data warehouse on
demand
ETL automation
Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In most
cases, the ETL process loads the data into a relational data store for analysis. However, it
would be possible to use HDInsight to filter and shape data into tabular formats, and then
use Power Query or the Hive ODBC Driver to import the data into a PowerPivot data
model that can provide a source for reports and interactive analysis in a Power BI site.
BI integration
Power BI provides a platform for sharing queries, data models, and reports. By sharing
queries that obtain HDInsight output files from Azure storage, organizations can use
Power BI to make big data processing results discoverable for self-service BI.
Use Power Query to share queries that retrieve data from HDInsight output files. This makes big
data processing results discoverable by business users who may have difficulty creating their
own queries.
Foster a culture of data stewardship in which users take responsibility for the queries they
define and share. Encourage users to document their queries and to monitor their usage in their
My Power BI site.
Use the Synonyms feature in PowerPivot to specify alternative terms for tables and columns in
your data model. This will improve the ability of the Power BI Q&A feature to interpret natural
language queries.
Editions:
All editions
PowerPivot
Yes (multiple
tables)
Yes
Power
Map*
Power BI
Sites*
Yes
Yes
Yes*
Yes
Yes
Yes
Yes
Yes*
Yes*
Power
View
Yes
Power
Query
Yes*
Yes* (define
synonyms)
Yes*
Figure 1 - Using Excel and Office 365 technologies to analyze big data processing results
The options discussed here assume that you want to use Excel to consume and visualize data directly
from HDInsight, or from the Azure blob storage it uses. However, in many scenarios the results of
HDInsight processing are transferred to a database (for example, a data warehouse implemented in SQL
Server) or an analytical data model (for example, a SQL Server Analysis Services cube). You can use
native Excel data connectivity, PowerPivot, and Power Query to consume data from practically any data
source, and then use native visualization tools, Power View, Power Map, and Power BI sites as described
in this section of the guide.
While many businesses rely on reports created by BI specialists, an increasing number of organizations
are empowering business users to create their own self-service reports. To support this scenario,
business users can use Report Builder (shown in Figure 2). This is a simplified report authoring tool that
is installed on demand from a report server. To further simplify self-service reporting you can have BI
professionals create and publish shared data sources and datasets that can be easily referenced in
Report Builder, reducing the need for business users to configure connections or write queries.
Considerations
Iterative data
exploration
For one-time analysis and data exploration, Excel is generally a more appropriate tool
than Reporting Services because it requires less in the way of infrastructure configuration
and provides a more dynamic user interface for interactive data exploration.
Data warehouse on
demand
ETL automation
Most ETL scenarios are designed to transform big data into a suitable structure and
volume for storage in a relational data source for further analysis and querying. In this
scenario, Reporting Services may be used to consume data from the relational data store
loaded by the ETL process.
BI integration
You can use Reporting Services to integrate data from HDInsight with enterprise BI data
at the report level by creating reports that display data from multiple data sources. For
example, you could use an ODBC data source to connect to HDInsight and query Hive
tables, and an OLE DB data source to connect to a SQL Server data warehouse.
However, in an enterprise BI scenario that combines corporate data and big data in formal
reports, better integration can generally be achieved by integrating at the data warehouse
or corporate data model level, and by using a single data source in Reporting Services to
connect to the integrated data.
When creating a data source for Hive tables, use an explicit ODBC connection string in
preference to a DSN. This ensures that the data source does not depend on a DSN on the report
server.
Consider increasing the default timeout value for datasets that query Hive tables. Hive queries
over ODBC can take a considerable amount of time.
Consider using report snapshots, or cached datasets and reports, to improve performance by
reducing the number of times that queries are submitted to HDInsight.
The guidance provided here assumes that you want to use SQL Server Reporting Services to consume
and visualize data directly from HDInsight. However, in many scenarios the results of HDInsight
processing are transferred to a database (for example, a data warehouse implemented in SQL Server) or
an analytical data model (for example, a SQL Server Analysis Services cube). You can use Reporting
Services to consume and visualize data from practically any data source.
must either import the data into SQL Server or configure a linked server to pass through the query and
results.
Tabular mode
When installed in Tabular mode, SSAS can be used to host tabular data models that are based on the
xVelocity in-memory analytics engine. These models use the same technology and design as PowerPivot
data models in Excel but can be scaled to handle much larger volumes of data, and they can be secured
using enterprise-level role based security. You create tabular data models in the Visual Studio-based SQL
Server Data Tools development environment, and you can choose to create the data model from scratch
or import an existing PowerPivot for Excel workbook.
Because Tabular models support ODBC data sources, you can easily include data from Hive tables in the
data model. You can use HiveQL queries to pre-process the data as it is imported in order to create the
schema that best suits your analytical and reporting goals. You can then use the modeling capabilities of
SQL Server Analysis Services to create relationships, hierarchies, and other custom model elements to
support the analysis that users need to perform.
Figure 1 shows a Tabular model for the weather data used in this section of the guide.
Considerations
Iterative data
exploration
If the results of HDInsight data processing can be encapsulated in Hive tables, and
multiple users must perform consistent analysis and reporting of the results, an SSAS
tabular model is an easy way to create a corporate data model for analysis. However, if
the analysis will only be performed by a small group of specialist users, you can probably
achieve this by using PowerPivot in Excel.
Data warehouse on
demand
ETL automation
If the target of the HDInsight-based ETL process is a relational database, you might build
a tabular data model based on the tables in the database in order to enable enterprise-
If the enterprise BI solution already uses tabular SSAS models, you can add Hive tables
to these models and create any necessary relationships to support integrated analysis of
corporate BI data and big data results from HDInsight. However, if the corporate BI data
warehouse is based on a dimensional model that includes surrogate keys and slowly
changing dimensions, it can be difficult to define relationships between tables in the two
data sources. In this case, integration at the data warehouse level may be a better
solution.
Multidimensional mode
As an alternative to Tabular mode, SSAS can be installed in Multidimensional mode. Multidimensional
mode provides support for a more established online analytical processing (OLAP) approach to cube
creation, and is the only mode supported by releases of SSAS prior to SQL Server 2012. Additionally, if
you plan to use SSAS data mining functionality, you must install SSAS in Multidimensional mode.
Multidimensional mode includes some features that are not supported or are difficult to implement in
Tabular data models, such as the ability to aggregate semi-additive measures across accounting
dimensions and use international translations in the cube definition. However, although
Multidimensional data models can be built on OLE DB data sources, some restrictions in the way cube
elements are implemented means that you cannot use an ODBC data source. Therefore, there is no way
to directly connect dimensions or measure groups in the data model to Hive tables in an HDInsight
cluster.
To use HDInsight data as a source for a Multidimensional data model in SSAS you must either transfer
the data from HDInsight to a relational database system such as SQL Server, or define a linked server in
another SQL Server instance that can act as a proxy and pass queries through to Hive tables in HDInsight.
The use of linked servers to access Hive tables from SQL Server, and techniques for transferring data
from HDInsight to SQL Server, are described in the topic SQL Server database.
Figure 2 shows a Multidimensional data model in Visual Studio. This data model is based on views in a
SQL Server database, which are in turn based on queries against a linked server that references Hive
tables in HDInsight.
Considerations
Iterative data
exploration
For one-time analysis, or analysis by a small group of users, the requirement to use a
relational database such as SQL Server as a proxy or interim host for the HDInsight
results means that this approach involves more effort than using a tabular data model or
just analyzing the data in Excel.
Data warehouse on
demand
ETL automation
If the target of the HDInsight-based ETL process is a relational database that can be
accessed through an OLE DB connection, you might build a multidimensional data model
based on the tables in the database to enable enterprise-level analysis and reporting.
BI integration
If the enterprise BI solution already uses multidimensional SSAS models that you want to
extend to include data from HDInsight, you should integrate the data at the data
warehouse level and base the data model on the relational data warehouse.
You cannot use ODBC data sources in a Multidimensional SSAS database. If you must include
Hive tables in a Multidimensional model, consider defining a linked server in a SQL Server
instance and adding SQL Server views that query the Hive tables to the SSAS data source.
If you are including data from Hive tables in a Tabular data model, use an explicit ODBC
connection string instead of a DSN. This will enable the data model to be refreshed when stored
on a server where the DSN is not available.
Consider the life cycle of the HDInsight cluster when scheduling data model processing. When
the model is processed, it refreshes partitions from the original data sources and so you should
ensure that that the HDInsight cluster and its Hive tables will be available when partitions based
on them are processed.
Linked servers
Linked servers are server-level connection definitions in a SQL Server instance that enable queries in the
local SQL Server engine to reference tables in remote servers. You can use the ODBC driver for Hive to
create a linked server in a SQL Server instance that references an HDInsight cluster, enabling you to
execute Transact-SQL queries that reference Hive tables.
To create a linked server you can either use the graphical tools in SQL Server Management Studio or the
sp_addlinkedserver system stored procedure, as shown in the following code.
Transact-SQL
EXEC master.dbo.sp_addlinkedserver
@server = N'HDINSIGHT', @srvproduct=N'Hive',
@provider=N'MSDASQL', @datasrc=N'HiveDSN',
@provstr=N'Provider=MSDASQL.1;Persist Security Info=True;User ID=UserName;
Password=P@ssw0rd;'
After you have defined the linked server you can use the Transact-SQL OpenQuery function to execute
pass-through queries against the Hive tables in the HDInsight data source, as shown in the following
code.
Transact-SQL
SELECT * FROM OpenQuery(HDINSIGHT, 'SELECT * FROM Observations');
Using a four-part distributed query as the source of the OpenQuery statement is not always a good idea
because the syntax of HiveQL differs from T-SQL in several ways.
By using a linked server you can create views in a SQL Server database that act as pass-through queries
against Hive tables, as shown in Figure 1. These views can then be queried by analytical tools that
connect to the SQL Server database.
Considerations
Iterative data
exploration
For one-time analysis, or analysis by a small group of users, the requirement to use a
relational database such as SQL Server as a proxy or interim host for the HDInsight
results means that this approach involves more effort than using a tabular data model or
just analyzing the data in Excel.
Data warehouse on
demand
Depending on the volume of data in the data warehouse, and the frequency of queries
against the Hive tables, using a linked server with a Hive-based data warehouse might
make it easier to support a wide range of client applications. A linked server is a suitable
solution for populating data models on a regular basis when they are processed during
out-of-hours periods, or for periodically refreshing cached datasets for Reporting Services.
However, the performance of pass-through queries over an ODBC connection may not be
sufficient to meet your users expectations for interactive querying and reporting directly in
client applications such as Excel.
ETL automation
Generally, the target of the ETL processes is a relational database, making a linked server
that references Hive tables unnecessary.
BI integration
If the ratio of Hive tables to data warehouse tables is small, or they are relatively rarely
queried, a linked server might be a suitable way to integrate data at the data warehouse
level. However, if there are many Hive tables or if the data in the Hive tables must be
tightly integrated into a dimensional data warehouse schema, it may be more effective to
transfer the data from HDInsight to local tables in the data warehouse.
PolyBase
PolyBase is a data integration technology in the Microsoft Analytics Platform System (APS) that enables
data in an HDInsight cluster to be queried as native tables in a relational data warehouse that is
implemented in SQL Server Parallel Data Warehouse (PDW). SQL Server PDW is an edition of SQL Server
that is only available pre-installed in an APS appliance, and it uses a massively parallel processing (MPP)
architecture to implement highly scalable data warehouse solutions.
PolyBase enables parallel data movement between SQL Server and HDInsight, and supports standard
Transact-SQL semantics such as GROUP BY and JOIN clauses that reference large volumes of data in
HDInsight. This enables APS to provide an enterprise-scale data warehouse solution that combines
relational data in data warehouse tables with data in an HDInsight cluster.
The following table describes specific considerations for using PolyBase in the HDInsight use cases and
models described in this guide.
Use case
Considerations
Iterative data
exploration
For one-time analysis, or analysis by a small group of users, the requirement to use an
APS appliance may be cost-prohibitive, unless such an appliance is already present in the
organization.
Data warehouse on
demand
If the volume of data and the number of query requests are extremely high, using an APS
appliance as a data warehouse platform that includes HDInsight data through PolyBase
might be the most cost-effective way to achieve the required levels of performance and
scalability your data warehousing solution requires.
ETL automation
Generally, the target of the ETL process is a relational database, making PolyBase
integration with HDInsight unnecessary.
BI integration
If your enterprise BI solution already uses an APS appliance, or the combined scalability
and performance requirements for enterprise BI and big data analysis is extremely high,
the combination of SQL Server PDW with PolyBase in a single APS appliance might be a
suitable solution. However, note that PolyBase does not inherently integrate HDInsight
data into a dimensional data warehouse schema. If you need to include big data in
dimension members that use surrogate keys, or you need to support slowly changing
Sqoop
Sqoop is a Hadoop technology included in HDInsight. It is designed to make it easy to transfer data
between Hadoop clusters and relational databases. You can use Sqoop to export data from HDInsight
data files to SQL Server database tables by specifying the location of the data files to be exported, and a
JDBC connection string for the target SQL Server instance. For example, you could run the following
command on an HDInsight server to copy the data in the /hive/warehouse/observations path to the
observations table in an Azure SQL Database named mydb located in a server named jkty65.
Sqoop command
sqoop export --connect "jdbc:sqlserver://jkty65.database.windows.net:1433;
database=mydb;user=username@jkty65;password=Pa$$w0rd;
logintimeout=30;"
--table observations
--export-dir /hive/warehouse/observations
Sqoop is generally a good solution for transferring data from HDInsight to Azure SQL Database servers,
or to instances of SQL Server that are hosted in virtual machines running in Azure, but it can present
connectivity challenges when used with on-premises database servers. A key requirement is that
network connectivity can be successfully established between the HDInsight cluster where the Sqoop
command is executed and the target SQL Server instance. When used with HDInsight this means that
the SQL Server instance must be accessible from the Azure service where the cluster is running, which
may not be permitted by security policies in organizations where the target SQL Server instance is
hosted in an on-premises data center.
In many cases you can enable secure connectivity between virtual machines in Azure and on-premises
servers by creating a virtual network in Azure. However, at the time of writing it was not possible to add
the virtual machines in an HDInsight cluster to an Azure virtual network, so this approach cannot be
used to enable Sqoop to communicate with an on-premises server hosting SQL Server without traversing
the corporate firewall.
You can use Sqoop interactively from the Hadoop command line, or you can use one of the following
techniques to initiate a Sqoop job:
Implement a custom application that uses the .NET SDK for HDInsight to submit a Sqoop job.
The following table describes specific considerations for using Sqoop in the HDInsight use cases and
models described in this guide.
Use case
Considerations
Iterative data
exploration
For one-time analysis, or analysis by a small group of users, using Sqoop is a simple way
to transfer the results of data processing to SQL Database or a SQL Server instance for
reporting or analysis.
Data warehouse on
demand
When using HDInsight as a data warehouse for big data analysis, the data is generally
accessed directly in Hive tablesmaking transfer to a database using Sqoop
unnecessary.
ETL automation
Generally, the target of the ETL processes is a relational database, and Sqoop may be
the mechanism that is used to load the transformed data into the target database.
BI integration
When you want to integrate the results of HDInsight processing with an enterprise BI
solution at the data warehouse level you can use Sqoop to transfer data from HDInsight
into the data warehouse tables, or (more commonly) into staging tables from where it will
be loaded into the data warehouse. However, if network connectivity between HDInsight
and the target database is not possible you may need to consider an alternative technique
to transfer the data, such as SQL Server Integration Services.
Considerations
Iterative data
exploration
For one-time analysis, or analysis by a small group of users, SSIS can provide a simple
way to transfer the results of data processing to SQL Server for reporting or analysis.
Data warehouse on
demand
When using HDInsight as a data warehouse for big data analysis, the data is generally
accessed directly in Hive tables, making transfer to a database through SSIS
unnecessary.
ETL automation
Generally, the target of the ETL processes is a relational database. In some cases the
ETL process in HDInsight might transform the data into a suitable structure, and then
SSIS can be used to complete the process by transferring the transformed data to SQL
Server.
BI integration
the data from HDInsight into a staging table, and perhaps use a different SSIS package to
load the staged data in synchronization with data from other corporate sources.
When using Sqoop to transfer data between HDInsight and SQL Server, consider the effect of
firewalls between the HDInsight cluster in Azure and the SQL Server database server.
When using SSIS to transfer data from Hive tables, use an explicit ODBC connection string
instead of a DSN. This enables the SSIS package to run on a server where the DSN is not
available.
When using SSIS to transfer data from Hive tables, specify the DefaultStringColumnLength
parameter in the ODBC connection string. The default value for this setting is 32767, which
results in SSIS treating all strings as DT_TEXT or DT_NTEXT data type values. For optimal
performance, limit strings to 4000 characters or less so that SSIS automatically treats them as
DT_STR or DT_WSTR data type values.
When using SSIS to work with Hive ODBC sources, set the ValidateExternalMetadata property
of the ODBC data source component to False. This prevents Visual Studio from validating the
metadata until you open the data source component, reducing the frequency with which the
Visual Studio environment becomes unresponsive while waiting for data from the HDInsight
cluster.
Considerations
Iterative data
exploration
For one-time analysis or iterative exploration of data, PowerShell provides a flexible, easy
to use scripting framework that you can use to upload data and scripts, initiate jobs, and
consume the results.
Data warehouse on
demand
Data warehouses are usually queried by reporting clients such as Excel or SQL Server
Reporting Services. However, PowerShell can be useful as a tool to quickly test queries.
ETL automation
The target of the ETL processes is typically a relational database. While you may use
PowerShell to upload source data to Azure blob storage and to initiate the HDInsight jobs
that encapsulate the ETL process, its unlikely that PowerShell would be an appropriate
tool to consume the results.
BI integration
In an enterprise BI solution, users generally use established tools such as Excel or SQL
Server Reporting Services to visualize data. However, in a similar way to the data
warehouse scenario, you may use PowerShell to test queries against Hive tables.
You can run PowerShell scripts interactively in a Windows command line window or in a
PowerShell-specific command line console. Additionally, you can edit and run PowerShell scripts
in the Windows PowerShell Interactive Scripting Environment (ISE), which provides IntelliSense
and other user interface enhancements that make it easier to write PowerShell code.
You can schedule the execution of PowerShell scripts using Windows Scheduler, SQL Server
Agent, or other tools as described in Building end-to-end solutions using HDInsight.
Before you use PowerShell to work with HDInsight you must configure the PowerShell
environment to connect to your Azure subscription. To do this you must first download and
install the Azure PowerShell module, which is available through the Web Platform Installer. For
more details see How to install and configure Azure PowerShell.
Considerations
Iterative data
exploration
For one-time analysis or iterative exploration of data, writing a custom client application
may be an inefficient way to consume the data unless the team analyzing the data have
existing .NET development skills and plan to implement a custom client for a future big
data processing solution.
Data warehouse on
demand
In some cases a big data solution consists of a data warehouse based on HDInsight and a
custom application that consumes data from the data warehouse. For example, the goal
of a big data project might be to incorporate data from an HDInsight-based data
warehouse into an ASP.NET web application. In this case, using the .NET Framework
libraries for HDInsight is an appropriate choice.
ETL automation
The target of the ETL processes is typically a relational database. You might use a
custom .NET application to upload source data to Azure blob storage and initiate the
HDInsight jobs that encapsulate the ETL process.
BI integration
In an enterprise BI solution, users generally use established tools such as Excel or SQL
Server Reporting Services to visualize data. However, you may use the .NET libraries for
HDInsight to integrate big data into a custom BI application or business process.
More information
For information on using PowerShell with HDInsight see HDInsight PowerShell Cmdlets Reference
Documentation.
For information on using the HDInsight SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.
Figure 1 shows how the results of this query are displayed in the Windows PowerShell ISE.
The Set-AzureStorageBlobContent cmdlet is used to copy local files to an Azure storage container. The
Set-AzureStorageBlobContent and Get-AzureStorageBlobContent cmdlets are often used together
when working with HDInsight to upload source data and scripts to Azure before initiating a data
processing job, and then to download the output of the job.
As an example, the following PowerShell code uses the Set-AzureStorageBlobContent cmdlet to upload
a Pig Latin script named SummarizeWeather.pig, which is then invoked using the NewAzureHDInsightPigJobDefinition and Start-AzureHDInsightJob cmdlets. The output file generated by the
job is downloaded using the Get-AzureStorageBlobContent cmdlet, and its contents are displayed using
the cat command.
Windows PowerShell
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
# Find the folder where this script is saved.
$localfolder = Split-Path -parent $MyInvocation.MyCommand.Definition
$destfolder =
$scriptFile =
$outputFolder
$outputFile =
"weather/scripts"
"SummarizeWeather.pig"
= "weather/output"
"part-r-00000"
The SummarizeWeather.pig script in this example generates the average wind speed and temperature
for each date in the source data, and stores the results in the /weather/output folder as shown in the
following code example.
Pig Latin (SummarizeWeather.pig)
Weather = LOAD '/weather/data' USING PigStorage(',') AS (obs_date:chararray,
obs_time:chararray, weekday:chararray, windspeed:float, temp:float);
GroupedWeather = GROUP Weather BY obs_date;
AggWeather = FOREACH GroupedWeather GENERATE group, AVG(Weather.windspeed) AS
avg_windspeed, MAX(Weather.temp) AS high_temp;
DailyWeather = FOREACH AggWeather GENERATE FLATTEN(group) AS obs_date, avg_windspeed,
high_temp;
SortedWeather = ORDER DailyWeather BY obs_date ASC;
STORE SortedWeather INTO '/weather/output';
Figure 1 shows how the results of this script are displayed in the Windows PowerShell ISE.
Note that the script must include the name of the output file to be downloaded. In most cases, Pig jobs
generate files in the format part-r-0000x. Some map/reduce operations may create files with the format
part-m-0000x, and Hive jobs that insert data into new tables generate numeric filenames such as
000000_0. In most cases you will need to determine the specific filename(s) generated by your data
processing job before writing PowerShell code to download the output.
The contents of downloaded files can be displayed in the console using the cat command, as in the
example above, or you could open a file containing delimited text results in Excel.
System;
System.Threading.Tasks;
System.Data;
System.Data.Odbc;
System.Data.Common;
namespace HiveClient
{
class Program
{
static void Main(string[] args)
{
GetData();
Console.WriteLine("----------------------------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}
System;
System.Collections.Generic;
System.Linq;
System.Text;
System.Threading.Tasks;
using Microsoft.Hadoop.Hive;
namespace LinqToHiveClient
{
class Program
{
static void Main(string[] args)
{
var db = new HiveDatabase(
webHCatUri: new Uri("https://mycluster.azurehdinsight.net"),
username: "user-name", password: "password",
azureStorageAccount: "storage-account-name.blob.core.windows.net",
azureStorageKey: "storage-account-key");
var q = from x in
(from o in db.Weather
select new { o.obs_date, temp = o.temperature })
group x by x.obs_date into g
select new { obs_date = g.Key, temp = g.Average(t => t.temp)};
q.ExecuteQuery().Wait();
var results = q.ToList();
foreach (var r in results)
{
Console.WriteLine(r.obs_date.ToShortDateString() + ": "
+ r.temp.ToString("#00.00"));
}
Console.WriteLine("---------------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}
}
public class HiveDatabase : HiveConnection
{
public HiveDatabase(Uri webHCatUri, string username, string password,
string azureStorageAccount, string azureStorageKey)
: base(webHCatUri, username, password,
azureStorageAccount, azureStorageKey) { }
public HiveTable<WeatherRow> Weather
{
get
{
return this.GetTable<WeatherRow>("Weather");
}
}
}
public class WeatherRow : HiveRow
{
public DateTime obs_date { get; set; }
public string obs_time { get; set; }
public string day { get; set; }
public float wind_speed { get; set; }
public float temperature { get; set; }
}
}
Notice that the code includes a class that inherits from HiveConnection, which provides an abstraction
for the Hive data source. This class contains a collection of tables that can be queried (in this case, a
single table named Weather). The table contains a collection of objects that represent the rows of data
in the table, each of which is implemented as a class that inherits from HiveRow. In this case, each row
from the Weather table contains the following fields:
obs_date
obs_time
day
wind_speed
temperature
The query in this example groups the data by obs_date and returns the average temperature value for
each date. The output from this example code is shown in Figure 1.
credentials, as described in Securing credentials in scripts and applications in the Security section of
this guide.
C#
using System;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;
namespace BlobClient
{
class Program
{
static void Main(string[] args)
{
GetResult();
Console.WriteLine("--------------------------");
Console.WriteLine("Press a key to end");
Console.Read();
}
static async void GetResult()
{
var hdInsightUser = "user-name";
var storageName = "storage-account-name";
var storageKey = "storage-account-key";
var containerName = "container-name";
var outputFile = "/weather/output/part-r-00000";
// Get the contents of the output file.
var hdfsClient = new WebHDFSClient(hdInsightUser,
new BlobStorageAdapter(storageName, storageKey, containerName, false));
await hdfsClient.OpenFile(outputFile)
.ContinueWith(r => r.Result.Content.ReadAsStringAsync()
.ContinueWith(c => Console.WriteLine(c.Result.ToString())));
}
}
}
Figure 1 - Output retrieved using the OpenFile method of the WebHDFSClient class
For more information about using the .NET SDK see HDInsight SDK Reference Documentation and the
incubator projects on the CodePlex website.
Using the Windows Azure Storage Library
In some cases you may want to download the output files generated by HDInsight jobs so that they can
be opened in client applications such as Excel. You can use the CloudBlockBlob class in the Windows
Azure Storage package to download the contents of the blob to a file.
The following example shows how you can use the Windows Azure Storage package in an application to
download the contents of a blob to a file.
C#
using System;
using System.Text;
using System.Threading.Tasks;
using
using
using
using
Microsoft.WindowsAzure.Storage;
Microsoft.WindowsAzure.Storage.Auth;
Microsoft.WindowsAzure.Storage.Blob;
System.IO;
namespace BlobDownloader
{
class Program
{
const string AZURE_STORAGE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;"
+ "AccountName=storage-account-name;AccountKey=storage-account-key";
static void Main(string[] args)
{
CloudStorageAccount storageAccount = CloudStorageAccount.Parse
(AZURE_STORAGE_CONNECTION_STRING);
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
CloudBlobContainer container = blobClient.GetContainerReference("containername");
CloudBlockBlob blob = container.GetBlockBlobReference("weather/output/part-r00000");
var fileStream = File.OpenWrite(@".\results.txt");
using ( fileStream)
{
blob.DownloadToStream(fileStream);
}
Console.WriteLine("Results downloaded to " + fileStream.Name);
Console.WriteLine("Press a key to end");
Console.Read();
}
}
}
For more information about using the classes in the Windows Azure Storage package see How to use
Blob Storage from .NET.
The automation and orchestration of these tasks must be planned carefully to create an overall solution
that performs efficiently and can be easily integrated into business practices. The more complex your big
data processing requirements, the more important it is to plan the coordination of all the moving
parts in the solution to achieve the required results in as efficient and error-free way as possible.
This section of the guide focuses on building end-to-end solutions that minimize the need for operator
or administrator intervention, maximize the security of the process and the data, and provide sufficient
information to be able to monitor solutions. This section is divided into two distinct topic areas:
Designing end-to-end solutions. This includes planning the solution to meet the requirements of
dependencies, constraints, and consistency; protecting the application, the data, and the
cluster; and implementing scheduling for the overall process and the individual tasks.
Monitoring and logging. This includes monitoring the cluster itself and the individual tasks,
auditing operations, and accessing log files.
More information
For more information about HDInsight see the Microsoft Azure HDInsight web page.
See Collecting and loading data into HDInsight for more details and considerations for provisioning a
cluster and storage, and uploading data to a big data solution such as HDInsight.
See Processing, querying, and transforming data using HDInsight for more details and considerations for
processing big data with HDInsight.
See Consuming and visualizing data from HDInsight for more details and considerations for consuming
the output of big data processing jobs.
See Appendix A - Tools and technologies reference for information about the many tools, frameworks,
utilities, and technologies you can adopt to help automate an end-to-end solution.
Data ingestion: source data is loaded to Azure storage, ready for processing. For details of how
you can automate individual tasks for data ingestion see Custom data upload clients in the
section Collecting and loading data into HDInsight.
Cluster provisioning: When the data is ready to be processed, a cluster is provisioned. For
details of how you can automate cluster provisioning see Custom cluster management clients in
the section Collecting and loading data into HDInsight.
Job submission and management: One or more jobs are executed on the cluster to process the
data and generate the required output. For details of how you can automate individual tasks for
submitting and managing jobs see Building custom clients in the section Processing, querying,
and transforming data using HDInsight.
Data consumption: The job output is retrieved from HDInsight, either directly by a client
application or through data transfer to a permanent data store. For details of how you can
automate data consumption tasks see Building custom clients in the section Consuming and
visualizing data from HDInsight.
Cluster deletion: The cluster is deleted when it is no longer required to process data or service
Hive queries. For details of how you can delete a cluster see Custom cluster management clients
in the section Collecting and loading data into HDInsight.
Data visualization: The retrieved results are visualized and analyzed, or used in a business
application. For details of tools for visualizing and analyzing the results see the section
Consuming and visualizing data from HDInsight.
However, before beginning to design an automated solution, it is sensible to start by identifying the
dependencies and constraints in your specific data processing scenario, and considering the
requirements for each stage in the overall solution. For example, you must consider how to coordinate
the automation of these operations as a whole, as well as planning the scheduling of each discrete task.
This section includes the following topics related to designing automated end-to-end solutions:
Security
Considerations
Consider the following points when designing and implementing end-to-end solutions around HDInsight:
Analyze the requirements for the solution before you start to implement automation. Consider
factors such as how the data will be collected, the rate at which it arrives, the timeliness of the
results, the need for quick access to aggregated results, and the consequent impact of the
speed of processing each batch. All of these factors will influence the processes and
technologies you choose, the batch size for each process, and the overall scheduling for the
solution.
Automating a solution can help to minimize errors for tasks that are repeated regularly, and by
setting permissions on the client-side applications that initiate jobs and access the data you can
also limit access so that only your authorized users can execute them. Automation is likely to be
necessary for all types of solutions except those where you are just experimenting with data
and processes.
The individual tasks in your solutions will have specific dependencies and constraints that you
must accommodate to achieve the best overall data processing workflow. Typically these
dependencies are time based and affect how you orchestrate and schedule the tasks and
processes. Not only must they execute in the correct order, but you may also need to ensure
that specific tasks will be completed before the next one begins. See Workflow dependencies
and constraints for more information.
Consider if you need to automate the creation of storage accounts to hold the cluster data, and
decide when this should occur. HDInsight can automatically create one or more linked storage
accounts for the data as part of the cluster provisioning process. Alternatively, you can
automate the creation of linked storage accounts before you create a cluster, and non-linked
storage accounts before or after you create a cluster. For example, you might automate
creating a new storage account, loading the data, creating a cluster that uses the new storage
account, and then executing a job. For more information about linked and non-linked storage
accounts see Cluster and storage initialization in the section Collecting and loading data into
HDInsight.
Consider the end-to-end security of your solution. You must protect the data from unauthorized
access and tampering when it is in storage and on the wire, and secure the cluster as a whole to
prevent unauthorized access. See Security for more details.
constraints that you must take into account when planning the solution. Typical dependencies and
constraints include:
Minimum latency toleration in consuming systems. How up-to-date does the data need to be
in reports, data models, and applications that consume the data processing results?
Volatility of source data. How frequently does the source data get updated or added to?
Data source dependencies. Are there data processing tasks for which data from one source
cannot be processed until data from another source is available?
Duration of processing tasks. How long does it typically take to complete each task in the
workflow?
Resource contention for existing workloads. To what degree can data processing operations
degrade the performance and scalability of systems that are in use for ongoing business
processes?
Cost. What is the budget for the employee time and infrastructure resources used to process
the data?
An example scenario
As an example, consider a scenario in which business analysts want to use an Excel report in an Office
365 Power BI site to visualize web server activity for an online retail site. The data in Excel is in a
PivotTable, which is based on a connection to Azure SQL Database. The web server log data must be
processed using a Pig job in HDInsight, and then loaded into SQL Database using Sqoop. The business
analysts want to be able to view daily page activity for each day the site is operational, up to and
including the previous day.
To plan a workflow for this requirement, you might consider the following questions:
How up-to-date does the data need to be in reports, data models, and applications that
consume the data processing results?
How frequently does the source data get updated or added to?
The requirement is that the Excel workbook includes all data up to and including the
previous days web server logs, so a solution is required that refreshes the workbook as
soon as possible after the last log entry of the day has been processed. In a 24x7 system
this means that the data must be processed daily, just after midnight.
This depends on how active the website is. Many large online retailers handle thousands of
requests per second, so the log files may grow extremely quickly.
Are there data processing tasks for which data from one source cannot be processed until data
from another source is available?
If analysis in Excel is limited to just the website activity, there are no dependencies between
data sources. However, if the web server log data must be combined with sales data
How long does it typically take to complete each task in the workflow?
To what degree can data processing operations degrade the performance and scalability of
systems that are in use for ongoing business processes?
You will need to test samples of data to determine this. Based on the high volatility of the
source data, and the requirement to include log entries right up to midnight, you might find
that it takes a significantly long time to upload a single daily log file and process it with
HDInsight before the Excel workbook can be refreshed with the latest data. You might
therefore decide that a better approach is to use hourly log files and perform multiple
uploads during the day, or capture the log data in real-time using a tool such as Flume and
write it directly to Azure storage. You could also process the data periodically during the
day to reduce the volume of data to be processed at midnight, enabling the Excel workbook
to be refreshed within a smaller time period.
There may be some impact on the web servers as the log files are read, and you should test
the resource utilization overhead this causes.
What is the budget for the employee time and infrastructure resources used to process the
data?
The process can be fully automated, which minimizes human resource costs. The main
running cost is the HDInsight cluster, and you can mitigate this by only provisioning the
cluster when it is needed to perform the data processing jobs. For example, you could
design a workflow in which log files are uploaded to Azure storage on an hourly basisbut
the HDInsight cluster is only provisioned at midnight when the last log file has been
uploaded, and then deleted when the data has been processed. If processing the logs for
the entire day takes too long to refresh the Excel workbook in a timely fashion, you could
automate provisioning of the cluster and processing of the data twice per day.
In the previous example, based on measurements you make by experimenting with each stage of the
process, you might design a workflow in which:
1. The web servers are configured to create a new log each hour.
2. On an hourly schedule the log files for the previous hour are uploaded to Azure storage. For
the purposes of this example, uploading an hourly log file takes between three and five
minutes.
3. At noon each day an HDInsight cluster is provisioned and the log files for the day so far are
processed. The results are then transferred to Azure SQL Database, and the cluster and the
log files that have been processed are deleted. For the purposes of the example, this takes
between five and ten minutes.
4. The remaining logs for the day are uploaded on an hourly schedule.
5. At midnight an HDInsight cluster is provisioned and the log files for the day so far are
processed. The results are then transferred to SQL Database and the cluster and the log files
that have been processed are deleted. For the purposes of the example, this takes between
five and ten minutes.
6. Fifteen minutes later the data model in the Excel workbook is refreshed to include the new
data that was added to the SQL Database tables during the two data processing activities
during the day.
The specific dependencies and constraints in each big data processing scenario can vary significantly.
However, spending the time upfront to consider how you will accommodate them will help you plan and
implement an effective end to end solution.
Task parameterization
Data consistency
DTExec, and other resourcesbut not HDInsight itself because these credentials will be provided in the
scripts or code.
Windows Task Scheduler enables you to specify Windows credentials for each scheduled task, and the
SQL Server Agent enables you to define proxies that encapsulate credentials with access to specific
subsystems for individual job steps. For more information about SQL Server Agent proxies and
subsystems see Implementing SQL Server Agent Security.
Task parameterization
Avoid hard-coding variable elements in your big data tasks. This may include file locations, Azure service
names, Azure storage access keys, and connection strings. Instead, design scripts, custom applications,
and SSIS packages to use parameters or encrypted configuration settings files to assign these values
dynamically. This can improve security, as well as maximizing reuse, minimizing development effort, and
reducing the chance of errors caused by multiple versions that might have subtle differences. See
Securing credentials in scripts and applications in the Security section of this guide for more
information.
When using SQL Server 2012 Integration Services or later, you can define project-level parameters and
connection strings that can be set using environment variables for a package deployed in an SSIS
catalog. For example, you could create an SSIS package that encapsulates your big data process and
deploy it to the SSIS catalog on a SQL Server instance. You can then define named environments (for
example Test or Production), and set default parameter values to be used when the package is run
in the context of a particular environment. When you schedule an SSIS package to be run using a SQL
Server Agent job you can specify the environment to be used.
If you use project-level parameters in an SSIS project, ensure that you set the Sensitive option for any
parameters that must be encrypted and stored securely. For more information see Integration Services
(SSIS) Parameters.
Data consistency
Partial failures in a data processing workflow can lead to inconsistent results. In many cases, analysis
based on inconsistent data can be more harmful to a business than no analysis at all.
When using SSIS to coordinate big data processes, use the control flow checkpoint feature to support
restarting the package at the point of failure.
Consider adding custom fields to enable lineage tracking of all data that flows through the process. For
example, add a field to all source data with a unique batch identifier that can be used to identify data
that was ingested by a particular instance of the workflow process. You can then use this identifier to
reverse all changes that were introduced by a failed instance of the workflow process.
Exception handling and logging
In any complex workflow, errors or unexpected events can cause exceptions that prevent the workflow
from completing successfully. When an error occurs in a complex workflow, it can be difficult to
determine what went wrong.
Most developers are familiar with common exception handling techniques, and you should ensure that
you apply these to all custom code in your solution. This includes custom .NET applications, PowerShell
scripts, map/reduce components, and Transact-SQL scripts. Implementing comprehensive logging
functionality for both successful and unsuccessful operations in all custom scripts and applications helps
to create a source of troubleshooting information in the event of a failure, as well as generating useful
monitoring data.
If you use PowerShell or custom .NET code to manage job submission and Oozie workflows, capture the
job output returned to the client and include it in your logs. This helps centralize the logged information,
making it easier to find issues that would otherwise require you to examine separate logs in the
HDInsight cluster (which may have been deleted at the end of a partially successful workflow).
If you use SSIS packages to coordinate big data processing tasks, take advantage of the native logging
capabilities in SSIS to record details of package execution, errors, and parameter values. You can also
take advantage of the detailed log reports that are generated for packages deployed in an SSIS catalog.
Running a Sqoop job to transfer data between HDInsight and a relational database.
Running an XMLA command in SQL Server Analysis Services (SSAS) to process a data model.
PowerShell is often the easiest way to automate individual tasks or sub-processes, and can be a good
choice for relatively simple end-to-end processes that have minimal steps and few conditional branches.
However, the dependency on one or more script files can make it fragile for complex solutions.
The following table shows how PowerShell can be used to automate an end-to-end solution for each of
the big data use cases and models discussed in this guide.
Use case
Considerations
Iterative data
exploration
Data warehouse on
demand
In this scenario you can use a PowerShell script to upload new data files to Azure storage,
provision the cluster, recreate Hive tables, refresh reports that are built on them, and then
delete the cluster.
ETL automation
In a simple ETL solution you can encapsulate the jobs that filter and shape the data in an
Oozie workflow, which can be initiated from a PowerShell script. If the source and/or
destination of the ETL process is a relational database that is accessible from the
HDInsight cluster, you can use Sqoop actions in the Oozie workflow. Otherwise you can
use PowerShell to upload source files and download output files, or to run an SSIS
package using the DTExec.exe command line tool.
BI integration
Considerations
Iterative data
exploration
practice you may want to implement a custom application that integrates the analytical
processing into a business process.
Data warehouse on
demand
If the data warehouse is specifically designed to help analysts examine data that is
generated by a custom business application, you might integrate the process of uploading
new data, provisioning a cluster, recreating Hive tables, refreshing reports, and deleting
the cluster into the application using classes and interfaces from the .NET SDK for
HDInsight.
ETL automation
As in the data warehouse on demand scenario, if the ETL process is designed to take the
output from a particular application and process it for analysis you could manage the
entire ETL process from the application itself.
BI integration
In this scenario there is generally an existing established ETL coordination solution based
on SSIS, and the processing of big data with HDInsight can be added to this solution.
Some of the processing tasks may be automated using custom SSIS tasks, which are
implemented using .NET code.
Data Flow. Data flow tasks encapsulate the transfer of data from one source to another, with
the ability apply complex transformations and data cleansing logic as the data is transferred.
Execute SQL. You can use Execute SQL tasks to run SQL commands in relational databases. For
example, after using Sqoop to transfer the output of a big data processing job to a staging table
you could use an Execute SQL task to load the staged data into a production table.
File System. You can use a File System task to manipulate files on the local file system. For
example, you could use a File System task to prepare files for upload to Azure storage.
Execute Process. You can use an execute process task to run a command, such as a custom
command line utility or a PowerShell script.
Analysis Services Processing. You can use an Analysis Services Processing task to process
(refresh) an SSAS data model. For example, after completing a job that creates Hive tables over
new data you could process any SSAS data models that are based on those tables to refresh the
data in the model.
Send Mail. You can use a Send Mail task to send a notification email to an operator when a
workflow is complete, or when a task in the workflow fails.
Additionally, you can use a Script task or create a custom task using .NET code to perform custom
actions.
SSIS control flows use precedence constraints to implement conditional branching, enabling you to
create complex workflows that handle exceptions or perform actions based on variable conditions. SSIS
also provides native logging support, making it easier to troubleshoot errors in the workflow.
The following table shows how SSIS can be used to automate an end-to-end solution for each of the big
data use cases and models discussed in this guide.
Use case
Considerations
Iterative data
exploration
Data warehouse on
demand
SSIS is designed to coordinate the transfer of data from one store to another, and can be
used effectively for large volumes of data that require transformation and cleansing before
being loaded into the target data warehouse. When the target is a Hive database in
HDInsight, you can use SSIS Execute Process tasks to run command line applications or
PowerShell scripts that provision HDInsight, load data to Azure storage, and create Hive
tables. You can then use an Analysis Services Processing task to process any SSAS data
models that are based on the data warehouse.
ETL automation
Although SSIS itself can be used to perform many ETL tasks, when the data must be
shaped and filtered using big data processing techniques in HDInsight you can use SSIS
to coordinate scripts and commands that provision the cluster, perform the data
processing jobs, export the output to a target data store, and delete the cluster.
BI integration
See SQL Server Integration Services in the MSDN Library for information about how to use SQL Server
Integration Services (SSIS) to automate and coordinate tasks.
Interactive: The operation is started on demand by a human operator. For example, a user
might run a PowerShell script to provision a cluster.
Scheduled: The operation is started automatically at a specified time. For example, the
Windows Task Scheduler application could be used to run a PowerShell script or custom tool
automatically at midnight to upload daily log files to Azure storage.
Triggered: The operation is started automatically by an event, or by the completion (or failure)
of another operation. For example, you could implement a custom Windows service that
monitors a local folder. When a new file is created, the service automatically uploads it to Azure
storage.
After the initial process or task has been started, it can start each sub-process automatically.
Alternatively, you can start them on a scheduled basis that allows sufficient time for all dependent subprocesses to complete.
This topic discusses two different scheduling aspects for automated solutions:
Windows Task Scheduler. You can use the Task Scheduler administrative tool (or the
schtasks.exe command line program) to configure one-time or recurring commands, and specify
a wide range of additional properties and behavior for each task. You can use this tool to trigger
a command at specific times, when a specific event is written to the Windows event log, or in
response to other system actions. Commands you can schedule with the Windows Task
Scheduler include:
SQL Server Agent. The SQL Server Agent is a commonly used automation tool for SQL Server
related tasks. You can use it to create multistep jobs that can then be scheduled to run at
specific times. The types of step you can use include the following:
SQL Server Analysis Services (SSAS) Command steps; for example, to process an SSAS data
model.
SQL Server Integration Services (SSIS) Package steps to run a SSIS packages.
SQL Server Agent offers greater flexibility and manageability than Windows Task Scheduler, but it
requires a SQL Server instance. If you are already planning to use SQL Server, and particularly SSIS, in
your solution then SQL Server Agent is generally the best way to automate scheduled execution of tasks.
However, Windows Task Scheduler offers an effective alternative when SQL Server is not available.
You may also be able to use the Azure Scheduler service in your Azure cloud service applications to
execute some types of tasks. Azure Scheduler can make HTTP requests to other services, and monitor
the outcome of these requests. It is unlikely to be used for initiating on-premises applications and tasks.
However, you might find it useful for accessing an HDInsight cluster directly to perform operations such
as transferring data and performing management tasks within the clustermany of these tasks expose a
REST API that Azure Scheduler could access. For more information see Azure Scheduler on the Azure
website.
Scheduling data refresh in consumers
You can use the Windows Task Scheduler and SQL Server Agent to schedule execution of an SSIS
package, console application, or PowerShell script. However, reports and data models that consume the
output of the processing workflow might need to be refreshed on their own schedule. You can process
SSAS data models in an SSIS control flow by using a SQL Server Agent job or by using PowerShell to run
an XMLA command in the SSAS server, but PowerPivot data models that are stored in Excel workbooks
cannot be processed using this techniqueand must be refreshed separately. Similarly, the refreshing
of SQL Server Reporting Services (SSRS) reports that make use of caching or snapshots must be managed
separately.
Scheduled data refresh for PowerPivot data models in SharePoint Server
PowerPivot workbooks that are shared in an on-premises SharePoint Server site can be refreshed
interactively on-demand, or the workbook owner can define a schedule for automatic data refresh. In a
regularly occurring big data process, the owners of shared workbooks are the data stewards for the data
models they contain. As such, they must take responsibility for scheduling data refresh at the earliest
possible time after updated data is available in order to keep the data models (and reports based on
them) up to date.
For data refresh to be successful, the SharePoint Server administrator must have enabled an unattended
service account for the PowerPivot service and this account must have access to all data sources in the
workbook, as well as all required system rights for Kerberos delegation. The SharePoint administrator
can also specify a range of business hours during which automatic scheduled refresh can occur.
For more information see PowerPivot Data Refresh with SharePoint 2013.
Security
It is vital to consider how you can maximize security for all the applications and services you build and
use. This is particularly the case with distributed applications and services, such as big data solutions,
that move data over public networks and store data outside the corporate network.
Typical areas of concern for security in these types of applications are:
Ensure you properly protect the cluster by using passwords of appropriate complexity.
Ensure you protect your Azure storage keys and keep them secret. If a malicious user obtains
the storage key, he or she will be able to directly access the cluster data held in blob storage.
Protect credentials, connection strings, and other sensitive information when you need to use
them in your scripts or application code. See Securing credentials in scripts and applications for
more information.
If you enable remote desktop access to the cluster, use a suitably strong password and
configure the access to expire as soon as possible after you finish using it. Remote desktop users
do not have administrative level permissions on the cluster, but it is still possible to access and
modify the core Hadoop system, and read data and the contents of configuration files (which
contain security information and settings) through a remote desktop connection.
Consider if protecting your clusters by using a custom or third party gatekeeper implementation
that can authenticate multiple users with different credentials would be appropriate for your
scenario.
Use local security policies and features, such as file permissions and execution rights, for tools
or scripts that transmit, store, and process the data.
scripts or configuration files you leave the cluster itself, and the data in Azure storage, open to anyone
who has access to these scripts or configuration files.
In production systems, and at any time when you are not just experimenting with HDInsight using test
data, you should consider how you will protect credentials, connections strings, and other sensitive
information in scripts and configuration files. Some solutions are:
Prompt the user to enter the required credentials as the script or application executes. This is a
common approach in interactive scenarios, but it is obviously not appropriate for automated
solutions where the script or application may run in unattended mode on a schedule, or in
response to a trigger event.
Store the required credentials in encrypted form in the configuration file. This approach is
typically used in .NET applications where sections of the configuration file can be encrypted
using the methods exposed by the .NET framework. See Encrypting Configuration Information
Using Protected Configuration for more information. You must ensure that only authorized
users can execute the application by protecting it using local security policies and features, such
as file permissions and execution rights.
Store the required credentials in a text file, a repository, a database, or Windows Registry in
encrypted form using the Data Protection API (DPAPI). This approach is typically used in
Windows PowerShell scripts. You must ensure that only authorized users can execute the script
by protecting it using local security policies and features, such as file permissions and execution
rights.
The article Working with Passwords, Secure Strings and Credentials in Windows PowerShell on the
TechNet wiki includes some useful examples of the techniques you can use.
Securing data passing over the network
HDInsight uses several protocols for communication between the cluster nodes, and between the
cluster and clients, including RPC, TCP/IP, and HTTP. Consider the following when deciding how to
secure data that passes across the network:
Use a secure protocol for all connections over the Internet to the cluster and to your Azure
storage account. Consider using Secure Socket Layer (SSL) for the connection to your storage
account to protect the data on the wire (supported and recommended for Azure storage). Use
SSL or Transport Layer Security (TLS), or other secure protocols, where appropriate when
communicating with the cluster from client-side tools and utilities, and keep in mind that some
tools may not support SSL or may require you to specifically configure them to use SSL. When
accessing Azure storage from client-side tools and utilities, use the wasbs secure protocol (you
must specify the full path to a file when you use the wasbs protocol).
Consider if you need to encrypt data in storage and on the wire. This is not trivial, and may
involve writing custom components to carry out the encryption. If you create custom
components, use proven libraries of encryption algorithms to carry out the encryption process.
Note that the encryption keys must be available within your custom components running in
Azure, when may leave them vulnerable.
Securing data in storage
Consider the following when deciding how to secure data in storage:
Do not store data that is not associated with your HDInsight processing jobs in the storage
accounts linked to a cluster. HDInsight has full access to all of the containers in linked storage
accounts because the account names and keys are stored in the cluster configuration. See
Cluster and storage initialization for details of how you can isolate parts of your data by using
separate storage accounts.
If you use non-linked storage accounts in an HDInsight job by specifying the storage key for
these accounts in the job files, the HDInsight job will have full access to all of the containers and
blobs in that account. Ensure that these non-linked storage accounts do not contain data that
must be kept private from HDInsight, and that the containers do not have public access
permission. See Using an HDInsight Cluster with Alternate Storage Accounts and Metastores
and Use Additional Storage Accounts with HDInsight Hive for more information.
Consider if using Shared Access Signatures (SAS) to provide access to data in Azure storage
would be an advantage in your scenario. SAS can provide fine-grained controlled and timelimited access to data for clients. For more details see Create and Use a Shared Access
Signature.
Consider if you need to employ monitoring processes that can detect inappropriate access to
the data, and can alert operators to possible security breaches. Ensure that you have a process
in place to lock down access in this case, detect the scope of the security breach, and ensure
validation and integrity of the data afterwards. Hadoop can log access to data. Azure blob
storage also has a built-in monitoring capabilityfor more information see How To Monitor a
Storage Account and Azure Storage Account Monitoring and Logging.
Consider encrypting sensitive data, sensitive parts of the data, or even whole folders and
subfolders. This may include splitting data into separate files, such as dividing credit card
information into different files that contain the card number and the related card-holder
information. Azure blob storage does not have a built-in encryption feature, and so you will
need to encrypt the data using encryption libraries and custom code, or with third-party tools.
If you are handling sensitive data that must be encrypted you will need to write custom
serializer and deserializer classes and install these in the cluster for use as the SerDe parameter
in Hive statements, or create custom map/reduce components that can manage the
serialization and deserialization. See the Apache Hive Develop Guide for more information
about creating a custom SerDe. However, consider that the additional processing requirements
for encryption imposes a trade-off between security and performance.
Considerations
about each node in the cluster, the applications (jobs) that are executing or have finished, job scheduler
details, current configuration of the cluster, and access to log files and metrics.
The portal also exposes a set of metrics that indicate in great detail the status and performance of each
job. These metrics can be used to monitor and fine tune jobs, and to locate errors and issues with your
solutions.
Considerations
When implementing monitoring and logging for your solutions, consider the following points:
As with any remote service or application, managing and monitoring its operation may appear
to be more difficult than for a locally installed equivalent. However, remote management and
monitoring technologies are widely available, and are an accepted part of almost all
administration tasks. In many cases the extension of these technologies to cloud-hosted
services and applications is almost seamless.
Establish a monitoring and logging strategy that can provide useful information for detecting
issues early, debugging problematic jobs and processes, and for use in planning. For example, as
well as collecting runtime data and events, consider measuring overall performance, cluster
load, and other factors that will be useful in planning for data growth and future requirements.
The YARN portal in HDInsight, accessible remotely, can provide a wide range of information
about performance and events for jobs and for the cluster as a whole.
Configure logging and manage the log files for all parts of the process, not just the jobs within
Hadoop. For example, monitor and log data ingestion and data export where the tools support
this, or consider changing to a tool that can provide the required support for logging and
monitoring. Many tools and services, such as SSIS and Azure storage, will need to be configured
to provide an appropriate level of logging.
Consider maintaining data lineage tracking by adding an identifier to each log entry, or through
other techniques. This allows you to trace back the original source of the data and the
operation, and follow it through each stage to understand its consistency and validity.
Consider how you can collect logs from the cluster, or from more than one cluster, and collate
them for purposes such as auditing, monitoring, planning, and alerting. You might use a custom
solution to access and download the log files on a regular basis, and combine and analyze them
to provide a dashboard-like display with additional capabilities for alerting for security or failure
detection. Such utilities could be created using PowerShell, the HDInsight SDKs, or code that
accesses the Azure Service Management API.
Consider if a monitoring solution or service would be a useful benefit. A management pack for
HDInsight is available for use with Microsoft System Center (see the Microsoft Download Center
for more details). In addition, you can use third-party tools such as Chukwa and Ganglia to
collect and centralize logs. Many companies offer services to monitor Hadoop-based big data
solutionssome examples are Centerity, Compuware APM, Sematext SPM, and Zettaset
Orchestrator.
The following table illustrates how monitoring and logging considerations apply to each of the use cases
and models described in this guide.
Use case
Considerations
Iterative data
exploration
In this model you are typically experimenting with data and do not have a long-term plan
for its use, or for the techniques you will discover for finding useful information in the data.
Therefore, monitoring is not likely to be a significant concern when using this model.
However, you may need to use the logging features of HDInsight to help discover the
optimum techniques for processing the data as you refine your investigation, and to debug
jobs.
Data warehouse on
demand
In this model you are likely to have established a regular process for uploading,
processing, and consuming data. Therefore, you should consider implementing a
monitoring and logging strategy that can detect issues early and assist in resolving them.
Typically, if you intend to delete and recreate the cluster on a regular bases, this will
require a custom solution using tools that run on the cluster or on-premises rather than
using a commercial monitoring service.
ETL automation
In this model you may be performing scheduled data transfer operations, and so it is vital
to establish a robust monitoring and logging mechanism to detect errors and to measure
performance.
BI integration
This model is usually part of an organizations core business functions, and so it is vital to
design a strategy that incorporates robust monitoring and logging features, and that can
detect failures early as well as providing ongoing data for forward planning. Monitoring for
security purposes, alerting, and auditing are likely to be important business requirements
in this model.
Description
Tools, APIs, SDKs, and technologies commonly used for
extracting and consuming the results from Hadoopbased solutions.
Data consumption
Tools, APIs, SDKs, and technologies commonly used for
extracting data from data sources and loading it into
Hadoop-based solutions.
Data ingestion
Data processing
Job submission
Management
Workflow
The tools, APIs, SDKs, and technologies are listed in alphabetical order below.
Ambari
A solution for provisioning, managing, and monitoring Hadoop clusters using an intuitive, easy-to-use
Hadoop management web UI backed by REST APIs.
Usage notes: Only the monitoring endpoint was available in HDInsight at the time this guide was
written.
For more info, see Ambari.
Aspera
A tool for high-performance transfer and synchronization of files and data sets of virtually any size, with
the full access control, privacy and security. Provides maximum speed transfer under variable network
conditions.
Usage notes:
Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered
when using only TCP.
Avro
A data serialization system that supports rich data structures, a compact, fast, binary data format, a
container file to store persistent data, remote procedure calls (RPC), and simple integration with
dynamic languages. Can be used with the client tools in the .NET SDK for Azure.
Usage notes:
AZCopy
A command-line utility designed for high performance that can upload and download Azure storage
blobs and files. Can be scripted for automaton. Offers a number of functions to filter and manipulate
content. Provides resuming, and logging functions.
Usage notes:
Transfers to and from an Azure datacenter will be constrained by the connection bandwidth
available.
Azkaban
A framework for creating workflows that access Hadoop. Designed to overcome the problem of
interdependencies between tasks.
Usage notes:
Uses a web server to schedule and manage jobs, an executor server to submit jobs to Hadoop,
and either an internal H2 database or a separate MySQL database to store job details.
A cloud-based service that can be used to collect data from a wide range of devices and applications,
apply rules that define automated actions on the data, and connect the data to business applications
and clients for analysis.
Usage notes:
Exposes storage resources through a REST API that can be called by any language that can make
HTTP/HTTPS requests.
Usage notes:
Provides programming libraries for several popular languages that simplify many tasks by
handling synchronous and asynchronous invocation, batching of operations, exception
management, automatic retries, operational behavior, and more.
Libraries are currently available for .NET, Java, and C++. Others will be available over time.
A free GUI-based tool for viewing, uploading, and managing data in Azure blob storage. Can be used to
view multiple storage accounts at the same in separate tab pages.
A platform-as-a-service (PaaS) relational database solution in Azure that offers a minimal configuration,
low maintenance solution for applications and business processes that require a relational database
with support for SQL Server semantics and client interfaces.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is
required, and decommission it after data processing is complete. If you want the results of the big data
processing to remain available in relational format for client applications to consume, you can transfer
the output generated by HDInsight into a relational database. Azure SQL Database is a good choice for
this when you want the data to remain in the cloud, and you do not want to incur the overhead of
configuring and managing a physical server or virtual machine running the SQL Server database engine.
For more info, see Azure SQL Database.
Casablanca
A project to develop support for writing native-code REST for Azure, with integration in Visual Studio.
Provides a consistent and powerful model for composing asynchronous operations based on C++ 11
features.
Usage notes:
Provides support for accessing REST services from native code on Windows Vista, Windows 7,
and Windows 8 through asynchronous C++ bindings to HTTP, JSON, and URIs.
Includes libraries for accessing Azure blob storage from native clients.
Cascading
A data processing API and processing query planner for defining, sharing, and executing data processing
workflows. Adds an abstraction layer over the Hadoop API to simplify development, job creation, and
scheduling.
Usage notes:
Can be deployed on a single node to efficiently test code and process local files before being
deployed on a cluster, or in a distributed mode that uses Hadoop,
Uses a metaphor of pipes (data streams) and filters (data operations) that can be assembled to
split, merge, group, or join streams of data while applying operations to each data record or
groups of records.
A comprehensive environment for managing Azure-hosted applications. Can be used to access Azure
storage, Azure log files, and manage the life cycle of applications. Provides a dashboard-style UI.
Usage notes:
Connects through a publishing file and enables use of groups and profiles for managing users
and resources.
Provides full control of storage accounts, including Azure blobs and containers.
Provides management capabilities for many types of Azure service including SQL Database.
Chef
An automation platform that transforms infrastructure into code. Allows you to automate configuration,
deployment and scaling for on-premises, cloud-hosted, and hybrid applications.
Usage notes: Available as a free open source version, and an enterprise version that includes additional
management features such as a portal, authentication and authorization management, and support for
multi-tenancy. Also available as a hosted service.
For more info, see Chef.
Chukwa
An open source data collection system for monitoring large distributed systems, built on top of HDFS
and map/reduce. Also includes a exible and powerful toolkit for displaying, monitoring and analyzing
results.
Has five primary components:
Collectors that receive data from the agent and write it to stable storage.
Hadoop Infrastructure Care Center, a web-portal style interface for displaying data.
CloudBerry Explorer
A free GUI-based file manager and explorer for browsing and accessing Azure storage.
Usage notes:
Also available as a paid-for Professional version that adds encryption, compression, multithreaded data transfer, file comparison, and FTP/SFTP support.
CloudXplorer
An easy-to-use GUI-based explorer for browsing and accessing Azure storage. Has a wide range of
features for managing storage and transferring data, including access to compressed files. Supports
auto-resume for file transfers.
Usage notes:
No logging features.
An open source command line interface for developers and IT administrators to develop, deploy and
manage Azure applications. Supports management tasks on Windows, Linux, and iOS. Commands can be
extended using Node.js.
Usage notes:
Can be used to manage almost all features of Azure including accounts, storage, databases,
virtual machines, websites, networks, and mobile services.
You must add the path to the command line PATH list.
For more info, see Cross-platform Command Line Interface (X-plat CLI).
D3.js
A high performance JavaScript library for manipulating documents based on data. It supports large
datasets and dynamic behaviors for interaction and animation, and can be used to generate attractive
and interactive output for reporting, dashboards, and any data visualization task. Based on web
standards such as HTML, SVG and CSS to expose the full capabilities of modern browsers.
Usage notes:
Allows you to bind arbitrary data to a Document Object Model (DOM) and then apply datadriven transformations to the document, such as generating an HTML table from an array of
numbers and using the same data to create an interactive SVG bar chart with smooth transitions
and interaction.
Provides a powerful declarative approach for selecting nodes and can operate on arbitrary sets
of nodes called selections.
Falcon
A framework for simplifying data management and pipeline processing that enables automated
movement and processing of datasets for ingestion, pipelines, disaster recovery, and data retention.
Runs on one server in the cluster and is accessed through the command-line interface or the REST API.
Usage notes:
Replicates HDFS files and Hive Tables between different clusters for disaster recovery and multicluster data discovery scenarios.
Automatically manages the complex logic of late data handling and retries.
Uses higher-level data abstractions (Clusters, Feeds, and Processes) enabling separation of
business logic from application logic.
Transparently coordinates and schedules data workflows using the existing Hadoop services
such as Oozie.
FileCatalyst
A client-server based file transfer system that supports common and secure protocols (UDP, FTP, FTPS,
HTTP, HTTPS), encryption, bandwidth management, monitoring, and logging.
Usage notes:
FileCatalyst Direct features are available by installing the FileCatalyst Server and one of the
client-side options.
Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered
when using only TCP.
Flume
A distributed, robust, and fault tolerant tool for efficiently collecting, aggregating, and moving large
amounts of log file data. Has a simple and flexible architecture based on streaming data flows and with a
tunable reliability mechanism. The simple extensible data model allows for automation using Java code.
Usage notes:
Includes several plugins to support various sources, channels, sinks and serializers. Well
supported third party plugins are also available.
You must manually configure SSL for each agent. Configuration can be complex and requires
knowledge of the infrastructure.
Provides a monitoring API that supports custom and third party tools.
Ganglia
A scalable distributed monitoring system that can be used to monitor computing clusters. It is based on
a hierarchical design targeted at federations of clusters, and uses common technologies such as XML for
data representation, XDR for compact, portable data transport, and RRDtool for data storage and
visualization.
Comprises the monitoring core, a web interface, an execution environment, a Python client, a command
line interface, and RSS capabilities.
For more info, see Ganglia.
Provides access to Hadoop to execute the standard Hadoop commands. Supports scripting for managing
Hadoop jobs and shows the status of commands and jobs.
Usage notes:
You must create scripts or batch files for operations you want to automate.
Hamake
A workflow framework based on directed acyclic graph (DAG) principles for scheduling and managing
sequences of jobs by defining datasets and ensuring that each is kept up to date by executing Hadoop
jobs.
Usage notes:
Generalizes the programming model for complex tasks through dataflow programming and
incremental processing.
Workflows are defined in XML and can include iterative steps and asynchronous operations over
more than one input dataset.
HCatalog
Provides a tabular abstraction layer that helps unify the way that data is interpreted across processing
interfaces, and provides a consistent way for data to be loaded and stored; regardless of the specific
processing interface being used. This abstraction exposes a relational view over the data, including
support for partitions.
Usage notes:
Easy to incorporate into solutions. Files in JSON, SequenceFile, CSV, and RC format can be read
and written by default, and a custom SerDe can be used to read and write files in other formats.
Enables notification of data availability, making it easier to write applications that perform
multiple jobs.
Additional effort is required in custom map/reduce components because custom load and store
functions must be created.
The HDInsight SDKs provide the capability to create clients that can manage the cluster, and execute
jobs in the cluster. Available for .NET development and other languages such as Node.js. WebHDFS
client is a .NET wrapper for interacting with WebHDFS compliant end-points in Hadoop and Azure
HDInsight. WebHCat is the REST API for HCatalog, a table and storage management layer for Hadoop.
Can be used for a wide range of tasks including:
Creating and submitting map/reduce, Pig, Hive, Sqoop, and Oozie jobs.
For more info see, HDInsight SDK and Microsoft .NET SDK For Hadoop.
Hive
An abstraction layer over the Hadoop query engine that provides a query language called HiveQL, which
is syntactically very similar to SQL and supports the ability to create tables of data that can be accessed
remotely through an ODBC connection. Hive enables you to create an interface to your data that can be
used in a similar way to a traditional relational database.
Usage notes:
Data can be consumed from Hive tables using tools such as Excel and SQL Server Reporting
Services, or though the ODBC driver for Hive.
Hive QL allows you to plug in custom mappers and reducers to perform more sophisticated
processing.
A good choice for processes such as summarization, ad hoc queries, and analysis on data that
has some identifiable structure; and for creating a layer of tables through which users can easily
query the source data, and data generated by previously executed jobs.
Kafka
A distributed, partitioned, replicated service with the functionality of a messaging system. Stores data as
logs across servers in a cluster and exposes the data through consumers to implement common
messaging patterns such as queuing and publish-subscribe.
Usage notes:
Uses the concepts of topics that are fed to Kafka by producers. The data is stored in the
distributed cluster servers, each of which is referred to as a broker, and accessed by consumers.
Data is exposed over TCP, and clients are available in a range of languages.
Data lifetime is configurable, and the system is fault tolerant though the use of replicated
copies.
Knox
A system that provides a single point of authentication and access for Hadoop services in a cluster.
Simplifies Hadoop security for users who access the cluster data and execute jobs, and for operators
who control access and manage the cluster.
Usage notes:
Delivers users a single cluster end-point that aggregates capabilities for data and jobs.
LINQ to Hive
A technology that supports authoring Hive queries using Language-Integrated Query (LINQ). The LINQ is
compiled to Hive and then executed on the Hadoop cluster.
Usage notes: The LINQ code can be executed within a client application or as a user-defined function
(UDF) within a Hive query.
For more info, see LINQ to Hive.
Mahout
A scalable machine learning and data mining library used to examine data files to extract specific types
of information. It provides an implementation of several machine learning algorithms, and is typically
used with source data files containing relationships between the items of interest in a data processing
solution.
Usage notes:
A good choice for grouping documents or data items that contain similar content;
recommendation mining to discover users preferences from their behavior; assigning new
documents or data items to a category based on the existing categorizations; and performing
frequent data mining operations based on the most recent data.
Management portal
The Azure Management portal can be used to configure and manage clusters, execute HiveQL
commands against the cluster, browse the file system, and view cluster activity. It shows a range of
settings and information about the cluster, and a list of the linked resources such as storage accounts. It
also provides the ability to connect to the cluster through RDP.
Provides rudimentary monitoring features including:
Accumulated, maximum, and minimum data for containers in the storage accounts.
A list of jobs that have executed and some basic information about each one.
For more info, see Get started using Hadoop 2.2 in HDInsight
Map/reduce
Map/reduce code consists of two functions; a mapper and a reducer. The mapper is run in parallel on
multiple cluster nodes, each node applying it to its own subset of the data. The output from the mapper
function on each node is then passed to the reducer function, which collates and summarizes the results
of the mapper function.
Usage notes:
A good choice for processing completely unstructured data by parsing it and using custom logic
to obtain structured information from it; for performing complex tasks that are difficult (or
impossible) to express in Pig or Hive without resorting to creating a UDF; for refining and
exerting full control over the query execution process, such as using a combiner in the map
phase to reduce the size of the map process output.
Microsoft Excel
One of the most commonly used data analysis and visualization tools in BI scenarios. It includes native
functionality for importing data from a wide range of sources, including HDInsight (via the Hive ODBC
driver) and relational databases such as SQL Server. Excel also provides native data visualization tools,
including tables, charts, conditional formatting, slicers, and timelines.
Usage notes: After HDInsight has been used to process data, the results can be consumed and visualized
in Excel. Excel can consume output from HDInsight jobs directly from Hive tables in the HDInsight cluster
or by importing output files from Azure storage, or through an intermediary querying and data modeling
technology such as SQL Server Analysis Services.
For more info, see Microsoft Excel.
A set of modules for Node.js that can be used to manage many features of Azure.
Includes separate modules for:
Core management
Compute management
Oozie
A tool that enables you to create repeatable, dynamic workflows for tasks to be performed in a Hadoop
cluster. Actions encapsulated in an Oozie workflow can include Sqoop transfers, map/reduce jobs, Pig
jobs, Hive jobs, and HDFS commands.
Usage notes:
Defining an Oozie workflow requires familiarity with the XML-based syntax used to define the
Direct Acyclic Graph (DAG) for the workflow actions.
You can initiate Oozie workflows from the Hadoop command line, a PowerShell script, a custom
.NET application, or any client that can submit an HTTP request to the Oozie REST API.
Phoenix
A client-embedded JDBC driver designed to perform low latency queries over data stored in Apache
HBase. It compiles standard SQL queries into a series of HBase scans, and orchestrates the running of
those scans to produce standard JDBC result sets. It also supports client-side batching and rollback.
Usage notes:
Supports all common SQL query statement clauses including SELECT, FROM, WHERE, GROUP BY,
HAVING, ORDER BY, and more.
Supports a full set of DML commands and DDL commands including table creation and versioned
incremental table alteration.
Allows columns to be defined dynamically at query time. Metadata for tables is stored in an
HBase table and versioned so that snapshot queries over prior versions automatically use the
correct schema.
Pig
A high-level data-flow language and execution framework for parallel computation that provides a
workflow semantic for processing data in HDInsight. Supports complex processing of the source data to
generate output that is useful for analysis and reporting. Pig statements generally involve defining
relations that contain data. Relations can be thought of as result sets, and can be based on a schema or
can be completely unstructured.
Usage notes: A good choice for restructuring data by defining columns, grouping values, or converting
columns to rows; transforming data such as merging and filtering data sets, and applying functions to all
or subsets of records; and as a sequence of operations that is often a logical way to approach many
map/reduce tasks.
For more info, see Pig.
Power BI
A service for Office 365 that builds on the data modeling and visualization capabilities of PowerPivot,
Power Query, Power View, and Power Map to create a cloud-based collaborative platform for selfservice BI. Provides a platform for users to share the insights they have found when analyzing and
visualizing the output generated by HDInsight, and to make the results of big data processing
discoverable for other, less technically proficient, users in the enterprise.
Usage notes:
Users can share queries created with Power Query to make data discoverable across the
enterprise through Online Search. Data visualizations created with Power View can be published
as reports in a Power BI site, and viewed in a browser or through the Power BI Windows Store
app. Data models created with PowerPivot can be published to a Power BI site and used as a
source for natural language queries using the Power BI Q&A feature.
By defining queries and data models that include the results of big data processing, users
become data stewards and publish it in a way that abstracts the complexities of consuming
and modeling data from HDInsight.
Power Map
An add-in for Excel that is available to Office 365 enterprise-level subscribers. Power Map enables users
to create animated tours that show changes in geographically-related data values over time, overlaid on
a map.
Usage notes: When the results of big data processing include geographical and temporal fields you can
import the results into an Excel worksheet or data model and visualize them using Power Map.
For more info, see Power Map.
Power Query
An add-in for Excel that you can use to define, save, and share queries. Queries can be used to retrieve,
filter, and shape data from a wide range of data sources. You can import the results of queries into
worksheets or into a workbook data model, which can then be refined using PowerPivot.
Usage notes: You can use Power Query to consume the results of big data processing in HDInsight by
defining a query that reads files from the Azure blob storage location that holds the output of big data
processing jobs. This enables Excel users to consume and visualize the results of big data processing,
even after the HDInsight cluster has been decommissioned.
For more info, see Power Query.
Power View
An add-in for Excel that enables users to explore data models by creating interactive data visualizations.
It is also available as a SharePoint Server application service when SQL Server Reporting Services is
installed in SharePoint-Integrated mode, enabling users to create data visualizations from PowerPivot
workbooks and Analysis Services data models in a web browser.
Usage notes: After the results of a big data processing job have been imported into a worksheet or data
model in Excel you can use Power View to explore the data visually. With Power View you can create a
set of related interactive data visualizations, including column and bar charts, pie charts, line charts, and
maps.
For more info, see Power View.
PowerPivot
A data modeling add-in for Excel that can be used to define tabular data models for slice and dice
analysis and visualization in Excel. You can use PowerPivot to combine data from multiple sources into a
tabular data model that defines relationships between data tables, hierarchies for drill-up/down
aggregation, and calculated fields and measures.
Usage notes: In a big data scenario you can use PowerPivot to import a result set generated by
HDInsight as a table into a data model, and then combine that table with data from other sources to
create a model for mash-up analysis and reporting.
For more info, see PowerPivot.
PowerShell
A powerful scripting language and environment designed to manage infrastructure and perform a wide
range of operations. Can be used to implement almost any manual or automated scenario. A good
choice for automating big data processing when there is no requirement to build a custom user
interface or integrate with an existing application. Additional packages of cmdlets and functionality are
available for Azure and HDInsight. The PowerShell interactive scripting environment (ISE) also provides a
useful client environment for testing and exploring.
When working with HDInsight it can be used to perform a wide range of tasks including:
Usage notes:
It supports SSL and includes commands for logging and monitoring actions.
Installed with Windows, though not all early versions offer optimal performance and range of
operations.
For optimum performance and capabilities all systems must be running the latest version.
Very well formed and powerful language, but has a reasonably high learning curve for new
adopters.
You can schedule PowerShell scripts to run automatically, or initiate them on-demand.
Puppet
Automates repetitive tasks, such as deploying applications and managing infrastructure, both onpremises and in the cloud.
Usage notes: The Enterprise version can automate tasks at any stage of the IT infrastructure lifecycle,
including: discovery, provisioning, OS and application configuration management, orchestration, and
reporting.
For more info, see Puppet.
A library that can be used to compose asynchronous and event-based programs using observable
collections and LINQ-style query operators. Can be used to create stream-processing solutions for
capturing, storing, processing, and uploading data. Supports multiple asynchronous data streams from
different sources.
Usage notes:
Can be used to address very complex streaming and processing scenarios, but all parts of the
solution must be created using code.
Requires a high level of knowledge and coding experience, although plenty of documentation
and samples are available.
Allows you to remotely connect to the head node of the HDInsight cluster and gain access to the
configuration and command line tools for the underlying HDP as well as the YARN and NameNode status
portals.
Usage notes:
You must specify a validity period after which the connection is automatically disabled.
Not recommended for use in production applications but is useful for experimentation and oneoff jobs, and for accessing Hadoop files and configuration on the cluster.
REST APIs
Utilities such cURL, which is available for a wide range of platforms including Windows.
Samza
A distributed stream processing framework that uses Kafka for messaging and Hadoop YARN to provide
fault tolerance, processor isolation, security, and resource management.
Usage notes:
Provides a very simple callback-based API comparable to map/reduce for processing messages in
the stream.
Pluggable architecture allows use with many other messaging systems and environments.
Signiant
A system that uses managers and agents to automate media technology and file-based transfers and
workflows. Can be integrated with existing IT infrastructure to enable highly efficient file-based
workflows.
Usage notes:
Solr
A highly reliable, scalable, and fault tolerant enterprise search platform from the Apache Lucene project
that provides powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic
clustering, database integration, rich document (such as Word and PDF) handling, and geospatial search.
Usage notes:
Includes distributed indexing, load-balanced querying, replication, and automated failover and
recovery.
REST-like HTTP/XML and JSON APIs make it easy to use from virtually any programming
language.
Wide ranging customization is possible using external configuration and an extensive plugin
architecture.
A component of SQL Server that enables enterprise-level data modeling to support BI. SSAS can be
deployed in multidimensional or tabular mode; and in either mode can be used to define a dimensional
model of the business to support reporting, interactive analysis, and key performance indicator (KPI)
visualization through dashboards and scorecards.
Usage notes: SSAS is commonly used in enterprise BI solutions where large volumes of data in a data
warehouse are pre-aggregated in a data model to support BI applications and reports. As organizations
start to integrate the results of big data processing into their enterprise BI ecosystem, SSAS provides a
way to combine traditional BI data from an enterprise data warehouse with new dimensions and
measures that are based on the results generated by HDInsight data processing jobs.
For more info, see SQL Server Analysis Services (SSAS).
The core component of SQL Server that provides an enterprise-scale database engine to support online
transaction processing (OLTP) and data warehouse workloads. You can install SQL Server on an onpremises server (physical or virtual) or in a virtual machine in Azure.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is
required, and decommission it after data processing is complete. If you want the results of the big data
processing to remain available in relational format for client applications to consume, you must transfer
the output generated by HDInsight into a relational database. The SQL Server database engine is a good
choice for this when you want to have full control over server and database engine configuration, or
when you want to combine the big data processing results with data that is already stored in a SQL
Server database.
For more info, see SQL Server Database Engine.
A SQL Server instance feature that consists of knowledge base databases containing rules for data
domain cleansing and matching, and a client tool that enables you to build a knowledge base and use it
to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and
de-duplication of your data.
Usage notes:
The DQS Cleansing component can be used to cleanse data as it passes through a SQL Server
Integration Services (SSIS) data flow. A similar DQS Matching component is available on
CodePlex to support data deduplication in a data flow.
Master Data Services (MDS) can make use of a DQS knowledge base to find duplicate business
entity records that have been imported into an MDS model.
DQS can use cloud-based reference data services provided by reference data providers to
cleanse data, for example by verifying parts of mailing addresses.
For more info, see SQL Server Data Quality Services (DQS).
A component of SQL Server that can be used to coordinate workflows that consist of automated tasks.
SSIS workflows are defined in packages, which can be deployed and managed in an SSIS Catalog on an
instance of the SQL Server database engine.
SSIS packages can encapsulate complex workflows that consist of multiple tasks and conditional
branching. In particular, SSIS packages can include data flow tasks that perform full ETL processes to
transfer data from one data store to another while applying transformations and data cleaning logic
during the workflow.
Usage notes:
Although SSIS is often primarily used as a platform for implementing data transfer solutions, in a
big data scenario is can also be used to coordinate the various disparate tasks required to ingest,
process, and consume data using HDInsight.
SSIS packages are created using the SQL Server Data Tools for Business Intelligence (SSDT-BI)
add-in for Visual Studio, which provides a graphical package design interface.
Completed packages can be deployed to an SSIS Catalog in SQL Server 2012 or later instances, or
they can be deployed as files.
Package execution can be automated using SQL Server Agent jobs, or you can run them from the
command line using the DTExec.exe utility.
To use SSIS in a big data solution, you require at least one instance of SQL Server.
A component of SQL Server that provides a platform for creating, publishing, and distributing reports.
SQL Server can be deployed in native mode where reports are viewed and managed in a Report
Manager website, or in SharePoint-Integrated mode where case reports are viewed and managed in a
SharePoint Server document library.
Usage notes: When big data analysis is incorporated into enterprise business operations, it is common
to include the results in formal reports. Report developers can create reports that consume big data
processing results directly from Hive tables (via the Hive ODBC Driver) or from intermediary data models
or databases, and publish those reports to a report server for on-demand viewing or automated
distribution via email subscriptions.
For more info, see SQL Server Reporting Services (SSRS).
Sqoop
An easy to use tool for tool designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores such as relational databases. It automates most of this process, relying on the
database to describe the schema for the data to be imported, and uses map/reduce to import and
export the data in order to provides parallel operation and fault tolerance.
Usage notes:
Storm
A distributed real-time computation system that provides a set of general primitives. It is simple, and
can be used with any programming language. Supports a high-level language called Trident that provides
exactly-once processing, transactional data store persistence, and a set of common stream analytics
operations.
Usage notes:
Tools are available to create and manage the processing topology and configuration.
Supports Logging that can be viewed through the Storm Web UI and a reliability API that allows
custom tools and third party services to provide performance monitoring. Some third party tools
support full real-time monitoring.
StreamInsight
A component of SQL Server that can be used to perform real-time analytics on streaming and other
types of data. Supports using the Observable/Observer pattern and an Input/Output adaptor model with
LINQ processing capabilities and an administrative GUI. Could be used to capture events to a local file
for batch upload to the cluster, or write the event data directly to the cluster storage. Code could
append events to an existing file, create a new file for each event, or create a new file based on
temporal windows in the event stream.
Usage notes:
Events are implemented as classes or structs, and the properties defined for the event class
provide the data values for visualization and analysis.
Monitoring information is by using the diagnostic views API which requires the Management
Web Service to be enabled and connected.
Provided a complex event processing (CEP) solution out of the box, including debugging tools.
Simplifies the monitoring process for HDInsight by providing capabilities to discover, monitor, and
manage HDInsight clusters deployed on an Analytics Platform System (APS) Appliance or Azure. Provides
views for proactive monitoring alerts, health and performance dashboards, and performance metrics for
Hadoop at the cluster and node level.
Usage notes:
Includes a custom diagram view that has detailed knowledge about cluster structure and the
health states of host components and cluster services.
Provides context sensitive tasks to stop or start host component, cluster service or all cluster
services at once.
For more info, see System Center management pack for HDInsight.
A feature available in all except the Express versions of Visual Studio. Provides a GUI-based explorer for
Azure features, including storage, with facilities to upload, view, and download files.
Usage notes:
Also provides access and management features for SQL Database, useful when using a custom
metastore with an HDInsight cluster.
346 Copyright
More information
For more details about pre-processing and loading data, and the considerations you should be aware of,
see the section Collecting and loading data into HDInsight of this guide.
For more details about processing the data using queries and transformations, and the considerations
you should be aware of, see the section Processing, querying, and transforming data using HDInsight of
this guide.
For more details about consuming and visualizing the results, and the considerations you should be
aware of, see the section Consuming and visualizing data from HDInsight of this guide.
For more details about automating and managing solutions, and the considerations you should be aware
of, see the section Building end-to-end solutions using HDInsight of this guide.
Copyright
This document is provided as-is. Information and views expressed in this document, including URL and
other Internet web site references, may change without notice.
Some examples depicted herein are provided for illustration only and are fictitious. No real association or
connection is intended or should be inferred.
This document does not provide you with any legal rights to any intellectual property in any Microsoft
product. You may copy and use this document for your internal, reference purposes.
2014 Microsoft. All rights reserved.
Microsoft, Bing, Bing logo, C++, Excel, HDInsight, MSDN, Office 365, SQL Azure, Visual Studio, Windows,
and Windows PowerShell are trademarks of the Microsoft group of companies. All other trademarks are
property of their respective owners.