Beruflich Dokumente
Kultur Dokumente
OpenStack Sahara
Essentials
pl
C o m m u n i t y
Omar Khedher
E x p e r i e n c e
D i s t i l l e d
OpenStack Sahara
Essentials
$ 39.99 US
25.99 UK
P U B L I S H I N G
Sa
m
ee
Omar Khedher
Preface
OpenStack, the ultimate cloud computing operating system, keeps growing and
gaining more popularity around the globe. One of the main reasons of OpenStack's
success is the collaboration of several big enterprises and companies worldwide.
Within every new release, the OpenStack community brings a new incubated
project to the cloud computing open source world. Lately, big data has also taken
a very important role in the OpenStack journey. Within its broad definition of the
complexity of data management and its value extraction, the big-data business faces
several challenges that need to be tackled. With the growth of the concept of cloud
paradigm in the last decade, the big-data world can also be offered as a service.
Specifically, the OpenStack community has taken on such a challenge to turn it
into a very unique opportunity: Big Data as a Service. The Sahara project makes
provisioning a complete elastic Hadoop cluster a very seamless operation with no
need for touching the underlying infrastructure. Running on OpenStack, Sahara
becomes a very mature project that supports Hadoop and Spark, the open source
in-memory computing framework. That becomes a very good deal to find a parallel
world about Big Data and Data Processing in Sahara named Elastic Data Processing.
Sahara, formerly known as Savanna, has become a very attractive project, mature
and supporting several big data providers.
In this book, we will explore the main motivation of using Sahara and how it
interacts with other services of OpenStack. The main motivation of using Sahara is
the facilities exposed from a central dashboard to manage big-data infrastructure and
simplify data-processing tasks. We will walk through the installation and integration
of Sahara OpenStack, launch clusters, execute sample jobs, explore more functions,
and troubleshoot some common errors. By the end of this book, you should not only
understand how Sahara operates and functions within the OpenStack ecosystem but
also realize its major use cases of cluster and workload management.
Preface
Understand the success of big data processing when it is combined with the
cloud computing paradigm
Learn how OpenStack can offer a unique big data management solution
[1]
Velocity: The processing speed of data and the rate at which data arrives
[2]
Chapter 1
[3]
The Hadoop framework lets data volumes increase while controlling the processing
time. Without diving into the Hadoop technology stack, which is out of the scope of
this book, it might be important to examine a few tools available under the umbrella
of the Hadoop project and within its ecosystem:
Apache Spark is another amazing alternative to process large amounts of data that
a typical MapReduce cannot provide. Typically, Spark can run on top of Hadoop
or standalone. Hadoop uses HDFS as its default file system. It is designed as a
distributed file system that provides a high throughput access to application data.
The big data tools (Hadoop/Spark) sound very promising. On the other hand, while
launching a project on a terabyte-scale, it might go quickly into a petabyte-scale. A
traditional solution is found by adding more clusters. However, operational teams
may face more difficulties with manual deployment, change management and
most importantly, performance scaling. Ideally, when actively working on a live
production setup, users should not experience any sort of service disruption. Adding
then an elasticity flavor to the Hadoop infrastructure in a scalable way is imperative.
How can you achieve this? An innovative idea is using the cloud.
[4]
Chapter 1
Some of the most recent functional programming languages are Scala and
R. Scala can be used to develop applications that interact with Hadoop
and Spark. R language has become very popular for data analysis, data
processing, and descriptive statistics. Integration of Hadoop with R is
ongoing; RHadoop is one of the R open source projects that exposes a
rich collection of packages to help the analysis of data with Hadoop. To
read more about RHadoop, visit the official GitHub project page found at
https://github.com/RevolutionAnalytics/RHadoop/wiki
[5]
We have used a few acronyms: EC2, S3 and EMR. From high-level architecture,
EMR sits on top of EC2 and S3. It uses EC2 for processing and S3 for storage. The
main purpose of EMR is to process data in the cloud without managing your own
infrastructure. As described briefly in the following diagram, data is being pulled
from S3 and is going to automatically spin up an EC2 cluster within a certain size.
The results will be piped back to S3. The hallmark of Hadoop in the cloud is zero
touch infrastructure. What you need to do is just specify what kind of job you
intend to run, the location of the data, and from where to pick up the results.
[6]
Chapter 1
Swift: The object storage management service. Any form of data in Swift is
stored in a redundant, scalable, distributed object storage using a cluster
of servers.
The awesomeness of OpenStack comes not only from its modular architecture
but also the contribution of its large community by developing and integrating a
new project in nearly every new OpenStack release. Within the Icehouse release,
OpenStack contributors turned on the light to meet the big data world: the Elastic
Data Processing service. That becomes even more amazing to see a cloud service
similar to EMR in Amazon running by OpenStack.
[7]
Well, it is time to open the curtains and explore the marriage of one of the most
popular big data programs, Hadoop, with one of the most successful cloud operating
system OpenStack: Sahara. As shown in the next diagram of the OpenStack IaaS
(short for Infrastructure as a Service) layering schema, Sahara can be expressed as
an optional service that sits on top of the base components of OpenStack. It can be
enabled or activated when running a private cloud based on OpenStack.
More details on Sahara integration in a running OpenStack environment
will be discussed in Chapter 2, Integrating OpenStack Sahara.
[8]
Chapter 1
The Sahara project was named Savanna and has been renamed
due to trademark issues.
Sahara in OpenStack
The main reason the Sahara project was born is the need for agile access to big
data. By moving big data to the cloud, we can capture many benefits for the user
experience in this case:
[9]
To be able to create a Hadoop cluster, Sahara will need to retrieve and register virtual
machine images in its own image registry by contacting Glance. Nova is also another
essential OpenStack core component to provision and launch virtual machines for
the Hadoop cluster. Additionally, Heat can be used by Sahara in order to automate
the deployment of a Hadoop cluster, which will be covered in a later chapter.
In OpenStack within the Juno release, it is possible to instruct
Sahara to use block storage as nodes backend.
[ 10 ]
Chapter 1
[ 11 ]
REST API: Every client request initiated from the dashboard will be
translated to a REST API call.
Auth: Like any other OpenStack service, Sahara must authenticate against
the authentication service Keystone. This also includes client and user
authorization to use the Sahara service.
[ 12 ]
Chapter 1
Vendor Plugins: The vendor plugins sit in the middle of the Sahara
architecture that exposes the type of cluster to be launched. Vendors such as
Cloudera and Apache Ambari provide their distributions in Sahara so users
can configure and launch a Hadoop based on their plugin mechanism.
Summary
In this chapter, you explored the factors behind the success of the emerging
technology of data processing and analysis using cloud computing technology. You
learned how OpenStack can be a great opportunity to offer the needed scalable and
elastic big data on-demand infrastructure. It can be also useful to execute on-demand
Elastic Data Processing tasks.
The first chapter exposed the new OpenStack incubated project called Sahara: a
rapid, auto-deploy, and scalable solution for Hadoop and Spark clusters. An overall
view of the Sahara architecture has been discussed for a fast-paced understanding of
the platform and how it works in an OpenStack private cloud environment.
Now it is time to get things running and discover how such an amazing big data
management solution can be used by installing OpenStack and integrating Sahara,
which will be the topic of the next chapter.
[ 13 ]
www.PacktPub.com
Stay Connected: