Sie sind auf Seite 1von 3

So welcome back.

So we left off in the previous video


talking about the new execution framework like YARN, Tez, and Spark and how they do
more complex acyclic graph of tasks and use advance features like memory
caching of data and things like that. So let's go a little bit more into detail. So
what we're gonna look at in this video
is layout of some of these new frameworks, so we are, how they fit into
the Hadoop environment essentially. And some of the optimization
strategies they've been using, all of these frameworks are pretty
complex, and to go into more detail we're gonna use some of the
upcoming modules, especially for Spark. We have a separate module
that goes into that. We'll look at couple
of examples in Tez and Spark to show how things get
better by using these frameworks. So if you're looking at were this
fits into the Hadoop framework, so you have HDFS layer at the bottom,
which is the storage layer essentially. So inserting on top of
that is YARN which is essentially the basic execution
engine in the next generation Hadoop. So there are some applications
that fit right on top of YARN. So HBase is an example that lots of other
applications that just work through YARN to let you do things. There are other
applications like Pig and
Hive that could use the optimized newer agents
like Tez that itself works through YARN. So you could use applications there and
some of these applications have backends that work either with Tez or
with Spark going through YARN. Either way, there are lots of options. Now, the good
news is,
if you are an end user, and you've gotten used to programming with Hive,
you don't really have to change anything. Because the backend implementation
is the only thing that changes, as far as Hive is concerned. All the commands and
everything look the same. Now, one other thing to mention here is Spark actually
can run
without YARN if needed. It could run directly on HDFS. In fact,
it can run on other forms of storage also. And Spark has a rich application base
that we will look into future modules. So first,
let's go to the basic YARN sector. So it does support the classic MapReduce
framework that you're used to, so if applications have
been returning that like in a comfortable with that,
they'll work in YARN. But the nice thing is there's a rich
source of applications on top of YARN. Storm, Impala that'll work through YARN,
so if you're used to those applications,
you don't have to change anything either. Of course, you can write your own user
developed
application that runs through YARN. You will have to write the client
application master and all the managing of all the tasks and things like that, but
you could upload an application anyhow. The other thing that YARN enables
is frameworks like Tez and Spark that sit on top of it. And they talk to YARN for
the resource requirements, but other than that they have their own mechanics
and self-supporting applications. So we'll start off with by looking at Tez. So the
main component there is essentially
it can handle data flow graphs. There is an expressive API
that allows you to do this, then the framework is integrated
with YARN as I mentioned. What it lets you do is customize
the application logic, so it doesn't necessarily need to
fit the MapReduce framework. It lets you customize data formats, so their is no
restriction like a key
value pair on MapReduce framework. And similarly with data transfer
there's some customization. So in the original MapReduce
framework if you had a complex graph of tasks you would end up
with a bunch of MapReduce jobs. That would essentially write to
disc after each of those jobs and you would end up maybe serialization
of tasks in some cases and there would be
synchronization [INAUDIBLE]. And with this you can get around that and it could be
the deals to one job and
we'll see an example in the next slide. So what this does, it essentially gives you
lots of
resource efficiency improvements. By a, not using as much of, like you're
not doing as much HDFS has rights and using HDFS data because
you simplified the job. But also, there are things like
reusing resources where possible, so a lot of times when you have
to spin up a container on YARN, there is a time delay involved,
because it takes time to start up things. With Tez, you could have essentially
one containers, that are, you can reuse containers where basically
you're not taking those costs of startup. And you can cache data in some cases so.
All that makes things much faster. Now, as I mentioned, so
this improves resource usage efficiency. But as I mentioned, things like Pig and
Hive already use Tez. So if you already use those tools
you're not changing anything. We are just changing the backend execution
engine, that is a very simple change, and then you get the performance benefits. So
let's look at a simple
Hive on Tez example. So we are just doing a select from a,
and joining from a second table b based on the id, and then joining on c based on
itemid, and then we are grouping things. So if you had gone and written this in the
original Hive MapReduce implementation, this would have turned into
a bunch of MapReduce jobs. So you would have a job for
the select a dot vendor. You would have a job for the join a and
c, and selecting the cost, and then you are basically
also gonna have another join at the end where you're taking
the data from b and a and then joining, and then there's
a selection task from table b. So you can see this has ended
up in a lot of MapReduce jobs. That also be writing to HDFS in
all these intermediate steps. So you can see the inefficiencies. Now, if you write
the same thing in Tez, what you're gonna end up with is
the intermediate map steps are gone. You're not writing to disk, you're
reusing some of the containers and data. So you can, and
you have a simplified graph. So you end up at a lot better performance. So this is
one example of essentially how
Tez works and as I mentioned there are, these more frameworks have
a rich set of features and lots of advantages of using them. Now, another example
is Spark. So this, again,
is an advanced DAG execution engine. The nice thing is it's very flexible. A lot of
the functions like mapping,
filters, joins, and group bys, etc., are easy to handle. It can handle cyclic data
flows which
is very important because if you have a integrated graph algorithm
like in machine learning cases or stream processing, it's gonna be
much more efficient unlike other engines where you could do it but
it's gonna be tedious and inefficient. You might have a lot of
intermediate data spills. Now, Spark will also keep track of
data produced during operations, so this allows for
storing working data stacks in memory, and it's pretty graceful about spilling
over to disk if it runs out of memory. So data can be shared across DAGs. Can be
shared between iterations and
can be reused. And this makes it much faster than
MapReduce or even some other DAG engines because of the in-memory
computing essentially. The other nice thing with Spark is the
functionality can be accessed from Java, Scala, Python, and R, and these are all
high-level languages
that a lot of people are used to. So you get the big data processing
advantages where you can write things in high-level languages, and
you'll see an example of this next. The other nice thing is there's a rich
suite of existing libraries that can handle things like graph machine learning,
streaming applications, all kinds of things that are data
intensive data processing available that you produce in isolation or
you could use them in combination. So it's very generalized
which is very powerful. Specially when you are writing
in high-level languages. So let's look at a simple
example of logistic regression. And if you're not familiar
with Python that's okay. I'm just gonna point out
a few things in this example. So this example, as I mentioned,
is written in Python. All we are doing is an iterative
machine learning algorithm, which is going through the same data set
of points through a MapReduce process, and it's basically trying to
find an optimal gradient. So what we are doing,
if you look at the code there, is an iterative process which loops
over several iterations and has a similar MapReduce process operating
on the same set of data essentially. So what this means is you
could cache these data points in RAM across iterations, and that would
essentially make things a lot faster. So as you can see there is
the cache() statement in the first line when we are loading the data. And that
keeps the data and
RAM across iterations. And then, the other thing to see is see
how easy it is to define mapping and reduce functions. These are in red in this
slide. So in a high-level languages
that's a lot easier to do. So we are looking at two things
essentially dynamic computing and the ease of expressing your
data processing algorithm. So if you run this core and this is
a standard example from the Spark site this runs about 100x faster
in Spark when compared to so there's a huge advantage in using memory. And as I
mentioned,
there'll be a much longer Spark module coming up later in this class.

Das könnte Ihnen auch gefallen