Sie sind auf Seite 1von 2

Welcome back,

in the previous module we learned about basic components of a Hadoop cluster,


HDFS, the MapReduce and YARN framework. So in this lesson,
which is broken up into three videos, you're gonna learn about
major concepts and components for execution of
applications on Hadoop clusters. So we'll look into some other
technical issues that might come up, some of the things that in
your maybe on the system side. And primarily this is about
the execution environment. So we'll focus on the MapReduce YARN
environment to start with, but then we'll see how some of
the newer environments work. So in the first video, we'll look at basically the
basic
execution environments in Hadoop. Limitations of what is called
the classic MapReduce framework. And then how the newer frameworks like
YARN and Tez and Spark are trying to compliment this classical MapReduce but
with much higher performance. So, as we discussed in the HDFS module,
files are broken up into blocks. And these are stored on different disks,
spread out over a lot of nodes. So you can see this in this picture. Now, the
advantage of this is, we are pooling the data resources across
hundreds or even thousands of nodes. And this gives us lots of throughput.
Basically we're adding the performance
from the disks and all the nodes. And you can see each node will have
blocks corresponding to different files. So, in this case,
one of the files is split over n nodes. And it's shown in green in the blocks. The
other blocks in yellow could
correspond to other files or they could be replica blocks for
the file that we have. Now, to get good performance while
processing this data that is spread out, we have to make sure that the compute
parts in the processing are performed on the nodes that are already
having the data. So ideally you would put
task one on block one there, task two on block two and node two,
and so on to task n on node n. This will be readily scalable because each
task is working on a piece of data that's local on the node, so
if you're not moving data around. And likely whatever pool their disk
performance, we can pool the compute performance of all the CPUs on the node so
you get a lot better performance. Now in the original Hadoop architecture
all applications were executed via what is called a MapReduce framework. So, what
the framework
lets you do is essentially run these tasks close to the data, without
having to worry about how it's done. That's done by the framework so the data
chunks are stored and the framework knows about where they're stored and
it gets the task close to the data. And it also does the managing
of the execution, monitoring the task and restarting some if
they have failures and things like that. The key here was all the applications
had to the fit the MapReduce approach. So you had to have these
distributed data chains and then which are independent of each other,
running on these data chains, and you had the shuffle process that would
feed the data into the reduce process. This works well for
a lot of data processing, but what if your application doesn't
fit this particular paradigm, or you can do it in this paradigm but
it's not efficient? And we look at some of these
examples later on in this module. So the solution is what we're going
to discuss a little bit later. So, the cases where this might happen
is if you're doing interactive data exploration and you might be looking at
the same set of data over and over again. Now in the traditional framework, this
would mean you would be loading
the data from disk every time you are trying to look at it,
which is obviously not efficient. If it could be stored in memory, you would
obviously get a lot better performance. The other case where this might happen
is if you can even get rid of data processing algorithm. And this happens in some
machine learning algorithms. So, this is where the next generation
frameworks like YARN, and Tez, and Spark come into play. They support complex
direct
acyclic graph tasks so by acyclic we mean not looping
essentially in the graph. And also, some of these tools, like Tez
and Spark, and Spark especially, will let you cache data in memory, and that
makes you get much better performance. So, we will look at
this in the next video. In more detail and with some examples so
that clarifies things.

Das könnte Ihnen auch gefallen