in the previous module we learned about basic components of a Hadoop cluster,
HDFS, the MapReduce and YARN framework. So in this lesson, which is broken up into three videos, you're gonna learn about major concepts and components for execution of applications on Hadoop clusters. So we'll look into some other technical issues that might come up, some of the things that in your maybe on the system side. And primarily this is about the execution environment. So we'll focus on the MapReduce YARN environment to start with, but then we'll see how some of the newer environments work. So in the first video, we'll look at basically the basic execution environments in Hadoop. Limitations of what is called the classic MapReduce framework. And then how the newer frameworks like YARN and Tez and Spark are trying to compliment this classical MapReduce but with much higher performance. So, as we discussed in the HDFS module, files are broken up into blocks. And these are stored on different disks, spread out over a lot of nodes. So you can see this in this picture. Now, the advantage of this is, we are pooling the data resources across hundreds or even thousands of nodes. And this gives us lots of throughput. Basically we're adding the performance from the disks and all the nodes. And you can see each node will have blocks corresponding to different files. So, in this case, one of the files is split over n nodes. And it's shown in green in the blocks. The other blocks in yellow could correspond to other files or they could be replica blocks for the file that we have. Now, to get good performance while processing this data that is spread out, we have to make sure that the compute parts in the processing are performed on the nodes that are already having the data. So ideally you would put task one on block one there, task two on block two and node two, and so on to task n on node n. This will be readily scalable because each task is working on a piece of data that's local on the node, so if you're not moving data around. And likely whatever pool their disk performance, we can pool the compute performance of all the CPUs on the node so you get a lot better performance. Now in the original Hadoop architecture all applications were executed via what is called a MapReduce framework. So, what the framework lets you do is essentially run these tasks close to the data, without having to worry about how it's done. That's done by the framework so the data chunks are stored and the framework knows about where they're stored and it gets the task close to the data. And it also does the managing of the execution, monitoring the task and restarting some if they have failures and things like that. The key here was all the applications had to the fit the MapReduce approach. So you had to have these distributed data chains and then which are independent of each other, running on these data chains, and you had the shuffle process that would feed the data into the reduce process. This works well for a lot of data processing, but what if your application doesn't fit this particular paradigm, or you can do it in this paradigm but it's not efficient? And we look at some of these examples later on in this module. So the solution is what we're going to discuss a little bit later. So, the cases where this might happen is if you're doing interactive data exploration and you might be looking at the same set of data over and over again. Now in the traditional framework, this would mean you would be loading the data from disk every time you are trying to look at it, which is obviously not efficient. If it could be stored in memory, you would obviously get a lot better performance. The other case where this might happen is if you can even get rid of data processing algorithm. And this happens in some machine learning algorithms. So, this is where the next generation frameworks like YARN, and Tez, and Spark come into play. They support complex direct acyclic graph tasks so by acyclic we mean not looping essentially in the graph. And also, some of these tools, like Tez and Spark, and Spark especially, will let you cache data in memory, and that makes you get much better performance. So, we will look at this in the next video. In more detail and with some examples so that clarifies things.