talking about the new execution framework like YARN, Tez, and Spark and how they do more complex acyclic graph of tasks and use advance features like memory caching of data and things like that. So let's go a little bit more into detail. So what we're gonna look at in this video is layout of some of these new frameworks, so we are, how they fit into the Hadoop environment essentially. And some of the optimization strategies they've been using, all of these frameworks are pretty complex, and to go into more detail we're gonna use some of the upcoming modules, especially for Spark. We have a separate module that goes into that. We'll look at couple of examples in Tez and Spark to show how things get better by using these frameworks. So if you're looking at were this fits into the Hadoop framework, so you have HDFS layer at the bottom, which is the storage layer essentially. So inserting on top of that is YARN which is essentially the basic execution engine in the next generation Hadoop. So there are some applications that fit right on top of YARN. So HBase is an example that lots of other applications that just work through YARN to let you do things. There are other applications like Pig and Hive that could use the optimized newer agents like Tez that itself works through YARN. So you could use applications there and some of these applications have backends that work either with Tez or with Spark going through YARN. Either way, there are lots of options. Now, the good news is, if you are an end user, and you've gotten used to programming with Hive, you don't really have to change anything. Because the backend implementation is the only thing that changes, as far as Hive is concerned. All the commands and everything look the same. Now, one other thing to mention here is Spark actually can run without YARN if needed. It could run directly on HDFS. In fact, it can run on other forms of storage also. And Spark has a rich application base that we will look into future modules. So first, let's go to the basic YARN sector. So it does support the classic MapReduce framework that you're used to, so if applications have been returning that like in a comfortable with that, they'll work in YARN. But the nice thing is there's a rich source of applications on top of YARN. Storm, Impala that'll work through YARN, so if you're used to those applications, you don't have to change anything either. Of course, you can write your own user developed application that runs through YARN. You will have to write the client application master and all the managing of all the tasks and things like that, but you could upload an application anyhow. The other thing that YARN enables is frameworks like Tez and Spark that sit on top of it. And they talk to YARN for the resource requirements, but other than that they have their own mechanics and self-supporting applications. So we'll start off with by looking at Tez. So the main component there is essentially it can handle data flow graphs. There is an expressive API that allows you to do this, then the framework is integrated with YARN as I mentioned. What it lets you do is customize the application logic, so it doesn't necessarily need to fit the MapReduce framework. It lets you customize data formats, so their is no restriction like a key value pair on MapReduce framework. And similarly with data transfer there's some customization. So in the original MapReduce framework if you had a complex graph of tasks you would end up with a bunch of MapReduce jobs. That would essentially write to disc after each of those jobs and you would end up maybe serialization of tasks in some cases and there would be synchronization [INAUDIBLE]. And with this you can get around that and it could be the deals to one job and we'll see an example in the next slide. So what this does, it essentially gives you lots of resource efficiency improvements. By a, not using as much of, like you're not doing as much HDFS has rights and using HDFS data because you simplified the job. But also, there are things like reusing resources where possible, so a lot of times when you have to spin up a container on YARN, there is a time delay involved, because it takes time to start up things. With Tez, you could have essentially one containers, that are, you can reuse containers where basically you're not taking those costs of startup. And you can cache data in some cases so. All that makes things much faster. Now, as I mentioned, so this improves resource usage efficiency. But as I mentioned, things like Pig and Hive already use Tez. So if you already use those tools you're not changing anything. We are just changing the backend execution engine, that is a very simple change, and then you get the performance benefits. So let's look at a simple Hive on Tez example. So we are just doing a select from a, and joining from a second table b based on the id, and then joining on c based on itemid, and then we are grouping things. So if you had gone and written this in the original Hive MapReduce implementation, this would have turned into a bunch of MapReduce jobs. So you would have a job for the select a dot vendor. You would have a job for the join a and c, and selecting the cost, and then you are basically also gonna have another join at the end where you're taking the data from b and a and then joining, and then there's a selection task from table b. So you can see this has ended up in a lot of MapReduce jobs. That also be writing to HDFS in all these intermediate steps. So you can see the inefficiencies. Now, if you write the same thing in Tez, what you're gonna end up with is the intermediate map steps are gone. You're not writing to disk, you're reusing some of the containers and data. So you can, and you have a simplified graph. So you end up at a lot better performance. So this is one example of essentially how Tez works and as I mentioned there are, these more frameworks have a rich set of features and lots of advantages of using them. Now, another example is Spark. So this, again, is an advanced DAG execution engine. The nice thing is it's very flexible. A lot of the functions like mapping, filters, joins, and group bys, etc., are easy to handle. It can handle cyclic data flows which is very important because if you have a integrated graph algorithm like in machine learning cases or stream processing, it's gonna be much more efficient unlike other engines where you could do it but it's gonna be tedious and inefficient. You might have a lot of intermediate data spills. Now, Spark will also keep track of data produced during operations, so this allows for storing working data stacks in memory, and it's pretty graceful about spilling over to disk if it runs out of memory. So data can be shared across DAGs. Can be shared between iterations and can be reused. And this makes it much faster than MapReduce or even some other DAG engines because of the in-memory computing essentially. The other nice thing with Spark is the functionality can be accessed from Java, Scala, Python, and R, and these are all high-level languages that a lot of people are used to. So you get the big data processing advantages where you can write things in high-level languages, and you'll see an example of this next. The other nice thing is there's a rich suite of existing libraries that can handle things like graph machine learning, streaming applications, all kinds of things that are data intensive data processing available that you produce in isolation or you could use them in combination. So it's very generalized which is very powerful. Specially when you are writing in high-level languages. So let's look at a simple example of logistic regression. And if you're not familiar with Python that's okay. I'm just gonna point out a few things in this example. So this example, as I mentioned, is written in Python. All we are doing is an iterative machine learning algorithm, which is going through the same data set of points through a MapReduce process, and it's basically trying to find an optimal gradient. So what we are doing, if you look at the code there, is an iterative process which loops over several iterations and has a similar MapReduce process operating on the same set of data essentially. So what this means is you could cache these data points in RAM across iterations, and that would essentially make things a lot faster. So as you can see there is the cache() statement in the first line when we are loading the data. And that keeps the data and RAM across iterations. And then, the other thing to see is see how easy it is to define mapping and reduce functions. These are in red in this slide. So in a high-level languages that's a lot easier to do. So we are looking at two things essentially dynamic computing and the ease of expressing your data processing algorithm. So if you run this core and this is a standard example from the Spark site this runs about 100x faster in Spark when compared to so there's a huge advantage in using memory. And as I mentioned, there'll be a much longer Spark module coming up later in this class.