Beruflich Dokumente
Kultur Dokumente
Contents
[hide]
p f u n c ti o n 2 . 3 P a rt it i o n f u n c ti o n 2 . 4 C o m p a ri s o n f u n c ti o n 2 . 5 R e
d u c e f u n c ti o n 2 . 6 O u t p u t w ri t e r
3 Distribut ion and reliabilit y 4 Uses 5 Impleme ntations 6 Referenc es 7 External links 7 . 1 P a p e r
s MapReduce is a software framework introduced by Google to support parallel computations over large (multiple petabyte[1]) data sets on clusters of computers. This framework is largely taken from map and reduce functions commonly used in functional programming,[2] although the actual semantics of the framework are not the same.[3] MapReduce implementations have been written in C++, Java, Python and other languages.
[edit] Example
The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:
map(String name, String document): // key: document name // value: document contents for each word w in document: EmitIntermediate(w, 1); reduce(String word, Iterator partialCounts): // key: a word // values: a list of aggregated partial counts int result = 0; for each v in partialCounts: result += ParseInt(v); Emit(result);
Here, each document is split in words, and each word is counted initially with a "1" value by the Map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to Reduce, thus this function just needs to sum all of its input values to
[edit] Dataflow
The frozen part of the MapReduce framework is a large distributed sort. The hot spots, which the application defines, are:
an input reader a Map function a partition function a compare function a Reduce function an output writer
[edit] Uses
MapReduce is useful in a wide range of applications, including: "distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation..." Most significantly, when MapReduce was finished, it was used to completely regenerate Google's index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses. [4] MapReduce's stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reduces. David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is.[5] They challenge the MapReduce proponents claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation".[5] MapReduce advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use[citation needed]. MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.[6]
[edit] Implementations
The Google MapReduce framework is implemented in C++ with interfaces in Python and Java. The Hadoop project is a free open source Java MapReduce implementation. Greenplum is a commercial MapReduce implementation, with support for Python, Perl, SQL and
other languages. Phoenix [1] is a shared-memory implementation of MapReduce implemented in C. MapReduce has also been implemented for the Cell Broadband Engine, also in C. [2] MapReduce has been implemented on NVIDIA GPUs (Graphics Processors) using CUDA [3]. Qt Concurrent is a simplified version of the framework, implemented in C++, used for distributing a task between multiple processor cores. CouchDB uses a MapReduce framework for defining views over distributed documents Skynet is an open source Ruby implementation of Googles MapReduce framework Disco is an open source MapReduce implementation by Nokia. Its core is written in Erlang and jobs are normally written in Python. Aster Data Systems nCluster In-Database MapReduce implements MapReduce inside the database.
[edit] References
Specific references: 1. ^ Google spotlights data center inner workings | Tech news blog - CNET News.com 2. ^ "Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages." -"MapReduce: Simplified Data Processing on Large Clusters", by Jeffrey Dean and Sanjay Ghemawat; from Google Labs 3. ^ "Google's MapReduce Programming Model -- Revisited" paper by Ralf Lammel; from Microsoft 4. ^ "How Google Works". baselinemag.com. "As of October, Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days, according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes." 5. ^ a b David DeWitt; Michael Stonebraker. "MapReduce: A major step backwards". databasecolumn.com. Retrieved on 2008-08-27. 6. ^ David DeWitt; Michael Stonebraker. "MapReduce II". databasecolumn.com. Retrieved on 2008-08-27. General references:
Dean, Jeffrey & Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". Retrieved Apr. 6, 2005.
In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 i N, 1 j M. The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files. The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 j M. The input for each reduce instance Rj consists of the files Fi,j, 1 i N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the "answer" to a MapReduce computation. To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. We now turn to the five concerns we have with this computing paradigm. 1. MapReduce is a step backwards in database access As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.
Schemas are good. Separation of the schema from the application is good. High-level access languages are good.
MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented. The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset. It is also crucial to separate the schema from the application program. If a programmer wants to write a
new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists. During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:
By stating what you want - rather than presenting an algorithm for how to get it (relational view) By presenting an algorithm for data access (Codasyl view)
The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce. MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process. Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes. 2. MapReduce is a poor implementation All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a bruteforce sequential search. MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism. One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these
ideas occurred in the late 1980s with systems such as Teradata. In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems. There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange. One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt. There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously -- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent faulttolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead. Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature. 3. MapReduce is not novel The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.
Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented. While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995. 4. MapReduce is missing features All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:
Bulk loader -- to transform input data in files into a desired format and load it into a DBMS Indexing -- as noted above Updates -- to change the data in the data base Transactions -- to support parallel update and recovery from failures during update Integrity constraints -- to help keep garbage out of the data base Referential integrity -- again, to help keep garbage out of the data base Views -- so the schema can change without having to rewrite the application program
In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs. 5. MapReduce is incompatible with the DBMS tools A modern SQL DBMS has available all of the following classes of tools:
Report writers (e.g., Crystal reports) to prepare reports for human visualization Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.
MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task. In Summary It is exciting to see a much larger community engaged in the design and implementation of scalable query
processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology -- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added. We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.
MapReduce II
By David DeWitt on January 25, 2008 2:56 PM | Permalink | Comments (25) | TrackBacks (1) [Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker] Last week's MapReduce post attracted tens of thousands of readers and generated many comments, almost all of them attacking our critique. Just to let you know, we don't hold a personal grudge against MapReduce. MapReduce didn't kill our dog, steal our car, or try and date our daughters. Our motivations for writing about MapReduce stem from MapReduce being increasingly seen as the most advanced and/or only way to analyze massive datasets. Advocates promote the tool without seemingly paying attention to years of academic and commercial database research and real world use. The point of our initial post was to say that there are striking similarities between MapReduce and a fairly primitive parallel database system. As such, MapReduce can be significantly improved by learning from the parallel database community. So, hold off on your comments for just a few minutes, as we will spend the rest of this post addressing four specific topics brought up repeatedly by those who commented on our previous blog: 1. MapReduce is not a database system, so don't judge it as one 2. MapReduce has excellent scalability; the proof is Google's use 3. MapReduce is cheap and databases are expensive 4. We are the old guard trying to defend our turf/legacy from the young turks
Feedback No. 1: MapReduce is not a database system, so don't judge it as one It's not that we don't understand this viewpoint. We are not claiming that MapReduce is a database system. What we are saying is that like a DBMS + SQL + analysis tools, MapReduce can be and is being used to analyze and perform computations on massive datasets. So we aren't judging apples and oranges. We are judging two approaches to analyzing massive amounts of information, even for less structured information. To illustrate our point, assume that you have two very large files of facts. The first file contains structured records of the form: Rankings (pageURL, pageRank) Records in the second file have the form: UserVisits (sourceIPAddr, destinationURL, date, adRevenue) Someone might ask, "What IP address generated the most ad revenue during the week of January 15th to the 22nd, and what was the average page rank of the pages visited?"
This question is a little tricky to answer in MapReduce because it consumes two data sets rather than one, and it requires a "join" of the two datasets to find pairs of Ranking and UserVisit records that have matching values for pageURL and destinationURL. In fact, it appears to require three MapReduce phases, as noted below. Phase 1 This phase filters UserVisits records that are outside the desired data range and then "joins" the qualifying records with records from the Rankings file.
Map program: The map program scans through UserVisits and Rankings records. Each UserVisit record is filtered on the date range specification. Qualifying records are emitted with composite keys of the form <destinationURL, T1 > where T1 indicates that it is a UserVisits record. Rankings records are emitted with composite keys of the form <pageURL, T2 > (T2 is a tag indicating it a Rankings record). Output records are repartitioned using a user-supplied partitioning function that only hashes on the URL portion of the composite key. Reduce Program: The input to the reduce program is a single sorted run of records in URL order. For each unique URL, the program splits the incoming records into two sets (one for Rankings records and one for UserVisits records) using the tag component of the composite key. To complete the join, reduce finds all matching pairs of records of the two sets. Output records are in the form of Temp1 (sourceIPAddr, pageURL, pageRank, adRevenue).
The reduce program must be capable of handling the case in which one or both of these sets with the same URL are too large to fit into memory and must be materialized on disk. Since access to these sets is through an iterator, a straightforward implementation will result in what is termed a nested-loops join. This join algorithm is known to have very bad performance I/O characteristics as "inner" set is scanned once for each record of the "outer" set. Phase 2 This phase computes the total ad revenue and average page rank for each Source IP Address.
Map program: Scan Temp1 using the identity function on sourceIPAddr. Reduce program: The reduce program makes a linear pass over the data. For each sourceIPAddr, it will sum the ad-revenue and compute the average page rank, retaining the one with the maximum total ad revenue. Each reduce worker then outputs a single record of the form Temp2 (sourceIPAddr, total_adRevenue, average_pageRank).
Phase 3
Map program: The program uses a single map worker that scans Temp2 and outputs the record with the maximum value for total_adRevenue.
We realize that portions of the processing steps described above are handled automatically by the MapReduce infrastructure (e.g., sorting and partitioning the records). Although we have not written this program, we estimate that the custom parts of the code (i.e., the map() and reduce() functions) would require substantially more code than the two fairly simple SQL statements to do the same: Q1 Select as Temp sourceIPAddr, avg(pageRank) as avgPR, sum(adRevenue) as adTotal From Rankings, UserVisits where Rankings.pageURL = UserVisits.destinationURL and date > "Jan 14" and date < "Jan 23" Group by sourceIPAddr Q2 Select sourceIPAddr, adTotal, avgPR From Temp Where adTotal = max (adTotal) No matter what you think of SQL, eight lines of code is almost certainly easier to write and debug than the programming required for MapReduce. We believe that MapReduce advocates should consider the advantages that layering a high-level language like SQL could provide to users of MapReduce. Apparently we're not alone in this assessment, as efforts such as PigLatin and Sawzall appear to be promising steps in this direction. We also firmly believe that augmenting the input files with a schema would provide the basis for improving the overall performance of MapReduce applications by allowing B-trees to be created on the input data sets and techniques like hash partitioning to be applied. These are technologies in widespread practice in today's parallel DBMSs, of which there are quite a number on the market, including ones from IBM, Teradata, Netezza, Greenplum, Oracle, and Vertica. All of these should be able to execute this program with the same or better scalability and performance of MapReduce. Here's how these capabilities could benefit MapReduce: 1. Indexing. The filter (date > "Jan 14" and date < "Jan 23") condition can be executed by using a B-tree index on the date attribute of the UserVisits table, avoiding a sequential scan of the entire table. 2. Data movement. When you load files into a distributed file system prior to running MapReduce, data items are typically assigned to blocks/partitions in sequential order. As records are loaded into a table in a parallel database system, it is standard practice to apply a hash function to an attribute value to determine which node the record should be stored on (the same basic idea as is used to determine which reduce worker should get an output record from a map instance). For example, records being loaded
into the Rankings and UserVisits tables might be mapped to a node by hashing on the pageURL and destinationURL attributes, respectively. If loaded this way, the join of Rankings and UserVisits in Q1 above would be performed completely locally with absolutely no data movement between nodes. Furthermore, as result records from the join are materialized, they will be pipelined directly into a local aggregate computation without being written first to disk. This local aggregate operator will partially compute the two aggregates (sum and average) concurrently (what is called a combiner in MapReduce terminology). These partial aggregates are then repartitioned by hashing on this sourceIPAddr to produce the final results for Q1. It is certainly the case that you could do the same thing in MapReduce by using hashing to map records to chunks of the file and then modifying the MapReduce program to exploit the knowledge of how the data was loaded. But in a database, physical data independence happens automatically. When Q1 is "compiled," the query optimizer will extract partitioning information about the two tables from the schema. It will then generate the correct query plan based on this partitioning information (e.g., maybe Rankings is hash partitioned on pageURL but UserVisits is hash partitioned on sourceIPAddr). This happens transparently to any user (modulo changes in response time) who submits a query involving a join of the two tables. 3. Column representation. Many questions access only a subset of the fields of the input files. The others do not need to be read by a column store. 4. Push, not pull. MapReduce relies on the materialization of the output files from the map phase on disk for fault tolerance. Parallel database systems push the intermediate files directly to the receiving (i.e., reduce) nodes, avoiding writing the intermediate results and then reading them back as they are pulled by the reduce computation. This provides MapReduce far superior fault tolerance at the expense of additional I/Os. In general, we expect these mechanisms to provide about a factor of 10 to 100 performance advantage, depending on the selectivity of the query, the width of the input records to the map computation, and the size of the output files from the map phase. As such, we believe that 10 to 100 parallel database nodes can do the work of 1,000 MapReduce nodes. To further illustrate out point, suppose you have a more general filter, F, a more general group_by function, G, and a more general Reduce function, R. PostgreSQL (an open source, free DBMS) allows the following SQL query over a table T: Select R (T) From T Group_by G (T) Where F (T)
F, R, and G can be written in a general-purpose language like C or C++. A SQL engine, extended with user-defined functions and aggregates, has nearly -- if not all -- of the generality of MapReduce. As such, we claim that most things that are possible in MapReduce are also possible in a SQL engine. Hence, it is exactly appropriate to compare the two approaches. We are working on a more complete paper that demonstrates the relative performance and relative programming effort between the two approaches, so, stay tuned. Feedback No. 2: MapReduce has excellent scalability; the proof is Google's use Many readers took offense at our comment about scaling and asserted that since Google runs MapReduce programs on 1,000s (perhaps 10s of 1,000s) of nodes it must scale well. Having started benchmarking database systems 25 years ago (yes, in 1983), we believe in a more scientific approach toward evaluating the scalability of any system for data intensive applications. Consider the following scenario. Assume that you have a 1 TB data set that has been partitioned across 100 nodes of a cluster (each node will have about 10 GB of data). Further assume that some MapReduce computation runs in 5 minutes if 100 nodes are used for both the map and reduce phases. Now scale the dataset to 10 TB, partition it over 1,000 nodes, and run the same MapReduce computation using those 1,000 nodes. If the performance of MapReduce scales linearly, it will execute the same computation on 10x the amount of data using 10x more hardware in the same 5 minutes. Linear scaleup is the gold standard for measuring the scalability of data intensive applications. As far as we are aware there are no published papers that study the scalability of MapReduce in a controlled scientific fashion. MapReduce may indeed scale linearly, but we have not seen published evidence of this. Feedback No. 3: MapReduce is cheap and databases are expensive Every organization has a "build" versus "buy" decision, and we don't question the decision by Google to roll its own data analysis solution. We also don't intend to defend DBMS pricing by the commercial vendors. What we wanted to point out is that we believe it is possible to build a version of MapReduce with more functionality and better performance. Pig is an excellent step in this direction. Also, we want to mention that there are several open source (i.e., free) DBMSs, including PostgreSQL, MySQL, Ingres, and BerkeleyDB. Several of the aforementioned parallel DBMS companies have increased the scale of these open source systems by adding parallel computing extensions. A number of individuals also commented that SQL and the relational data model are too restrictive. Indeed, the relational data model might very well be the wrong data model for the types of datasets that MapReduce applications are targeting. However, there is considerable ground between the relational data model and no data model at all. The point we were trying to make is that developers writing business applications have benefited significantly from the notion of organizing data in the database according to a data model and accessing that data through a declarative query language. We don't care what that language or model is. Pig, for example, employs a nested relational model, which gives developers more flexibility that a traditional 1NF doesn't allow.
Feedback No. 4: We are the old guard trying to defend our turf/legacy from the young turks Since both of us are among the "gray beards" and have been on this earth about 2 Giga-seconds, we have seen a lot of ideas come and go. We are constantly struck by the following two observations:
How insular computer science is. The propagation of ideas from sub-discipline to sub-discipline is very slow and sketchy. Most of us are content to do our own thing, rather than learn what other sub-disciplines have to offer. How little knowledge is passed from generation to generation. In a recent paper entitled "What goes around comes around," (M. Stonebraker/J. Hellerstein, Readings in Database Systems 4th edition, MIT Press, 2004) one of us noted that many current database ideas were tried a quarter of a century ago and discarded. However, such pragma does not seem to be passed down from the "gray beards" to the "young turks." The turks and gray beards aren't usually and shouldn't be adversaries.
Thanks for stopping by the "pasture" and reading this post. We look forward to reading your feedback, comments and alternative viewpoints.
The Basics
The first step in building a parallel program is identifying sets of tasks that can run concurrently and/or paritions of data that can be processed concurrently. Sometimes it's just not possible. Consider a Fibonacci function:
Fk+2 = Fk + Fk+1
A function to compute this based on the form above, cannot be "parallelized" because each computed value is dependent on previously computed values. A common situation is having a large amount of consistent data which must be processed. If the data can be decomposed into equal-size partitions, we can devise a parallel solution. Consider a huge array which can be broken up into sub-arrays. If the same processing is required for each array element, with no dependencies in the computations, and no communication required between tasks, we have an ideal parallel computing opportunity. Here is a common implementation technique called master/worker. The MASTER:
initializes the array and splits it up according to the number of available WORKERS sends each WORKER its subarray receives the results from each WORKER receives the subarray from the MASTER performs processing on the subarray returns results to MASTER
The WORKER:
This model implements static load balancing which is commonly used if all tasks are performing the same amount of work on identical machines. In general, load balancing refers to techniques which try to spread tasks among the processors in a parallel system to avoid some processors being idle while others have tasks queueing up for execution.
A static load balancer allocates processes to processors at run time while taking no account of current network load. Dynamic algorithms are more flexible, though more computationally expensive, and give some consideration to the network load before allocating the new process to a processor. As an example of the MASTER/WORKER technique, consider one of the methods for approximating pi. The first step is to inscribe a circle inside a square: The area of the square, denoted As = (2r)2 or 4r2. The area of the circle, denoted Ac, is pi * r2. So:
pi As r2 pi = = = = Ac / r2 4r2 As / 4 4 * Ac / As
The reason we are doing all these algebraic manipulation is we can parallelize this method in the following way. 1. 2. 3. 4. Randomly generate points in the square Count the number of generated points that are both in the circle and in the square r = the number of points in the circle divided by the number of points in the square PI = 4 * r
What is MapReduce?
Now that we have seen some basic examples of parallel programming, we can look at the MapReduce programming model. This model derives from the map and reduce combinators from a functional language like Lisp. In Lisp, a map takes as input a function and a sequence of values. It then applies the function to each value in the sequence. A reduce combines all the elements of a sequence using a binary operation. For example, it can use "+" to add up all the elements in the sequence. MapReduce is inspired by these concepts. It was developed within Google as a mechanism for processing large amounts of raw data, for example, crawled documents or web request logs. This data is so large, it
must be distributed across thousands of machines in order to be processed in a reasonable time. This distribution implies parallel computing since the same computations are performed on each CPU, but with a different dataset. MapReduce is an abstraction that allows Google engineers to perform simple computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance. Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the reduce function. The reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. [1] Consider the problem of counting the number of occurrences of each word in a large collection of documents:
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); [1]
The map function emits each word plus an associated count of occurrences ("1" in this example). The reduce function sums together all the counts emitted for a particular word.
key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory. 4. Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers. 5. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. If the amount of intermediate data is too large to fit in memory, an external sort is used. 6. The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code. After successful completion, the output of the MapReduce execution is available in the R output files. [1] To detect failure, the master pings every worker periodically. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state, and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling. Completed map tasks are re-executed when failure occurs because their output is stored on the local disk(s) of the failed machine and is therefore inaccessible. Completed reduce tasks do not need to be reexecuted since their output is stored in a global fille system.
MapReduce Examples
Here are a few simple examples of interesting programs that can be easily expressed as MapReduce computations. Distributed Grep: The map function emits a line if it matches a given pattern. The reduce function is an identity function that just copies the supplied intermediate data to the output. Count of URL Access Frequency: The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Reverse Web-Link Graph: The map function outputs <target, source> pairs for each link to a target URL found in a page named "source". The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>. Term-Vector per Host: A term vector summarizes the most important words that occur in a document or a set of documents as a list of <word, frequency> pairs. The map function emits a <hostname, term vector> pair for each input document (where the hostname is extracted from the URL of the document). The reduce function is passed all per-document term vectors for a given host. It adds these term vectors together, throwing away infrequent terms, and then emits a final <hostname, term vector> pair. Inverted Index: The map function parses each document, and emits a sequence of <word, document ID>
pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a <word, list(document ID)> pair. The set of all output pairs forms a simple inverted index. It is easy to augment this computation to keep track of word positions. [1]
Sponsored by:
About MapReduce
MapReduce is a programming model specifically implemented for processing large data sets. The model was developed by Jeffrey Dean and Sanjay Ghemawat at Google (see "MapReduce: Simplified data processing on large clusters"). At its core, MapReduce is a combination of two functions -- map() and reduce(), as its name would suggest. A quick look at a sample Java program will help you get your bearings in MapReduce. This application implements a very simple version of the MapReduce framework, but isn't built on Hadoop. The simple, abstracted program will illustrate the core parts of the MapReduce framework and the terminology associated with it. The application creates some strings, counts the number of characters in each string, and finally sums them up to show the total number of characters altogether. Listing 1 contains the program's Main class.
Listing 1 just instantiates a class called MyMapReduce, which is shown in Listing 2. Listing 2. MyMapReduce.java
import java.util.*; public class MyMapReduce ...
Download complete Listing 2 As you see, the crux of the class lies in just four functions:
The init() method creates some dummy data (just 30 strings). This data serves as the input data for the program. Note that in the real world, this input could be gigabytes, terabytes, or petabytes of data! The step1ConvertIntoBuckets() method segments the input data. In this example, the data is divided into six smaller chunks and put inside an ArrayList named buckets. You can see that the method takes a list, which contains all of the input data, and another int value, numberOfBuckets. This value has been hardcoded to five; if you divide 30 strings into five buckets, each bucket will have six strings each. Each bucket in turn is represented as an ArrayList. These array lists are put finally into another list and returned. So, at the end of the function, you have an array list with five buckets (array lists) of six strings each. These buckets can be put in memory (as in this case), saved to disk, or put onto different nodes in a cluster! step2RunMapFunctionForAllBuckets() is the next method invoked from init(). This method internally creates five threads (because there are five buckets -- the idea is to start a thread for each bucket). The class responsible for threading is StartThread, which is implemented as an inner class. Each thread processes each bucket and puts the individual result in another array list named intermediateresults. All the computation and threading takes place within the same JVM, and the whole process runs on a single machine.
If the buckets were on different machines, a master should be monitoring them to know when the computation is over, if there are any failures in processing in any of the nodes, and so on. It would be great if the master could perform the computations on different nodes, rather than bringing the data from all nodes to the master itself and executing it. The step3RunReduceFunctionForAllBuckets() method collates the results from intermediateresults, sums it up, and gives you the final output. Note that intermediateresults needs to combine the results from the parallel processing explained in the previous bullet point. The exciting part is that this process also can happen concurrently!
Assume that, in some way, the data is divided into smaller chunks and is inserted into buckets. You have a total of 10 buckets now, with 10,000 elements of data within each of them. (Don't bother worrying about who exactly does the dividing at the moment.) Apply a function named map(), which in turn executes your search algorithm on a single bucket and repeats it concurrently for all the buckets in parallel, storing the result (of processing of each bucket) in another set of buckets, called result buckets. Note that there may be more than one result bucket. Apply a function named reduce() on each of these result buckets. This function iterates through the result buckets, takes in each value, and then performs some kind of processing, if needed. The processing may either aggregate the individual values or apply some kind of business logic on the aggregated or individual values. This functionality once again takes place concurrently. Finally, you will get the result you expected.
These four steps are very simple but there is so much power in them! Let's look at the details.
the place where the data resides. If you write a C++ or Java program to process data on multiple threads, then the program fetches data from a data source (typically a remote database server) and is usually executed on the machine where your application is running. In MapReduce implementations, the computation happens on the distributed nodes.
Apache Hadoop
Hadoop is an open source implementation of the MapReduce programming model. Hadoop relies not on Google File System (GFS), but on its own Hadoop Distributed File System (HDFS). HDFS replicates data blocks in a reliable manner and places them on different nodes; computation is then performed by Hadoop on these nodes. HDFS is similar to other filesystems, but is designed to be highly fault tolerant. This distributed filesystem does not require any high-end hardware, but can run on commodity computers and software; it is also scalable, which is one of the primary design goals for the implementation. HDFS is independent of any specific hardware or software platform, and is hence easily portable across heterogeneous systems. If you've worked with clustered Java EE applications, you're probably familiar with the concepts of a master instance that manages other instances of the application server (called slaves) in a network deployment architecture. These master instances may be called deployment managers (if you're using WebSphere), manager servers (with WebLogic) or admin servers (with Tomcat). It is the responsibility of the master server instance to delegate various responsibilities to slave application server instances, to listen for handshaking signals from each instance so as to decide which are alive and which are dead, to do IP multicasting whenever required for synchronization of serializable sessions and data, and other similar tasks. The master stores the metadata and relevant port information of the slaves and works in a collaborative manner so that the end user feels as if there is only one instance. HDFS works more or less in a similar way. In the HDFS architecture, the master is called a NameNode and the slaves are called DataNodes. There is only a single NameNode in HDFS, whereas there are many DataNodes across the cluster, usually one per node. HDFS allocates a namespace (similar to a package in Java, a tablespace in Oracle, or a namespace in C++) for storing user data. A file might be split into one or more data blocks, and these data blocks are kept in a set of DataNodes. The NameNode will have the necessary metadata information on how the blocks are mapped to each other and which blocks are being stored in which of the NameNodes. Note that not all the requests to be delegated to DataNodes need to pass through the NameNode. All the filesystem's client requests for reading and writing are processed directly by the DataNodes, whereas namespace operations like the opening, closing, and renaming of directories are performed by NameNodes. NameNodes are responsible for issuing instructions to DataNodes for data block creation, replication, and deletion. A typical deployment of HDFS has a dedicated machine that runs only the NameNode. Each of the other
machines in the cluster typically runs one instance of the DataNode software, though the architecture does allow you to run multiple DataNodes on the same machine. The NameNode is concerned with metadata repository and control, but otherwise never handles user data. The NameNode uses a special kind of log, named EditLog, for the persistence of metadata.
Deploying Hadoop
Though Hadoop is a pure Java implementation, you can use it in two different ways. You can either take advantage of a streaming API provided with it or use Hadoop pipes. The latter option allows you to build Hadoop apps with C++; this article will focus on the former. Hadoop's main design goal is to provide storage and communication on lots of homogeneous commodity machines. The implementers selected Linux as their initial platform for development and testing; hence, if you're working with Hadoop on Windows, you will have to install separate software to mimic the shell environment. Hadoop can run in three different ways, depending on how the processes are distributed:
Standalone mode: This is the default mode provided with Hadoop. Everything is run as a single Java process. Pseudo-distributed mode: Here, Hadoop is configured to run on a single machine, with different Hadoop daemons run as different Java processes. Fully distributed or cluster mode: Typically, one machine in the cluster is designated as the NameNode and another machine as the JobTracker. There is exactly one NameNode in each cluster, which manages the namespace, filesystem metadata, and access control. You can also set up an optional SecondaryNameNode, used for periodic handshaking with NameNode for fault tolerance. The rest of the machines within the cluster act as both DataNodes and TaskTrackers. The DataNode holds the system data; each data node manages its own locally scoped storage, or its local hard disk. The TaskTrackers carry out map and reduce operations.
import import import import import import import import import import import import import import import import import import
java.util.StringTokenizer; java.io.*; java.net.*; java.util.regex.MatchResult; org.apache.hadoop.conf.Configuration; org.apache.hadoop.conf.Configured; org.apache.hadoop.fs.Path; org.apache.hadoop.io.Text; org.apache.hadoop.io.LongWritable; org.apache.hadoop.mapred.JobClient; org.apache.hadoop.mapred.JobConf; org.apache.hadoop.mapred.MapReduceBase; org.apache.hadoop.mapred.Mapper; org.apache.hadoop.mapred.OutputCollector; org.apache.hadoop.mapred.Reducer; org.apache.hadoop.mapred.Reporter; org.apache.hadoop.util.Tool; org.apache.hadoop.util.ToolRunner;
The first set of imports is for the standard Java classes, and the second set is for the MapReduce implementation. The EchoOhce class begins by extending org.apache.hadoop.conf.Configured and implementing the interface org.apache.hadoop.until.Tool, as you can see in Listing 4. Listing 4. Extending Configured, implementing Tool
public class EchoOhce extends Configured implements Tool { //..your code goes here }
The Configured class is responsible for delivering the configuration parameters specified in certain XML files. This is done when the programmer invokes the getConf() method of this class. This method returns an instance of org.apache.hadoop.conf.Configuration, which is basically a holder for the resources specified as name-value pairs in XML data. Each resource is named by either a String or by an org.apache.hadoop.fs.Path instance. By default, the two resources loaded in order from the classpath are:
hadoop-default.xml: This file contains read-only defaults for Hadoop, like global properties, logging properties, I/O properties, filesystem properties, and the like. If you want to use your own values for any of these properties, you can override them in hadoop-site.xml. hadoop-site.xml: Here you can override the values in hadoop-default.xml that do not meet your specific objectives.
Please note that applications may add additional resources -- as many as you want. Those are loaded in order from the classpath. You can find out more from the Hadoop API documentation for the addResource() and addFinalResource() methods. addFinalResource() allows the flexibility for declaring a resource to be final so that subsequently loaded resources cannot alter that value.
You might have noticed that the code implements an interface named Tool. This interface supports a variety of methods to handle generic command-line options. The interface forces the programmer to write a method, run(), that takes in String arrays as parameters and returns an int. The integer returned will determine whether the execution has been successful or not. Once you've implemented the run() method in your class, you can write your main() method, as in Listing 5. Listing 5. main() method
public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new EchoOhce(), args); System.exit(res); }
The org.apache.hadoop.util.ToolRunner class invokes the run() method implemented in the EchoOhce class. The ToolRunner utility helps to run classes that implement the Tool interface. With this facility, developers can avoid writing a custom handler to process various input options.
Map: Includes functionality for processing input key- value pairs to generate output key-value pairs. Reduce: Includes functionality for collecting output from parallel map processing and outputting that collected data.
Figure 1 illustrates how the sample app will work. Figure 1. Map and Reduce in action (click to enlarge) First, take a look at the Map class in Listing 6. Listing 6. Map class
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private Text inputText = new Text(); private Text reverseText = new Text(); public void map(LongWritable key, Text inputs, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String inputString = inputs.toString(); int length = inputString.length(); StringBuffer reverse = new StringBuffer(); for(int i=length-1; i>=0; i--) {
As mentioned earlier, the EchoOhce application must take an input string, reverse it, and return a keyvalue pair with input and reverse strings together. First, it gets the parameters for the map() function -namely, the inputs and the output. From the inputs, it gets the input String. The application uses the simple Java API to find the reverse of this String, then creates a key-value pair by setting the input String and the reverse String. You end up with an OutputCollector instance, which contains the result of this processing. Assume that this is one result obtained from one execution of the map() function on one of the nodes. Obviously, you'll need to combine all such outputs. This is exactly what the reduce() method of the Reduce class, shown in Listing 7, will do. Listing 7. Reduce.reduce()
public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { while (values.hasNext()) { output.collect(key, values.next()); } }
The MapReduce framework knows how many OutputCollectors there are and which are to be combined for the final result. The reduce() method actually does the grunt work. Finally, to complete EchoOhce's Main class, you need to set the values for your configuration. Basically, these values inform the MapReduce framework about the types of the output keys and values, the names of the Map and Reduce classes, and so on. The complete run() method is shown in Listing 8. Listing 8. run()
public int run(String[] args) throws Exception { JobConf conf = new JobConf(getConf(), EchoOhce.class); conf.setJobName("EchoOhce"); ...
As you can see in the listing, you must first create a Configuration instance; org.apache.hadoop.mapred.JobConf extends from the Configuration class. JobConf has the primary responsibility of sending your map and reduce implementations to the Hadoop framework for execution. Once the JobConf instance has been given the appropriate values for your MapReduce implementation, you invoke the most important method, named runJob(), on the org.apache.hadoop.mapred.JobClient class, by passing in the JobConf instance. JobClient internally communicates with the org.apache.hadoop.mapred.JobTracker class, and provides facilities for submission of jobs, tracking progress, accessing the progress or logs, or getting cluster status. That should give you a good sense of how EchoOhce, a sample MapReduce application, works. We'll conclude with instructions for installing the relevant software and running the application.
(Note that /cygdrive prefix. This is how Cygwin maps your Windows directory to a Unix-style directory format.) 7. Start Cygwin by choosing Start > All Programs > Cygwin > Cygwin Bash Shell. 8. In Hadoop, communication between different processes across different machines is achieved in through SSH, so the next important step is to get sshd running. If you're using SSH for the first time, please note that sshd needs a config file to run, which is generated by the following command:
9. ssh-host-config
When you enter this, you will get a prompt usually asking for the value for CYGWIN. Enter ntsec tty. If you are again prompted with a question on the privilege separation that should be used, your answer should be no. If asked for your consent for installing SSH as a service, give yes as your response. Once this has been set up, start the sshd service by typing:
10. /usr/sbin/sshd
12.
13.
If you're asked for a passphrase to SSH to the localhost, press Ctrl-C and enter:
14. ssh-keygen -t dsa -P ' ' -f /.ssh/id_dsa cat /.ssh/id_dsa.pub >> /.ssh/authorized_keys
15. Try running the example programs available at the Hadoop site. If all of the above steps have gone as they should, you should be get the expected output. 16. Now it's time to create the input data for the EchoOhce application:
17. echo "Hello" >> word1 echo "World" >> word2 echo "Goodbye" >> word3 echo "JavaWorld" >> word4
18. Next, you need to put the files you created in Step 10 into HDFS after creating a directory. Note that you do not need to create any partitions for HDFS. It comes as part of the Hadoop installation, and all you need to do is execute the following commands:
19. bin/hadoop dfs -mkdir words bin/hadoop dfs -put word1 words/ bin/hadoop dfs -put word2 words/ bin/hadoop dfs -put word3 words/ bin/hadoop dfs -put word4 words/
20.
Next, create a JAR file for the sample application. As an easy and extensible approach,
create two environment variables in your machine, HADOOP_HOME and HADOOP_VERSION. (For the sample under consideration, the values will be D:\Hadoop and 0.17.1, respectively.) Now you can create EchoOhce.jar with the following commands:
21. mkdir EchoOhce_classes javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d EchoOhce_classes EchoOhce.java jar -cvf EchoOhce.jar -C EchoOhce_classes/
22.
Finally, its time to see the output. Run the application with the following command:
You will see an output screen with details like the following:
24. 08/07/18 11:14:45 INFO streaming.StreamJob: map 0% reduce 0% 08/07/18 11:14:52 INFO streaming.StreamJob: map 40% reduce 0% 08/07/18 11:14:53 INFO streaming.StreamJob: map 80% reduce 0% 08/07/18 11:14:54 INFO streaming.StreamJob: map 100% reduce 0% 08/07/18 11:15:03 INFO streaming.StreamJob: map 100% reduce 100% 08/07/18 11:15:03 INFO streaming.StreamJob: Job complete: job_20080718003_0007 08/07/18 11:15:03 INFO streaming.StreamJob: Output: result
Now go to result directory, and look in the file named result. It should contain the following:
25. Hello olleH World dlroW Goodbye eybdooG JavaWorld dlroWavaJ
Listing 9. hadoop-site.xml
4. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>MACH1:8000</value> </property> <property> <name>mapred.job.tracker</name> <value>MACH2:8000</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.secondary.info.port</name> <value>8001</value> </property> <property> <name>dfs.info.port</name> <value>8002</value> </property> <property> <name>mapred.job.tracker.info.port</name> <value>8003</value> </property> <property> <name>tasktracker.http.port</name> <value>8004</value> </property> </configuration>
5. Open the file named masters under the conf directory. Here you need to add the master NameNode and JobTracker names, as shown in Listing 10. (If there are existing entries, please replace them with those shown in the listing). Listing 10. Adding NameNode and JobTracker names
6. MACH1 MACH2
7. Open the file named slaves under the conf directory. This is where you put the names of DataNodes, as shown in Listing 11. (Again, if there are existing entries in this file, please replace them.) Listing 11. Adding DataNode names
8. MACH3 MACH4
9. Now you're ready to go, and it's time to start the Hadoop cluster. Log on to each node, accepting the defaults. Log into your NameNode as follows:
10. ssh MACH1
(Note that you can stop this later with the stop-dfs.sh command.) 13. Start the JobTracker exactly as above, with the following commands:
14. ssh MACH2 cd /hadoop0.17.1/ bin/start-mapred.sh
(Again, this can be stopped later by the corresponding stop-mapred.sh command.) You can now execute the EchoOche application as described in the previous section, in the same way. The difference is that now the program will be executed across a cluster of DataNodes. You can confirm this by going to the Web interface provided with Hadoop. Point your browser to http://localhost:8002. (The default is actually port 50070; to see why you'd need to use port 8002 here, take a closer look at Listing 9.) You should see a frame similar to the one in Figure 2, showing the details of NameNode and all jobs managed by it. Figure 2. Hadoop Web interface, showing the number of nodes and their status (click to enlarge) This Web interface will provide many details to browse through, showing you the full statistics of your application. Hadoop comes with several different Web interfaces by default; you can see their default URLs in Hadoop-default.xml. For example, in this sample application, http://localhost:8003 will show you JobTracker statistics. (The default is port is 50030.)
In conclusion
In this article, we've presented the fundamentals of MapReduce programming with the open source Hadoop framework. This excellent framework accelerates the processing of large amounts of data through distributed processes, delivering very fast responses. It can be adopted and customized to meet various development requirements and can be scaled by increasing the number of nodes available for
processing. The extensibility and simplicity of the framework are the key differentiators that make it a promising tool for data processing.