Hadoop-How It Works

http://www.thecloudavenue.com/2012/10/is-java-prerequisite-for-getting.
html (site -
Necessity of java program in Hadoop
How it works?
You can't have a conversation about Big Data for very long without running into the
elephant in the room: Hadoop. This open source software platform managed by the Apache
Software Foundation has proven to be very helpful in storing and managing vast amounts
of data cheaply and efficiently.
But what exactly is Hadoop, and what makes it so special? Basically, it's a way of storing
enormous data sets across distributed clusters of servers and then running "distributed"
analysis applications in each cluster.
It's designed to be robust, in that your Big Data applications will continue to run even when
individual servers or clusters fail. And it's also designed to be efficient, because it
doesn't require your applications to shuttle huge volumes of data across your network.
Here's how Apache formally describes it:

The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly available service on top of a cluster of computers, each of which
may be prone to failures.
Look deeper, though, and there's even more magic at work. Hadoop is almost completely
modular, which means that you can swap out almost any of its components for a different
software tool. That makes the architecture incredibly flexible, as well as robust and
efficient.

Hadoop Distributed File system (HDFS)
If you remember nothing else about Hadoop, keep this in mind: It has two main parts - a
data processing framework and a distributed file system for data storage. There's more to
it than that, of course, but those two components really make things go.
The distributed file system is that far-flung array of storage clusters noted above - i.e., the
Hadoop component that holds the actual data. By default, Hadoop uses the cleverly
named Hadoop Distributed File System (HDFS), although it can use other file systems as
well.
HDFS is like the bucket of the Hadoop system: You dump in your data and it sits there all
nice and cozy until you want to do something with it, whether that's running an analysis on
it within Hadoop or capturing and exporting a set of data to another tool and performing
the analysis there.
Data Processing Framework & Map Reduce
The data processing framework is the tool used to work with the data itself. By default, this
is the Java-based system known as Map Reduce. You hear more about Map Reduce than the
HDFS side of Hadoop for two reasons:
It's the tool that actually gets data processed.
It tends to drive people slightly crazy when they work with it.
In a "normal" relational database, data is found and analyzed using queries, based on the
industry-standard Structured Query Language (SQL). Non-relational databases use queries,
too; they're just not constrained to use only SQL, but can use other query languages to pull
information out of data stores. Hence, the term NoSQL.
But Hadoop is not really a database: It stores data and you can pull data out of it, but there
are no queries involved - SQL or otherwise. Hadoop is more of a data warehousing system -
so it needs a system like Map Reduce to actually process the data.
Map Reduce runs as a series of jobs, with each job essentially a separate Java application
that goes out into the data and starts pulling out information as needed. Using Map Reduce
instead of a query gives data seekers a lot of power and flexibility, but also adds a lot of
complexity.
There are tools to make this easier: Hadoop includes Hive, another Apache application that
helps convert query language into Map Reduce jobs, for instance. But MapReduce's
complexity and its limitation to one-job-at-a-time batch processing tends to result in
Hadoop getting used more often as a data warehousing than as a data analysis tool.
(See also Hadoop Adoption Accelerates, But Not For Data Analytics.)
Scattered Across the Cluster
There is another element of Hadoop that makes it unique: All of the functions described act
as distributed systems, not the more typical centralized systems seen in traditional
databases.
In a database that uses multiple machines, the work tends to be divided out: all of the data
sits on one or more machines, and all of the data processing software is housed on another
server (or set of servers).
On a Hadoop cluster, the data within HDFS and the Map Reduce system are housed on
every machine in the cluster. This has two benefits: it adds redundancy to the system in
case one machine in the cluster goes down, and it brings the data processing software into
the same machines where data is stored, which speeds information retrieval.
Like we said: Robust and efficient.
When a request for information comes in, Map Reduce uses two components, a Job Tracker
that sits on the Hadoop master node, and Task Trackers that sit out on each node within
the Hadoop network.
The process is fairly linear: The Map part is accomplished by the Job Tracker dividing
computing jobs up into defined pieces and shifting those jobs out to the Task Trackers on
the machines out on the cluster where the needed data is stored. Once the job is run, the
correct subset of data is Reduced back to the central node of the Hadoop cluster, combined
with all the other datasets found on all of the cluster's machines.
HDFS is distributed in a similar fashion. A single Name Node tracks where data is housed in
the cluster of servers, known as Data Nodes. Data is stored in data blocks on the Data
Nodes. HDFS replicates those data blocks, usually 128MB in size, and distributes them so
they are replicated within multiple nodes across the cluster.
This distribution style gives Hadoop another big advantage: Since data and processing live
on the same servers in the cluster, every time you add a new machine to the cluster, your
system gains the space of the hard drive and the power of the new processor.
Kit Your Hadoop
As mentioned earlier, users of Hadoop don't have to stick with just HDFS or Map Reduce.
For its Elastic Compute Cloud solutions, Amazon Web Services has adapted its own S3 file
system for Hadoop. DataStax' Brisk is a Hadoop distribution that replaces HDFS with
Apache Cassandra's CassandraFS.
To get around MapReduce's first-in-first-out limitations, the Cascading framework gives
developers an easier tool in which to run jobs and more flexibility to schedule jobs.
Advertisement Continue reading below
Hadoop is not always a complete, out-of-the-box solution for every Big Data task. Map
Reduce, as noted, is enough of a pressure point that many Hadoop users prefer to use the
framework only for its capability to store lots of data fast and cheap.
But Hadoop is still the best, most widely used system for managing large amounts of data
quickly when you don't have the time or the money to store it in a relational database.
That's why Hadoop is likely to remain the elephant in the Big Data room for some time to
come.

Is Hadoop now easy to use? If not, what assistance do most users
need?
From the perspective of a developer, Hadoop is relatively straight forward. It is possible to
learn the basics of MapReduce in a day and the advanced APIs in a week. The hardest part
of MR programming is understanding how to translate algorithms and standard processing
techniques into their MR equivalents. For this reason or out of laziness (I mean that in a
good way), many (and I mean *many*) users use higher level languages like Apache
Hive or Apache Pig.

Administration tasks - installation, configuration, maintenance, capacity planning,
monitoring - are definitely non-trivial. Hadoop is a distributed system and is an order of
magnitude more complicated than a system that runs on a single machine like web and
application servers, relational databases, and the like.

The short list of skills required to work with Hadoop:
* Java knowledge (intermediate to advanced as a developer, JVM architecture and tuning as
an admin, both ideal).
* Intermediate Linux knowledge, mostly for administrative tasks (memory management, IO
subsystem, process management, kernel tuning, user resource controls, metric collection
and monitoring).
* Basic understanding of distributed systems.
* Basic understanding of filesystem design.

Most people need:
* Training for developers, admins, or analytics users to get them up to speed.
* Support to keep systems up and running for mission critical clusters.
* Professional services to reduce risk of building systems that use Hadoop by augmenting
internal teams.
* Management tools and systems to greatly simplify administrative tasks.

Hadoop is still non-trivial in real production systems. It's not uncommon that companies
staff up dedicated people to manage large clusters.

Full disclosure: I work for Cloudera who provides training, support, management tools, and
an open source distribution of software that includesApache Hadoop.

Hadoop-How It Works

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Hadoop-How It Works

Hochgeladen von

Copyright:

Verfügbare Formate

http://www.thecloudavenue.com/2012/10/is-java-prerequisite-for-getting.

Das könnte Ihnen auch gefallen