HDFS

Hadoop was originally developed to address the big data needs of Web and media
companies, but today it’s being used around the world to address a wider set of data
needs, big and small, in practically every industry. When the Apache Hadoop project
was initially released, it had two primary components:
1. A storage component called HDFS (Hadoop Distributed File System) that

works on low-cost, commodity hardware.
2. A resource management and processing component called MapReduce.
Keeping these two Hadoop components in mind, HDFS and MapReduce, let’s take a quick look at how Hadoop
addresses three business scenarios that aren’t necessarily related to big data:
1. Data staging: Corporate data is growing, and it’s going to grow even faster. It’s just getting too expensive to
extend and maintain a data warehouse.
2. Data processing: Organizations are having so much trouble processing and analyzing normal data that they
can’t even think about dealing with big data.
3. Data archiving: Businesses must keep their data for seven years for compliance reasons, but would like to store
and analyze decades of data – without breaking the bank (or the server).
Do any of these scenarios ring a bell? If so, Hadoop may be able to help.
Data staging
Today, many organizations have a traditional data warehouse setup that looks something like this:
•Application data, such as ERP or CRM, is captured in one or more relational databases.
•ETL tools then extract, transform and load this data into a data warehouse ecosystem (EDW, data marts, operational
data stores, analytic sandboxes, etc.).
•Users then interact with the data warehouse ecosystem via BI and analytical tools.
What if you used Hadoop to handle your ETL processing? You could write MapReduce jobs to load the application
data into HDFS, transform it and then send the transformed data to the data warehouse. The bonus? Because of the
low cost of Hadoop storage, you could store both versions of the data in HDFS: the “before” application data and the
“after” transformed data. Your data would all be in one place, making it easier to manage, reprocess, and possibly
analyze at a later date.
If you’re experiencing rapid application data growth or you’re having trouble getting all your ETL jobs to finish in a
timely manner. Consider handing off some of this work to Hadoop – using your ETL vendor’s Hadoop/HDFS
connector or MapReduce – to get ahead of your data, not behind it.
Data processing
This is a simple example from a Facebook presentation a few years ago:
Instead of using costly data warehouse resources to update data in the warehouse, why not send the
necessary data to Hadoop, let MapReduce do its thing, and then send the updated data back to the
warehouse? The Facebook example used was updating your mutual friends list on a regular basis. As you can
imagine, this would be a resource-intensive process involving a lot of data – a job that is easily handled by
Hadoop.
This example not only applies to the processing of data stored in your data warehouse, but in any of your
operational or analytical systems. Take advantage of Hadoop’s low-cost processing power so that your
relational systems are freed up to do what they do best.
Data archiving
This third scenario is very common and pretty straightforward. Since Hadoop runs on commodity hardware that
scales easily and quickly, organizations can now store and archive a lot more data at a much lower cost.
For example, what if you didn’t have to destroy data after its regulatory life to save on storage costs? What if
you could easily and cost-effectively keep all your data? Or maybe it’s not just about keeping the data on hand,
but rather, a need to analyze more data. Why limit your analysis to the last three, five or seven years when you
can easily store and analyze decades of data? Isn’t this a data geek’s paradise?
The bottom line
Don’t fall into the trap of believing that Hadoop is a big-data-only solution. It’s much more than that. Hadoop is
powerful open source technology that is fully capable of supporting and managing one of your organization’s
greatest assets: your data. Hadoop is ready for the challenge. Are you?
The emerging of Big data:
In the web2.0 era, the amount of data getting generated is reaching over peta bytes of data.Programmers and the
Business analyst are looking to analyze the large amount of data to drive the business. The data are key for any
business.
There are two important characteristic of big data that causes the challenges.
•Store data fail safe
•Process the data faster
1. Store data fail safe
Over the years the storage capacity of a single disk increased considerably to drive storing data. Now 1 TB of hard
disk is very normal. But the speed of reading the data from the disks are not cope up with the increase of storage
speed. On an average we get only 100MB/s. This is a lot of time to read all the data from the disk, that will take 2 and
half hour to read all the data from a disk.
How to increase the reading time: Parallel access:

One way to improve the read process is to read data from multiple disk source, hence the computation will be faster.
The draw back of this approach is we may end up more disks than the actual data size. But the increase of storage
space and reduce in prices of those disk made this approach dearer.
What about hardware failure?

It is inevitable that the hardware may fail. So there should be a technique to duplicate data to different storage
systems, So even one system fails the other will picks up. So we need an effective file system(HDFS) to store data
distributed.
2.Process the data faster:
Often the analysis needs to combine data from different node for computation like sorting and merging. So we need an
effective programming model like MapReduce which Hadoop build-on.The ability to process unstructured data and
slowness in seek time are the biggest challenges in computing big data.
Seek time:
Seek time is the process of moving disk header to particular place of the disk to read or write.The data access is
primarily depends on the seek time. But the traditional B-Tree algorithm used by RDMBS good for updating and
selecting data, it is not as efficient as of MapReduce for sorting and merging. The batch processing which most of write
less and read often where the relational database fit for continuously updated.
Processing semi structure data:
RDBMS is good fit when your data organized in a structured way such as XML or tables. Because the whole data
structure build around the relationships of data. The semi-structured data like a spreadsheet, though it organized as
rows and cells, each row and cell can hold any data. Unstructured data such as image file or PDF won’t fit in to a
relational databases. Map Reduce works well with unstructured and semi-structured data since it interpret the data at
the processing time unlike RDBMS which force while storing time (with constraints and data types).
Normalization:
RDBMS often normalized to reduce the duplication where as distributed data processing build on top of duplication of
data over different node. Duplication required so that even one node gone down, the data should not get lost and the
computation should go on undisturbed. Hadoop HDFS file system and MapReduce algorithm perfectly build for it.
Linear Scalability:
On MapReduce, If you increase the input data to be processed the processing speed with get reduce. On the other
hand if you increase number of clusters, the processing speed will increase. It means the amount of data needs to be
processed and the size of the cluster is directly proportional. It is not true in case of SQL queries.
Recommendation:
RDMBS is good when you have a Gigabytes of structured data, which read and write often and need high
integrity.
Hadoop is good when you have a Peta bytes of semi-structure or unstructured (though fit for structure too),
which often read and write once in a while, require to process in batch mode with linear scaling and low
integrity. People now start to use Hadoop for real-time analytic too now.
The Algorithm
•Generally MapReduce paradigm is based on sending the computer to where the data resides!
•MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
• Map stage : The map or mapper’s job is to process the input data. Generally the input data is
in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and creates several
small chunks of data.
• Reduce stage : This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
•During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
•The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
•Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
•After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
•Client applications submit jobs to the Job tracker.
•The JobTracker talks to the NameNode to determine the location of the data
•The JobTracker locates TaskTracker nodes with available slots at or near the data
•The JobTracker submits the work to the chosen TaskTracker nodes.
•The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to
have failed and the work is scheduled on a different TaskTracker.
•A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may
resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist
the TaskTracker as unreliable.
•When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the
JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so
the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker
tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on
the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a
machine in the same rack.
1. Storage
HBase
A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style
computations using MapReduce and point queries (random reads).
Architecture:
HBase Architecture
Uses:
-Storage of large data volumes (billions of rows) atop clusters of commodity hardware
-Bulk storage of logs, documents, real-time activity feeds and raw imported data
-Consistent performance of reads/writes to data used by Hadoop applications
-Data Store than can be aggregated or processed using MapReduce functionality
-Data platform for Analytics and Machine Learning
HCatalog
A table and storage management layer for Hadoop that enables Hadoop applications (Pig, MapReduce, and Hive) to read and
write data to a tabular form as opposed to files. It also provides REST APIs so that external systems can access these tables’
metadata.
Architecture:
HCatalog Architecture
Uses:
-Centralized location of storage for data used by Hadoop applications
-Reusable data store for sequenced and iterated Hadoop processes (ex: ETL)
-Storage of data in a relational abstraction
2. Processing
MapReduce
A distributed data processing model and execution environment that runs on large clusters of commodity machines. It uses the
MapReduce algorithm which breaks down all operations into Map or Reduce functions.
Architecture:
MapReduce Architecture
Uses:
-Aggregation (Counting, Sorting, Filtering, Stitching) on large and desperate data sets
-Scalable parallelism of Map or Reduce tasks
-Distributed task execution
-Machine learning
Pig
A scripting SQL based language and execution environment for creating complex MapReduce transformations. Functions are
written in Pig Latin (the language) and translated into executable MapReduce jobs. Pig also allows the user to create extended
functions (UDFs) using Java.
Architecture:
Pig Architecture
Uses:
-Scripting environment to execute ETL tasks/procedures on raw data in HDFS
-SQL based language for creating and running complex MapReduce functions
-Data processing, stitching, schematizing on large and desperate data sets
3. Querying
HIVE
A distributed data warehouse built on top of HDFS to manage and organize large amounts of data. Hive provides a query
language based on SQL semantics (HiveQL) which is translated by the runtime engine to MapReduce jobs for querying the data.
Architecture:
Hive Architecture
Uses:
-Schematized data store for housing large amounts of raw data
-SQL-like Environment to execute analysis and querying tasks on raw data in HDFS
-Integration with outside RDBMS applications
4. External integration
Flume
A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data into
HDFS. Flume's transports large quantities of event data using a steaming data flow architecture that is fault tolerant and failover
recovery ready.
Architecture:
Flume Architecture
Uses:
-Transportation of large amounts of event data (network traffic, logs, email messages)
-Stream data from multiple sources into HDFS
-Guaranteed and reliable real-time data streaming to Hadoop applications

HDFS

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

HDFS

Hochgeladen von

Copyright:

Verfügbare Formate

Hadoop was originally developed to address the big data needs of Web and media

1. A storage component called HDFS (Hadoop Distributed File System) that

2. A resource management and processing component called MapReduce.

This is a simple example from a Facebook presentation a few years ago:

The bottom line

How to increase the reading time: Parallel access:

What about hardware failure?

•When the work is completed, the JobTracker updates its status.

Client applications can poll the JobTracker for information.

Das könnte Ihnen auch gefallen