Sie sind auf Seite 1von 15

Hadoop Distributed

File File
System
(HDFS)
Hadoop
Distributed
System
(HDFS)
Agenda
Introduction Hadoop
Hadoop eco-system
Definition
Architecture
Agenda
HDFS
HDFS Components

hadoop

Introduction to Big Data and Hadoop


TheunderlyingTechnologywasinventedbyGoogleanddevelopedbyDougCutting
Hadoopisdesignedtoefficientlyprocesslargevolumesofinformationbyconnecting
manycommoditycomputerstogethertoworkinparallel.
Bigdataisatermappliedtodatasetswhosesizeisbeyondtheabilityofcommonly
usedsoftwaretoolstocapture,manage,andprocessthedatawithinatolerable
elapsedtime.Bigdatasizesareaconstantlymovingtargetcurrentlyrangingfroma
fewdozenterabytestomanypetabytesofdatainasingledataset.

Examples
weblogs,
sensornetworks,
socialnetworks,
Internetsearchindexing,
calldetailrecords,

hadoop

Hadoop and its eco system

HDFS Definition
The Hadoop Distributed File System (HDFS) is a distributed file system designed
to run on
commodity hardware. HDFS is a distributed, scalable, and portable filesystem
written in Java for the
Hadoop framework. It has many similarities with existing distributed file
systems. Hadoop
Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
HDFS creates multiple replicas of data blocks and distributes them on compute
nodes throughout a
cluster to enable reliable, extremely rapid computations. HDFS is highly faulttolerant and is
designed to be deployed on low-cost hardware. HDFS provides high throughput
access to
application data and is suitable for applications that have large data sets

HDFS consists of following components (daemons)


Name Node
Data Node
Secondary Name Node
hadoop

HDFS Components
Namenode:
NameNode, a master server, manages the file system namespace and regulates
access to files by clients.

Meta-data in Memory
The entire metadata is in main memory

Types of Metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication factor

A Transaction Log
Records file creations, file deletions. Etc
Data Node:
DataNodes, one per node in the cluster, manages storage attached to the nodes
that they run on

A Block Server
Stores data in the local file system (e.g. ext3)
Stores meta-data of a block (e.g. CRC)
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the NameNode

Facilitates Pipelining of Data


hadoop
Forwards data to other specified DataNodes

HDFS Components
Secondary Name Node
Not used as hot stand-by or mirror node. Failover node is in future release.
Will be renamed in 0.21 to CheckNode
Bakup nameNode periodically wakes up and processes check point and
updates the nameNode
Memory requirements are the same as nameNode (big)
Typically on a separate machine in large cluster ( > 10 nodes)
Directory is same as nameNode except it keeps previous checkpoint version in
addition to current.
It can be used to restore failed nameNode (just copy current directory to new
nameNode)

hadoop

Basic Hadoop Filesystem commands


In order to work with HDFS you need to use the hadoop fs command.
For example to list the /
hadoop fs ls /
hadoop fs ls /user/root
hadoop fs lsr /user
For make the directory test you can issue the following command
hadoop fs mkdir test
In order to move files between your regular linux filesystem and HDFS
hadoop fs put /test/README README
To see a new file called /user/root/README listed. In order to view the
contents of
this file we will use the cat command as follows
hadoop fs cat README
To find the size of files you need to use the du or dus commands
hadoop fs du README
movefromLocal
Hadoop fs -moveFromLocal <src> <dst>

Das könnte Ihnen auch gefallen