Sie sind auf Seite 1von 34

INTRODUCTION TO HADOOP

Dr. G Sudha Sadhasivam


Professor, CSE
PSG College of Technology
Coimbatore
Contents

Distributed System
DFS
Hadoop
Why its is needed?
Issues
Mutate / lease
Operating systems
Operating system - Software that supervises
and controls tasks on a computer. Individual
OS:
Batch processing jobs are collected, placed in
a queue, no interaction with job during processing
Time shared computing resources are provided
to different users, interaction with program during
execution
RT systems fast response, can be interrupted
Distributed Systems
Consists of a number of computers that are connected and
managed so that they automatically share the job processing load
among the constituent computers.
A distributed operating system is one that appears to its users as
a traditional uniprocessor system, even though it is actually
composed of multiple processors.
It gives a single system view to its users and provides a single
service.
Users are transparent to location of files. It provides a virtual
computing env.
Eg The Internet, ATM banking networks, mobile computing
networks, Global Positioning Systems and Air Traffic Control
DISTRIBUTED SYSTEM IS A COLLECTION OF INDEPENDENT
COMPUTERS THAT APPEARS TO IS USERS AS A SINGLE
COHERENT SYSTEM
Network Operating System
In a network operating system the users are aware of the existence of multiple computers.
The operating system of individual computers must have facilities to have communication
and functionality.
Each machine runs its own OS and has its own user.
Remote login and file access
Less transparent but more independency

Applicatio Applicatio Applicatio Application Application Application


n n n

Distributed Operating System Services Network Network Network


OS OS OS

Distributed OS Networked OS
DFS
Resource sharing is the motivation behind distributed
Systems. To share files file system
File System is responsible for the organization, storage,
retrieval, naming, sharing, and protection of files.
The file system is responsible for controlling access to
the data and for performing low-level operations such as
buffering frequently used data and issuing disk I/O
requests
The goal is to allow users of physically distributed
computers to share data and storage resources by
using a common file system.
Hadoop
What is Hadoop?

It's a framework for running applications on large clusters of
commodity hardware which produces huge data and to
process it

Apache Software Foundation Project

Open source

Amazons EC2

alpha (0.18) release available for download

Hadoop Includes

HDFS a distributed filesystem

Map/Reduce HDFS implements this programming model. It
is an offline computing engine

Concept
Moving computation is more efficient than moving large
data
Data intensive applications with Petabytes of data.
Web pages - 20+ billion web pages x 20KB = 400+
terabytes
One computer can read 30-35 MB/sec from disk ~four
months to read the web
same problem with 1000 machines, < 3 hours
Difficulty with a large number of machines
communication and coordination
recovering from machine failure
status reporting
debugging
optimization
locality
FACTS
Single-thread performance doesnt matter
We have large problems and total throughput/price more important
than peak performance
Stuff Breaks more reliability
If you have one server, it may stay up three years (1,000 days)
If you have 10,000 servers, expect to lose ten a day
Ultra-reliable hardware doesnt really help
At large scales, super-fancy reliable hardware still fails, albeit less often
software still needs to be fault-tolerant
commodity machines without fancy hardware give better
perf/price

DECISION : COMMODITY HARDWARE.

DFS : HADOOP REASONS?????


WHAT SOFTWARE MODEL????????
HDFS Why? Seek vs Transfer
CPU & transfer speed, RAM & disk size double every 18
- 24 months
Seek time nearly constant (~5%/year)
Time to read entire drive is growing vs transfer rate.
Moral: scalable computing must go at transfer rate
BTree (Relational DBS)
operate at seek rate, log(N) seeks/access
-- memory / stream based
sort/merge flat files (MapReduce)
operate at transfer rate, log(N) transfers/sort
-- Batch based
Characteristics

Fault tolerant, scalable, Efficient, reliable distributed


storage system
Moving computation to place of data
Single cluster with computation and data.
Process huge amounts of data.
Scalable: store and process petabytes of data.
Economical:
It distributes the data and processing across clusters of
commonly available computers.
Clusters PCs into a storage and computing platform.
It minimises no of CPU cycles, RAM on individual
machines etc.
Efficient:
By distributing the data, Hadoop can process it in parallel on
the nodes where the data is located. This makes it extremely
rapid.
Computation is moved to place where data is present.
Reliable:
Hadoop automatically maintains multiple copies of data
Automatically redeploys computing tasks based on failures.
Cluster node runs both DFS and MR
Data Model
Data is organized into files and directories
Files are divided into uniform sized blocks and
distributed across cluster nodes
Replicate blocks to handle hardware failure
Checksums of data for corruption detection
and recovery
Expose block placement so that computes can
be migrated to data
large streaming reads and small random reads
Facility for multiple clients to append to a file
Assumes commodity hardware that fails
Files are replicated to handle hardware failure
Checksums for corruption detection and
recovery
Continues operation as nodes / racks
added / removed

Optimized for fast batch processing


Data location exposed to allow computes to
move to data
Stores data in chunks/blocks on every node
in the cluster
Provides VERY high aggregate bandwidth
Files are broken in to large blocks.
Typically 128 MB block size
Blocks are replicated for reliability
One replica on local node, another replica on a remote rack,
Third replica on local rack,
Additional replicas are randomly placed
Understands rack locality
Data placement exposed so that computation can be
migrated to data
Client talks to both NameNode and DataNodes
Data is not sent through the namenode, clients access
data directly from DataNode
Throughput of file system scales nearly linearly with the
number of nodes.
Block Placement
Hadoop Cluster Architecture:
Components
DFS Master Namenode
Manages the file system namespace
Controls read/write access to files
Manages block replication
Checkpoints namespace and journals namespace
changes for reliability

Metadata of Name node in Memory


The entire metadata is in main memory
No demand paging of FS metadata

Types of Metadata:
List of files, file and chunk namespaces; list of blocks,
location of replicas; file attributes etc.
DFS SLAVES or DATA NODES
Serve read/write requests from clients
Perform replication tasks upon instruction by namenode
Data nodes act as:
1) A Block Server
Stores data in the local file system
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
2) Block Report: Periodically sends a report of all existing
blocks to the NameNode
3) Periodically sends heartbeat to NameNode (detect node
failures)
4) Facilitates Pipelining of Data (to other specified
DataNodes)
Map/Reduce Master Jobtracker
Accepts MR jobs submitted by users
Assigns Map and Reduce tasks to Tasktrackers
Monitors task and tasktracker status, re
executes tasks upon failure
Map/Reduce Slaves Tasktrackers
Run Map and Reduce tasks upon instruction
from the Jobtracker
Manage storage and transmission of
intermediate output.
SECONDARY NAME NODE

Copies FsImage and Transaction Log from


NameNode to a temporary directory
Merges FSImage and Transaction Log into
a new FSImage in temporary directory
Uploads new FSImage to the NameNode
Transaction Log on NameNode is purged
HDFS Architecture

NameNode: filename, offset> blockid, block > datanode


DataNode: maps block > local disk
Secondary NameNode: periodically merges edit logs
Block is also called chunk
JOBTRACKER, TASKTACKER AND JOBCLIENT
HDFS API
Most common file and directory operations
supported:
Create, open, close, read, write, seek, list,
delete etc.
Files are write once and have exclusively
one writer
Some operations peculiar to HDFS:
set replication, get block locations
Support for owners, permissions
DATA CORRECTNESS
Use Checksums to validate data
Use CRC32
File Creation
Client computes checksum per 512 byte
DataNode stores the checksum
File access
Client retrieves the data and checksum from
DataNode
If Validation fails, Client tries other replicas
MUTATION ORDER AND LEASES
A mutation is an operation that changes the
contents / metadata of a chunk such as append /
write operation.
Each mutation is performed at all replicas.
Leases (order of mutations) are used to maintain
consistency
Master grants chunk lease to one replica (primary)
Primary picks the serial order for all mutations to
the chunk
All replicas follow this order (consistency)
Software Model - ???
Parallel programming improves performance and
efficiency.
In a parallel program, the processing is broken up into
parts, each of which can be executed concurrently
Identify whether the problem can be parallelised (fib)
Matrix operations with independency
Master/Worker
The MASTER:
initializes the array and splits it up according
to the number of available WORKERS
sends each WORKER its subarray
receives the results from each WORKER
The WORKER:
receives the subarray from the MASTER
performs processing on the subarray
returns results to MASTER
CALCULATING PI
The area of the square, denoted
As = (2r)^2 or 4r^2.
The area of the circle, denoted
Ac, is pi * r2.
pi = Ac / r^2
As = 4r^2
r^2 = As / 4
pi = 4 * Ac / As
pi= 4 * No of pts on
the circle / num of
points on the square
Randomly generate points in the square
Count the number of generated points that are
both in the circle and in the square MAP
(find ra = No of pts on the circle / num of points
on the square)
ra = the number of points in the circle divided
by the number of points in the square
gather all ra
PI = 4 * r REDUCE
Parallelised calculation of points on the circle
(MAP)
Then merged in to find PI REDUCE
Cluster node runs both DFS and MR
WHAT IS MAP REDUCE PROGRAMMING
Restricted parallel programming model meant
for large clusters
User implements Map() and Reduce()
Parallel computing framework (HDFS lib)
Libraries take care of EVERYTHING else
(abstraction)
Parallelization
Fault Tolerance
Data Distribution
Load Balancing
Useful model for many practical tasks
Conclusion
Why commodity hw ?
because cheaper
designed to tolerate faults
Why HDFS ?
network bandwidth vs seek latency
Why Map reduce programming model?
parallel programming
large data sets
moving computation to data
single compute + data cluster

Das könnte Ihnen auch gefallen