Sie sind auf Seite 1von 39

How to monitor the

$H!T out of Hadoop

Developing a comprehensive
open approach to monitoring
hadoop clusters
Relevant Hadoop Information
From 3 – 3000 Nodes
Hardware/Software failures “common”
Redundant Components DataNode,
TaskTracker
Non-redundant Components NameNode,
JobTracker, SecondaryNameNode
Fast Evolving Technology (Best
Practices?)
Monitoring Software
Nagios –
– Red Yellow Green Alerts, Escalations
– Defacto Standard – Widely deployed
– Text base configuration
– Web Interface
– Pluggable with shell scripts/external apps
Return 0 - OK
Cacti
Performance Graphing System
RRD/RRA Front End
Slick Web Interface
Template System for Graph Types
Pluggable
– SNMP input
– Shell script /external program
hadoop-cacti-jtg
JMX Fetching Code w/ (kick off) scripts
Cacti templates For Hadoop
Premade Nagios Check Scripts
Helper/Batch/automation scripts
Apache License
Hadoop JMX
Sample Cluster P1
NameNode & SecNameNode
– Hardware RAID
– 8 GB RAM
– 1x QUAD CORE
– DerbyDB (hive) on SecNameNode
JobTracker
– 8GB RAM
– 1x QUAD CORE
A Sample Cluster p2
Slave (hadoopdata1-XXXX)
– JBOD 8x 1TB SATA Disk
– RAM 16GB
– 2x Quad Core
Prerequisites
Nagios (install) DAG RPMs
Cacti (install) Several RPMS
Liberal network access to the cluster
Alerts & Escalations
X nodes * Y Services = < Sleep
Define a policy
– Wake Me Up’s (SMS)
– Don’t Wake Me Up’s (EMAIL)
– Review (Daily, Weekly, Monthly)
Wake Me Up’s
NameNode
– Disk Full (Big Big Headache)
– RAID Array Issues (failed disk)
JobTracker
SecNameNode
– Do not realize it is not working too late
Don’t Wake Me Up’s
Or ‘Wake someone else up’
DataNode
– Warning Currently Failed Disk will down the Data
Node (see Jira)
TaskTracker
Hardware
– Bad Disk (Start RMA)
Slaves are expendable (up to a point)
Monitoring Battle Plan
Start With the Basics
– Ping, Disk
Add Hadoop Specific Alarms
– check_data_node
Add JMX Graphing
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
The Basics Nagios
Nagios (All Nodes)
– Host up (Ping check)
– Disk % Full
– SWAP > 85 %

* Load based alarms are somewhat useless


389% CPU load is not necessarily a bad
thing in Hadoopville
The Basics Cacti
Cacti (All Nodes)
– CPU (full CPU)
– RAM/SWAP
– Network
– Disk Usage
Disk Utilization
RAID Tools
Hpacucli – not a Street Fighter move
– Alerts on RAID events (NameNode)
Disk failed
Rebuilding
– JBOD (DataNode)
Failed Drive
Drive Errors
Dell, SUN, Vendor Specific Tools
Before you jump in
X Nodes * Y Checks * = Lots of work
About 3 Nodes into the process …
– Wait!!! I need some interns!!!
Solution S.I.C.C.T. Semi-Intelligent-
Configuration-cloning-tools
– (I made that up)
– (for this presentation)
Nagios
Answers “IS IT RUNNING?”
Text based Configuration
Cacti
Answers “HOW WELL IS IT RUNNING?”
Web Based configuration
– php-cli tools
Monitoring Battle Plan
Thus Far
Start With the Basics
– Ping, Disk !!!!!!Done!!!!!!
Add Hadoop Specific Alarms
– check_data_node
Add JMX Graphing
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
Add Hadoop Specific Alarms
Hadoop Components with a Web Interface
– NameNode 50070
– JobTracker 50030
– TaskTracker 50060
– DataNode 50075
check_http + regex = simple + effective
nagios_check_commands.cfg
define command {
command_name check_remote_namenode
command_line $USER1$/check_http -H
$HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r
NameNode
}
define service {
service_description check_remote_namenode
use generic-service
host_name hadoopname1
check_command check_remote_namenode!50070
}

Component Failure
(Future) Newer Hadoop will have XML status
Monitoring Battle Plan
Start With the Basics
– Ping, Disk (Done)
Add Hadoop Specific Alarms
– check_data_node (Done)
Add JMX Graphing
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
JMX Graphing
Enable JMX
Import Templates
JMX Graphing
JMX Graphing
JMX Graphing
Standard Java JMX
Monitoring Battle Plan
Thus Far
Start With the Basics !!!!!!Done!!!!!
– Ping, Disk
Add Hadoop Specific Alarms !Done!
– check_data_node
Add JMX Graphing !Done!
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
Add JMX based Alarms
hadoop-cacti-jtg is flexible
– extend fetch classes
– Don’t call output()
– Write your own check logic
Quick JMX Base Walkthrough

url, user, pass, object specified from CLI


wantedVariables, wantedOperations by
inheritance
fetch() output() provided
Extend for NameNode
Extend for Nagios
Monitoring Battle Plan
Start With the Basics !DONE!
– Ping, Disk
Add Hadoop Specific Alarms !DONE!
– check_data_node
Add JMX Graphing !DONE!
– NameNodeOperations
Add JMX Based alarms !DONE!
– FilesTotal > 1,000,000 or LiveNodes < 50%
Review
File System Growth
– Size
– Number of Files
– Number of Blocks
– Ratio’s
Utilization
– CPU/Memory
– Disk
Email (nightly)
– FSCK
– DSFADMIN
The Future
JMX Coming to JobTracker and
TaskTracker (0.21)
– Collect and Graph Jobs Running
– Collect and Graph Map / Reduce per node
– Profile Specific Jobs in Cacti?

Das könnte Ihnen auch gefallen