Beruflich Dokumente
Kultur Dokumente
Developing a comprehensive
open approach to monitoring
hadoop clusters
Relevant Hadoop Information
From 3 – 3000 Nodes
Hardware/Software failures “common”
Redundant Components DataNode,
TaskTracker
Non-redundant Components NameNode,
JobTracker, SecondaryNameNode
Fast Evolving Technology (Best
Practices?)
Monitoring Software
Nagios –
– Red Yellow Green Alerts, Escalations
– Defacto Standard – Widely deployed
– Text base configuration
– Web Interface
– Pluggable with shell scripts/external apps
Return 0 - OK
Cacti
Performance Graphing System
RRD/RRA Front End
Slick Web Interface
Template System for Graph Types
Pluggable
– SNMP input
– Shell script /external program
hadoop-cacti-jtg
JMX Fetching Code w/ (kick off) scripts
Cacti templates For Hadoop
Premade Nagios Check Scripts
Helper/Batch/automation scripts
Apache License
Hadoop JMX
Sample Cluster P1
NameNode & SecNameNode
– Hardware RAID
– 8 GB RAM
– 1x QUAD CORE
– DerbyDB (hive) on SecNameNode
JobTracker
– 8GB RAM
– 1x QUAD CORE
A Sample Cluster p2
Slave (hadoopdata1-XXXX)
– JBOD 8x 1TB SATA Disk
– RAM 16GB
– 2x Quad Core
Prerequisites
Nagios (install) DAG RPMs
Cacti (install) Several RPMS
Liberal network access to the cluster
Alerts & Escalations
X nodes * Y Services = < Sleep
Define a policy
– Wake Me Up’s (SMS)
– Don’t Wake Me Up’s (EMAIL)
– Review (Daily, Weekly, Monthly)
Wake Me Up’s
NameNode
– Disk Full (Big Big Headache)
– RAID Array Issues (failed disk)
JobTracker
SecNameNode
– Do not realize it is not working too late
Don’t Wake Me Up’s
Or ‘Wake someone else up’
DataNode
– Warning Currently Failed Disk will down the Data
Node (see Jira)
TaskTracker
Hardware
– Bad Disk (Start RMA)
Slaves are expendable (up to a point)
Monitoring Battle Plan
Start With the Basics
– Ping, Disk
Add Hadoop Specific Alarms
– check_data_node
Add JMX Graphing
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
The Basics Nagios
Nagios (All Nodes)
– Host up (Ping check)
– Disk % Full
– SWAP > 85 %
Component Failure
(Future) Newer Hadoop will have XML status
Monitoring Battle Plan
Start With the Basics
– Ping, Disk (Done)
Add Hadoop Specific Alarms
– check_data_node (Done)
Add JMX Graphing
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
JMX Graphing
Enable JMX
Import Templates
JMX Graphing
JMX Graphing
JMX Graphing
Standard Java JMX
Monitoring Battle Plan
Thus Far
Start With the Basics !!!!!!Done!!!!!
– Ping, Disk
Add Hadoop Specific Alarms !Done!
– check_data_node
Add JMX Graphing !Done!
– NameNodeOperations
Add JMX Based alarms
– FilesTotal > 1,000,000 or LiveNodes < 50%
Add JMX based Alarms
hadoop-cacti-jtg is flexible
– extend fetch classes
– Don’t call output()
– Write your own check logic
Quick JMX Base Walkthrough