Sie sind auf Seite 1von 68

Czech Technical University in Prague

Faculty of Electrical Engineering

Bachelor's Project

Sun servers open--source


software systems management
Ondej Jakubk

Supervisor: Ing. Josef Hajas


Study Program: Electrical Engineering and Information Technology
Computer Engineering
May 27, 2010

Acknowledgement
I would like to thank my family, my friends and my colleagues for their insight, support and wisdom. I am truly grateful for being surrounded by such brilliant people.

Declaration
I hereby declare that I have completed this project independently and that I have
listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act 60 Zkon
. 121/2000Sb. (copyright law), and with the rights connected with the copyright act
including the changes in the act.
In . . . . . . . . . . . . . . . . . . . . . . . on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Abstrakt
elem tto bakalsk prce je provst analzu dostupnch softwarovch produkt pro systmovou sprvu, a ji komernch i otevench, dle analyzovat monosti
integrace se servery spolenosti Oracle (Sun) a implementace integranho een do
vybranho nstroje.
Soust analzy je t teoretick st zamen na uitenost systmov sprvy,
pouvan metody zskvn dat a protokoly, kter jsou pi monitorovn a sprv
server pouvany.

Abstract
Objective of this bachelor's project is to analyze available systems management
products, both commercial and open--source. It analyzes integration possibilities
against servers made by Oracle (Sun) and a result of this project is an integration
into a selected software.
As a part of analysis there is also a theory focused on benets of systems management, available methods of data acquisition and protocols that are used for monitoring
and managing servers.

Contents
1

Introduction

2 Systems management software


2.1
Commercial oerings
2.2
Open-source oerings

1
3
3
5

3 Protocols for system management


3.1
Simple Network Management Protocol
3.1.1
Monitoring over SNMP
3.1.2
Important terms related to SNMP
3.1.3
Management Information Base
3.1.3.1
ASN.1
3.2
Intelligent Platform Management Interface
3.3
Web-Based Enterprise Management
3.4
Other protocols
3.4.1
Remote shell access
3.4.2
Other protocols

11
11
12
13
13
14
14
15
16
16
17

4 Approaches to system management


4.1
Way of communication
4.1.1
In-band communication
4.1.2
Out-of-band communication
4.1.3
Side-band communication
4.2
By means of data gathering
4.2.1
Active monitoring
4.2.2
Passive monitoring
4.2.3
Combination of active and passive monitoring
4.3
Final comparison

19
19
19
20
21
21
21
22
23
23

25

Sensors and components

6 Management interfaces of Oracle Sun servers


6.1
System controllers
6.2
Command--line interface
6.3
SNMP
6.3.1
Oracle Sun MIBs
6.3.1.1
Origin and purpose of these MIBs
6.3.1.2
Notications
6.3.1.3
Polled data
6.4
IPMI
6.5
Other interfaces

29
29
30
31
33
34
35
37
39
39

7 Zenoss integration
7.1
Choosing an approach
7.2
Development environment
7.3
Important design decisions
7.3.1
Event classes
7.3.2
Per-trap mapping vs. defaultmapping
7.4
Development steps
7.4.1
Compiling MIBs
7.4.2
Creating Event classes
7.4.3
Creating Event mappings
7.4.4
Adding products
7.4.5
Final modications
7.5
Testing
7.6
Future extension

41
41
41
42
42
42
43
44
44
45
50
50
51
51

Conclusion

53

CD Contents

57

1 Introduction
Systems management has become a very important topic in almost every organisation depending on IT services. It encompasses entire life cycle of IT infrastructure,
including i.e. tracking and documenting requirements, purchasing and renewing
equipment, license management, fault and risk monitoring etc. While systems management has beenin some wayalways present in IT departments of mid-size to big
enterprises, approach to systems management was often dened in a company-specic way, with no standardization.
However, many companies now span a number of countries or even continents.
For all but the biggest companies, it would be very inecient to invest in development of complete in-house solution for systems managementthese companies rely
on third party solutions, that oer cheaper, well tested and supported alternative.
Decentralization of IT resources is a very important factor for the need of systems
management. It has become quite common to have more than one datacenter, often
in remote locations, possibly quite far apart from each other so that in case of an
accident at or near one of them, the operations of a company can continue relatively
uninterrupted (in this case, by accident we mean either a natural phenomenalike
ooding, storm, reor an act of ill willsuch as a terrorist attack). Because the IT
support may not be alway present on site, an advanced warning of some components'
possible failures is very important. Some, albeit not all system management software
suites can even tie individual systems, groups of systems or even components to a
service, so when a failure is imminent, one can see which services are in jeopardy.
Businesses of today rely on IT more than ever before. Even a minute long outage
can cost thousands of dollars in eect. Therefore, some companies (notably telecommunication companies, banks, etc.) build systems with certain level of redundancy,
so in the case of failure of one system, other system takes over in a reasonable amount
of time, so the interruption is barely noticeable. System management is necessary in
this case as it provides information about the nature of failure and it helps selecting
and migrating to a dierent system.
Computing power (in the sense of CPU processing speeds, RAM and storage sizes,
etc.) keeps growing and its price is falling. However the workload is so variable that
computing power may not load processing node enough so that its power consumption
is actually higher than the outcome of its work.
This led to a rebirth of one IT industryvirtualization. To a certain level, virtualization has been possible on various levels since 1967, in this case on IBM CP-40.
However, the main reason back then was to enable various software to run unmodied or simultaneously (computers were batch oriented and most software was not

Sun servers open--source systems management

designed for any level of multitasking). Now, the reason for virtualization is consolidation, power consumption reduction and control of expenses.
Availability of relatively cheap but powerful commodity hardware has led to a
new architecture of ITinstead of renting a dedicated machine (although this is still
possible), one can rent virtual machines, running on possibly very dierent set of
hardware. With properly setup infrastructure (ber channel or iSCSI disk arrays,
virtualization software supporting live migration etc.), it is possible to achieve a very
high availability and reliability.
However, cheaper systems are being built from cheaper components that are
prone to failure more often than never, thus the need for proper monitoring is high.
With proper software, migrating of virtual machines in case of a hardware malfunction can be automated.
Power consumption monitoring is a very important part of systems management.
With power becoming more expensive, a careful monitoring of power consumption
with relation to tasks performed is required to manage the costs of ones IT operations
or to properly bill the customers (the latter applies specically to cloud computing
customers).
This bachelors project will focus on one area of systems managementsystems
health monitoring. With above in mind, we can try to focus on a clear design, that will
allow implementing above described features or connecting with existing features in
place.
Objective is to design and implement a Zenoss extension (also known as ZenPack)
that will allow to discover, monitor and report system health status of some Oracle
Sun servers to user. Zenoss was chosen because it is a very advanced integration platform, with advanced features such as graphing, so a future extensions like recording
and analyzing power consumption trends can be implemented. Selection was done in
unpublished work by the author, available separately [1].

2 Systems management software


In this chapter, an incomplete list of both commercial and non-commercial software
used for systems management is presented. When possible, manageability features
of Oracle servers using these particular software solutions is also described.
While there are many software solutions available from various vendors, only
few are listed in this section, just to give a brief overview of present features. The
objective is to make the resulting integration with open-source software comparable
to already existing integrations.

2.1 Commercial oerings


The following commercial product have been used by the author to manage Oracle
Sun servers:

CA Unicenter NSM
HP Operations Manager
IBM Director
IBM Tivoli Enterprise Console
IBM Tivoli NetCool OMNIbus

All of these products can do passive monitoringlisten for events, either received
using SNMP traps, system logs or some other mechanism (like direct database entry,
command line tool execution etc.).
The Tivoli Enterprise Console, also known as TEC is one of the oldest systems
management package. It relies on Tivoli Management Framework which provides
also way how to install other extensions and patches. TEC itself has rather simple
GUI written in Java, but the backend consists of many helper programs usually written in C. TEC is used to do passive monitoring onlyit waits for events and those
events get processed using internal engine (some of its parts are based on Prolog language). This software package however requires preinstalled database system to be
present.

Sun servers open--source systems management

Figure 2.1 IBM Tivoli Enterprise Console


with graphed amount of incoming events
NetCool OMNIbus is similar to TEC, but it has a more modern GUI. Being a
product acquired through acquisition, it is not written in Java but in compiled language. It uses totally dierent language for writing custom extension and as one of a
few, it has its own database bundled.
Operations Manager, Director and Unicenter NSM are products of dierent companies, but they have one common featurethey support active polling. Other than
that, they oer similar features and all can receive and process notications from
Oracle Sun servers.
The following features are present in all integrations with these products:

Translating SNMP traps and notications into user readable form.


Removing duplicates of events.
Having events with lower severity automatically close events with higher severity.

Systems management software

Figure 2.2 TEC showing new events


Integration that support polling usually can at least display the state of system LEDs,
some (CA Unicenter NSM) can display a hierarchy of sensors.

2.2 Open-source oerings


In the open--source market, there are right now the following major products:

Nagios
OpenNMS
Zabbix
Zenoss

Sun servers open--source systems management

Figure 2.3

CA Unicenter NSM showing hierarchy of sensors

Nagios is the oldest and most mature open--source product. It is very scalable, well
documented, but its web GUI lacks some modern featureswhich of course means it
is very fast, albeit sometimes not very user friendly.
It is written mainly in C, which is another cause of high speed. Monitoring data
can be obtained by running checks either built-in or user supplied scripts called
plugins whose exit code and (optionally) any output is processed and evaluated by
Nagios.
Checks can be run either locally or remotely using a tool called NRPE (Nagios
Remote Plugin Executor). In addition to having Nagios to run a check actively (see
subsection 4.2.1 at page 21), one can also feed data into Nagios asynchronously (see
subsection 4.2.2 at page 22). For more information please see www.nagios.org or
[2].
OpenNMS is another network monitoring/management software package. While
Nagios achieves portability across dierent platform by using C as its programming
language, OpenNMS is written in Java, which makes it too very portable. It requires

Systems management software

Figure 2.4 Nagios showing status of services (image from www.nagios.org)


database for its backing. It provides more modern GUI to user, otherwise its features
are mostly comparable to others.
From [3]:
Zabbix is an enterprise-class open source distributed monitoring solution.
Zabbix is software that monitors numerous parameters of a network and
the health and integrity of servers. Zabbix uses a exible notication mechanism that allows users to congure e-mail based alerts for virtually any event.
This allows a fast reaction to server problems. Zabbix oers excellent reporting and data visualisation features based on the stored data. This makes Zabbix ideal for capacity planning.

Sun servers open--source systems management

Figure 2.5 OpenNMS event list


Zabbix supports both polling and trapping. All Zabbix reports and statistics, as well as conguration parameters, are accessed through a web-based
front end. A web-based front end ensures that the status of your network and
the health of your servers can be assessed from any location. Properly congured, Zabbix can play an important role in monitoring IT infrastructure.
This is equally true for small organisations with a few servers and for large
companies with a multitude of servers.
Zabbix is written in C and PHP and requires a database backing.
Finally, we are about to look at Zenoss, which is our integration platform. Ocial
documentation [4]says:
Zenoss is today's premier open source IT management solution. Through integrated monitoring, it enables you to manage the status and health of your
infrastructure through a single, Web-based console.
The power of Zenoss starts with its in-depth Inventory and Conguration
Management Database (CMDB). Zenoss creates this database by discovering
managed resourcesservers, networks, and other devicesin your IT environment. The resulting environment model provides a complete inventory of
your key systems, down to the level of resource components (interfaces, services, and processes, and installed software.)

Systems management software

With the model built, you can use Zenoss' integrated availability and performance monitoring features to monitor and report on all aspects of your IT
infrastructure. Zenoss also provides events and fault management features
that tie into the CMDB. These features help drive operational eciency and
productivity by automating many of the notication, alerts, escalation, and
remediation tasks you perform each day.
Zenoss is written in Python and is based on Zope application platform and like most
previously mentioned software products, it requires databasespecically MySQL.

Figure 2.6

Zenoss with list of manufacturers

10

Sun servers open--source systems management

3 Protocols for system management


Systems management can be thought of as a network application. As such, it is necessary to have one or more protocols, that will allow user to gather data (for description
of data gathering methods, please see chapter 4 at page 19). These protocols dier in
their complexity, reliability and verbosity.
Some devices may also implement two or more protocols simultaneously, but the
amount of data exposed may not be the same, even for the same device. Also, level of
support of these protocols varies considerable (e.g. very few software packages support IPMI out--of--the--box). In this chapter we will describe some of the most commonly used protocols that have been used for systems management.

3.1 Simple Network Management Protocol


Taken from Wikipedia [5]:
Simple Network Management Protocol (SNMP) is a UDP-based network protocol. It is used mostly in network management systems to monitor network-attached devices for conditions that warrant administrative attention. SNMP
is a component of the Internet Protocol Suite as dened by the Internet Engineering Task Force (IETF). It consists of a set of standards for network management, including an application layer protocol, a database schema, and a
set of data objects.
SNMP exposes management data in the form of variables on the managed
systems, which describe the system conguration. These variables can then
be queried (and sometimes set) by managing applications.
Although in the early days of the internet by network devices mostly computers were
meant, the specication is designed very much device-independently, therefore devices such as

servers
routers
racks
switches

11

12

Sun servers open--source systems management

wireless access points


uninterruptible power supplies
can be monitored. Since the SNMP implementation can be carried out even on very
small devices, SNMP can be implemented even for devices like air conditioning control
etc.
Currently, SNMP exists in three versions (in parentheses the years of standardization by the Internet Engineering Task Force is given):

SNMP v1 (19881990) [68]


SNMP v2c (1993)
SNMP v3 (2002)
Even though the latest version of SNMP brings very important new features, like
authentication and encryption, it is still not supported by some of the network management software suites.

3.1.1 Monitoring over SNMP


Network infrastructure implementing monitoring contains two important software
componentsthe agent and the network management software, also known as NMS.
Agent implements SNMP protocol and uses it to expose data. The structure of
data is dened using Management Information Base (see below). Usually vendors
choose to dene their MIBs very broadly, so every agent implementing that particular
MIB may not make use of all structures.
Network management software also makes use of the MIB to gather and translate data it can get from the agent and performs further processingamong others
statistics, error notication, automated error processing etc.
SNMP protocol supports both active and passive monitoring. In active monitoring, NMS uses SNMP requests (gets or sets) to get data or set conguration parameters on the managed device directly. When monitoring passively, NMS only listens for
SNMP data coming from the managed device (SNMP uses two termstrap and noticationthey are often used interchangeably, although rst term refers to SNMP
v1 and the latter to SNMP v2c and v3). Version 2c also species an inform packet,
that diers from trap and notication as it makes the NMS send a conrmation when
such packet is received. However, this mechanism is rarely used.

Protocols for system management

13

SNMP is a datagram protocol and therefore there is a possibility of the data being
lost en route. This is especially important when using passive monitoringnetwork
elements such as routers can cause UDP packets to be lost and in the case of fatal error
(by fatal error an error causing powering o of the monitored device) the notication
may not be received at all, causing the error to be found due to some other malfunction
(typically a segment of network being down, possibly a service like database or web
server being inaccessible).

3.1.2 Important terms related to SNMP


When working with SNMP based technologies, one can ofter come across the following
terms:

OID
varbind
table
scalar
index

OID is an abbreviation for object identier. It is represented as a dotted n--tuple of


integers (MIBs actually describe the textual representation of these OIDs).
Varbind stands for variable binding. It is consists of OID and its values, which
can be OID too or it can be a number, string, or any other data structure expressable
using ASN.1.
Scalar value is dened in MIB and it is always referenced using single OID.
Table is dened in MIB too, but to access the rows in columns, one must append
an index after the OID of column. Table is simply a set of columns.

3.1.3 Management Information Base


As mentioned above subsection 3.1.1 at page 12, there is a special format that describes the data sent over SNMP. Format of a MIB is derived from ASN.1 (see subsection 3.1.3.1 at page 14). Formally, it has been dened in [9]. Citation:

14

Sun servers open--source systems management

Management information is viewed as a collection of managed objects, residing in a virtual information store, termed the Management Information Base
(MIB). Collections of related objects are dened in MIB modules. These modules are written using an adapted subset of OSI's Abstract Syntax Notation
One, ASN.1 [10]. It is the purpose of this document, the Structure of Management Information (SMI), to dene that adapted subset, and to assign a set of
associated administrative values.

3.1.3.1 ASN.1
Abstract Syntax Notation One is one of many approaches on data structure description. What makes it stand out is that it allows specication of the structure, but it
also describes its encoding and decoding into various formats (ranging from binary
formats to XML).
ASN.1 is an international standard adopted by Internation Telecommunication
Union (ITU) and by ISO/IEC. It has been standardized as [1013]. Due to its versatility, ASN.1 and its hierarchical data model is used other application protocols as well,
including internet telephony (H.323) and directory services (LDAP).

3.2 Intelligent Platform Management Interface


Rather than a being a single protocol specication, IPMI species full set of physical
interfaces to a system controller, communication protocol and data representation. It
is specied in [14], a standard designed by a computer manufacturer consortium led
by Intel. Citation for [14]:
The IPMI specications dene standardized, abstracted interfaces to the platform management subsystem. IPMI includes the denition of interfaces for
extending platform management between board within the main chassis, and
between multiple chassis.
The term platform management is used to refer to the monitoring and
control functions that are built in to the platform hardware and primarily used
for the purpose of monitoring the health of the system hardware. This typically includes monitoring elements such as system temperatures, voltages,
fans, power supplies, bus errors, system physical security, etc. It includes

Protocols for system management

15

automatic and manually driven recovery capabilities such as local or remote


system resets and power on/o operations. It includes the logging of abnormal or out--of--range conditions for later examination and alerting where the
platform issues the alert without aid of run--time software. Lastly it includes
inventory information that can help identify a failed hardware unit.

3.3 Web-Based Enterprise Management


A modied excerpt from [15]:
WBEM is a set of management and Internet standard technologies developed
to unify the management of distributed computing environments, facilitating
the exchange of data across otherwise disparate technologies and platforms.
It consists of a core set of standards developed by DMTF (Distributed Management Task Force), which includes the Common Information Model (CIM),
CIM-XML, CIM Query Language, WBEM Discovery using Service Location
Protocol (SLP) and WBEM Universal Resource Identier (URI) mapping. In
addition, the DMTF has developed a WBEM Management Prole template,
allowing for simplied prole development to deliver a complete, standalone
denition for the management of a particular system, subsystem, service or
other entity.
WBEM is extensible, facilitating the development of platform-neutral,
reusable infrastructure, tools and applications. In addition to its use by vendors, end users and the open source community, WBEM is enabling other industry organizations to build on its foundation in areas including Web services, security, storage, grid and utility computing.
Openness of the WBEM specications led to development of several implementation,
notably OpenPegasue [16]and WMI (Windows Management Instrumentation). WMI
does not rely on Web Services, but rather on COM objects and RPC calls.

16

Sun servers open--source systems management

WBEM is now part of many operating systemsapart from Windows' WMI, it is


present in most enterprise Linux distributions and in commercial Unices, like Oracle
Solaris and HP-UX.

3.4 Other protocols

3.4.1 Remote shell access


System management has traditionally used a particularly simple approach using serial line, or its alternativetelnet or secure shell access to the system controller or
to the system itself.
System controller on most server platform oers a broad range of system management possibilities. Besides power control and console control, it also provides system administrator with the ability to display the status of sensors and to list system
events.

# ssh root@myhost
Password:
Waiting for daemons to initialize...
Daemons ready
Sun(TM) Integrated Lights Out Manager
Version 3.0.6.1.d r48331
Copyright 2009 Sun Microsystems, Inc. All rights
reserved.
Use is subject to license terms.
-> show /SYS product_name
/SYS
Properties:
product_name = SPARC-Enterprise-T5220

Figure 3.1

Output from service console

Protocols for system management

17

Although the output is optimized for human reading and not for programmatic analysis, there are well established tools that can parse this output (expect [17]), and feed
the resulting data to a system management software.
This technique applies not only to system controller, but to BIOSes and even operating system command line utilities. There are a few Zenoss extensionsZenPacks,
that use the technique of parsing text output to deliver information on processes, CPU
load, storage status and more.

# cat /proc/partitions
major minor #blocks name
8
8
8
8

0
1
2
5

312571224
309917916
1
2650693

sda
sda1
sda2
sda5

Figure 3.2 Output of cat command

3.4.2 Other protocols


In addition to protocols listed above, there are some other protocols used for system
management. One of the mature one is syslog protocol.
Unix system log protocol is specied in [18]. It was designed with networking in
ming, so although it is generally used on local host, it is possible to setup the daemon
to lter and forward messages to a network host. On this host, further processing
can be done. Usually, traditional syslog will not record originating host name, so
there needs to be a special daemon or the system logging daemon needs a special
conguration.
Being a very old protocol, there is almost no security (besides facilities like rejecting a host that is not in a list, etc.), and by generating a ood of messages, it is
possible to overload the daemon or ll the space in /var/log lesystem, which may
lead to unexpected failures.
Commercial products (especially those that contain or can be used with their own
agents on remote hosts) also use various RPC mechanisms. Among the most common,
there are the following:

18

Sun servers open--source systems management

ONC RPC (Open Network Computing Remote Procedure Call) [19]


CORBA (Common Object Request Broker Architecture) [20]
SOAP (Simple Object Access Protocol) [21]
XML-RPC (XML Remote Procedure Call) [22]

Description of these protocols is beyond the scope of this project, for further information please consult the references. In case of proprietary software, details about the
usage of these protocols may not be fully known, therefore their use as an communication protocol with custom software may be very challenging.

4 Approaches to system management


In this chapter we will describe possible approaches to system management, and compare them in terms of protocol requirements, generated network trac and reliability.
Possibly the simplest approach to system management (more specically, system
health monitoring) is simply to wait until the device stops working, rendering some
service or services unusable. While possible to do so (indeed, author have observed
such approach in an educational institution), there is no warning in advance and
therefore such approach is only feasible in environments where setting up monitoring
would be more expensive than repairing failed systems.

4.1 Way of communication


To be able to monitor any system, there must be a way to connect to it. In systems
management, we usually use one of the following four communication channels:

local only
in-band communication
out-of-band communication
side-band communication

By local communication a non-network communication with monitored system is usually meant. This may involve connecting serial console (e.g. laptop with serial line)
or display, keyboard and mouse manually. Watching status LEDs in person can be
also used for quick system status checking. For the purpose of this project, we will
not consider this as a viable method of system monitoring. All other communication
channels are described below.

4.1.1 In-band communication


In-band communication is a way of system monitoring and management communication, where the monitoring data is sent over the same network channel as production
data (e.g. web trac).

19

20

Sun servers open--source systems management

This implies that operating system on the monitored device has to support management trac handling (usually, this is accomplished by running a so-called agent).
Also, it means that management trac occupies (at least partially) useful bandwidth
and that the agent will use some CPU cycles.
On the other hand, using this type of communication poses no additional requirements on the existing network infrastructureno additional cabling is required and
no changes to network switches and routers needs to be made. Especially when dealing with many servers, savings on network infrastructure may be signicant.
One signicant drawback of this approach is that without operating system running, management may not be possible (although servers with Wake--on--LAN capability can be at least turned on remotely).

4.1.2 Out-of-band communication


Out-of-band communication is complementary to the in-band communication. It uses
its own network port or, in some setups, serial line connected to network terminal
server.
Monitoring capabilities therefore do not depend on running operating system,
nor does the monitoring trac aect production network bandwidth and CPU load.
Depending on the system controller (this term is used mainly in connection with
SPARC systems, another used terms are BMCbaseboard management controller
and SPservice processor) additional features may be oered to the system administratorfor example console redirection, storage redirection and management, rmware
update etc. Power control is one of the basic features.
This type of communication requires additional cabling and switching, so the
resulting network infrastructure is more dense and also more expensive. System
controller on the other hand does not use any special network features so very low
cost commodity switches.
Security of this dedicated management network is of vital concern to the user.
Breach may lead to disruption of management trac and it may be possible to overload the system controller. In case of breaking into the system controller, the adversary could not only take the entire system down (possibly damaging production
data), but it may be possible to boot a totally dierent operating system from redirected storageleading to data leak or intentional corruption. Of course, booting
a dierent operating system using a direct (i.e. production network) breach is also
possible, but this channel is expected to be much more secure (strong passwords, rewall, etc.). But a separate network may lead to temptation to keep default passwords,

Approaches to system management

21

therefore it is very important to develop and enforce security guidelines with same
strictness as guidelines applying to operating system and network security.
In conclusion, drawback of this approach is higher network infrastructure costs,
but for setups requiring additional features like storage redirection etc., this approach
is benecial.

4.1.3 Side-band communication


Side-band communication combines the best features of both communication methods
described above. Side-band communication usually involves system controller, that
uses the same network port as production network, but operating in a separate virtual
LAN (VLAN).
Features are usually comparable to those of out-of-band communications, yet
there are some savings in network infrastructure. Setting up network components
to correctly route information based on VLAN information may be more complicated
than other means.
Finally, not all service controllers support this type of communication, so unless
there is a bigger number of servers supporting this type of communication, investing time into setting up side-band monitoring in addition to any of the previously
mentioned ways is probably not a worthwhile eort.

4.2 By means of data gathering

4.2.1 Active monitoring


By active monitoring we mean such setup, where the monitoring station (i.e. a box
running monitoring software) actively queries managed (monitored) devices.
Certain protocols (like IPMI) support only this type of monitoring, others (like
SNMP) support both active and passive.
During active monitoring, the following data (albeit not all of these may be available ) is usually gathered and/or updated in regular time intervals:

22

Sun servers open--source systems management

list of hardware components with their statuses


list of sensors with current values, thresholds and statuses
overall system health status
Depending on the verbosity of data obtained and on time intervals, active monitoring
can cause a signicant network trac (this may not be favourable especially when
using in-band communication). However the amount of data transmitted may be regulated by selecting only a subset of data (e.g. checking a system status and reading
an extended set of data when the status changes).
Advantage of active monitoring is reliabilityeven when using non-reliable data
transfer (UDP protocol used with SNMP protocol), the monitoring station can usually
detect missing data and request it again.
Another huge advantage is the ability to gather statistically relevant data to be
stored and processed (like power consumption, network port trac etc.). Advanced
features of monitoring software can include graphing and reporting, which can in
turn be used to consolidate computing resources in power-ecient way.
This type of monitoring is usually supported by most network devices, ranging
from servers to low-cost switches.

4.2.2 Passive monitoring


Passive monitoring is an opposite (and complementary) approach compared to activethis time, it is the responsibility of the monitored device to report a status
change to monitoring software. Based on this received information, monitoring station will perform some actioneither predened or dened by user. Actions can be
from operator notication using paging or SMS, to automatic failure correction (like
starting virtual machines migration etc.).
However, when using non-reliable data transport (UDP), passive notication may
not even be received. Also, especially when using SNMP protocol, management station does not usually send a reception conrmation. Multiple switches en route can
adversely aect datagrams, causing the message to be delayed, received out-of-order
or entirely to be lost. To prevent this, some management and monitoring software
can listen for SNMP notications in local network and send it to the master management host using some reliable protocol (in most software this is implemented as RPC,
either original ONC RPC, web service call or propriatery protocol).

Approaches to system management

23

Huge advantage is that very little network trac is generated, and also this
method is very CPU usage friendly (neither agent/system controller nor monitoring
station are processing huge amounts of data).
This method may not be supported by all devices.

4.2.3 Combination of active and passive monitoring


When both above mentioned approaches are combined, possibly the most reliable
monitoring system can be built. However, not all monitoring packages allow these
two approaches to be combined.
Modus operandi is like this:
1. Monitoring station reads all data using active approach (i.e. full repository).
2. Monitored hosts issue notications based on their status changes.
3. Monitoring station updates it's data either by:
a. using solely data from the passive notication
b. refreshing all data from the appropriate monitored device
4. Once a while, monitoring station refreshes all data (just in case notication was
lost).

4.3 Final comparison


To be able to correctly choose between various approaches to monitoring, it is best to
have these methods compared in tables:
Feature

In-band

Out-of-band

Side-band

OS Independent

no

yes

yes

Communication port

shared

separate

shared

Uses host CPU

yes

no

no

Special net. requirements

none

yes, cabling

yes, setup

Display/storage redirection

needs OS support

yes

yes

Power management

limited

yes

yes

Table 4.1

Comparison of communication methods

24

Sun servers open--source systems management

Feature

Active

Passive

Combination

Comm. initiator

management host

monitored device

both

Network trac

high

low

medium

Reliability

high

lower

highest

Stat. data available

yes

no

yes

Mgmt software support

medium

very high

very low

Mged devices support

high

lower

Table 4.2 Comparison of data acquisition methods


Selection in particular setup will be subject to available software, number and type of
devices, current network infrastructure hierarchy and also time and budget alloted.

5 Sensors and components


Before we can get deeper into the actual data presented by Oracle Sun system controllers and agents, we need to dene and explain terms that are connected with a
server.
Component is any functional part of the server. Components may nd themselves
in a number of states:

present
absent
functioning
about to malfunction
malfunctioning
unknown

Very closely related term is sensor. Sensors are usually connected with components, although they may be connected with a whole system. There are fundamentally two types of sensors:

physical (e.g. voltage, fan speed, etc.)


virtual (e.g. system is OK)
The dierence is, that virtual sensors are being computed based on physical sensors.
It shall be noted that for some virtual sensors, the underlying physical sensors may
be hidden.
Physical sensors usually detect some values being out of range or just some true/false
conditions. Some types of physical sensors:

button sensor (power buttons, chassis intrusion detection)


fan speed sensor
current sensor
presence sensor
temperature sensor
voltage sensor

25

26

Sun servers open--source systems management

Among virtual sensors are those whose condition is base on state of other sensors
(e.g. power sensor measuring in Watts will be calculated from appropriate voltage
and current sensors) or based on a condition detected by software. For example:

memory ECC error sensor


OK/not--OK sensor
power sensor
Some sensors (mostly physical) have setup some thresholds. A threshold is a value,
which the measured value must achieve and cross for the sensor to change its state.
Usually, only sensors that measure continuous values (numeric sensors, the opposite
being discrete sensors) have dened thresholds:

non--critical
critical
non--recoverable
When a non--critical threshold is being crossed, usually a notication is generated,
but the condition is not severe and it won't impact function of the system. Staying
beyond critical threshold may potentially aect reliability and endurance may be affected. Non--recoverable threshold crossing usually signals something has gone very
wrong and the system is immediately shutdown (although this can be modied and
sometimes disabled).
Also, thresholds can be low and highfor example, temperature sensor measuring ambient temperature has a all six thresholds dened (high temperature is not
desired equally as freezing temperatures).
Discrete sensors have only a certain set of states they can have. Here is an incomplete list of discrete values certain sensors can have:

disabled
memory error detected
OK/fail
present/absent

Both kinds of sensors have so-called assertions and deassertions. These two are opposite to each other. Assertion means that the sensor assumes some state (usually

Sensors and components

27

error state), deasertion means that the sensor leaves the state that was previously
asserted.
However, this may sometimes be trickylets see an example. We have a sensor
HDD0 (the names are usually longer, but for the sake of example lets keep this one)
that has the following states:

Device Present
Device Absent
Hot Spare
Rebuild In Progress

and for all of the, both assertion and deassertion is enabled. In this particular example, having the sensor in Device Present Assert means that the particular device
is present. Similarly, Device Absent Assert will mean that the device has been removed.
There is however one more approachhave the device in Device Absent Deassert and Device Absent Deassert and Device Present Deassert. Both mean
the same thing as the ones in previous paragraphthe device has been inserted (is
no longer absent) and device has been removed (and is no longer present) respectively. Any integration dealing with sensor must be aware of this and preferably
should translate incoming notications into one common format and discard the less
common and more confusing one.

28

Sun servers open--source systems management

6 Management interfaces of Oracle Sun servers


Since this project focuses on systems management of Oracle Sun servers, we rst
need to describe management capabilities of these servers.
Oracle (and previously Sun) has a very broad portfolio of servers. However, for
this project, we will focus on the following hardware families:

Oracle Sun Fire x86 Servers (X2000 and X4000 series)


Oracle Sun SPARC Enterprise Server (T1000, T2000 and T5000 series)
Oracle Sun Blade Server Modules (X6000 and T6000 series)
older Sun Fire Servers (SPARCs, V210 for example)

The work will be done primarily on latest available servers (i.e. not End--of--Life ones).
Although it may seem as a waste of time to target also servers no longer in production,
it is author's belief that these servers may still be present especially in educational
institutions, where they performance is still sucient and having an open--source tool
for monitoring will be more than benecial.

6.1 System controllers


All the servers mentioned above have a special, independent computer on--board, that
controls power, monitors environmental and system characteristics (voltages, device
presence, fan speeds etc.) and reports the using methods describe below. This computer is called system controller on SPARCs and service processor on x86 servers.
On Oracle Sun servers mentioned above, one may encounter the following versions of system controllers:

Advanced Lights Out Manager (ALOM) [23 and 24]


Embedded Lights Out Manager (eLOM) [25]
Integrated Lights Out Manager (ILOM) [26]
ALOM is the oldest from these two, and one can nd it only on older SPARC servers
(there are two versionsALOM and ALOM--CMT, the rst one being used on sun4u

29

30

Sun servers open--source systems management

platforms and the latter being used on servers with UltraSPARC T1 processorthese
processors have the ability to run several threads in parallel, also called Chip Multithreading, hence the abbreviation CMT).
ALOM had only command line interface and they can send e-mail to administrator in the event of malfunction, newer version of ALOM--CMT also support SNMP
protocol. There is no web GUI, though. ALOM is primarily out--of--band (using serial line or its own network port), but it can be congured from within Solaris using
scadm(1M) command. Features are pretty much standard:

power control
serial console redirection
logical domains (on CMT machines, [27])
environment monitoring
listing, disabling and enabling components

eLOM on the other side can be found only on older x86 platforms. It oers command line interface, SNMP interface and web interface. In addition to features listed
with ALOM (except the logical domains), eLOM has these additional features:

graphical console redirection


storage redirection
ILOM is the latest and actively developed system controller software. It can be
found both on SPARC and x86 servers and it oers everything ALOM and eLOM oer
together.

6.2 Command--line interface


Command--line interface is universally available on all three service controllers. However, the syntax of commands diers considerably (to mitigate this to veteran SPARC
administrators, ILOM on SPARC can be run in ALOM--compatible mode, so that most
commands and possibly even script these administrators know or have written will
work as expected). Please see the examples:

Management interfaces of Oracle Sun servers

31

# ssh root@alom-server
Copyright 2008 Sun Microsystems, Inc.
Use is subject to license terms.

All rights reserved.

Sun(tm) Advanced Lights Out Manager CMT v1.7.6


Please login: admin
Please Enter password: *****
sc> showhost
Sun-Fire-T2000 System Firmware 6.7.6

2009/10/29 16:06

Host ash versions:


OBP 4.30.4 2009/08/19 07:24
Hypervisor 1.7.3.a 2009/10/29 15:50
POST 4.30.4 2009/08/19 07:47

Figure 6.1

ALOM exampleinformation about server

Command--line interface can be accessed over the following interfaces:

serial line
telnet (may be disabled for security reasons)
secure shell
internally over OS tool (e.g. scadm(1M))

6.3 SNMP
SNMP interface is arguably the most used interface for system management. Both
eLOM and ILOM support SNMP from the very rst versions, ALOM--CMT started
to support SNMP directly relatively late.
However, either due to absence of SNMP interface (ALOM--CMT prior to v1.4) or
due to simple wish to monitor the system in--band, there are so-called agents. There
are currently two:

Monitoring Agent for Sun Fire and Netra Systems (MASF) [28]

32

Sun servers open--source systems management

# ssh root@elom-host
root@elom-host's password:
Sun(TM) Embedded Lights Out Manager
Copyright 2004-2006 Sun Microsystems, Inc. All rights reserved.
Version 2.91
Hostname: SUNSP0016365B97FB
IP address: 10.18.141.146
MAC address: 00:16:36:5B:97:FB
System serial number: 0624QC0029
/SP -> show /SP/SystemInfo/ProductInfo
/SP/SystemInfo/ProductInfo
Targets:
Properties:
ProductManufacturer = Sun Microsystems
ProductProductName = Sun Fire X2200 M2
ProductPartlNumber = 1S39U9ZST61
ProductSerialNumber = 0624QC0029
AssetTag =
Target Commands:
show

Figure 6.2

eLOM exampleinformation about server

Oracle Server Hardware Management Agent [29]


MASF is available only on SPARC systems, but it supports both ALOM (including the
CMT variant) and ILOM system controller. On the other hand, the Hardware Management Agent supports only x86 systems and only those running specic versions
of ILOM.
All system controllers supporting SNMP and both agents can be congured to
accept incoming SNMP requests for data (useful when monitoring these systems
activelyalso known as polling) and/or they can send SNMP traps or notications

Management interfaces of Oracle Sun servers

33

# ssh root@sparc-ilom
Password:
Waiting for daemons to initialize...
Daemons ready
Sun(TM) Integrated Lights Out Manager
Version 3.0.6.1.d r48331
Copyright 2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Warning: password is set to factory default.
-> show /SYS
...
Properties:
type = Host System
ipmi_name = /SYS
keyswitch_state = Normal
product_name = SPARC-Enterprise-T5220
product_part_number = 602-3821-08
product_serial_number = BEL07513TT
product_manufacturer = SUN MICROSYSTEMS
fault_state = OK
power_state = On
...

Figure 6.3 ILOM exampleinformation about server


on their own (passive monitoring). However, the format of data diers considerably
among the types of service controller or agents. Its structure is important for further
work on the integration with Zenoss, so the data structure (described using MIBs)
will be discussed in the next section.

6.3.1 Oracle Sun MIBs


Format and purpose of MIB was already dened (see section 3.1.3 at page 13). Oracle
Sun systems (or more precisely, the system controllers and agents) implement some
of the following MIBs:

34

Sun servers open--source systems management

ENTITY-MIB
SUN-PLATFORM-MIB
SUN-ILOM-PET-MIB
SUN-HW-TRAP-MIB
SUN-HW-MONITORING-MIB
SUN-ASR-NOTIFICATION-MIB

In the following paragraphs, we will look into these MIBs in higher detail.

6.3.1.1 Origin and purpose of these MIBs


ENTITY-MIB is the only MIB that has not been dened by Oracle (formerly Sun). It

is dened in an independent specication [30]. The purpose of MIB is given as follows


([30]):
In particular, it (this MIB) describes managed objects used for managing multiple logical and physical entities managed by a single SNMP agent.

ENTITY-MIB contains structures that (in terms of server management) describe

various components of the server, including details about count and type of processors,
DIMM modules manufacturer etc.
SUN-PLATFORM-MIB is a MIB that extends ENTITY-MIB with details about
operational state and also it contains tables that identify and list system sensors, together with their thresholds and current values. Also, this MIB in particular denes
some notications, that can be used to dynamically modify the model of monitored
system and/or it can be translated and displayed to user. However, these traps do not
carry all the information (like the type of sensor issuing the warning), so additional
action is required to get such information (typically, this is done using regular expression that looks for a certain pattern of sensor names). Using regular expressions is
quick and functional way, but author believes the correct approach is to poll the agent
or system controller for a correct sensor type based on received OIDs present in the
notications. These two MIBs are supported in MASF (SPARC) and all ILOMs and
eLOMs.
SUN-ILOM-PET-MIB is one of the MIBs that doesn't use typical Sun (Oracle)
OID tree, but it instead uses a tree wiredformgmt (Wired for Management). This
is an OID tree reserved by Intel [31]for so-called PETs (Platform Event Traps). These
largely correspond with IPMI and ofter carry similar date. However, such trap generated carries a computed specic type (a number that identies the type of trap or

Management interfaces of Oracle Sun servers

35

notications that is being sent). Most NMSes can't deal with dynamic specic types,
they expect these numbers to be assigned statically and dened in the MIBand that
is the purpose of this MIB. However, in case there is another PET MIB by a dierent vendor, they will share the OID tree and the numbers will collide. Not only will
the names and descriptions of most or all notications dierent, but some may have
totally dierent meaning.
SUN-HW-TRAP-MIB was designed relatively recently with a single purposeeliminate the need to do a regular expression matching or polling agent when a trap is
received. Hence, a direct display of these traps is preferred.
SUN-HW-MONITORING-MIB was designed to remove a dependency on ENTITY-MIB
and to provide some more information about the monitored system. It features data
like cumulative state, which is computed on the monitored host side. The advantage
of this approach is mainly saving the network tracNMS may poll only few values in the MIB and get a full tree only in case something goes wrong. This MIB is
implemented only in the Hardware Management Agent.
SUN-ASR-NOTIFICATION-MIB is currently implemented by ASR agent. Description from [32]:
ASR is a secure, scalable, customer--installable software feature of warranty
and SunSpectrum support that provides auto-case generation when specic
hardware faults occur. ASR is designed to enable faster problem resolution by
eliminating the need to initiate contact with Sun for hardware failures, reducing both the number of phone calls needed and overall phone time required.
ASR also simplies support operations by utilizing electronic diagnostic data.
In case there is an error detected (hardware error), the ASR agent sends details
about the error, together with unique identier of the system to Oracle, where the
data is ltered and entered as a Service Request on behalf of the customer. This saves
time and communication eorts. In addition, ASR generates a SNMP notication to
inform the customer about Service Request being created on his behalf.

6.3.1.2 Notications
It is not feasible to describe every single notication declared in all MIBs, as that
would make this document extensively long and also very quickly outdated. In this
section, we will describe the basic principles behind notications in Oracle (Sun)
MIBs.

36

Sun servers open--source systems management

ENTITY-MIB has only one notication, entCongChange is the only present

notication. Its sole purpose is to inform NMS that a conguration change has occurred and that it should reread all data.
SUN-PLATFORM-MIB has at present twelve notications dened. These notications were designed to work in cooperation with ENTITY-MIB, and as such each
notication carries an OID that points to the ENTITY-MIB and contains some additional information. However, this is not practical for integrations that only translate
notications, so there are additional varbind sunPlatNoticationAdditionalInfo that contain a human--readable text of the event that occurred.
SUN-ILOM-PET-MIB was already briey described. What is interesting about
the notications is that they contain only one varbind, but with a string of encoded
binary data. Among them there is also a sensor name, which is often decoded from
the trap and the rest is discarded as the meaning of the notication is already given
by the specication.
SUN-HW-TRAP-MIB is the only MIB designed solely for the purpose of sending
traps. As of now, it has seventy three notications dened. Names of the notications
contain both the type of sensor on which the event occurred, but also which threshold
was crossed. In the additional varbinds there is the full name of the sensor, threshold
value and current value. Example:

sunHwTrapVoltageNonCritThresholdExceededa non--critical threshold was exceeded


sunHwTrapVoltageOkthe voltage is OK now
Please bear in mind that SNMP is UDP based and therefore each trap with lower
severity (e.g. the one suggesting system is getting into better condition) should automatically close all previous events with higher severity, if they were sent for the same
sensor.
SUN-ASR-NOTIFICATION-MIB has only ve notication:

sunAsrSrCreatedTrap
sunAsrSrCreationInProgressTrap
sunAsrSrUpdatedTrap
sunAsrSrDelayedTrap
sunAsrSrFailureTrap

Management interfaces of Oracle Sun servers

37

With these notications, NMS can display appropriate messages when a service request gets created, is being created, has been updated, is delayed or has failed, respectively.

6.3.1.3 Polled data


ENTITY-MIB contains the following tables:

entPhysicalTable
entLogicalTable
entLPMappingTable
entAliasMappingTable
entPhysicalContainsTable

It also contains entLastChangeTime scalar value.


Taken from [30]:
The entPhysicalTable contains one row per physical entity, and must
always contain at least one row for an overall physical entity, which should
have an entPhysicalClass value of stack(11)', chassis(3)' or module(9)'.
Each row is indexed by an arbitrary, small integer, and contains a description and type of the physical entity. It also optionally contains the index
number of another entPhysicalEntry indicating a containment relationship between the two.
The entLogicalTable contains one row per logical entity. Each row is
indexed by an arbitrary, small integer and contains a name, description, and
type of the logical entity. It also contains information to allow access to the
MIB information for the logical entity.
The entLPMappingTable contains mappings between entLogicalIndex values (logical entities) and entPhysicalIndex values (the physical components supporting that entity). A logical entity can map to more than
one physical component, and more than one logical entity can map to (share)
the same physical component.
The entAliasMappingTable contains mappings between entLogicalIndex, entPhysicalIndex pairs and alias' object identier values. This
allows resources managed with other MIBs (e.g., repeater ports, bridge ports,

38

Sun servers open--source systems management

physical and logical interfaces) to be identied in the physical entity hierarchy.


Note that each alias identier is only relevant in a particular naming scope.
The entPhysicalContainsTable contains simple mappings between
entPhysicalContainedIn' values for each container/containee' relationship in the managed system. The indexing of this table allows an NMS to
quickly discover the entPhysicalIndex' values for all children of a given
physical entity.
Scalar object entLastChangeTime represents the value of sysUptime
when any part of the Entity MIB conguration last changed.

SUN-PLATFORM-MIB is an extension of ENTITY-MIB. Specically, it augments


entPhysicalTable with information about Oracle/Sun specic equipment information and most importantly it adds information about sensors (i.e. when a row in
entPhysicalTable refers to a sensor, agent implementing the MIB will ll in
details about this sensorlike sensor type, thresholds and valuesinto appropriate
table with the same index as the row in entPhysicalTable).
SUN-HW-MONITORING-MIB is independent on ENTITY-MIB and is complemented by SUN-HW-TRAP-MIB, which denitions of notications.
This MIB contains similar data as ENTITY-MIB, but the data is spread among
more tables:

sunHwMonInventoryTable
sunHwNumericVoltageSensorTable
sunHwDiscreteVoltageSensorTable
sunHwNumericCurrentSensorTable
sunHwDiscreteCurrentSensorTable
sunHwNumericPowerDeviceSensorTable
sunHwDiscretePowerDeviceSensorTable
sunHwNumericCoolingDeviceSensorTable
sunHwDiscreteCoolingDeviceSensorTable
sunHwNumericTemperatureSensorTable
sunHwDiscreteTemperatureSensorTable
sunHwNumericProcessorSensorTable
sunHwDiscreteProcessorSensorTable
sunHwNumericMemorySensorTable
sunHwDiscreteMemorySensorTable
sunHwNumericHardDriveSensorTable
sunHwDiscreteHardDriveSensorTable
sunHwNumericIOSensorTable
sunHwDiscreteIOSensorTable

Management interfaces of Oracle Sun servers

39

sunHwNumericSlotOrConnectorSensorTable
sunHwDiscreteSlotOrConnectorSensorTable
sunHwNumericOtherSensorTable
sunHwDiscreteOtherSensorTable
sunHwMonIndicatorTable
sunHwMonTotalPowerConsumption

As one can see, this MIB is more ne grained that ENTITY-MIB. In addition to these
tables, certain values of interest are also directly available as scalars, which radically
simplies writing management extensions. There are quite a few scalars, only some
are listed below (for a full list and description see the MIB itself, it is well commented):

sunHwMonProductName
sunHwMonProductType
sunHwMonCumulativeSensorAlarmStatus
sunHwMonIndicatorServiceName
sunHwMonIndicatorServiceCurrentStatus

6.4 IPMI
IPMI is supported only in eLOM and ILOM. Utilities that access system controllers
over IPMI (e.g. ipmitool(1M), [33]) can use two connection methods:

out--of--band or side--band over network


locally over KCS interface
While the rst is available always, KCS (Keyboard Style Controller) was not available on SPARC systems until recentlythis was caused by a driver missing, not a
hardware defect [35].

6.5 Other interfaces


All of the system controllers can send notications using e-mail and they can also forward the events to a system logging daemon running on remote host. To the author's
knowledge, these interfaces are seldom used.

40

Sun servers open--source systems management

However, web interface is used quite often, it oers a quick way how to check
server status, server components and also to upgrade rmware remotely without having to run TFTP server.

Figure 6.4 ILOM login screen

7 Zenoss integration
Since we now have all management protocols, approaches and Oracle Sun servers
available interfaces described, we can start designing and implementing Zenoss integration. As resources materials [3641]were invaluable and provided all information
needed for designing and implementing the integration.

7.1 Choosing an approach


Zenoss supports both active and passive approach. To be able to actively poll system
controllers or agents for data, it is necessary to develop plugins in Python that extend
Zenoss' object model. While the API is not overly complex and ENTITY-MIB modelling is already present, it would be time consuming to implement the other MIB
(SUN-HW-MONITORING-MIB) and management capabilities would thus be limited
to system controllers with ILOM and eLOM and to SPARC hosts running MASF.
On the other hand, implementing trap handling is easier, and as a result of implementing support for SUN-PLATFORM-MIB and SUN-HW-TRAP-MIB notications
much more platforms will be supported:
Eventually, the desired functionality is that of existing integration with IBM Tivoli
Enterprise Console [42]or IBM Tivoli NetCool OMNIbus [43].

7.2 Development environment


A VirtualBox virtual machine running Debian GNU/Linux 5.0 with installed stack
Zenoss 2.5.1 (recently updated to 2.5.2). Development was done accordingly to Jane

41

42

Sun servers open--source systems management

Curry's [40]development tree was stored outside of Zenoss and versioned in Mercurial repository.

7.3 Important design decisions

7.3.1 Event classes


Zenoss organizes events into event classes. There are certain already existing classes,
like /Hw/Perf etc. There were possible two approaches:
1. extend existing event classes
2. create a completely separate namespace with new event classes
While the rst approach would suggest that the integration would t seamlessly into
existing environment (especially helpful when users already have some paging, e-mail
or other notications setup), the second approach guarantees that there will be no
clashes with existing setup (of course, unless the user creates his own event classes
with the same names).
As this integration should not break anything in the end--users setup, it has been
decided to create a completely separate namespace.

7.3.2 Per-trap mapping vs. defaultmapping


When Zenoss receives an event (in this case caused by receiving SNMP notication),
it will try to process the event using Event Class Key, which is usually the name of
the SNMP notication (provided the MIB is loaded and compiled). To do that, it will
search its database and looks for Event Class Mappings, which play a similar role as
rules in other software.

Zenoss integration

43

Figure 7.1 Zenoss Event Processing


When the mapping is not found, it will try and look for defaultmapping, that may
process the generated event. Although it would be simpler to develop just one block of
code to process these events, there is a concern that running a larger block of code for
every single notication would make the application much slower. Hence, a decision
to create a mapping for every single SNMP notication has been made.

7.4 Development steps


In this section we will describe steps taken to develop this integration. There is
one step common to all subsequents stepsonce it has been veried that the described action was successful, the resulting objects are added to the ZenPack (called

44

Sun servers open--source systems management

ZenPacks.ojakubcik.OracleHwMonitoring), the ZenPack is exported and


the commited to Mercurial repository.

7.4.1 Compiling MIBs


This is arguably the simplest step. It involves copying used MIBs to location where
Zenoss expects them ($ZENHOME/share/mibs/site). The $ZENHOME environment variable is set by default for user zenoss.
Then, as user zenoss, one has to run the command

$ zenmib -v 10

to process the new MIBs and load them into Zenoss.

7.4.2 Creating Event classes


Before creating mappings, it is necessary to have all event classes against which we
want to map events to. Based on the two MIBs used now, the following classes will
be created:

/Events/Oracle
/Events/Oracle/Voltage
/Events/Oracle/Temperature
/Events/Oracle/Electrical Current
/Events/Oracle/Fan Speed
/Events/Oracle/Other
/Events/Oracle/Power Supply
/Events/Oracle/Fan
/Events/Oracle/Processor
/Events/Oracle/Memory
/Events/Oracle/Hard Drive

Zenoss integration

45

/Events/Oracle/IO
/Events/Oracle/Slot or Connector
/Events/Oracle/Component
/Events/Oracle/FRU
/Events/Oracle/Power Consumption

These can be created from GUI by following the Events menu item in the left navigation bar and the by clicking Add New Organizer from the menu on the left from
Subclasses.
However, it is also possible to do this using a tool zendmd, which is essentially
a Python interpreter with preloaded Zenoss classes [44](this is just a skeleton script,
full can be found on CD in directory scripts as le createEventClasses.py):

import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd
event_classes = [
'/Events/Oracle',
'/Events/Oracle/Voltage',
...
]
for ec in event_classes:
dmd.Events.manage_addOrganizer(ec)
commit()

As a result, we now have all event classes we need in place and can proceed to
the event mappings creation.

7.4.3 Creating Event mappings


Recommended procedure for creating Event class mappings is to have the Zenoss
SNMP daemon receive all possible notications and then by creating the mappings
from GUI. These mapping can then be modied again from GUI [39].

46

Sun servers open--source systems management

However, if we do that for just one notication we can observe the following attributes are present (lled values are in parentheses) and the rest is to be lled manually:

Name (SNMP trap name, e.g. sunPlatObjectCreation)


Event Class Key (SNMP trap name, e.g. sunPlatObjectCreation)
Sequence (number, in my case 7)
Rule
Regex
Example (snmp trap sunPlatObjectCreation)
Transform
Explanation
Resolution
Meaning of these elds is in [36]:

NameAn identier for this event class mapping. Not important for matching events.
Event Class KeyMust match the incoming event's eventClassKey eld
for this mapping to be considered as a match for events.
SequenceSequence number of this mapping, among mappings with an
identical event class key property. Go to the Sequence tab to alter its position.
RuleProvides a programmatic secondary match requirement. It takes a
Python expression. If the expression evaluates to True for an event, this
mapping is applied.
RegexThe regular expression match is used only in cases where the rule
property is blank. It takes a Perl Compatible Regular Expression (PCRE).
If the regex matches an event's message eld, then this mapping is applied.
TransformTakes Python code that will be executed on the event only if
it matches this mapping. For more details on transforms, see the section
titled Event Class Transform.
ExplanationFree-form text eld that can be used to add an explanation
eld to any event that matches this mapping.
ResolutionFree-form text eld that can be used to add a resolution eld
to any event that matches this mapping.
Although we possibly could enter all mappings by using GUI, this would be error
prone and not very ecient. Luckily, as Zenoss is based on Zope, every GUI action
has a corresponding Python function that can be called.

Zenoss integration

47

To manipulate event classes, we rst need to get the class that represents them.
This is doable by the following method:

dmd.Events.getOrganizer(name)

where name is a full path to event class organizer.


Each organizer has a method createInstance that takes one parameteridentier of the created mapping (in our case, this will be the name of the notication).
This method nally returns and instance of EventClassInst, that we will further
manipulate.
EventClassInst has attributes that correspond to the eld described earlier
(e.g. eventClassKey). After creating the new mapping instance, all we need to do
is to set corresponding attributes using standard Python syntax and nally commit
everything into ZODB (Zope Object Database) by calling the commit() procedure.
In following list, we will describe which attributes and how need to or should be
set:

eventClassKey and id shall be set to the translated name of the SNMP notication.
example shall be set to snmp trap <name>.
transform shall contain Python code that will modify received event text, severity and possibly set other values so clearing will work.
explanation and resolution may contain text explaining nature of the
event.
Transform eld, corresponding to the transform attribute will contain dierent
Python code for notications from dierent MIBs. Some of them may be dropped
automatically:

# Drop this event


evt._action = "history"

48

Sun servers open--source systems management

Most of the traps from SUN-HW-TRAP-MIB will have processing similar to this
(please note, that although MIBs do specify an user friendly mapping of integers to
names, Zenoss does not use these mappings):

# Get interesting attributes


component = getattr(evt,'sunHwTrapComponentName', None)
threshold_type = getattr(evt, 'sunHwTrapThresholdType', None)
threshold_value = getattr(evt, 'sunHwTrapThresholdValue', None)
reading = getattr(evt, 'sunHwTrapSensorValue', None)
if threshold_type == 1:
# Upper
thr_type_text = "upper"
thr_word = "over"
thr_compare = ">="
elif threshold_type == 2:
# Lower
thr_type_text = "lower"
thr_word = "below"
thr_compare = "<="
else:
# Unknown threshold
evt._action = "drop"
evt.severity = 2 # Info
return
evt.summary = "<Sensor type> sensor %{component}s: reading is ..."
evt.component = component
evt.severity = SEVERITY
evt._action = "status"
# 0 = CLEAR, DEBUG, INFO, WARNING, ERROR, CRITICAL = 5

Other notications will have similar processing. How do we put all this together?
Lets put together a algorithm:
1.
2.
3.
4.

Construct a list of notication names.


For each notication, assign an Event Class and severity
Based on predened teplates, generate a transformation code for each notication.
For each notication, nd appropriate organizer (Event Class) and based on the
previously obtained information, create a mapping.

When this is done, one may end up with a following script. Of course, this is not a
complete script, full version is present on the CD. First, we need to prepare a list of
notication, together with their Event Classes:

Zenoss integration

49

denitions = []
# No /Events/Oracle needed, that is added automatically
# Sun HW Trap MIB - threshold notications
for sensor_short, sensor_type, zen_group in [
('Voltage', 'Voltage', '/Voltage'),
('Temp', 'Temperature', '/Temperature'), ...
]:
for thr_value, severity, threshold_type in [
('Fatal', 5, 'non-recoverable'),
('Crit', 4, 'critical'),
('NonCrit', 3, 'non-critical')]:
name = 'sunHwTrap' + sensor_short + thr_value +
'ThresholdExceeded'
organizer = zen_group
transform = hw_thr_assert % {
'severity' : severity,
'type' : sensor_type,
'threshold_type' : threshold_type}
d = {
'name' : name,
'organizer' : organizer,
'transform' : transform}
denitions.append(d)

Here, the hw_thr_assert and hw_thr deassert are strings that contain the
template for transformation script to be input into Zenoss.
When we have the denitions array lled up with transformation rules, we
can cycle through them and create mappings in Zenoss:

for denition in denitions:


org = dmd.Events.getOrganizer('/Events/Oracle" +
denition['organizer'])
inst = org.createInstance('" + denition['name'] + "')
inst.example = 'snmp trap ' + denition['name']
inst.transform = denition['transform']

Finally, we need to add some preamble to the script:

50

Sun servers open--source systems management

import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd

Also we need to commit the changes to database:

commit()

7.4.4 Adding products


Finally, we may want to add a new manufacturer and a list of products. This again
can be done from GUI or from command--line using zendmd.
However, the syntax here is not as easy as in the rst example, so for purpose of
this project, products were created by hand using GUI.
Manufacturer Oracle was added to Zenoss, and a list of servers was created:

Oracle Sun Fire X2250 Server


Oracle Sun Fire X2270 Server
Oracle Sun Fire X4100 M2 Server
Oracle Sun Fire X4200 M2 Server
Oracle Sun Fire X4600 M2 Server
Oracle Sun Fire X4540 Server
Oracle Sun Fire X4140 Server
etc.

7.4.5 Final modications


Even though scripting the creation of the mappings saved us a considerable amount
of time, the script inevitably may not be able to generate all messages and severities

Zenoss integration

51

correctly. Hence, a walkthrough the generated mappings is recommended and modifying the generated code to make it more ecient for given purpose is encouraged.
Small modications were needed especially with the notications that cover more
than one event (sunHwTrapHardDriveStatus) and most SUN-PLATFORM-MIB
notications.

7.5 Testing
Optimal approach for testing would be to create an automation that would simulate
failures on physical machines, which would in turn respond with notication. A semi-manual checking would then be required to conrm that the integration works as
expected.
However, due to time constraints and unavailability of all testing machines, a
dierent approach was chosen. One server (Oracle Sun SPARC Enterprise T5220
Server) was congured to send notications from system controller and MASF agent
to the same IP address running Zenoss with this integration. Hard drives, power supplies and fans were the removed and the reinstalled to verify that traps are received
and cleared.

7.6 Future extension


As of now, the integration has just basic functionality. Following paragraphs describe
the possible new features to be developed, possibly as a future work of author.
Testing framework. To ensure this software works, a complete automated testing
framework supporting physical servers needs to be developed and regularly run.
Better clearing mechanism. Right now, due to Zenoss way of handling clearing
events (i.e. only events with cleared severity can clear others) it is true that notications ending with Deassert have severity of cleared. This may not be true, because
even if the sensor reading drops below non--recoverable threshold, its reading is now
critical and not OK.
Polling. This would mean developing a plugin into Zenoss that would discover
and model the server using data obtained by periodical reading MIB data.
Model updates from traps. Instead or in addition to writing to event console when
a SNMP notication is received, a previously obtained model of the server could be
either updated or a forced reread of all data can be forced. This of course requires a

52

Sun servers open--source systems management

functional polling and to function properly, a model will need to be updated anyway
from time to time, just to make sure that a SNMP notication wasn't lost en route.
Graphing and reporting. Based on data obtained by previous two extensions, it
would be possible to implement graphing and reporting, showing for example temperature trends, and more importantly power consumption.

8 Conclusion
This project was partially research and partially implementation oriented. As a result, a brief yet hopefully useful description of system management motivations, technologies and software was given.
In addition, a basic but functional integration into open--source system management tool was developed and tested (albeit only in limited way), by which this project
fullled its assignment.
Author implemented a new and previously unknown (or at least not publicly described) way how to create Event Class mappings programatically.
However, from the former idea of a complete monitoring solution that would do
polling, graphing and notications simultaneously was not realized. Nonetheless,
even though this solution does not use all features of Zenoss, there is a room for
improvement, as described earlier.

53

54

Sun servers open--source systems management

References
[1] O. Jakubk, Selecting open-source system management solution for integrating
with Sun servers (unpublished, 2009). Available on CD.
[2] E. Galstad Nagios Core Version 3.x Documentation. (2009).
[3] Zabbix SIA, Zabbix 1.8 manual.
[4] Zenoss, Inc., Zenossgetting started (Zenoss, Inc., 2009).
[5] Wikipedia, Simple network management protocol (2010).
[6] M. Rose and K. McCloghrie, RFC1155: Structure and identication of management information for TCP/IP-based internets (IETF, 1990).
[7] K. McCloghrie and M. Rose, RFC1156: Management Information Base for network management of TCP/IP-based internets (IETF, 1990).
[8] J. Case, M. Fedor, M. Schostall, and J. Davin, RFC1157: Simple Network Management Protocol (SNMP) (IETF, 1990).
[9] K. McCloghrie, D. Perkins, and J. Schoenwaelder, RFC2578: Structure of Management Information Version 2 (SMIv2) (IETF, 1999).
[10] ITU, Abstract Syntax Notation One: Specication of basic notation (ITU, 2002a).
[11] ITU, Abstract Syntax Notation One: Information object specication (ITU,
2002b).
[12] ITU, Abstract Syntax Notation One: Constraint specication (ITU, 2002c).
[13] ITU, Abstract Syntax Notation One: Parameterization of ASN.1 specications
(ITU, 2002d).
[14] Intel, HP, NEC, and Dell, Intelligent Platform Management Interface Specication (Intel, 2009). Second generation, v2.0.
[15] DMTF, Inc., Web-based enterprise management (wbem) faqs (DMTF, Inc., 2010).
[16] The Open Group OpenPegasus. (2010). www.openpegasus.org.
[17] D. Libes, The expect home page (Don Libes, 2009). http://expect.nist.gov/.
[18] R. Gerhards, RFC5424: The Syslog Protocol (IETF, 2009).
[19] R. Thurlow, RFC5531 RPC: Remote Procedure Call Protocol Specication Version
2 (IETF, 2009).
[20] Object Management Group, Inc. Common Object Request Broker Architecture
(CORBA) Specication, Version 3.1. (2008).
[21] World Wide Web Consortium SOAP Version 1.2 Part 1: Messaging Framework.
(2007). second editions.
[22] D. Winer, Xml-rpc specication (xml-rpc.com, 1999).
[23] Sun Microsystems, Inc. Sun Advanced Lights Out Manager (ALOM) 1.6 Administration Guide. (2007b). 819-2445-11.

55

56

Sun servers open--source systems management

[24] Sun Microsystems, Inc. Advanced Lights Out Management (ALOM) CMT v1.4
Guide. (2007a). 819-7991-10.
[25] Sun Microsystems, Inc. Embedded Lights Out Manager Administration GuideFor
the Sun Fire X2200 M2 and Sun Fire X2100 M2 Servers. (2009). 819-6588-14.
[26] Oracle, Inc. Oracle Integrated Lights Out Manager (ILOM) 3.0 Getting Started
Guide. (2010c). 820-5523-11.
[27] Oracle, Inc. Oracle VM Server for SPARC. (2010e). (formerly LDOMS).
[28] Sun Microsystems, Inc. Sun SNMP Management Agent for Sun Fire and Netra
Systems. (2004).
[29] Oracle, Inc. Sun Server Management Agents 2.0 User's Guide. (2010b).
821-1610.
[30] K. McCloghrie and A. Bierman, RFC2737: Entity MIB (Version 2) (IETF, 1999).
Obsoleted by RFC 4133.
[31] Intel, HP, NEC, and Dell Platform Event Trap Format Specication. v1.0.
[32] Oracle, Inc. Auto Service Request (ASR) v2.6Installation and Operations
Guide. (2010a). http://wikis.sun.com/display/ASRSO/Home.
[33] D. Laurie IPMItool. (2007). http://ipmitool.sourceforge.net/.
[34] Oracle, Inc. IPMItool. (2010d).
http://www.sun.com/systemmanagement/tools.jsp.
[35] Sun Microsystems, Inc., PSARC 2008/119 sun4v /dev/bmc (Sun Microsystems,
Inc., 2008). (not available publicly).
[36] Zenoss, Inc. Zenoss Administration. (2010b).
[37] Zenoss, Inc. Zenoss Developer's Guide. (2010c).
[38] Zenoss, Inc., Zenoss 2.5 source code documentation (Zenoss, Inc., 2010a).
[39] J. Curry Zenoss Event Management. (2010). version 3.
[40] J. Curry, Creating Zenoss ZenPacks (Jane Curry, 2009a).
[41] J. Curry Crafting Zenoss Core users for events and zProperties. (2009b). draft.
[42] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Enterprise
Console Environment. (2009b).
[43] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Netcool/OMNIbus Environment. (2009a).
[44] N. Brockett, batchaddlocations.py (Zenoss, Inc., 2009).

A CD Contents
As a part of this project, a CD was created. It contains the following les and directories:

Others/Directory containing other documents.


Project/Directory containing PDF le of this project.
RFC/Directory containing RFCs.
ZenPack/Directory containing source les for ZenPack.
READMEDescription of les on CD.

LVII

58

Sun servers open--source systems management

Das könnte Ihnen auch gefallen