Sie sind auf Seite 1von 46

RDS Technical Training - All about the N-series RLM and BMC Modules

Norman Bogard Americas N series ATS

Steve Lawler NetApp Technical Marketing Engineer

Agenda
What are the RLM and BMC modules? Differences between the RLM and the BMC Configuration Communications mechanisms Inter-controller dialogue Hardware-assisted takeover

2009 NetApp. All rights reserved.

What are the RLM and BMC modules?


Enables remote management of storage system irrespective of controller state
Thorough, flexible, simple to operate Make appliances more robust

Reduce total cost of ownership


Centralized administration Allow easier deployment and administration of appliances in remote locations

Enterprise customers expect remote platform management

2009 NetApp. All rights reserved.

Differences between RLM and BMC

BMC (Baseboard Management Controller):


Incorporated on N3000 series controllers

RLM (Remote LAN Module):


Incorporated on N5000/6000/7000 series controllers

Functionally equivalent User access process varies slightly

2009 NetApp. All rights reserved.

Benefits and Solution Design


Robust remote platform management solution:

Remote console access TCP/IP over Ethernet SSH for secure connectivity
Hardware integration on current controllers No additional hardware or connectivity required Leverages existing data center infrastructure Eliminates separate remote support infrastructure

2009 NetApp. All rights reserved.

Topology Enterprise Data Center


Customer Data Center
RLM

Support
CS Tools and DB

The Internet

Firewall

Customer LAN

Firewall

SSL

Gateway

Private Mgmt LAN

BMC
SSH

Remote CLI/Console access

Operations Manager

2009 NetApp. All rights reserved.

Features
Secure network interface Console pass-through Remote power cycle Down filer notification Remote diagnosis of failures Remote reset Remotely initiate coredumps Capture console logs Access to HW event logs ZAPI interface for DFM/RAM SNMP Remote GDB Platform independent SW extensibility
7

2009 NetApp. All rights reserved.

Remote Platform Management


Remote Platform Management built-into the Appliance

Remote power control Remote console access Data ONTAP CLI, firmware and system diagnostics Secure network interface (SSH) Call home - down filer notifications Initiate core-dump (CPU NMI Interrupt) Access to system logs from a down appliance Non-volatile HW system event logs Captured console logs Software events
2009 NetApp. All rights reserved. 8

External LAN Interface


TCP/IP connection over a physical layer 10/100Mb Ethernet Dedicated LAN port for RLM/BMC Allows management LAN for physical security Secure connection to clients SSH protocol UserIDs, passwords, keys etc. managed through Data ONTAP Logging and Auditing Multiple services Appliance console redirection RLM/BMC CLI GDB over Ethernet ZAPI Alerts SMTP, SNMP
2009 NetApp. All rights reserved. 9

Data ONTAP and Controller Integration


Management through Data ONTAP
Install and Configuration Firmware Update Provides direct access to hardware Works even when controller is off, hung or inoperative

Customer Interface using SSH


Multiple ports and SSH services
Appliance console redirection GDB connection to appliance RLM CLI

Extensible, field upgradeable SW architecture


Integration with NetApp support model

2009 NetApp. All rights reserved.

10

Configuration
What information is needed?
Decide if DHCP or static addressing will be used
DHCP
Tie MAC address in DHCP server MAC address from FRU MAC address from toaster> rlm status

Static IP address
IP address Netmask of network Gateway (GW) of network

Mailhost address
SMTP (email server) used by RLM to send ASUPs
2009 NetApp. All rights reserved. 11

Configuring the RLM using Data ONTAP


There are 3 ways to configure an RLM using Data ONTAP:
Initial appliance setup
Zeros appliances file system and sets up appliance including the RLM

toaster> setup
Reconfigures appliance and RLM without zeroing file system

toaster> rlm setup


Just configures the RLM

2009 NetApp. All rights reserved.

12

RLM Testing Autosupport


To test RLMs Autosupport
toaster> rlm test autosupport

Provided that AutoSupport has been properly configured you should soon receive RLMs ASUP message

2009 NetApp. All rights reserved.

13

RLM Updating RLM Firmware


RLM on NOW site
http://now.netapp.com/NOW/download/tools/rlm_fw/

Latest firmware and instructions on NOW site Changes to update instructions posted on NOW site as relevant

RLM firmware can be updated in 2 ways


Data ONTAP CLI RLM CLI

2009 NetApp. All rights reserved.

14

RLM Firmware Update using ONTAP


Use software command to get RLM firmware (RLM_FW.zip)
toaster> software install http://webserver/path/RLM_FW.zip -f

Update the RLM


toaster> rlm update

2009 NetApp. All rights reserved.

15

RLM Firmware Update from RLM CLI


Install RLM firmware image (RLM_FW.tar.gz)
RLM toaster> rlm update http://webserver_ip_address/path/RLM_FW. tar.gz

web_server_ip_address is the IP address of the web server on a network accessible to your appliance

2009 NetApp. All rights reserved.

16

How To Connect To The Module?


Must access CLI securely Why? The network or Internet is between customer and filer SSH Only: Telnet not supported
Telnet disabled by default in Data ONTAP 8

Users in group Administrators allowed access to RLM For security, logging in as root not allowed at RLM Login as user naroot on the RLM when using root credentials (password)
2009 NetApp. All rights reserved. 17

RLM Commands
RLM toaster> ?
date exit events help priv rlm system version

RLM toaster> system


system system system system system console - connect to the system console core - dump the system core and reset log - print system console logs power - commands controlling system power reset - reset the system using the selected firmware

RLM toaster> rlm


rlm rlm rlm rlm reboot - reboot the RLM sensors - print RLM environmental sensors status status - print RLM status update - update RLM firmware
18

2009 NetApp. All rights reserved.

BMC Commands
help Display a list of BMC commands. reboot The reboot command forces the BMC to reboot itself and perform a self-test. If your console connection is through the BMC it will be dropped. setup Interactively configure the BMC local-area network (LAN) setttings. status Display the current status of the BMC. test autosupport Test the BMC autosupport by commanding the BMC to send a test autosupport to all autosupport email addresses in the option lists autosupport.to, autosupport.noteto, and autosupport.support.to.
2009 NetApp. All rights reserved. 19

Troubleshooting Scenario #1 - System Down / Hung / Reboot Loop

RLM will have sent ASUP. Console logs:


RLM toaster> system log

************************************************** * Log Starts * ************************************************** Phoenix TrustedCore(tm) Server Copyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved Portions Copyright (c) 2005 Network Appliance, Inc. All Rights Reserved BIOS Version: 1.0X13 CPU= AMD Opteron(tm) Processor 852 X 4 Testing RAM. 512MB RAM tested 32768MB RAM installed Fixed Disk 0: SMART ATA Flash Disk New event log messages, please check the event log ERROR 0251: System CMOS checksum bad - Default configuration used
Boot Loader version 1.0X5 Copyright (C) 2000,2001,2002,2003 Broadcom Corporation. Portions Copyright (C) 2002-2005 Network Appliance Inc. CPU Type: AMD Opteron(tm) Processor 852 BIOS POST Failure(s) detected. Abort AUTOBOOT

2009 NetApp. All rights reserved.

20

Troubleshooting Scenario #2 - Obtain Console Access

RLM provides remote controller console login


RLM toaster> system console Type Ctrl-D to exit. LOADER> version Variable Name -------------------BIOS_VERSION LOADER_VERSION Value -------------------------------------------------1.0X13 1.0X5

LOADER> boot_ontap Loader:elf64 Filesys:fat Dev:ide0.0 File:X86_64/kernel/primary.krn Options:(null) Loading: 0x200000/40125488 0x2844430/42433840 0x50bc160/1929773 0x529338d/3 Entry at 0x00202008 Starting program at 0x00202008 [...] toaster> sysconfig v [...]
2009 NetApp. All rights reserved. 21

Troubleshooting Scenario #3 - Power Cycle

On the RLM console


RLM toaster> system power cycle This will cause a dirty shutdown of your appliance. Continue? [y/n] y

On the controller console


toaster> Phoenix TrustedCore(tm) Server Copyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved Portions Copyright (c) 2005 Network Appliance, Inc. All Rights Reserved BIOS Version: 1.0X13 CPU= AMD Opteron(tm) Processor 852 X 4 Testing RAM. 512MB RAM tested 32768MB RAM installed Fixed Disk 0: SMART ATA Flash Disk

2009 NetApp. All rights reserved.

22

Troubleshooting Scenario #4 - Corrupt Motherboard Firmware

On the RLM console


RLM toaster> system reset backup This will cause a dirty shutdown of your appliance. Continue? [y/n] y

On the controller console


LOADER> update_flash ** DO NOT TURN OFF YOUR MACHINE UNTIL THE FLASH UPDATE COMPLETES!! ** Programming... [accidentally power off the machine here] Phoenix TrustedCore(tm) Server Copyright 1985-2005 Phoenix Technologies Ltd. All Rights Reserved

Portions Copyright (c) 2005 Network Appliance, Inc. All Rights Reserved BIOS Version: 1.0X13

2009 NetApp. All rights reserved.

23

Troubleshooting Scenario #5 - Remotely Generate a Core

On the RLM console


RLM toaster> system core This will cause a dirty shutdown of your appliance. Continue? [y/n] y

On the controller console


toaster> PANIC: RLM NMI .. dumping core! in process idle_thread2 on release NetApp Release mainN_051023_2300 on Wed Oct 26 18:37:38 GMT 2005 version: NetApp Release mainN_051023_2300: Mon Oct 24 04:28:07 PDT 2005

cc flags: 2
DUMPCORE: START Dumping to disks: 0d.55 0d.48 ....................................

2009 NetApp. All rights reserved.

24

Troubleshooting Scenario #6 - System Down with HW/FW Problem


events command
RLM toaster> events all Record 1: [...] Record 89: Wed Oct 26 19:46:39 2005 [Agent Event.normal]: FIFO 0x4042 Agent Excelsior, PCIE_RESET deasserted. Record 90: Wed Oct 26 19:46:39 2005 [Agent Event.normal]: FIFO 0x4043 Agent Excelsior, FC_RESET deasserted. Record 91: Wed Oct 26 19:46:57 2005 [Excelsior BIOS.warning]: POST error 0x0051: ERR_CMOS_CHECKSUM

system sensors command


RLM toaster> priv set advanced RLM toaster*> system sensors Sensor Sensor Sensor ID Name State ====== ======== ====== 0x001 POW1_FAIL good 0x002 POW2_FAIL good 0x003 P0_THRMTRP BAD 0x004 P1_THRMTRP good 0x005 P2_THRMTRP good ...
2009 NetApp. All rights reserved.

Current Value ======= D D A D D

25

Summary: Using the RLM/BMC CLI for troubleshooting


If you need controller console access
RLM toaster> system console RLM toaster> system log RLM toaster> system core RLM toaster> system reset RLM toaster> system power cycle RLM toaster> priv set diag RLM toaster*> system debug_port RLM toaster> system reset backup RLM toaster> events all RLM toaster> priv set advanced RLM toaster*> system sensors
26

If you need controller console log


If controller is hanging / unresponsive

If FW cant boot

Find out why controller is misbehaving

2009 NetApp. All rights reserved.

RLM Status
RLM status can be obtained in two ways
From Data ONTAP console: toaster> rlm status
Just shows rlm information

toaster> sysconfig
Show appliance and rlm status.

2009 NetApp. All rights reserved.

27

Example - RLM status

Output from rlm status


toaster> rlm status Remote LAN Module Part Number: Revision: Serial Number: Firmware Version: Mgmt MAC Address: Using DHCP: IP Address: Netmask: Gateway:
2009 NetApp. All rights reserved.

Status: Online 110-00030 B0 304926 1.2 00:A0:98:01:9A:86 no 172.22.136.64 255.255.224.0 172.22.128.1


28

RLM status (via sysconfig)


sysconfig will not show RLM IP address information unless
options.autosupport.content == complete

Site specific information for RLM in sysconfig keeps in line with current Autosupport policies.

2009 NetApp. All rights reserved.

29

RLM - System Console Access (Redirection)


RLM toaster> RLM toaster> system console Type Ctrl-D to exit. Password: Thu Nov 10 06:11:45 GMT [rlm_console_login_m:info]: root logged in from RLM toaster*> (Ctrl-D) RLM toaster>

2009 NetApp. All rights reserved.

30

EMS Error Messages for RLM errors


Data ONTAP generates EMS messages for RLM errors Hourly status monitoring of RLM fails Mailhost not setup correctly for AutoSupport Network Configuration of RLM failed Firmware Update errors Heartbeat from RLM
Stopped Resumed Booted from backup

Data ONTAP RLM communication errors Errors sending userid/password information to RLM

Error Messages and Troubleshooting Guide


Describes RLM EMS Error messages Provides corrective actions

2009 NetApp. All rights reserved.

31

RLM generated Down-Controller ASUPs


RLM continuously monitors the System Health
Firmware POST Errors Boot failures Heartbeat from Data ONTAP Data ONTAP abnormal reboots Watchdog resets Hardware errors User initiated reboots/power-cycles/NMI

When the system goes down or fails to boot


RLM generates Down-Controller AutoSupport email
2009 NetApp. All rights reserved. 32

Remote Support Diagnostics Tool


Customer
IBM/NetApp Support

HTTPS

Internet

Firewall

Firewall

Remote Support Customer Data Repository

RLM v3.0 Secured access model Nondisruptive upgrade Functional even when
appliance is down

Appliance down notification Optimized CORE handling Remote data collection Trigger AutoSupport on-demand

2009 NetApp. All rights reserved.

33

Hardware Assisted Takeover


Slow Node Failure Detection: Results in Long Takeover Time HA systems takeover partner workload after failure detection Legacy failure detection is a slow process
Partners in a cluster use heartbeat mechanism to determine failover
This heartbeat over the IB link is SW driven Partner waits up to 15 seconds to avoid premature takeovers due to Data ONTAP scheduling issues
cf.takeover.detection.seconds

2009 NetApp. All rights reserved.

34

Hardware Assisted Takeover


Key Features

Predictable failure detection time Platform independent configuration Secure alerting mechanism
Prevent replay attacks

Native diagnosability
Continuous runtime diagnosis

Customer-initiated test mechanisms Leverage existing infrastructure


Based on standard SNMP v1 Traps

2009 NetApp. All rights reserved.

35

Hardware Assisted Takeover Using Out-of-Band Hardware Alerting Mechanism

Out-of-band hardware-based failure detection


Predictable failure detection time for a class of failures
Detection time reduced from 15 seconds to less than 3 seconds (RLM detection and reporting takes
~20ms)

Leverageable across product portfolio


Based on standard SNMP Traps

Does not replace the existing HA mechanism


Optimization for hardware-assist detected failures
2009 NetApp. All rights reserved. 36

Hardware Assisted Takeover: Separate Storage Controllers


Controller 1 Controller 2

InfiniBand Interconnect

ONTAP

Gig-E

Gig-E

Data ONTAP

Network

hwassist
Enet

hwassist
Enet

2009 NetApp. All rights reserved.

37

Hardware Assisted Takeover Benefits


First introduced with release of Data ONTAP 7.3 Speeds takeover in the event of:
Abnormal system reboot (aka panic) System reset due to watchdog timeout System power off, power cycle, or reset of the partner System POST error during boot Complete loss of power to the partner Environmental shutdown conditions

Takeover not expedited when:


Operator-initiated halt of the partner - already at minimum latency via cluster interconnect 'Busy-Hung' of partner, where it continues to service its watchdog
2009 NetApp. All rights reserved. 38

Hardware Assisted Takeover: Data ONTAP Commands and Options


The following customer-visible options are supported 1.options cf.hw_assist.enable To enable/disable hwassist on partner 2. options cf.hw_assist.partner.address To configure partner IP address on which alerts will be sent by RLM. 3. options cf.hw_assist.partner.port To configure partner UDP port on which alerts will be sent by RLM. The following hidden options are supported 1. options cf.hw_assist.health_check_interval Interval in secs to send periodic keep alive alerts. 2. options cf.hw_assist.retry_count Number of times each hardware assist alert is sent.

2009 NetApp. All rights reserved.

39

Hardware Assisted Takeover: Data ONTAP Commands and Options


The following commands are supported in advanced mode 1. cf hw_assist status Command is used to get latest status of hwassist feature. If hwassist is active it will print port and IP address on which hwassist is listening for traps. If hwassist is inactive it will print the reason with a possible solution. 2. cf hw_assist test Command is used to test send/recv path of hwassist alerts between clustered filers. 3. cf hw_assist stats Command will print detailed information of all hwassist alerts received by the filer. 4. cf hw_assist stats clear Command will clear information of all hwassist alerts received by the filer.

The following commands are supported in test mode 5. cf test_hw_assist get ss Command will print current shared secret for local as well as partner node. 6. cf test_hw_assist update ss Command will update shared secret for local node.
2009 NetApp. All rights reserved. 40

Hardware Assisted Takeover


Single Management Port; Integrated HWAssist
Backplane
RJ-45 RJ-45

Enet

Enet Switch Switch

RLM

RLM

Agent SIO

Agent SIO

Data ONTAP
Gig-E
10/100 Enet 10/100 Enet

Data ONTAP
Gig-E
Gig-E

Gig-E

2009 NetApp. All rights reserved.

41

Hardware Assisted Takeover


Alerting Mechanism using SNMP Traps
UDP message formatted as SNMP trap Multiple trap messages based on configuration settings Resumes working in the event of reboot of the hwassist during the uptime of the filer Backward compatible with Data ONTAP kernels that do not support this feature Extensible data format for future improvements On a N6xxx system, the IP address specified in the cf.hw_assist.partner.address option should specify the partner's e0m interface. (The e0M interface is dedicated to Data ONTAP management activities.)

2009 NetApp. All rights reserved.

42

Hardware Assisted Takeover


Basic Design Flow of Events
RLM/BMC on Failed Controller (Downfiler)
Detects a failure event at its monitored controller Triggers an alert to partner controller (Data ONTAP) Alert message identifies cause of failure Alert message sent via UDP (in SNMP Trap format)

CFO software on partner controller


Receives RLM/BMC alert (UDP packet in SNMP Trap format) Applies policy to received alert Initiates takeover if warranted

Estimated failure detection time savings


RLM/BMC: ~20ms to detect event and send alert CFO: <1 to 3 seconds to process RLM/BMC alert Detection time reduced by >10 sec for RLM/BMC detected failures
2009 NetApp. All rights reserved. 43

Example of SNMP v1 Trap:


2006-06-23 11:02:16 or-196-rlm.lab.netapp.com [172.22.136.196] (via 172.22.136.196) TRAP, SNMP v1, community public iso.3.6.1.4.1.789 Enterprise Specific Trap (536) Uptime: 0:00:01.90 iso.3.6.1.4.1.789.1.1.12.0 = STRING: Remote Management Event: type=system_down, severity=notice, event=power_cycle_via_rlm, ss=ABCDE56789, system_id=0118044518 iso.3.6.1.4.1.789.1.1.9.0 = STRING: "12345678 Where: iso.3.6.1.4.1.789: Is Netapp enterprise OID. iso.3.6.1.4.1.789.1.1.12.0: OID used for the variable field that will contain the trap-specific info. iso.3.6.1.4.1.789.1.1.9.0: OID with product serial type: type of event i.e. system_down, system_up, keep_alive, test severity: would be alert, warning, notice, normal, info, debug event:post_error, abnormal_reboot, l2_watchdog_timeout etc ss: shared secret key (will be 0's if we have no key, there will be no key for periodic and test types) system_id: system id of the system from which the trap is sent
2009 NetApp. All rights reserved. 44

Hardware Assisted Takeover


Types of Failures Detected

Loss of power Level 2 Watchdog Timer Reset System POST Failures


Firmware POST fatal errors Boot media corruption

Operator Initiated system down events


Power cycle, power down or reset

Boot Timeout Abnormal Reboots including Panics Data ONTAP RLM heartbeat timeouts

2009 NetApp. All rights reserved.

45

Thank You!

2009 NetApp. All rights reserved.

46

Das könnte Ihnen auch gefallen