Sie sind auf Seite 1von 48

Slide 1

Oracle

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.


Proprietary and Confidential
2008
Oracle Corporation Proprietary and Confidential

Slide 2

Sun x86 Servers Troubleshooting Tools and


FRU/CRU Replacement Course

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Welcome to the x86 Servers Troubleshooting tools and FRU/CRU Replacement course.

Slide 3

Course Objectives
By the end of this course, you should be able to:
Interpret system indicators
Locate data gathering tools used to troubleshoot x86

system issues
Locate and describe diagnostic troubleshooting tools

supported on x86 systems


Describe FRU/CRU replacement procedures

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

By the end of this course, you should be able to interpret system indicators, locate and describe
tools used to gather troubleshooting data, locate and describe diagnostic tools, and describe
FRU/CRU replacement procedures for x86 systems.

Slide 4

Before You Begin


You have the option to Test Out of this course
at any time.
So, lets get started

So, lets get started

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

As a reminder, you have the option to Test Out of this course at any time. So, lets get started

Slide 5

Troubleshooting Tools

System Indicators (LEDs)


Data Gathering Troubleshooting Tools
Diagnostic Troubleshooting Tools
Storage Diagnostic Tools

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

There are many ways to gather data on x86 servers for troubleshooting purposes. In this course we will
discuss a few of these methods including system indicators, data gathering tools, as well as system and
storage diagnostic tools. Well start by looking at the x86 system indicators.

Slide 6

Interpreting System Indicators


Power LED (green):

OFF = Server not powered


BLINK = Standby Power Mode
Fast BLINK = Service Processor initializing
Slow BLINK = Server Host is initializing

ON = OS Booted

Service Action Required LED (amber):

ON = Hardware failure
OK-to-Remove LED (blue):

ON = Component ready for removal


6

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

System indicators, or LEDs, are a good place to start when troubleshooting a hardware issue. The
x86 servers typically have 3 system LEDs to indicate the state of the platform. The Power LED
is a green LED. There are three states for the green Power LED. When the LED is off, the
server is not connected to AC power. When the LED is blinking, the server is in Standby Power
mode. This means that the server is connected to AC power, but the host is not powered.
Depending on the server, you may have two blinking states, the fast blink and the slow blink
states. The fast blink lasts for two minutes after the AC power is applied to the host, during the
time that the SP is initializing. The slow blink occurs when power is applied to the server host,
during the time that the server host is booting. To determine which blink states are supported by
a specific server, refer to that servers service manual. When the Power LED is steady on, this
indicates that the server is connected to AC power and its host is powered. The Service Action
Required is an amber LED. This LED turns steady on when any hardware failure occurs within
the server or to any of its components. A corresponding component Service Action Required
LED may light up on the specific server component that has failed. The OK-to-Remove LED is
blue. This LED turns steady on to indicate when it is safe to remove a failed component for
replacement.

Slide 7

Interpreting Locator, CPU, and DIMM Indicators


Locator LED (white):
Programmed to flash to assist onsite technicians find the server

CPU and DIMM LEDs (amber):


Push the Fault Remind button to light the component fault LED
Fault Remind Button

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Fault Remind Button

Proprietary and Confidential

White locator LEDs maybe on some server models. These can be programmed to flash to assist
onsite technicians in finding a specific server among a number of servers.
Amber LEDs are found on certain components internal to the server. Each CPU and each DIMM
has its own amber LED. To provide capacitor power to the CPU and DIMM fault LEDs, press
the Fault Remind button. The LED with the faulty component will then light.
Since the amber component LEDs are powered by a capacitor, they will only have enough power
to light up a couple of times, and only for a short time. Due to this limitation, be prepared to
locate the faulty component quickly, as you may only get results from pressing the Fault Remind
button a couple of times. This action needs to be completed within a set time period, determined
by each specific server.

Slide 8

Interpreting Disk and Ethernet Indicators


Disk LEDs:

OK Power (green)
Service Action Required (amber)
Ready-to-Remove (blue)
Ethernet Port LEDs
Link/Activity (green)

Speed (green [1-Gbit/sec])


(amber [100-Mbit/sec])
(off [10-Mbit/sec])

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Hard disk drives have three LEDs. The green OK Power LED will light when the disk is
powered and will flash according the disks activity. The amber Service Action Required LED,
again, is the hardware fault indicator. The blue Ready-to-Remove LED is steady on when the
disk is ready to be removed.
Ethernet ports have two green LEDs. The Link/Activity LED is located on the left of the port and
it flashes according to port activity. The Speed LED is located on the right of the port and its
color determines the speed the port is configured for. Green indicates 1 gigabit per second,
Amber indicates 100 megabit per second, and Off indicates a 10 megabit per second speed
configuration.

Slide 9

Interpreting Power Supply and Fan Indicators


Power Supply LEDs:
AC (green)
DC (green)
Service Action Required (amber)

Ready-to-Remove (blue) [optional]

Fan LED:
OK Power (green)
Service Action Required (amber)

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

x86 Platform power supplies support two or three LEDs. These LEDs may include: a green AC
LED, a green DC LED, an amber Service Action Required LED and/or a blue Ready-to-Remove
LED. A lighted AC LED indicates that AC power is present at the power supply, while the DC
LED indicates that the power supply is generating DC power when it is on.
The fan modules support one or two LEDs. A lighted green OK power LED indicates that the fan
module is powered. The amber Service Action Required LED will light when the fan module
encounters a hardware failure.
Now that we have a good understanding of the platform and component level LEDs, lets look at
some other data gathering tools that can be used to troubleshoot x86 issues.

Slide 10

Data Gathering Troubleshooting Tools


Tools

Resident on

Description

System Event Logs

ILOM, BIOS, IPMI, Solaris,


Linux and Windows

Logs of system events categorized along with a


timestamp, severity and error message.

faultmgmt and ipmitool

ILOM and IPMI

Displays sensor and indicator information

prtdiag, prtconf, sysconfig

Solaris

Displays system configuration

ifconfig, netstat

Solaris and Linux

Displays and configures network configuration

ipconfig, netstat

Windows

Displays and configures the network


configuration

snapshot

ILOM

This utility enables you to produce a snapshot


of the server SP at any instant in time.

Explorer

Solaris

A Solaris specific data gathering tool for


collection of the more advanced feature
information and its current status.

MPS Report

Windows

Microsoft Product Support Reports (Data


Collection Tool)
http://www.microsoft.com/download/en/details.
aspx?id=24745

sosreport

RedHat, Oracle Enterprise


Linux (OEL)

Data Collection Tool (built into OS)

supportconfig

SuSE

Data Collection Tool (built into OS)

Press PLAY (4) to Continue


10

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

In this next section of the course we will discuss data gathering tools to use for troubleshooting.
The x86 Platform data gathering troubleshooting tools allow the user to view system status and
configuration. The tools to view system status are listed in the table on the slide, along with
where they reside and their function. We will look at each of these tools in more detail on the
next few slides.

Slide 11

OS System Event Logs


A system error may be displayed to system console and
recorded in the OS system error log
OS System Logs
Solaris and Linux System Logs
# vi /var/adm/messages
Solaris
# view /var/log/messages Linux

Windows System Logs (using Event Viewer)


Start -> Control Panel -> Administrative Tools -> Computer
Management

11

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The customers first indication of an error may be displayed to the system console and will be
recorded in the OS system error log. To access these logs through Solaris and Linux you can
display the messages files using the vi or view command to display the contents of the messages
files. The messages files are generated and updated by the syslog facility of Solaris and Linux.
For Windows, navigate to the Computer Management screen which will give you access to the
Event Viewer that displays the system log among other screens.

Slide 12

ILOM, BIOS and IPMI System Event Logs


ILOM SEL (System Event Logs)
CLI: -> show /SP/logs/event/list
BUI: System Monitoring -> Event Log

BIOS DMI (Direct Memory Interface) / SEL


BUI: Advanced -> Event Logging

IPMI SEL
CLI: # ipmitool -U root -H <HOST> sel elist

Log Fields

12

Timestamp
Severity
Description
Device

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

System event logs are also supported within ILOM, BIOS and IPMI. ILOMs system event logs
are available through its CLI and BUI interfaces. The CLI command line is displayed here along
with the path to the BUI screen.
BIOS has a system event log that can be accessed by using the navigation path displayed here to
open the Event Logging screen. IPMI, which is the closest management tool to the hardware, has
a system event log that is accessible through the IPMItool. This log is a subset of the events
posted within the BIOS system event log.
Logs can give you an ordered list of events that lead up to the hardware problem by using the
timestamp field to order the entries. The severity field determines the severity of the reported
event. The description field can pinpoint the source of the problem, or it can give enough
information about the problem to start the FRU isolation process. The device field of a log is
useful to determine the log entries that are related to the same device.

Slide 13

ILOM and IPMItool LED Information


ILOM CLI: ->

show /SP/faultmgmt

ILOM BUI: System

Monitoring -> Fault Management

IPMItool shell/command prompt


# ipmitool -U root -H <HOST> sdr list (sensors)
On some platforms the command is:
# ipmitool -U root -H <HOST> led get all(indicators)
But this command generates errors on some platforms (e.g. X4440)
and you need to use the following command instead :
# ipmitool -U root -H <HOST> sbled get all (indicators)

IPMItool is available from the servers Tools and Driver


CD/DVD or within MOS
MOS/ISP -> Patches & Updates -> Patch Search -> Product or Family
(Advanced) -> Select Platform
Press PLAY (4) to Continue
13

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Diagnostic Guide
Proprietary and Confidential

If the customer does not have easy access to the x86 server and therefore can not visibly report
on the status of the LED indicators, then there are ways to view this information remotely. The
x86 server indicators and sensors data can be accessed using the ILOM CLI under
/SP/faultmanagement, or within the ILOM BUI under the Fault Management tab. This data can
also be displayed using the IPMItool CLI command that is supported by IPMI. In the IPMItool
examples on the slide, notice that sdr within the first command line corresponds to the sensor
data repository while led within the second command refers to the indicators. Note that Fault
Management is only available on current systems.
The IPMItool software is available from the servers Tools and Driver CD, or from the MOS
location displayed. Manuals on the use of the IPMItool commands are available within the
Diagnostics Guide for a specific server. Click on the link for a diagnostic guide sample. The
presentation will now stop to allow you to access the path and link.

Slide 14

Operating System Utilities


Solaris and Linux Utilities

prtdiag
prtconf
ifconfig a
sysconfig
netstat -a

(HW configuration and state for Solaris)


(Logical configuration for Solaris)
(Network configuration for Solaris and Linux)
(HW configuration for Linux)
(Network config for Solaris, Linux & Windows)

Windows Utilities

14

ipconfig
(Network configuration and state for Windows)
View hardware configuration information with msinfo32
Start -> Run and type msinfo32 to view the configuration
File -> Save to save the configuration.

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Solaris, Linux, and Windows operating systems also provide utilities to view the current
hardware and network configurations. The utilities are listed along with the operating systems
that support them. Notice that the ifconfig command also gives you the ability to modify the
network configuration. There are no exact Windows equivalents to most of the utilities shown,
however there is a command line utility call ipconfig. Also, hardware configuration information
can be displayed using msinfo32 byt navigating to Start -> Run then selecting File -> Save to
save the configuration.

Slide 15

ILOM Snapshot

-> set /SP/diag/snapshot/dataset=data


-> set /SP/diag/snapshot/dump_uri=URL

data:
normal
full
normal-logonly
full-logonly

Collects ILOM, OS and HW information


Collects all information and may reset the server
Collects only log files in normal mode
Collects only log files in full mode

URL:
Any valid target directory location
protocol://username:password@host/directory

Protocols supported: tftp, ftp, sftp, scp, http, or https


Press PLAY (4) to Continue
15

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

ILOM Addendum

Proprietary and Confidential

Until now we have been gathering individual portions of system data. It would be more efficient
to gather larger portions of data to analyze. The ILOM snapshot utility collects log files, runs
various commands and collects their output from the service processor, then sends the data
collection as a downloaded file to a user-defined location.
To perform an ILOM snapshot, define the data to be collected using the first set command shown
on the slide. The data field can be normal, full, normal-logonly and full-logonly. The variable
you select depends on how much data you want to collect as indicated by their definitions.
The second command sets the location where the data will be sent using the format displayed.
The protocols supported are tftp, ftp, sftp, scp, http, or https. The same protocols supported by
the ILOM commands are used for backup and restore of the ILOM configuration.
For more information on ILOM snapshot, click on the ILOM Addendum link.

Slide 16

Oracle Explorer Data Collector


Oracle Explorer Data Collector is a diagnostic data collection tool made
up of scripts and executables as part of the Services Tools Bundle (STB).
Oracle Explorer Data Collector is a collection of shell scripts that gathers
information and creates a detailed snapshot of a system's configuration.
Type of information collected includes
Information related to drivers and patches
Recent system event history and log file entries
View Oracle Explorer Data Collector
Install Oracle Explorer Data Collector using the Service Tools Bundle
(STB) Installer

STB User's Guide

16

STB Installer

Press PLAY (4) to Continue

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Oracle Explorer Data Collector is a diagnostic data collection tool that is made up of shell scripts
and a few binary executables. Oracle Explorer Data Collector is designed to run on Solaris x86
platforms and is distributed as part of the Services Tools Bundle or STB.
Oracle Explorer Data Collector is a collection of shell scripts that gather information and create a
detailed snapshot of a system's configuration. Information related to drivers, patches, recent
system event history and log file entries is obtained from the Explorer output. For additional
information click on the Oracle Explorer link.
You can install Oracle Explorer Data Collector using the STB installer. The STB software and
documentation can be accessed from the STB link. Refer to the STB documentation for the
installation instructions.
Now that we've discussed the use of Data Gathering Troubleshooting Tools to view system
information, lets Check Your Knowledge.

Slide 17

PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:

Goes to Next Slide


Goes to Next Slide
At any time
At any time
Unlimited times

Slide 18

Diagnostic Troubleshooting Tools


x86 Platform troubleshooting tools includes diagnostics to
generate more information about a problem
Tools

Resident in

Description

Oracle VTS (SunVTS)

Bootable CD or Solaris

Oracle VTS (SunVTS) is an exerciser that can


either be booted from a CD/ISO image or installed
directly on a host running the Solaris OS.

POST

BIOS

Power On Self Test that executes when the server


is reset or powers on.

U-Boot

SP

SP diagnostics that executes when the SP resets


or powers on.

spdiags and hostdiags

ILOM

SP diagnostics that runs during SP initialization

Pc-Check

ILOM

PC Check diagnostics is integrated into ILOM and


also comes in the servers Tools and Driver
CD/DVD

For tools for specific x86 servers, view the Sun x86 Servers Diagnostic Guide
found in the Sun System Handbooks Related Documentation link.
Press PLAY (4) to Continue
18

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

In the next section of the course, we will review some diagnostic troubleshooting tools that are
compatible with most x86 Platforms. Along with the tool, the table displays where the diagnostic
tool resides and a description of its functionality. For the tools that are compatible with a specific
x86 Server, view the Oracle x86 Servers Diagnostic Guide that can be found in the Sun System
Handbook's Related Documentation link.

Slide 19

Oracle Validation Test Suite (Oracle VTS)


Oracle Validation Test Suite (previously known as SunVTS) is
an exerciser that tests and validates Sun hardware and repair
verification
Oracle Validation Test Suite runs from:
Server running Solaris OS
Bootable CD/DVD
USB boot image
Oracle Validation Test Suite downloads from:

Oracle VTS Download

Press PLAY (4) to Continue


19

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Oracle Validation Test Suite, previously known as SunVTS, is an exerciser that tests and
validates Sun hardware. Oracle Validation Test Suite or Oracle VTS is used to ensure the proper
operation of the overall system under test and its underlying hardware. It stimulates, detects, and
identifies hardware faults and is used for both hardware validation and repair verification.
The Oracle VTS diagnostics are available on Solaris, USB boot image, or off a bootable CD. The
bootable CD allows you to boot a CD resident Solaris OS which boots a CD resident Oracle VTS
then tests the server and generates a report.
The minimum Oracle VTS version supported is the one that comes shipped with the server. The
current Oracle VTS version can be found in the servers product notes and can be downloaded
from the link provided on the slide. The SunVTS download link also provides SunVTS versions
for Linux.

Slide 20

Oracle Validation Test Suite (Oracle VTS) (cont.)


Oracle Validation Test Suite:
CD DVD Test (cddvdtest)
CPU Test (cputest)
Cryptographics Test (cryptotest)
Disk and Diskette Drives Test (disktest)
Data Translation Look-aside Buffer Test (dtlbtest)
Emulex HBA Test (emlxtest)
Floating Point Unit Test (fputest)
InfiniBand Host Channel Adapter Test (ibhcatest)
Level 1 Data Cache Test (l1dcachetest)
Level 2 SRAM Test (l2sramtest)

Physical Memory Test (pmemtest)


QLogic Host Bus Adapter Test (qlctest)
RAM Test (ramtest)
Serial Port Test (serialtest)
System Test (systest)
Universal Serial Board Test (usbtest)
Virtual Memory Test (vmemtest)
Tape Drive Test (tapetest)
Network Hardware Test (nettest)
Ethernet Loopback Test (netlbtest)

For descriptions of these tests and instructions on how run Oracle VTS refer to:
http://www.oracle.com/technetwork/documentation/sys-mgmt-networking-190072.html
Press PLAY (4) to Continue
20

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The Oracle VTS is listed. As you can see these tests cover all server internal components as well
as I/O components. For descriptions of these tests and instructions on how run Oracle VTS, click
on the link.

Slide 21

POST Diagnostics
Power On Self Test is a series of diagnostics that execute before
the server OS is booted to verify that the hardware is healthy and
the configuration is valid
Fatal HW Error:
OS boot will stop
The error is reported to:
ILOM Fault Management

ILOM System Event Logs

POST Events table is in the Service Manuals

POST Events Table

POST Error Codes table is found in the appendix of the


Diagnostic Guides
Press PLAY (4) to Continue
21

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

POST Error Codes

Proprietary and Confidential

Power On Self Test is a series of diagnostics that execute before the server OS is booted to verify
that the hardware is healthy and the configuration is valid. If a fatal hardware error is
encountered, the OS boot will stop and the error is reported to ILOM Fault Management and
ILOM System Event Logs. A list of POST events that can stop or allow OS boot to continue are
found within the servers service manual. A list of POST error codes are found in the appendix of
the Diagnostic Guides. These tables can be useful in trying to determine the cause of a hardware
failure caught by POST.

Slide 22

U-Boot Diagnostics
At system start-up, U-Boot diagnostic software initializes the
server and tests the server SP prior to booting the ILOM firmware
U-Boot Test

Normal

Quick

Extended

Description

Memory Data Bus Test

Checks for opens/shorts on SP Memorys data bus

Memory Address Bus Test

Checks for opens/shorts on SP Memorys address bus.

Memory Data Integrity Test

Checks for data integrity on the SP Memory.

Flash Test X Checks access to


Flash.Watch Dog Test

Checks the Watch Dog functionality on the SP.

I2C Probe Tests

Checks the connectivity to I2C devices on standby power.

Verifies ability to read from specified Ethernet port.

Ethernet Test

U-Boot execution modes:


Normal (default)
Quick (optional)
Extended (optional)
Press PLAY (4) to Continue
22

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

At system start-up, when AC power is connected to the x86 Platform, the U-Boot diagnostic
software initializes the server and tests aspects of the server service processor prior to booting
the ILOM firmware. The U-Boot diagnostic tests are designed to test the hardware required to
enable the server SP to boot successfully.
There are three execution modes that U-Boot supports. These include normal mode, which is the
default mode, quick and extended modes, which are optional. The modes determine which tests
are run and for how long.
The U-Boot tests are listed in this table according to the mode they run in with a description of
the test. The presentation will now stop to allow you to view this table.

Slide 23

U-Boot Diagnostics (cont.)


U-Boot Test

Normal

Ethernet Link Test

Ethernet Internal Loopback


Test
Real Time Clock Test

Quick
X

Verifies link on specified PHY.

Verifies Ethernet functionality by sending


and receiving packets.

Checks functionality of the real-time clock


on the SP.

Checks USB 1.1 functionality

Runs internal USB 1.1 built-in self-test (BIST).

Checks USB 2.0 functionality.

Verifies ability to read from the BIOS flash.

Verifies DIMM SPD access along with


checksum and prints SPD information.

Test verifies the correct power revision of the complex


programmable logic device (CPLD).

USB 2.0 Test


BIOS Flash ID Test

Serial Presence Detect (SPD)


Access Test
Power CPLD

Description

USB 1.1 Test

USB 1.1 BIST

Extended

Press PLAY (4) to Continue


23

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

This is the continuation of the table on the U-Boot tests.

Proprietary and Confidential

Slide 24

U-Boot Diagnostics (cont.)


Reset or Reboot the server host
.
.
.

Enter Diagnostics Mode {'q'uick/'n'ormal (default)/e'x'tended]...


.
.
.

Any U-Boot failures are reported to the ILOM System


Event Log and the Fault Management.
For more test information on U-Boot refer to
the Oracle x86 Servers Diagnostics Guide
which can be found on this link:

X86 Server Documentation

Press PLAY (4) to Continue


24

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

To configure and run U-Boot diagnostics, power cycle or reset the server then wait for the UBoot message that will display over the serial port. When it appears select either q , n, or x for
the U-Boot mode. The U-Boot tests will display on the console.
Note, any U-Boot failures are reported to the ILOM System Event Log and the Fault
Management. For more information on U-Boot refer to the Oracle x86 Servers Diagnostics
Guide.

Slide 25

Pc-Check Diagnostics
Pc-Check is a DOS-based diagnostics utility
Available from:

Within ILOM through its CLI and BUI


Servers Tools and Driver CD/DVD

25

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Pc-Check is a DOS-based diagnostics utility that can be used to test the x86 Platforms. Pc-Check
is available in newer service processors. For servers that do not have a service processor, PcCheck can be executed from the servers Tools and Driver CD/DVD.

Slide 26

Pc-Check Diagnostics (cont.)


Pc-Check Operating modes:

26

Enabled Runs a select list of tests upon start up of the host


Extended Runs a comprehensive test suite upon start up of the host
Manual Select individual tests or test suites
Disabled Disables testing

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Pc-Check supports four operating modes.


The enabled mode runs a select list of tests upon start up of the host which takes
approximately 3 minutes to execute. Once the tests complete, it will continue to boot the next
device based on the BIOS Boot Priority List. This is a quick test that is recommended upon first
time field installation.
The extended mode runs a comprehensive test suite upon start up of the host which takes
approximately 30 minutes or longer to execute. This is a longer test that is recommended any
time you physically change the system configuration to verify the new configuration.
The manual mode allows you to select individual tests from the Pc-Check menus, or to select
predefined test suites available through the Immediate Burn-in test menu. This is recommended
when you want to test individual server components for fault isolation testing.
The disabled mode is the default mode that allows you to disable Pc-Check diagnostic tests
upon start-up of the host.

Slide 27

Pc-Check Diagnostics Through ILOM


Accessing Pc-Check through
ILOM CLI
-> set /SP/diag state=mode
-> stop /SYS
-> start /SYS
ILOM BUI

Remote Control -> Diagnostics


Manual Mode
Advanced Diagnostics Testing Menu = individual test
Immediate Burn-in Testing Menu = test suites
For more test information on Pc-Check, refer to the Oracle x86 Servers
Diagnostics Guide which can be found on this link:
X86 Server Documentation
Press PLAY (4) to Continue
27

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

To access Pc-Check you can either use the ILOM CLI or BUI.
The CLI commands listed set the Pc-Check mode then reboot the server. This will include Pc-Check
testing during the server host boot. The BUI navigation path displayed gives you access to the Pc-Check
setup where you can select the mode.
If manual mode was selected via either CLI or BUI, the Pc-Check menus will be displayed with a choice to
select the Advanced Diagnostics Testing Menu which will display the individual tests, or to select the
Immediate Burn-in Testing Menu to display the test suites.

Slide 28

HW Level ILOM Diagnostics


spdiags
LED test to check proper functioning of LEDs
Temperature sensor testing of components

hostdiags
Command line examples:
-> hostdiags info
-> hostdiags fan_test
-> hostdiags memerr
-> hostdiags psu_test 0
An example of the use of Hostdiags is documented in Bug#6828998 under Sun
this bug report under the Sun database.
Bug#6828998 under Oracle
Press PLAY (4) to Continue
28

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The spdiag command, within ILOM 2.0, opens a menu of tests that allow you to select the
component to test. Two examples are the LED test that can turn on/off LEDs and the temperature
command that tests the temperature sensors of the CPU, DIMMs and other components.
Another set of diagnostics is the hostdiags. This is a CLI command that has a series of options
that can be added to a command line. The most useful commands are: info for the host state, fan
test to verify the fans and memerr to display memory errors.
For more information on Hostdiags, refer to the documented bug. It is also important to note that
this particular bug number was duplicated in the Oracle and Sun bug databases. The same bug
number was used, but they are two different issues. To avoid confusion, be sure to indicate in the
Product Source field whether this is a Sun or Oracle bug in the bug search.

Slide 29

PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:

Goes to Next Slide


Goes to Next Slide
At any time
At any time
Unlimited times

Slide 30

Determining Storage Configuration


Sun Disk Management Overview document (820-6350)

Library

x86 Platform Disk Controllers:


Disk controller supported
RAID levels supported
Disk controller configuration mechanisms
OS driver support
Disk management upgrade tools
Firmware upgrade tools
Press PLAY (4) to Continue
30

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Storage Management Solutions supported by the x86 Platforms are identified within the Sun
Disk Management Overview document listed. Click the link provided to open a library where
this document is located. Scroll down the document list and open the Sun Disk Management
Overview entry.
This document lists the x86 servers along with the disk controllers they support, the RAID levels
supported, what mechanism is used to configure the disk controller, what operating systems have
drivers to support the controller, along with the disk management and firmware upgrade tools.

Slide 31

Determining Disk Controller Type


# /usr/sbin/prtconf D

[Solaris]

# lsscsi H

[Linux]

Control Panel -> System -> Hardware -> Device Manager

-> SCSI and RAID Host Bus Adapter [Windows 2003]


Control Panel -> System -> Device Manager
-> Storage Controller [Windows 2008]

Disk controller Manufacturers: Intel, LSI, Adaptec and Nvidia


Disk controller types: SAS, SATA , IDE and SCSI
Disk types: HDDs, SSDs, Compact Flash cards, and Flash Modules
31

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

From the OS, the type of disk controller your server supports can be displayed using the Solaris
command, Linux command or Windows navigation paths displayed.
The x86 Platforms may use disk controllers provided by Intel, LSI, Adaptec and Nvidia,
supporting SAS, SATA, IDE and SCSI. Storage disk types supported are HDDs, SSDs, Compact
Flash cards, and Flash Modules.

Slide 32

Storage Management Tools

Oracle Hardware Installation Assistant (OHIA)

OHIA Library

Sun LSI 106x RAID Users Guide (820-4933)

LSISAS1064/1064E

LSISAS1068/1068E

Sun Intel Adaptec BIOS RAID Utility User's Manual - 820-4708

Intel Matrix Storage - 820-7143

Press PLAY (4) to Continue


32

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The Oracle Hardware Installation Assistant, or OHIA, is a storage management tool that has the
capability to update some of the Host Bus Adapters provided by Oracle. For documents on OHIA
refer to the library link provided.
If you are dealing with LSI disk controllers the document listed provides the instructions on how
to configure and manage the disks supported by the controllers that are listed. Be mindful that
this list will grow as more LSI disk controllers are released. The Oracle LSI part numbers are
listed on the slide.
For Adaptec disk controllers, the document listed provides instructions on how to configure and
manage supported Adaptec disks. Consider that this list will also expand as more Adaptec disk
controllers are released. The Oracle Intel Adaptec BIOS RAID Utility Manual part number is
820-4708. The Intel disk controller part number is 820-7143

Slide 33

PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:

Goes to Next Slide


Goes to Next Slide
At any time
At any time
Unlimited times

Slide 34

Difference Between a FRU and CRU

FRUs and CRUs are server components that were


designated as replaceable at the customer site
FRUs can only be replaced by a qualified Oracle or
Partner technician
CRUs are replaced by an Oracle Customer

34

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

In earlier courses, we learned the definitions of a Field Replaceable Unit, or FRU, and a
Customer Replaceable Unit, or CRU. For the x86 Platform, FRUs and CRUs are server
components that were designated as replaceable at the customer site. A component designated as
a FRU can only be replaced by a qualified Oracle or Oracle Partner technician. A CRU is
replaced by the Oracle customer.

Slide 35

Locating the FRU and CRU List


Sun System Handbook

IPMI
# ipmitool -U root -H 10.8.151.171 fru list

X4140 FRUs and CRUs

Press PLAY (4) to Continue


35

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The FRU and CRU list for a specific server can be found in the Sun System Handbook. Click the
link to review a sample server list of CRUs and FRUs. From a command line you can use the
IPMITool command to list the FRUs and CRUs. The example command on the slide will list
FRUs on the X4140 Server.

Slide 36

Replacing FRUs and CRUs


Locate the installation and replacement procedures
Server Installation and Service Manuals (docs.oracle.com)
Server top cover label
EIS Checklists
X2200 M2

Press PLAY (4) to Continue


36

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

x86 installation and replacement procedures are located in the server installation and service
manuals which can be found on the Oracle Technology Network. The procedures can also be
found on a label on some the server top covers. As mentioned earlier, the EIS checklists are
highly recommended for server installation.

Slide 37

Hot Swap, Hot Pluggable or Cold Swap


Replacement Methods
Hot Swap
Component can be removed and installed in the server without
software intervention. An example is a cooling fan.

Hot Pluggable
Component requires software intervention prior to removal. An
example is running cfgadm to remove a disk drive.

Cold Swap
Component requires the server to be powered down prior to
removal. An example is a DIMM.

37

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Its vital to understand component replacement procedures. Before starting a replacement of a


component, you must determine whether the component is a hot swap, a hot pluggable, or a cold
swap component. A hot swap component can be removed or installed in the server without
software intervention. An example of this would be a cooling fan. A hot pluggable component
can be installed in the server, but requires software intervention prior to removal of the part.
Configuration of the device to be replaced requires preparation to alert the operating system that
the component is no longer available for operation. An example is the use of CFGADM to
prepare a disk drive for removal. A cold swap component requires the server be powered down
prior to removal or installation of the component. An example is a DIMM memory card.

Slide 38

Replacement of Memory
Population guideline differences
Between Intel-based and AMD-based servers
Between current Intel and earlier, legacy Intel processors
Server configuration of memory and their relative locations
within the server

DIMM manufacturer, type, density and speed

Location of the population guidelines


Servers Service Manual

Servers top cover label


EIS Checklist

38

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Server memory varies, so you need to rely on server memory population guidelines. How you
proceed with populating DIMMS depends on the CPU, the servers memory configuration, and
the DIMM specifications.
There are population guideline differences between Intel-based and AMD-based servers and also
between current Intel processors and earlier legacy Intel processors. The server configuration
may not utilize the processors full memory capacity which determines the population guidelines.
The manufacturer, type of DIMM, as well as their density and speed may also determine the
population guidelines.
Due to the differences, it is important to reference population guides for each server. These
guidelines can be found in the servers service manual, the servers top cover label, or the EIS
checklist.

Slide 39

Sample AMD Opteron-Based Memory Configuration


X6240 Server Module

39

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

This is the first of three examples of x86 platform memory configurations. This slide shows the
X6240 server blade which is an AMD Opteron-based server. Each CPU supports 8 DIMM slots
that are shared using a Hypertransport link between the CPUs. The DIMMs need to be installed
in pairs, as indicated in the tables. The DDR2 DIMMs must come from the same manufacturer
and have the same density and speed. The population order starts from the DIMM slot farthest
from the CPU.

Slide 40

Example Legacy Intel-Based Memory Configurations

X4150
Server

40

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

The X4150 server blade provides an example of x86 platform memory configuration. This is a
legacy-based server, where the two CPUs share 16 DIMM slots. The DIMMs need to be installed
in pairs, as indicated in the tables, with FB-DIMMs that come from the same manufacturer with
the same density and speed. Notice that the A and B channel DIMM slots are the first matched
DIMM slots to be populated while the C and D channel DIMM slots are the second matched
DIMM slots to be populated. All DIMM slots are shared by both CPUs through a NorthBridge
chip since the A to D channels are directly connected to this chip.

Slide 41

Sample of a Current Intel Memory Configuration


Ranked DIMMs

X4170 Server

DDR3

Press PLAY (4) to Continue


41

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

A third example of a memory configuration is the X4170 server. This is a Xeon-based server,
with two CPUs that share 12 DIMM slots through a Quick Path link between the CPUs. Each of
the two Intel processors has eight associate DIMM sockets, D0 through D7, as shown in the
diagram. The DIMM types supported are DDR3s that are quad, dual or single ranked. Click the
link for a description of ranked DIMMs starting with Quad Ranked DIMMs.

Slide 42

Replacement of Disks
The disk population guidelines are dependent on the
platform type and whether the disk is directly or
indirectly connected to the server.
Replacement rules
Before a disk can be removed it needs to be isolated from its
operating environment, if not under RAID control
Replacement procedures are located in
Servers Service Manual
Server Top Cover Label
EIS Checklist

NOTE: Disk replacement is also dependent on the type of OS.


42

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Disk population guidelines are dependent on several parameters. These include the platform
type and whether the disk is directly or indirectly connected to the server.
No matter what type of disk or how it is physically associated with the server, it must be isolated
from its operating environment before removal, if not under RAID control. The procedure for its
replacement can be located in the servers service manual, the server top cover label, or the EIS
checklist.
Note, disk replacement is also dependent on the type of OS.

Slide 43

Replacement of Other Components


Other external server components:
Power supplies
CD/DVD drives

Internal Components:
CPUs

Riser Cards

CPU Modules

Memory Board

Motherboard

Fan Modules

SP Board

Fan Boards

REM and FEM

System Battery

Disk Backplanes

I/O Adapters

Front Panel Indicator Board


Power Distribution Board
Press PLAY (4) to Continue
43

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Sun System Handbook


X4600 M2 Service Manual
Proprietary and Confidential

Some of the external and internal server components are listed that can be replaced on the x86
platforms. The supported components for a specific server can be determined by displaying its
Full Components list within the Sun System Handbook. Click on the link provided to view the
Sun System Handbook.
As in the case of the memory and disks, the procedures for the external and internal component
replacements can be located in either the servers service manual, the servers top cover label, or
on the EIS checklist. Click on the link provided to display the X4600 M2 Service Manual so that
you can view examples of its component procedures.

Slide 44

PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:

Goes to Next Slide


Goes to Next Slide
At any time
At any time
Unlimited times

Slide 45

x86 Platform Course Summary

In summary, you have studied how to interpret system indicators,


locate and describe tools used to gather troubleshooting data,
locate and describe diagnostic tools, and describe FRU/CRU

replacement procedures for x86 systems.

45

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

In summary, you have studied how to interpret system indicators, locate and describe tools used
to gather troubleshooting data, locate and describe diagnostic tools, and describe FRU/CRU
replacement procedures for x86 systems.

Slide 46

End of Sun x86 Servers Troubleshooting Tools and


FRU/CRU Replacement Course

46

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

WZD-SSx86-301

Proprietary and Confidential

This completes the Sun x86 Servers Troubleshooting Tools and FRU/CRU Replacement course.
Remember, in order to get credit for this course, you must take the course assessment and pass
with a score of 80% or higher.

Slide 47

Thank You

47

Copyright 2011, Oracle and/or its affiliates. All rights reserved.

Proprietary and Confidential

Slide 48

Oracle

Das könnte Ihnen auch gefallen