Beruflich Dokumente
Kultur Dokumente
Oracle
Slide 2
Welcome to the x86 Servers Troubleshooting tools and FRU/CRU Replacement course.
Slide 3
Course Objectives
By the end of this course, you should be able to:
Interpret system indicators
Locate data gathering tools used to troubleshoot x86
system issues
Locate and describe diagnostic troubleshooting tools
By the end of this course, you should be able to interpret system indicators, locate and describe
tools used to gather troubleshooting data, locate and describe diagnostic tools, and describe
FRU/CRU replacement procedures for x86 systems.
Slide 4
As a reminder, you have the option to Test Out of this course at any time. So, lets get started
Slide 5
Troubleshooting Tools
There are many ways to gather data on x86 servers for troubleshooting purposes. In this course we will
discuss a few of these methods including system indicators, data gathering tools, as well as system and
storage diagnostic tools. Well start by looking at the x86 system indicators.
Slide 6
ON = OS Booted
ON = Hardware failure
OK-to-Remove LED (blue):
System indicators, or LEDs, are a good place to start when troubleshooting a hardware issue. The
x86 servers typically have 3 system LEDs to indicate the state of the platform. The Power LED
is a green LED. There are three states for the green Power LED. When the LED is off, the
server is not connected to AC power. When the LED is blinking, the server is in Standby Power
mode. This means that the server is connected to AC power, but the host is not powered.
Depending on the server, you may have two blinking states, the fast blink and the slow blink
states. The fast blink lasts for two minutes after the AC power is applied to the host, during the
time that the SP is initializing. The slow blink occurs when power is applied to the server host,
during the time that the server host is booting. To determine which blink states are supported by
a specific server, refer to that servers service manual. When the Power LED is steady on, this
indicates that the server is connected to AC power and its host is powered. The Service Action
Required is an amber LED. This LED turns steady on when any hardware failure occurs within
the server or to any of its components. A corresponding component Service Action Required
LED may light up on the specific server component that has failed. The OK-to-Remove LED is
blue. This LED turns steady on to indicate when it is safe to remove a failed component for
replacement.
Slide 7
White locator LEDs maybe on some server models. These can be programmed to flash to assist
onsite technicians in finding a specific server among a number of servers.
Amber LEDs are found on certain components internal to the server. Each CPU and each DIMM
has its own amber LED. To provide capacitor power to the CPU and DIMM fault LEDs, press
the Fault Remind button. The LED with the faulty component will then light.
Since the amber component LEDs are powered by a capacitor, they will only have enough power
to light up a couple of times, and only for a short time. Due to this limitation, be prepared to
locate the faulty component quickly, as you may only get results from pressing the Fault Remind
button a couple of times. This action needs to be completed within a set time period, determined
by each specific server.
Slide 8
OK Power (green)
Service Action Required (amber)
Ready-to-Remove (blue)
Ethernet Port LEDs
Link/Activity (green)
Hard disk drives have three LEDs. The green OK Power LED will light when the disk is
powered and will flash according the disks activity. The amber Service Action Required LED,
again, is the hardware fault indicator. The blue Ready-to-Remove LED is steady on when the
disk is ready to be removed.
Ethernet ports have two green LEDs. The Link/Activity LED is located on the left of the port and
it flashes according to port activity. The Speed LED is located on the right of the port and its
color determines the speed the port is configured for. Green indicates 1 gigabit per second,
Amber indicates 100 megabit per second, and Off indicates a 10 megabit per second speed
configuration.
Slide 9
Fan LED:
OK Power (green)
Service Action Required (amber)
x86 Platform power supplies support two or three LEDs. These LEDs may include: a green AC
LED, a green DC LED, an amber Service Action Required LED and/or a blue Ready-to-Remove
LED. A lighted AC LED indicates that AC power is present at the power supply, while the DC
LED indicates that the power supply is generating DC power when it is on.
The fan modules support one or two LEDs. A lighted green OK power LED indicates that the fan
module is powered. The amber Service Action Required LED will light when the fan module
encounters a hardware failure.
Now that we have a good understanding of the platform and component level LEDs, lets look at
some other data gathering tools that can be used to troubleshoot x86 issues.
Slide 10
Resident on
Description
Solaris
ifconfig, netstat
ipconfig, netstat
Windows
snapshot
ILOM
Explorer
Solaris
MPS Report
Windows
sosreport
supportconfig
SuSE
In this next section of the course we will discuss data gathering tools to use for troubleshooting.
The x86 Platform data gathering troubleshooting tools allow the user to view system status and
configuration. The tools to view system status are listed in the table on the slide, along with
where they reside and their function. We will look at each of these tools in more detail on the
next few slides.
Slide 11
11
The customers first indication of an error may be displayed to the system console and will be
recorded in the OS system error log. To access these logs through Solaris and Linux you can
display the messages files using the vi or view command to display the contents of the messages
files. The messages files are generated and updated by the syslog facility of Solaris and Linux.
For Windows, navigate to the Computer Management screen which will give you access to the
Event Viewer that displays the system log among other screens.
Slide 12
IPMI SEL
CLI: # ipmitool -U root -H <HOST> sel elist
Log Fields
12
Timestamp
Severity
Description
Device
System event logs are also supported within ILOM, BIOS and IPMI. ILOMs system event logs
are available through its CLI and BUI interfaces. The CLI command line is displayed here along
with the path to the BUI screen.
BIOS has a system event log that can be accessed by using the navigation path displayed here to
open the Event Logging screen. IPMI, which is the closest management tool to the hardware, has
a system event log that is accessible through the IPMItool. This log is a subset of the events
posted within the BIOS system event log.
Logs can give you an ordered list of events that lead up to the hardware problem by using the
timestamp field to order the entries. The severity field determines the severity of the reported
event. The description field can pinpoint the source of the problem, or it can give enough
information about the problem to start the FRU isolation process. The device field of a log is
useful to determine the log entries that are related to the same device.
Slide 13
show /SP/faultmgmt
Diagnostic Guide
Proprietary and Confidential
If the customer does not have easy access to the x86 server and therefore can not visibly report
on the status of the LED indicators, then there are ways to view this information remotely. The
x86 server indicators and sensors data can be accessed using the ILOM CLI under
/SP/faultmanagement, or within the ILOM BUI under the Fault Management tab. This data can
also be displayed using the IPMItool CLI command that is supported by IPMI. In the IPMItool
examples on the slide, notice that sdr within the first command line corresponds to the sensor
data repository while led within the second command refers to the indicators. Note that Fault
Management is only available on current systems.
The IPMItool software is available from the servers Tools and Driver CD, or from the MOS
location displayed. Manuals on the use of the IPMItool commands are available within the
Diagnostics Guide for a specific server. Click on the link for a diagnostic guide sample. The
presentation will now stop to allow you to access the path and link.
Slide 14
prtdiag
prtconf
ifconfig a
sysconfig
netstat -a
Windows Utilities
14
ipconfig
(Network configuration and state for Windows)
View hardware configuration information with msinfo32
Start -> Run and type msinfo32 to view the configuration
File -> Save to save the configuration.
Solaris, Linux, and Windows operating systems also provide utilities to view the current
hardware and network configurations. The utilities are listed along with the operating systems
that support them. Notice that the ifconfig command also gives you the ability to modify the
network configuration. There are no exact Windows equivalents to most of the utilities shown,
however there is a command line utility call ipconfig. Also, hardware configuration information
can be displayed using msinfo32 byt navigating to Start -> Run then selecting File -> Save to
save the configuration.
Slide 15
ILOM Snapshot
data:
normal
full
normal-logonly
full-logonly
URL:
Any valid target directory location
protocol://username:password@host/directory
ILOM Addendum
Until now we have been gathering individual portions of system data. It would be more efficient
to gather larger portions of data to analyze. The ILOM snapshot utility collects log files, runs
various commands and collects their output from the service processor, then sends the data
collection as a downloaded file to a user-defined location.
To perform an ILOM snapshot, define the data to be collected using the first set command shown
on the slide. The data field can be normal, full, normal-logonly and full-logonly. The variable
you select depends on how much data you want to collect as indicated by their definitions.
The second command sets the location where the data will be sent using the format displayed.
The protocols supported are tftp, ftp, sftp, scp, http, or https. The same protocols supported by
the ILOM commands are used for backup and restore of the ILOM configuration.
For more information on ILOM snapshot, click on the ILOM Addendum link.
Slide 16
16
STB Installer
Oracle Explorer Data Collector is a diagnostic data collection tool that is made up of shell scripts
and a few binary executables. Oracle Explorer Data Collector is designed to run on Solaris x86
platforms and is distributed as part of the Services Tools Bundle or STB.
Oracle Explorer Data Collector is a collection of shell scripts that gather information and create a
detailed snapshot of a system's configuration. Information related to drivers, patches, recent
system event history and log file entries is obtained from the Explorer output. For additional
information click on the Oracle Explorer link.
You can install Oracle Explorer Data Collector using the STB installer. The STB software and
documentation can be accessed from the STB link. Refer to the STB documentation for the
installation instructions.
Now that we've discussed the use of Data Gathering Troubleshooting Tools to view system
information, lets Check Your Knowledge.
Slide 17
PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:
Slide 18
Resident in
Description
Bootable CD or Solaris
POST
BIOS
U-Boot
SP
ILOM
Pc-Check
ILOM
For tools for specific x86 servers, view the Sun x86 Servers Diagnostic Guide
found in the Sun System Handbooks Related Documentation link.
Press PLAY (4) to Continue
18
In the next section of the course, we will review some diagnostic troubleshooting tools that are
compatible with most x86 Platforms. Along with the tool, the table displays where the diagnostic
tool resides and a description of its functionality. For the tools that are compatible with a specific
x86 Server, view the Oracle x86 Servers Diagnostic Guide that can be found in the Sun System
Handbook's Related Documentation link.
Slide 19
Oracle Validation Test Suite, previously known as SunVTS, is an exerciser that tests and
validates Sun hardware. Oracle Validation Test Suite or Oracle VTS is used to ensure the proper
operation of the overall system under test and its underlying hardware. It stimulates, detects, and
identifies hardware faults and is used for both hardware validation and repair verification.
The Oracle VTS diagnostics are available on Solaris, USB boot image, or off a bootable CD. The
bootable CD allows you to boot a CD resident Solaris OS which boots a CD resident Oracle VTS
then tests the server and generates a report.
The minimum Oracle VTS version supported is the one that comes shipped with the server. The
current Oracle VTS version can be found in the servers product notes and can be downloaded
from the link provided on the slide. The SunVTS download link also provides SunVTS versions
for Linux.
Slide 20
For descriptions of these tests and instructions on how run Oracle VTS refer to:
http://www.oracle.com/technetwork/documentation/sys-mgmt-networking-190072.html
Press PLAY (4) to Continue
20
The Oracle VTS is listed. As you can see these tests cover all server internal components as well
as I/O components. For descriptions of these tests and instructions on how run Oracle VTS, click
on the link.
Slide 21
POST Diagnostics
Power On Self Test is a series of diagnostics that execute before
the server OS is booted to verify that the hardware is healthy and
the configuration is valid
Fatal HW Error:
OS boot will stop
The error is reported to:
ILOM Fault Management
Power On Self Test is a series of diagnostics that execute before the server OS is booted to verify
that the hardware is healthy and the configuration is valid. If a fatal hardware error is
encountered, the OS boot will stop and the error is reported to ILOM Fault Management and
ILOM System Event Logs. A list of POST events that can stop or allow OS boot to continue are
found within the servers service manual. A list of POST error codes are found in the appendix of
the Diagnostic Guides. These tables can be useful in trying to determine the cause of a hardware
failure caught by POST.
Slide 22
U-Boot Diagnostics
At system start-up, U-Boot diagnostic software initializes the
server and tests the server SP prior to booting the ILOM firmware
U-Boot Test
Normal
Quick
Extended
Description
Ethernet Test
At system start-up, when AC power is connected to the x86 Platform, the U-Boot diagnostic
software initializes the server and tests aspects of the server service processor prior to booting
the ILOM firmware. The U-Boot diagnostic tests are designed to test the hardware required to
enable the server SP to boot successfully.
There are three execution modes that U-Boot supports. These include normal mode, which is the
default mode, quick and extended modes, which are optional. The modes determine which tests
are run and for how long.
The U-Boot tests are listed in this table according to the mode they run in with a description of
the test. The presentation will now stop to allow you to view this table.
Slide 23
Normal
Quick
X
Description
Extended
Slide 24
To configure and run U-Boot diagnostics, power cycle or reset the server then wait for the UBoot message that will display over the serial port. When it appears select either q , n, or x for
the U-Boot mode. The U-Boot tests will display on the console.
Note, any U-Boot failures are reported to the ILOM System Event Log and the Fault
Management. For more information on U-Boot refer to the Oracle x86 Servers Diagnostics
Guide.
Slide 25
Pc-Check Diagnostics
Pc-Check is a DOS-based diagnostics utility
Available from:
25
Pc-Check is a DOS-based diagnostics utility that can be used to test the x86 Platforms. Pc-Check
is available in newer service processors. For servers that do not have a service processor, PcCheck can be executed from the servers Tools and Driver CD/DVD.
Slide 26
26
Slide 27
To access Pc-Check you can either use the ILOM CLI or BUI.
The CLI commands listed set the Pc-Check mode then reboot the server. This will include Pc-Check
testing during the server host boot. The BUI navigation path displayed gives you access to the Pc-Check
setup where you can select the mode.
If manual mode was selected via either CLI or BUI, the Pc-Check menus will be displayed with a choice to
select the Advanced Diagnostics Testing Menu which will display the individual tests, or to select the
Immediate Burn-in Testing Menu to display the test suites.
Slide 28
hostdiags
Command line examples:
-> hostdiags info
-> hostdiags fan_test
-> hostdiags memerr
-> hostdiags psu_test 0
An example of the use of Hostdiags is documented in Bug#6828998 under Sun
this bug report under the Sun database.
Bug#6828998 under Oracle
Press PLAY (4) to Continue
28
The spdiag command, within ILOM 2.0, opens a menu of tests that allow you to select the
component to test. Two examples are the LED test that can turn on/off LEDs and the temperature
command that tests the temperature sensors of the CPU, DIMMs and other components.
Another set of diagnostics is the hostdiags. This is a CLI command that has a series of options
that can be added to a command line. The most useful commands are: info for the host state, fan
test to verify the fans and memerr to display memory errors.
For more information on Hostdiags, refer to the documented bug. It is also important to note that
this particular bug number was duplicated in the Oracle and Sun bug databases. The same bug
number was used, but they are two different issues. To avoid confusion, be sure to indicate in the
Product Source field whether this is a Sun or Oracle bug in the bug search.
Slide 29
PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:
Slide 30
Library
Storage Management Solutions supported by the x86 Platforms are identified within the Sun
Disk Management Overview document listed. Click the link provided to open a library where
this document is located. Scroll down the document list and open the Sun Disk Management
Overview entry.
This document lists the x86 servers along with the disk controllers they support, the RAID levels
supported, what mechanism is used to configure the disk controller, what operating systems have
drivers to support the controller, along with the disk management and firmware upgrade tools.
Slide 31
[Solaris]
# lsscsi H
[Linux]
From the OS, the type of disk controller your server supports can be displayed using the Solaris
command, Linux command or Windows navigation paths displayed.
The x86 Platforms may use disk controllers provided by Intel, LSI, Adaptec and Nvidia,
supporting SAS, SATA, IDE and SCSI. Storage disk types supported are HDDs, SSDs, Compact
Flash cards, and Flash Modules.
Slide 32
OHIA Library
LSISAS1064/1064E
LSISAS1068/1068E
The Oracle Hardware Installation Assistant, or OHIA, is a storage management tool that has the
capability to update some of the Host Bus Adapters provided by Oracle. For documents on OHIA
refer to the library link provided.
If you are dealing with LSI disk controllers the document listed provides the instructions on how
to configure and manage the disks supported by the controllers that are listed. Be mindful that
this list will grow as more LSI disk controllers are released. The Oracle LSI part numbers are
listed on the slide.
For Adaptec disk controllers, the document listed provides instructions on how to configure and
manage supported Adaptec disks. Consider that this list will also expand as more Adaptec disk
controllers are released. The Oracle Intel Adaptec BIOS RAID Utility Manual part number is
820-4708. The Intel disk controller part number is 820-7143
Slide 33
PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:
Slide 34
34
In earlier courses, we learned the definitions of a Field Replaceable Unit, or FRU, and a
Customer Replaceable Unit, or CRU. For the x86 Platform, FRUs and CRUs are server
components that were designated as replaceable at the customer site. A component designated as
a FRU can only be replaced by a qualified Oracle or Oracle Partner technician. A CRU is
replaced by the Oracle customer.
Slide 35
IPMI
# ipmitool -U root -H 10.8.151.171 fru list
The FRU and CRU list for a specific server can be found in the Sun System Handbook. Click the
link to review a sample server list of CRUs and FRUs. From a command line you can use the
IPMITool command to list the FRUs and CRUs. The example command on the slide will list
FRUs on the X4140 Server.
Slide 36
x86 installation and replacement procedures are located in the server installation and service
manuals which can be found on the Oracle Technology Network. The procedures can also be
found on a label on some the server top covers. As mentioned earlier, the EIS checklists are
highly recommended for server installation.
Slide 37
Hot Pluggable
Component requires software intervention prior to removal. An
example is running cfgadm to remove a disk drive.
Cold Swap
Component requires the server to be powered down prior to
removal. An example is a DIMM.
37
Slide 38
Replacement of Memory
Population guideline differences
Between Intel-based and AMD-based servers
Between current Intel and earlier, legacy Intel processors
Server configuration of memory and their relative locations
within the server
38
Server memory varies, so you need to rely on server memory population guidelines. How you
proceed with populating DIMMS depends on the CPU, the servers memory configuration, and
the DIMM specifications.
There are population guideline differences between Intel-based and AMD-based servers and also
between current Intel processors and earlier legacy Intel processors. The server configuration
may not utilize the processors full memory capacity which determines the population guidelines.
The manufacturer, type of DIMM, as well as their density and speed may also determine the
population guidelines.
Due to the differences, it is important to reference population guides for each server. These
guidelines can be found in the servers service manual, the servers top cover label, or the EIS
checklist.
Slide 39
39
This is the first of three examples of x86 platform memory configurations. This slide shows the
X6240 server blade which is an AMD Opteron-based server. Each CPU supports 8 DIMM slots
that are shared using a Hypertransport link between the CPUs. The DIMMs need to be installed
in pairs, as indicated in the tables. The DDR2 DIMMs must come from the same manufacturer
and have the same density and speed. The population order starts from the DIMM slot farthest
from the CPU.
Slide 40
X4150
Server
40
The X4150 server blade provides an example of x86 platform memory configuration. This is a
legacy-based server, where the two CPUs share 16 DIMM slots. The DIMMs need to be installed
in pairs, as indicated in the tables, with FB-DIMMs that come from the same manufacturer with
the same density and speed. Notice that the A and B channel DIMM slots are the first matched
DIMM slots to be populated while the C and D channel DIMM slots are the second matched
DIMM slots to be populated. All DIMM slots are shared by both CPUs through a NorthBridge
chip since the A to D channels are directly connected to this chip.
Slide 41
X4170 Server
DDR3
A third example of a memory configuration is the X4170 server. This is a Xeon-based server,
with two CPUs that share 12 DIMM slots through a Quick Path link between the CPUs. Each of
the two Intel processors has eight associate DIMM sockets, D0 through D7, as shown in the
diagram. The DIMM types supported are DDR3s that are quad, dual or single ranked. Click the
link for a description of ranked DIMMs starting with Quad Ranked DIMMs.
Slide 42
Replacement of Disks
The disk population guidelines are dependent on the
platform type and whether the disk is directly or
indirectly connected to the server.
Replacement rules
Before a disk can be removed it needs to be isolated from its
operating environment, if not under RAID control
Replacement procedures are located in
Servers Service Manual
Server Top Cover Label
EIS Checklist
Disk population guidelines are dependent on several parameters. These include the platform
type and whether the disk is directly or indirectly connected to the server.
No matter what type of disk or how it is physically associated with the server, it must be isolated
from its operating environment before removal, if not under RAID control. The procedure for its
replacement can be located in the servers service manual, the server top cover label, or the EIS
checklist.
Note, disk replacement is also dependent on the type of OS.
Slide 43
Internal Components:
CPUs
Riser Cards
CPU Modules
Memory Board
Motherboard
Fan Modules
SP Board
Fan Boards
System Battery
Disk Backplanes
I/O Adapters
Some of the external and internal server components are listed that can be replaced on the x86
platforms. The supported components for a specific server can be determined by displaying its
Full Components list within the Sun System Handbook. Click on the link provided to view the
Sun System Handbook.
As in the case of the memory and disks, the procedures for the external and internal component
replacements can be located in either the servers service manual, the servers top cover label, or
on the EIS checklist. Click on the link provided to display the X4600 M2 Service Manual so that
you can view examples of its component procedures.
Slide 44
PROPERTIES
On passing, 'Finish' button:
On failing, 'Finish' button:
Allow user to leave quiz:
User may view slides after quiz:
User may attempt quiz:
Slide 45
45
In summary, you have studied how to interpret system indicators, locate and describe tools used
to gather troubleshooting data, locate and describe diagnostic tools, and describe FRU/CRU
replacement procedures for x86 systems.
Slide 46
46
WZD-SSx86-301
This completes the Sun x86 Servers Troubleshooting Tools and FRU/CRU Replacement course.
Remember, in order to get credit for this course, you must take the course assessment and pass
with a score of 80% or higher.
Slide 47
Thank You
47
Slide 48
Oracle