Sie sind auf Seite 1von 61

Sun Fire 280R Server

Troubleshooting Guide

Product Technical Support – VSP


(Editors: Olly Sharwood & Mike LaFlamme)

Abstract
The intent of this Troubleshooting Guide is to provide Support Services Customer Facing
engineers, the ability to understand and recognize known problems on Sun Fire 280R servers,
and to know what information is necessary, and how to gather it, for correctly diagnosing certain
classes of difficult problems. The stated goal is that the customer receive the same service
responses and assistance for Sun Fire 280R server cases, regardless of the expertise level of
the Customer Facing engineer they open the case with, or the difficulty of the problem. The
guide is organized into two sections. The first section covers setup of the system in preparation
for enabling diagnostic data to be captured. The second section categorizes common problems
encountered based on the FRU's in the Sun Fire 280R server. It must be realized that in some
problems, even with all possible output data, there is still difficulty in narrowing down the cause
between several possible FRU types. In these cases, it is useful to apply basic trial-and-error
procedures with a second known good system and parts, to eliminate as many of the suspects
as possible, by testing them one at a time.

Revision 1.2.1; March 9, 2005


Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 1
Table of Contents
Section I. General System Configuration for Diagnosability................4
Sun Fire 280R Server Serial Numbers................................4
Patches..........................................................................5
Explorer Data Collector...................................................6
SunVTS 5.1 Diagnostic & System Exercising Tool...............6
Hardware Watchdog Mechanisms...................................7
Console Logging.............................................................7
Remote System Controller (RSC)......................................8
OBP Settings..................................................................9
Core Dump Analysis.......................................................10
General Questions.........................................................13

Section 2. Hardware Troubleshooting Sun Fire 280R server to


correct FRU..................................................14
Responding to System Error States................................14
Responding to System Hang States...............................14
Fatal Reset Errors and RED State Exceptions....................15
About Unexpected Reboots............................................18
CPU Module
Panics with “send_mondo_set:timeout”........19
“Invalid AFSR” on CPUx Messages..................20
“CPU seeprom format:” Messages.................20
Detecting Bad CPU Writers............................21
Memory DIMMS
General Memory Configuration Rules
and Guidelines.............................25
Patches.......................................................27
Memory Replacement Guidelines..................27
OBP Firmware..............................................32
Interpreting AFSR & AFAR outputs.................................34
Disks
General Disk Troubleshooting........................38
PCI Cards......................................................................43
Power Supplies and PDB................................................47

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 2


Table of Contents Cont:

Motherboard................................................................48
Miscellaneous Issues
Sun Rack 900 (NGR).......................................50
Appendix
Appendix A: Trap Types Table for UltraSPARC III CPU's....51
Appendix B: Manual Decoding of ECC Memory Errors......52
Appendix C: Device Tree Layout for Sun Fire 280R
Server..........................................................61

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 3


Section 1. General System Configuration
for Diagnosability
Overview of Customer's Application
It is always useful to gather a profile of the type of applications the customer is running on the
system, to better understand what the system is doing at the time of failure. The application type
and mix may be loading the system in such a way that brings out the failure faster or uncovers a
unique failure signature that other similar systems running different applications will not exhibit.

Some applications can be categorized as follows:


1) CPU Intensive – Mostly scientific and computational applications e.g. cryptographics
2) Memory Intensive – Database queries
3) I/O Intensive – Web hosting, transactional processing, backup/restore applications, file/print

Sun Fire 280R Server Serial Numbers


For better case history tracking, accurate reporting of persistent field problems, assistance in
implementing proactive services, and checking of applicability of FIN's, FCO's and SunAlerts, it
is critical to have the correct system serial number entered into the Radiance case.

The serial number is located on the system in two places. There is a label on the rear of the
system, to the left of the PCI slots which contains the S/N, top level part number P/N 6xx-xxxx-xx
of the original system configuration, text similar to “Assembled in <Country>” and a barcode.
There is also a label on the front of the system, on the metal immediately below the internal disk
drives, which contains just the S/N and P/N.

The following is a breakdown of valid serial numbers for Sun Fire 280R systems:

Valid serial numbers on the Sun Fire 280R server start at approximately 109xxxxx, as the first
systems shipped were manufactured in week 9 of 2001. Plant codes where Sun Fire 280R
server systems have been assembled are:
Foot Hill Ranch, CA C
Ashton, UK S
Santa Palomba, Italy Z
Toronto, Canada AD
Kladno, Czech Republic AA

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 4


Patches
It has been consistently seen during customer escalation's that a majority of diagnosability
concerns and system downtime can be avoided by keeping current with recommended patch
levels on the Sun Fire 280R server. PTS VSP maintains current recommended patch lists on its
website, and this information almost always supersedes the current revision of FIN I0697-1,
SunAlert 28290 or InfoDoc 27524. While these documents originally served a specific purpose to
get the message out at the time, they are difficult to keep current and updated due to the
processes involved in re-publishing them with updated content. As long as it is available, the
field should always use the patch lists for Solaris 8 and Solaris 9 (including firmware patches)
posted here:

http://pts-americas.west/vsp/wgs/products/littleneck/patches_index.html

The brief page provides a handy table list of just the patches and their synopsis. The detailed
page provides more detail as to specific bug fixes in each patch revision that affects Sun Fire
280R server's. The text file version provides the table list in a convenient text version for
emailing to CU's.

In the event a CU refuses to update patches as the first step in troubleshooting their problem,
please make use of the bug lists provided on the detailed page, and the information in this guide
to make them aware of specific “why” they should load the patch, how it will take care of this
issue without needless hardware replacements, and try to sell them some of the patch
management solutions that Sun provides such as PatchPro, Patch Manager (Solaris 9), SRS
NetConnect, use of Solaris Management Console, or periodic “flash” jumpstart installs from a
standard image. If there are further bug or information that the CU still needs to justify the patch
update, please make PTS VSP aware of the situation by opening a PTS VSP Engagement Task
in Radiance, and we will work with the Customer Facing engineer to assist in fulfilling the CU's
specific needs.

Kernel Update Patch (KUP) for Solaris 8 108528-16 (or later) and Solaris 9 KUP 112233-01 (or
later) changes the way in which error events for memory and cache errors are logged. On prior
KUP revisions, only the 256th error counted for each event type would be logged to the console
and “/var/adm/messages”, which is not useful in most cases. To make every error be logged
on older revisions, the “/etc/system” file needs to be modified to include “set ce_verbose = 1”.
With the newer KUP revisions, no additional setting of “/etc/system” parameters is required,
as the default for VSP servers is now to log all error events in “/var/adm/messages”. Also, if the
CU has console logging and/or RSC configured, the CU can choose to have the output sent to
the console device in addition to “/var/adm/messages” by setting the following variables in
“/etc/system”:
set ce_verbose_memory = 2;
set ce_verbose_other = 2;

Versions earlier than Solaris 8 KUP 108528-16 or Solaris 9 KUP 112233-03 may require hand-
decoding of AFSR and AFAR's to ensure the output from Solaris is interpreting the correct bad
DIMM or Bank. In addition, new functionality has been (and is continuing to be) added to the
kernel (KUP-20 & KUP-06) to allow offlining of identified failing DIMMs by preventing pages from
being allocated to the physical memory that DIMM covers. This is designed to allow the system
to continue running until such time as the reported error and failing DIMM can be replaced during
a regular maintenance window. The core of this functionality is expected to be included in
Solaris 8 KUP 108528-24 and Solaris 9 KUP 112233-09 when released, and these will be the
new minimum recommended versions when available.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 5


In particular for Sun Fire 280R server, the minimum recommended versions are Solaris 8 KUP
108528-21 (or later) or Solaris 9 KUP 112233-06 (or later).

Future bug fixes and KUP schedules are available internally here:
http://jurassic.eng/shared/ON/patch_docs/data/

Explorer Data Collector


The Explorer data collector should be run regularly, and preferably uploaded to the
“proactive.central” database. Explorer provides a number of files that will need to be analyzed
upon a failure, including but not limited to “prtdiag -v” system configuration,
“/var/adm/messages” files, “showrev -p” patch levels, “diskinfo” firmware levels. Other files
may be useful for other configuration and diagnostic information as necessary.
If explorer output is not available, then review the complete set of /var/adm/messages files for
any messages that may be related to the failure, such as from the SCSI driver, or picld
environmental monitoring daemon. Especially in the case of disk drive and power supply
reported failures, these are rarely checked and verified before sending out replacement parts for
a problem that may be seating, incorrect cabling or a transient application reporting error not
specifically bad hardware. See the sections below for more detailed troubleshooting procedures
for disk drive problems.
The latest version of Explorer should always be used, and is downloadable from
http://sunsolve.central/ where there is also a link to the http://proactive.central/ database.

Sun Install Check (SunIC) tool, also uses Explorer and the eRAS database of checks, and is
available for use with new Sun Fire 280R server installations, and/or verification that older
systems are up to current levels to avoid known issues. The externally available production
SunIC version is updated with new checks every 2 weeks. It no longer requires Explorer to be
installed, but does still use explorer as part of its opearation, and is available for separate
download here:
http://wwws.sun.com/software/installcheck/index.html

SunVTS 5.1 Diagnostic & System Exercising Tool


In cases where the system is failing intermittently, it is useful to use an exercising tool such as
SunVTS to try and bring about the failure more frequently. It is always recommended to use the
latest version of SunVTS 5.1 PSx, to allow taking advantage of new diagnostic tests and
improved existing tests. The current latest version recommended is SunVTS 5.1 PS3 where PS
stands for Patch Set 3. This is shipped with Solaris 9 HW 08/03 and Solaris 8 HW 07/03, and
supports Solaris 8 since HW2/02 version, and all Solaris 9 versions. Although “untested” and
therefore “unsupported”, 5.1 VTS should also work on all earlier Solaris 8 versions. All future
versions will be 5.1 PS#. It is preferred to take the system out of production and into a lab for
SunVTS stress testing, preferrably using “Exclusive Mode” which runs single tests of particular
devices selected sequentially. Starting from 5.1 VTS also has the ability to run in “Online” mode
which runs tests non-intrusively whenever the device being tested is idle, so as not to affect
production applications. With 5.1PS2, a scheduler was added to enable running tests at a pre-
defined time and length, to help stress testing around the CU's primary application idle times.
The latest version of SunVTS and test documentation is downloadable internally or externally
here:
http://diagnostics.eng/sunvts/
http://www.sun.com/oem/products/vts/

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 6


Hardware Watchdog Mechanisms
The hardware watchdog mechanism is a hardware timer that is continually reset as long as the
operating system is running. If the system hangs, the operating system is no longer able to reset
the timer. The timer then expires and causes an automatic reset, displaying debug information
on the system console. The hardware watchdog mechanism is disabled by default. On the Sun
Fire 280R server, the Solaris operating environment must be configured before the hardware
watchdog mechanism can be enabled. In addition, if the watchdog is not enabled, but the system
is hung, a manual eXternally Initiated Reset (XIR) may be done. For details on doing this
manual XIR, see Section 2 “Responding to System Hang States”.
To enable, set watchdog_enable=1 in /etc/system, then reboot the system. In newer server
platforms e.g. Sun Fire V210/240, this watchdog mechanism will be enabled by default.
The OBP configuration variable error-reset-recovery allows you to control how the
hardware watchdog mechanism behaves when the timer expires. Using the OBP “setenv” or
Solaris “eeprom” command, the variable can be set for the following:
• boot (default) - Resets the timer and attempts to reboot the system
• sync (recommended) - Attempts to automatically generate a core dump file dump, reset
the timer, and reboot the system. This will become the default setting in future OBP
versions.
• none (equivalent to issuing a manual XIR from RSC) -Drops the server to the ok prompt,
enabling you to issue commands and debug the system
Due to some bugs related specifically to Solaris 8's handling of this mechanism, it is required to
upgrade KUP to 108528-17 or later, and OBP to version 4.5.16 or later to ensure this
functionality works correctly.

Console Logging
Console logging is recommended to capture as much information as possible about the system
state if/when important events occur. Systems urgently needing maintenance may not be able to
log messages elsewhere. Examples of this are when troubleshooting POST failures of critical
components, Fatal Reset errors and RED State Exceptions. In these conditions, either Solaris
has not yet started, or the Solaris operating environment terminates abruptly, and although it
sends messages to the system console, the operating environment software does not log any
messages in traditional file system locations like the “/var/adm/messages“ file.

The error logging daemon, syslogd, automatically records various system warnings and errors in
the “/var/adm/messages“ files. By default, many of these system messages are also
displayed on the system console and stored in the “/var/adm/messages“ file. You can direct
where these messages are stored or have them sent to a remote system by setting up system
message logging. For more information, see "How to Customize System Message Logging" in
the System Administration Guide: Advanced Administration, which is part of the Solaris System
Administrator Collection.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 7


Remote System Controller (RSC)
PTS VSP strongly recommends configuring and using the RSC which is provided with every
system, but disabled and not configured by default. It is useful for getting console output if you
do not have a logging console server, in addition to providing XIR capability for system hangs.
RSC can also provide basic monitoring, alerting of failures and act as POST diag output for
systems remotely. The RSC provides for remote capability and is a lights-out always-on
solution, regardless of the power state of the system. RSC can be configured to be accessible
through either of 3 built-in mechanisms: 10BaseT ethernet network, RJ45 serial port attached to
another terminal server or workstation (RJ45-DB25 adapter provided in ship kit), or built-in dial-
up modem.

RSC provides an event log of RSC-defined events that have been detected, e.g fan tray failures
or system resets/power state changes, accessible with the “loghistory” command. RSC also logs
the system console when configured (including POST diagnostics if optionally configured), using
four logs, accessible through the “consolehistory” command. For more details on these
commands, see the RSC documentation linked below. The four logs are named “boot”, “run”,
“boot-old” and “run-old”. These essentially log the output on “boot” up to the time that Solaris
starts to boot, then it cycles over to the “run” log, and then the “old” store the logs from the
immediatly prior boot/run. Each log can be individually viewed with “consolehistory” and all four
logs should be looked at when troubleshooting errors logged on the console. In some failure
situations, a large stream of data is sent to the system console. Because RSC log messages are
written into a "circular buffer" that holds 64 Kbytes of data, it is possible that the output identifying
the original failing component can be overwritten. If it is possible for the customer to configure
RSC and connect that to a logging console, that would be the ideal situation.

For more information on RSC, refer to the PTS TOI (updated February 2003), available in PDF
or StarOffice here:
http://pts-americas.west/vsp/wgs/products/littleneck/RSC/

The RSC software and documentation are available on the Solaris Supplemental CDROM in the
Solaris 8 or 9 Media Kit, or downloadable from here:
http://www.sun.com/servers/rsc.html

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 8


OBP Settings
PTS Recommends the following OBP settings for enabling diagnostics by
default, and providing maximum diagnosability output verbosity:

Variable Name Default Setting Recommended Setting

auto-boot? true true

diag-device net same as boot-device

diag-level min max

diag-switch? false true

error-reset-recovery boot sync

security-mode No default none

test-args verbose, subtests

OpenBoot (OBP) Firmware should also be updated regularly and kept up to the latest available
version. To help facilitate this, and reduce maintenance window downtime, the later versions of
firmware patch include a shell script to update the firmware directly from a normal running
Solaris. The patch also includes the older method that requires booting from the special binary
update file.

Keeping up to date on OBP firmware ensures always having and using the latest POST and
OBDiag diagnostic testing components, as well as correction of diagnosability output, OBP
behaviour and initialization bugs. POST and OBDiag tests are continuously being improved to
identify newly discovered hardware failure modes, as well as to provide better diagnostic error
reporting. For Sun Fire 280R server's, POST and OBDiag versions are typically the same as the
OBP version number, however a deviation did occur with OBP 4.5.19 and 4.5.21 versions, which
includes POST 4.7.4 to take advantage of the new FPU test to detect the same type of problem
the CPU Diagnostic Monitor will identify, as described in SunAlert 55081. The current version as
of Aug 29th 2004 is OBP, POST and OBDiag is 4.13.0 in patch 111292-17.

A future version of OBP may change these settings to be the defaults. This will significantly
increase boot time, particularly on a system with 8GB of memory, where diag-level=max memory
tests take ~36 minutes with 1.2GHz CPU's, and up to 50 minutes with 750MHz CPU's. If this
increased boot time poses a problem for the customer who regularly reboots or otherwise resets
the machine or are experiencing software-induced panic's, it may be preferred to leave diag-
switch?=false. In these cases, diagnostics may be enabled temporarily for one-time runs using
either the system keyswitch turned to DIAG position temporarily, or the RSC “bootmode diag”
command. This is recommended any time the system is powered on e.g. after any hardware
change, or power outage. If the system then develops what is suspected to be a hardware
problem, then enable diagnostics by setting the variable “diag-switch?=true” after the first failure,
to ensure any subsequent failures report verbose full messages and run through max level
POST.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 9


Core Dump Analysis
In some failure situations, a Sun engineer might need to analyze a system core dump file to
determine the root cause of a system failure. Although the core dump process is enabled by
default, the system can still be configured to suite the customer's system configuration. The
customer may want to change the default core dump directory to another locally mounted
location so they can better manage any system core dumps, or to an alternate location with a
larger available space to adequately save multiple core dumps. In certain testing and pre-
production environments, this is recommended since core dump files can take up a large amount
of file system space.
See How to Enable the Core Dump Process for instructions on how to calculate the amount of
available swap space.

How to Enable the Core Dump Process:

This is normally a task that would be completed just prior to placing a system into the production
environment.
1. Access the system console. Check that the core dump process is enabled. As root, type the
dumpadm command.

# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/machinename
Savecore enabled: yes

By default, the core dump process is enabled in the Solaris 8 operating environment.
2. Verify that there is sufficient swap space to dump memory. Type the swap -l command.

# swap -l
swapfile dev swaplo blocks free
/dev/dsk/c0t3d0s0 32,24 16 4097312 4062048
/dev/dsk/c0t1d0s0 32,8 16 4097312 4060576
/dev/dsk/c0t1d0s1 32,9 16 4097312 4065808

To determine how many bytes of swap space are available, multiply the number in the blocks
column by 512 bytes per block. Taking the number of blocks from the first entry, c0t3d0s0,
calculate as follows:
4097312 blocks x 512 bytes/block = 2097823744 bytes.
The result is approximately 2 Gbytes is available to capture core dump files.
3. Verify that there is sufficient file system space for storing the core dump files. Type the
df -k command.

# df -k /var/crash/`uname -n`

By default the location where savecore files are stored is:


/var/crash/`uname -n`

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 10


For instance, for the mysystem server, the default directory is:
/var/crash/mysystem
The file system specified must have space for the core dump files.
If you see messages from savecore indicating not enough space in the /var/crash/ file, any
other locally mounted (not NFS) file system can be used. Following is a sample message from
savecore.
System dump time: Wed Apr 23 17:03:48 2003

savecore: not enough space in /var/crash/lneck-a (216 MB avail, 246 MB needed)

Perform Steps 4 and 5 if there is not enough space.


4. Type the df -k1 command to identify locations with more space.

# df -k1
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c1t0d0s0 832109 552314 221548 72% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mntab
swap 3626264 16 362624 81% /var/run
swap 3626656 408 362624 81% /tmp
/dev/dsk/c1t0d0s7 33912732 9 33573596 1% /export/home

5. Type the dumpadm -s command to specify a location for storing the dump files generated by
savecore. See the dumpadm (1M) man page for more information.

# dumpadm -s /export/home/
Dump content: kernel pages
Dump device: /dev/dsk/c3t5d0s1 (swap)
Savecore directory: /export/home
Savecore enabled: yes

How to Test the Core Dump Setup:

Before placing the system into a production environment, it might be useful to test whether the
core dump setup works. This procedure might take some time depending on the amount of
installed memory.
1. Back up all your data and access the system console.
2. Take the core dump using either of the two following methods:
A) If you have the Dump Device setup with dumpadm to be a dedicated device (i.e. not
swap), you can test the dump on the system live using the savecore -L command. This
takes a snapshot of the live running Solaris system, and saves it to the dump device
configured without actually rebooting or altering the system in any way.
B) If you have the Dump Device setup with dumpadm to be the default swap device, you
need to gracefully shut down the system using the shutdown command. Then, at the ok
prompt, issue the sync command.
You should see "dumping" messages on the system console. During this process, you can see
the savecore messages.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 11


3. Wait for the system to finish dumping or rebooting, depending on the method you used in
step 2.
4. Look for system core dump files in your savecore directory.
The files are named unix.y and vmcore.y, where y is the integer dump number. There should
also be a bounds file that contains the next crash number savecore will use. If a core dump is
not generated, perform the procedure described in How to Enable the Core Dump Process.
Using SCAT for analysis of corefiles:
SCAT stands for Solaris Core Analysis Tool
SCAT was formerly called "FM" tool
The SCAT homepage can be found at:
http://openproject.eng.sun.com/projectweb/solariscat/
A list of commands and their descriptions can be found at:
http://openproject.eng.sun.com/projectweb/solariscat/commands.html
The download is available from the homepage
The above info and examples of some commands and the install procedure can be found
at http://pts-americas.west/vsp/wgs/products/littleneck/SCAT.html

SCAT is useful to determine if the problem can be quickly attributed by hardware without having
to engage a kernel engineer for a simple hardware issue. An example would be having only a
corefile for data and looking at the msgbuf to find the server was experiencing UE errors which
would be a hardware issue.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 12


General Questions?
To summarize this section, the appropriate actions above should be taken proactively to prepare
the system for maximum diagnosability in the event a problem arises on that system. The
following questions should be used to start troubleshooting the specifics of the problem.
1. Is the system serial number given valid, and correct?
2. Has there been any changes to the system in the past 1 days? 1 week? 1 month? 3 months?
6 months? Since the system was installed and placed into service?
3. What was the system doing immediately prior to the problem occurring?
4. What applications are being run on the system?
5. What is the physical system configuration, including “/usr/sbin/prtfru” and
“/usr/platform/`uname -i`/sbin/prtdiag -v” output, PCI cards and externally
attached hardware, and if any volume manager software is being used?
6. Is the system regularly patched to current levels? If not, what are the levels of firmware,
Kernel Update Patch and other required patches on the system. Please update these to the
latest to ensure this is not a known problem that is already fixed.
7. Is RSC configured? If not, recommend it.
8. Is the system configured with diagnostics enabled, hardware watchdog enabled, and
savecore enabled as recommended? If not, recommend it.
9. What is the frequency of the errors? Either transient occurring randomly without a pattern, or
persistent occurring in a repeatable pattern every time a specific event occurs e.g. Application
starting.
10. Please provide Explorer and full console logs prior to, during and immediately following the
problem occurring. Review as many files as possible in the Explorer to create a picture of
the answers to the question above. Check the applicability of SunAlerts, FIN's and FCO's as
needed per the sections below.
11. What actions have been taken to correct the problem to date?
12. Check with customer and search by serial number for case history of any previous cases
open for this system and/or problem. This may be a second case report of this same
problem, where the first case correctly recommended to perform preparation actions such as
patch and OBP setting updates. This second case would now be expected to contain more
captured output information to allow troubleshooting further to the correct FRU first time.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 13


Section 2. Hardware Troubleshooting
Sun Fire 280R server to Correct FRU
When troubleshooting, it is important to understand what kind of error has occurred, to
distinguish between real and apparent system hangs, and to respond appropriately to error
conditions so as to preserve valuable information.

Responding to System Error States


Depending on the severity of a system error, a Sun Fire 280R server might or might not respond
to commands you issue to the system. Once you have gathered all available information, you
can begin taking action.
Guidelines to remember:
• Avoid power cycling the system until you have gathered all the information you can. Error
information might be lost when power cycling the system.
• If your system appears to be hung, attempt multiple approaches to get the system to
respond. See “Responding to System Hang States” section below.

Responding to System Hang States


Troubleshooting a hanging system can be a difficult process because the root cause of the hang
might be masked by false error indications from another part of the system e.g. application bugs.
Therefore, it is important that you carefully examine all the information sources available to you
before you attempt any remedy. Also, it is helpful to understand the type of hang the system is
experiencing. This hang state information is especially important to Sun Support Services –
Product Technical Support engineers or other Support Services – Customer Facing engineers
that may be working on this type of problem.
A system soft hang can be characterized by any of the following symptoms:
• Usability or performance of the system gradually decreases.
• New attempts to access the system fail.
• Some parts of the system appear to stop responding.
• You can drop the system into the OpenBoot ok prompt level.
Some soft hangs might dissipate on their own, while others will require that the system be
interrupted to gather information at the OpenBoot prompt level. A soft hang should respond to a
break signal that is sent via the system console, or RSC card. If the break signal does not work,
then the next available option is to turn the keyswitch from one position to another (except OFF).
If a message is reported by picld on the console, acknowledging that the keyswitch position
changed, this indicates that Solaris is still running to some degree. Next try sending an
“externally Initiated Reset” (XIR) signal which works to break hangs up to the highest interrupt
level 15 (1 less than power-cycle). An XIR can be sent on a Sun Fire 280R server either
manually from the RSC card, or if platform drivers patch 109888-17 or later is installed, perform
a “triple-tap” sequence to tap the power-button 3 times within 1.5 seconds, which will send an
XIR to the system.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 14


A system hard hang leaves the system unresponsive to a system break sequence. You will know
that a system is in a hard hang state when you have attempted all the soft hang remedies with
no success.

Fatal Reset Errors and RED State Exceptions

Background Information:
Fatal Reset errors and RED State Exceptions are most often caused by hardware problems. In
some isolated cases, software can cause a Fatal Reset error or RED State Exception. Typically,
these are device driver problems that can be identified easily. Information on known problems
can usually be found in SunSolve Online in known bugs and patches, or by contacting the third-
party driver vendor.
Fatal Reset Errors:
Hardware Fatal Reset errors are the result of an "illegal" hardware state that is detected by the
system. A hardware Fatal Reset error can either be a transient error or a hard error. A transient
error causes intermittent failures. A hard error causes persistent failures that occur in the same
way. The following example shows a sample Fatal Reset error alert from the system console
with OBP variable “diag-switch?=true“:

System Reset: (SPOR) (PLL)


Fatal Error reported by: (Cpu1)
IO-Bridge CE AFSR: 0 IO-Bridge CE AFAR: 0
IO-Bridge UE AFSR: 0 IO-Bridge UE AFAR: 0
JTAG-ID: ffffffffffffffff
JTAG DATA:
0 0 0 0
0 0 0 0
JTAG-ID: ffffffffffffffff
AFSR: a000000000000 AFAR: 0000040004e02000
JTAG DATA:
0 0 0 0
0 0 0 0
JTAG-ID: 1483203b
JTAG DATA:
0 0 0 0
0 0 0 0
JTAG-ID: 15060045
JTAG DATA:
0 0 0 0
0 0 0 0
JTAG-ID: 1142903b
JTAG DATA:
0 0 0 0
0 0 0 0
Probing gptwo at 0,0 SUNW,UltraSPARC-III (750 MHz @ 5:1, 8 MB)
memory-controller
Probing gptwo at 1,0 SUNW,UltraSPARC-III (750 MHz @ 5:1, 8 MB)
memory-controller
Probing gptwo at 8,0

With diag-switch?=false, only 1 line is reported which does not provide any diagnostic data.
It is critical to note that the word “reported” means just that. The device(s) listed on this line
detected are reporting the error, and are not necessarily (or normally) the root-cause of the error.
Using the below procedures and tools, the above example decodes to a “PERR System Protocol
Error” with an address in I/O Space and “ISAP system request parity error on incoming
address” which is an error on the data being detected by the victim - CPU1. This example is
typical of a problem associated with a suspect IO-Bridge (Schizo) on the motherboard or a faulty

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 15


PCI card, which as you can see is not CPU1 that is reporting the Fatal Reset.
Why doesn't the fatal reset output always tell us which hardware component is faulty?
Why does the fatal error decoder often state “ No recognizable error condition detected”?

Fatal hardware errors (bus protocol errors and internal errors) are reported in the EMU Error
Status Register(EESR) if the corresponding bit mask bits are 0 in the EMU Error Mask Register
(EEMR)

For each bit in the EESR, there is a corresponding bit in the EMU Shadow Resister (ESR) to
allow designers to gain visibility to the error status of the EMU .
The EMU Shadow Register carries out only two functions;
1. Capturing values on the EESR to scannable flops in the ESR
2. Shifting out the captured values through the scan-out port.

A more detailed description of register functions can be found in the SPARC JPS1
Implementation Supplement p.n. 806-6754

The problem is that we do not do a JTAG scan of the CPUs after OBP 4.5.9 because it can
cause the system to hang and not recover. We are not getting all the data that may be needed to
indicate the failing component. Please see bug id 4635979.

Troubleshooting in this situation would be searching for data in other places that may indicate
what has lead to the fatal reset. Look at the big picture. Are there any indications of errors or
problems in the /var/adm/messages file? Has the servers use or configuration changed recently?
The use of diagnostics such as SUNvts and POST should be used to see if the fatal condition
can be triggered and then isolated.

RED_State Exceptions:
A RED State Exception (RSE) condition is most commonly a hardware fault that is detected by
the system. A RSE causes a loss of system integrity, which would jeopardize the system if
Solaris software continued to operate. Therefore, Solaris software terminates ungracefully
without logging any details of the RED State Exception error in the /var/adm/messages file,
and all output is only logged to the system console. It is critical to obtain the first RSE output, as
subsequent outputs may show cascading (or looping) RSE errors as a symptom of the original
error. The following example shows a sample RED State Exception error that CPU 0 reported on
the system console. Determining that CPU0 is bad in this example, would be done by
interpreting that the “Trap Type” (TT) event is 070 in all 5 “Trap Levels” (TL), which is a Fast
ECC Error, or L2 cache event local to the CPU module reporting, in this case CPU0, and there
were no RSE events by CPU1 or events from memory in “/var/adm/messages” occurring prior to
the RSE. The OBP command “.traps” can be run at the ok prompt to determine what trap
types are possible on this system, or for Sun Fire 280R server's these are listed in Appendix A.
Making the determination from trap types to failing component is highly subjective and relies on
reviewing all RSE outputs from one or both CPU modules, any other data available from other
types of errors, and knowledge of what was occurring on the system at the time of the event or
the system's service history.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 16


RED State Exception

CPU: 0000.0000.0000.0000
TL=0000.0000.0000.0005 TT=0000.0000.0000.0070
TPC=0000.0000.1014.6654 TnPC=0000.0000.1014.6658
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0004 TT=0000.0000.0000.0070
TPC=0000.0000.1014.667c TnPC=0000.0000.1014.6680
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0003 TT=0000.0000.0000.0070
TPC=0000.0000.1014.6654 TnPC=0000.0000.1014.6658
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0002 TT=0000.0000.0000.0070
TPC=0000.0000.1014.64e0 TnPC=0000.0000.1014.64e4
TSTATE=0000.0099.8000.1500
TL=0000.0000.0000.0001 TT=0000.0000.0000.0070
TPC=0000.0000.1007.29a0 TnPC=0000.0000.1007.29a4
TSTATE=0000.0044.8000.1600

Recommended Solution Steps:


The most important pieces of information to gather when diagnosing a Fatal Reset error or RED
State Exception are:
• System console output at the time of the error
• Recent service history of systems that encounter Fatal Reset errors or RED State
Exceptions. Double-check correct system serial number is captured and entered into
Radiance Case to ensure accurate tracking of Sun Service Calls/Radiance Case history
for this specific system.
Capturing system console indications and messages at the time of the error can help you isolate
the true cause of the error. In some cases, the true cause of the original error might be masked
by false error indications from another part of the system. For example, POST results (shown by
the output from the prtdiag command) might indicate failed components, when, in fact, the
"failed" components are not the actual cause of the Fatal Reset error. In most cases, a good
component will actually report the Fatal Reset error.
By analyzing the system console output at the time of the error, you can avoid replacing
components based on these false error indications. In addition, knowing the service history of a
system experiencing transient errors can help you avoid repeatedly replacing "failed"
components that do not fix the problem.
On the Sun Fire 280R server, RSE's and Fatal Reset errors are most commonly caused by a
bad CPU or memory. It is unlikely the motherboard is the cause of a RSE or Fatal Reset,
however in the above Fatal Reset example as noted, the error data when decoded is normally
associated with a problem IO-Bridge (Schizo) or PCI card. Once the console data is captured it
can be decoded to provide hints as to the root cause of the problem. In some instances, the only
output will not be of the form above, but will be of a form similar to:
CPU: 0000.0000.0000.0000 AFSR: 0008.0000.0000.0000 AFAR: 0000.0400.0470.0200

In these cases, this information can be manually decoded using the information in the Memory
section below.
The Fatal Reset and RED State Exception outputs can be decoded with a tool developed by
PTS EMEA engineers. These tools are available here:
http://cpre-emea.uk/cgi-bin/fatal.pl

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 17


http://cpre-emea.uk/cgi-bin/redstate.tcl

Once the tools have been used to decode the output messages, they attempt to interpret any
AFSR and AFAR information or Trap Type information present. It is critical to note that the word
“reported” means just that with regards to Fatal Reset errors. The device(s) listed on this line of
the Fatal Reset error detected are reporting the error, and is (are) not necessarily (or normally)
the root-cause of the error. Note also that the output of these errors being correctly decodable
relies on the output provided by OBP 4.10.1 or later firmware, and should be the minimum
recommended firmware patch installed as detailed in Section 1 above.

To correctly diagnose a Fatal Reset or RSE error, it is necessary to gather as much data as
possible from the system console, and decode these outputs with the tools above and the
information detailed below in the Memory section, as well as using some reference information
contained in InfoDoc 43642, FIN I0954-1, and the Trap Types defined in “SparcV9 Joint
Programming Specification 1: Commonality” and “SparcV9 JPS1: UltraSPARC III Supplement”,
which is copied in Appendix A. Having gathered and decoded all of the data, and considering
the system's complete service history, the mostly likely FRU causing the error can be determined
accurately.

About Unexpected Reboots


Sometimes, a system might reboot unexpectedly. In that case, ensure that the reboot was not
caused by a panic. For example, L2-cache errors, which occur in user space (not kernel space),
might cause Solaris software to log the L2-cache failure data and reboot the system. The
information logged might be sufficient to troubleshoot and correct the problem. If the reboot was
not caused by a panic, it might be caused by a Fatal Reset error or a RED State Exception that
was only logged to the console.
Also, POST settings can determine the system response to certain error conditions. If POST is
not invoked during the reboot process, or if the system diagnostics level is not enabled and set to
max, you might need to run system diagnostics at a higher level of coverage to determine the
source of the reboot if the system message and system console files do not clearly indicate the
source of the reboot. See Section 1 above for more details on the configuration of recommended
OBP diagnostics settings.
It is extremely rare to have a failure-mode without any messages reported on the console. In
these cases there is most likely a power issue (See the Power/PDB section below).

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 18


CPU Module
Panics with “send_mondo_set: timeout”
Background Information:
The “send_mondo_set” panic occurs when one CPU has sent an interrupt request to another
CPU and that other CPU fails to respond within a specified timeout period.

The mondo mechanism is used in SparcV9 architectures to send an interrupt to one or more
processors. In a multiprocessor system, when "CPU A" wants to interrupt "CPU B", CPU A
sends a mondo interrupt to CPU B. CPU A is the initiator and CPU B is supposed to respond to
the mondo dispatched by CPU A. If CPU B does not respond to the request of CPU A, CPU A
keeps retrying for a specified time.

Once this time limit is reached, a "send mondo timeout" panic is initiated by CPU A. As part of
the panic procedure CPU A will attempt to stop all other CPUs, and it will send an interrupt to all
other CPUs to request this. If some CPUs fail to stop as requested, then CPU A will complain
with "failed to stop" messages; hence a send mondo timeout to CPU B is often accompanied by
a "failed to stop CPU B" message.

There is a also a known set of specific SparcV9 processor instructions on the UltraSPARC III
family of CPU's that can trigger a send_mondo_set timeout panic to occur by locking up the CPU
into a state such that it ignores all incoming mondo interrupt requests, thereby guaranteeing the
timeout and subsequent panic will occur.

Recommended Solution Steps:


1. Upgrade Kernel Update Patch. Specific kernel routines were developed to detect the
occurrence of the locked up CPU and to restart it online. These are part of Solaris 8 KUP
108528-19 (or later) and Solaris 9 KUP 112233-05 (or later). There is a known issue with
certain Oracle versions that can trigger this condition, which is resolved with these kernel
fixes.
2. Check for the presence of applications using J2SE v1.2.2 that are running with or compiled
with the non-standard/experimental JIT compiler optimization option -Xoptimize. This option
has been proven to trigger “send_mondo_set: timeout” panic's on otherwise good hardware.
See Case 2 in FIN I0765-1 to determine how to check for this problem and the corrective
action as necessary.
3. If a system should experience "send_mondo_set: timeout" panics *after* applying steps 1 &
2, then the CPU should be considered bad, and replaced. Note that the CPU that is good is
the one that takes the panic thread. The CPU that is bad, is the one indicated by the
message:
panic: failed to stop cpu0 <<< Non responsive CPU 0 >>>
The CPU module should be returned for CPAS Corrective Action, since this is a non-normal
failure mode after applying steps 1 & 2. Returning the module through this process will help
to identify (root-cause) what caused the CPU to be non-responsive, and corrective action can
be established for the problem to ensure no other future CPU will encounter a similar
problem.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 19


“Invalid AFSR” on CPUx Messages

Background Information:
The “Invalid AFSR” message occurs when the CPU's AFSR register has been corrupted by
another earlier error. Since the nature of this message is extremely misleading it is critical to
have the complete messages file dating as far back as possible, and to ensure all error output is
being appropriately logged to the “/var/adm/messages” file. A specific improvement in Solaris
8 KUP 108528-16 (or later) and Solaris 9 KUP 112233-02 (or later) was added to improve the
diagnosability output of this message type.

Recommended Solution Steps:


1. Upgrade Kernel Update Patch to Solaris 8 KUP 108528-16 (or later) and Solaris 9 KUP
112233-02 (or later), if it is not already.
2. Evaluate all other memory or cache type error messages, based on InfoDoc 43642. Any
preceding error, or error of the same ErrID number is the root-cause of the problem, and the
“Invalid AFSR on CPUx” message is an artifact. There is likely nothing wrong with the CPU,
unless its also reporting distinct L2 cache errors unique to the CPU.
3. In addition, look for “Invalid AFSR” messages coming from only 1 or both CPU's on a 2-CPU
system. If there are 2 CPU's in the system, and KUP revision is up to date, and only 1 CPU
is reporting this message, and there is no other indication of another error occurring
alongside these messages, ONLY if all these conditions are true should you suspect the
reporting CPU itself is bad.

“CPU seeprom format:” Messages

Background Information:
The following messages are normally seen on the system console, whenever the system does a
reset, and indicate that the CPU FRUID SEEPROM has been read by OBP and is being used to
initialize the CPU's:

CPU seeprom format: 0000.0000.0000.0001


or
CPU seeprom format: 0000.0000.0000.0002

The “1” indicates the CPU present is an UltraSPARC III (Cheetah) module. The “2” indicates the
CPU present is an UltraSPARC III Cu (Cheetah+/Cheetah++) module. A message of this type
will be printed for every CPU present in the system.

Recommended Solution Steps:


1. There is nothing to do here. This is a normal message from OBP that is expected on every
reset. It does not indicate any type of problem.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 20


Detecting Bad CPU Writers

Background Information:
ECC errors are usually a result of faulty DIMMs. In some cases ECC
errors are generated by UPA or safari devices writing bad ECC to memory or
from corruption on the datapath.
Serengeti, Starfire and Starcat/kitty all have datapath parity error
detection but the volume systems do not.

Interrupt vector errors IVC or IVU:


Interrupt vectors are CPU to CPU, or CPU to IO-Bridge (Schizo) transactions where memory is
not involved. It is not possible to diagnose the fault from these messages as only the destination
CPU is known they just indicate the presence of a bad writer or datapath fault.

Please refer to Infodoc 70134 for more information on bad writers.

1. CPU to CPU Interrupt vectors


===============================

It is not possible to diagnose the fault from these messages as


only the destination CPU is known, they are just an indicator of the
presence of a bad writer or datapath fault.

SUNW,UltraSPARC-III+: [ID 216810 kern.info] NOTICE: [AFT0] IVC Event


detected by CPU2 at TL=0, errID 0x0000feda.85273f30
AFSR 0x00004002<IVC,CE>.00000184 AFAR 0x00000061.f684f9d0 INVALID
Fault_PC <unknown> Esynd 0x0184 AMBIGUOUS unum not available
SUNW,UltraSPARC-III+: [ID 727630 kern.info] [AFT0] errID
0x0000feda.85273f30 Corrected Unknown Error on unum not available is
Intermittent
SUNW,UltraSPARC-III+: [ID 280347 kern.info] [AFT0] errID
0x0000feda.85273f30 Data Bit 84 was in error and corrected

2. Schizo to CPU Interrupt vectors


==================================

An Interrupt vector from a known Schizo to the reporting CPU. Both devices
involved in the transaction are reported in the messages so it is possible
to say there is a datapath fault between two points.

In this 6800 example.

CPU 15 and AFAR 0x00000424.00912f90 == Schizo 0 on IO board 8

The easiest way to decode the AFAR is to use an online decoder.


http://watch-dog.central/afar/afar.html

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 21


SUNW,UltraSPARC-III+: [ID 225070 kern.info] NOTICE: [AFT0] IVC Event
detected by CPU15 at TL=0, errID 0x00064019.f8a2e010
AFSR 0x00004002<IVC,CE>.000000c2 AFAR 0x00000424.00912f90 INVALID
Fault_PC <unknown> Esynd 0x00c2 AMBIGUOUS unum not available
SUNW,UltraSPARC-III+: [ID 633724 kern.info] [AFT0] errID
0x00064019.f8a2e010 Data Bit 110 was in error and corrected

PIO:
Programmed Input/Output (PIO) is a way of moving data between devices in a computer in
which all data must pass through the processor. Due to the actions of each read/write operation
through the CPU this is slow when compared to DMA operations. PIO on Safari-bus systems is
a direct transfer between a CPU and the IO-Bridge (Schizo), where memory is not involved. The
unique aspect of these error events is that the device writing the data is logged.

DMA:
Direct Memory Access (DMA) is a capability that allow data to be sent directly from an attached
device (such as a disk drive or network driver) to the memory on the computer. The
microprocessor initially sets up the operation then is freed from involvement with the data
transfer, thus speeding up overall computer operation.

Recommended Solution Steps:


1. Upgrade Kernel Update Patch to Solaris 8 KUP 108528-16 (or later) and Solaris 9 KUP
112233-02 (or later), if it is not already.
2. Look in the complete “/var/adm/messages” logs for IVC or IVU events that may indicate
the presence of a bad CPU writer or datapath fault.
3. Look in the complete “/var/adm/messages” logs for PIO write failures which come from the
bad CPU writer. DMA errors may also be present as symptoms from a previously badly
written PIO memory transaction.

Examples:
1. PIO write transaction from a known CPU to the IO-Bridge (Schizo)

Aug 14 13:07:05 test pcisch: [ID 285080 kern.info] NOTICE: correctable


error detected by pci0 (safari id 0) during
Aug 14 13:07:05 test PIO write transaction
Aug 14 13:07:05 test pcisch: [ID 534207 kern.info] mask is 102.
Aug 14 13:07:05 test pcisch: [ID 956438 kern.info] pci bus B
memory access, IO safari command, address 0000000e.7e000030.
Aug 14 13:07:05 test pcisch: [ID 144388 kern.info]
AFSR=80000102.c0800079 AFAR=0000090e.7e000030,
Aug 14 13:07:05 test quad word offset 00000000.00000003, Memory
Module id 8.
Aug 14 13:07:05 test pcisch: [ID 916270 kern.info] syndrome bits 79
Aug 14 13:07:05 test pcisch: [ID 545677 kern.info] mtag 0, mtag
ecc syndrome 0

Note that these messages will be enhanced in a future revision of KUP under bug 4866710.

From this message we can see that Safari ID 0 (CPU0) was talking to pci0 (Schizo) and a
Correctable Error (CE) event occurred. Additional CE events logged may match the same
Esynd 79 to the same DIMM J# location on both groups of memory banks.

2. PIO write transaction from a known CPU to the IO-Bridge (Schizo), with a subsequent
matching DMA event:

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 22


Jan 1 03:12:42 fred pcisch: [ID 285080 kern.info] NOTICE: correctable
error detected by pci0 (safari id 0) during
Jan 1 03:12:42 fred PIO write transaction
Jan 1 03:12:42 fred pcisch: [ID 534207 kern.info] mask is 2ff.
Jan 1 03:12:42 fred pcisch: [ID 956438 kern.info] pci bus A registers
access, IO safari command, address 00000000.04602800.
Jan 1 03:12:42 fred pcisch: [ID 144388 kern.info]
AFSR=800002ff.00800112 AFAR=00000a00.04602800,
Jan 1 03:12:42 fred quad word offset 00000000.00000000, Memory
Module id 8.
Jan 1 03:12:42 fred pcisch: [ID 916270 kern.info] syndrome bits 112
Jan 1 03:12:42 fred pcisch: [ID 545677 kern.info] mtag 0, mtag ecc
syndrome 0
Jan 1 03:12:42 fred pcisch: [ID 285080 kern.info] NOTICE: correctable
error detected by pci0 (safari id 0) during
Jan 1 03:12:42 fred PIO write transaction
Jan 1 03:12:42 fred pcisch: [ID 534207 kern.info] mask is 10f.
Jan 1 03:12:42 fred pcisch: [ID 956438 kern.info] pci bus B memory
access, IO safari command, address 0000000e.00100010.
Jan 1 03:12:42 fred pcisch: [ID 144388 kern.info]
AFSR=9000010f.40800112 AFAR=0000090e.00100010,
Jan 1 03:12:42 fred quad word offset 00000000.00000001, Memory
Module id 8.
Jan 1 03:12:42 fred pcisch: [ID 916270 kern.info] syndrome bits 112
Jan 1 03:12:42 fred pcisch: [ID 545677 kern.info] mtag 0, mtag ecc
syndrome 0
Jan 1 03:12:42 fred pcisch: [ID 308334 kern.info] secondary error
from PIO write transaction
Jan 1 03:12:43 fred pcisch: [ID 285080 kern.info] NOTICE: correctable
error detected by pci0 (safari id 8) during
Jan 1 03:12:43 fred DVMA read transaction
Jan 1 03:12:43 fred pcisch: [ID 475334 kern.info] Transaction was a
block operation.
Jan 1 03:12:43 fred pcisch: [ID 956438 kern.info] dvma access, Memory
safari command, address 00000000.3801f440, owned_in asserted.
Jan 1 03:12:43 fred pcisch: [ID 144388 kern.info]
AFSR=48000000.08400112 AFAR=00000000.3801f440,
Jan 1 03:12:43 fred quad word offset 00000000.00000000, Memory
Module J0100 id 8.
Jan 1 03:12:43 fred pcisch: [ID 916270 kern.info] syndrome bits 112
Jan 1 03:12:43 fred pcisch: [ID 545677 kern.info] mtag 0, mtag ecc
syndrome 0
Jan 1 03:12:43 fred pcisch: [ID 308334 kern.info] secondary error
from DVMA read transaction

The first two logged events occurred during a PIO write operation. The CPU will have read in
the data and checked the ECC, yet when the data ECC was checked by the pcisch driver it has
detected a correctable error, and logs the safari id 0 (CPU0) of the device that sent the data.
The third error logged was during a DVMA read transaction from memory to the safari id 8
device (Schizo). A CE occurred and this has implicated a memory dimm (J0100). The Esynd
bits 112 are the same as that reported in the writer error above and this is when this memory
was corrupted. The reported DVMA is a symptom of the bad CPU0 who wrote bad data into
memory. Two different PCI busses (A 66MHz & B 33MHz) are both implicated as well as the
memory modules. Not recognizing the meaning of the original PIO write error message could
result in wrong parts replaced. In this example a memory dimm might have been replaced
whereas the CPU was truly bad.

3. PIO read transaction - A read from the IO-Bridge (Schizo) to a known CPU

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 23


In this example the CPU detects the error and reports the event. The reported AFAR being
accessed during the transaction decodes to the IO-Bridge as the bad writing device. See the
Memory section below for more information on decoding AFAR's on Sun Fire 280R server.

SUNW,UltraSPARC-III: [ID 440028 kern.info] NOTICE: [AFT0] Corrected


system bus (CE) Event detected by CPU0 at TL=0, errID
0x00002489.d1b13260
AFSR 0x00000002<CE>.0000002c AFAR 0x00000400.04701590
Fault_PC 0x780c0024 Esynd 0x002c
SUNW,UltraSPARC-III: [ID 526731 kern.info] [AFT0] errID
0x00002489.d1b13260 Data Bit 7 was in error and corrected

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 24


Memory DIMMs

General Memory Configuration Rules & Guidelines

Nomenclature:
• Each Physical Group of 4 DIMM's contains 2 Logical Banks, each logical bank contributes ½
of the total memory provided by that Group of DIMM's, since all NGDIMM's are double-sided.
• Each side of each DIMM contributes 1/4th of the memory to each logical bank.
• Physical Group 0 contains Logical Banks 0 & 2 on DIMM's J0100, J0202, J0304, J0406
• Physical Group 1 contains Logical Banks 1 & 3 on DIMM's J0101, J0203, J0305, J0407

Do....
Do Install DIMM's in groups of four at a time within the same group.
Do Install at least 4 DIMM's in either GROUP 0 or GROUP 1 for minimum support.
Do Install same size DIMM's in same group for automatic 2-way memory interleaving between
the 2 logical banks in the group.
Do Install same size DIMM's in both groups for automatic 4-way memory interleaving between
the 4 logical banks in both groups.
Do Install the latest Kernel Update Patch (KUP) to ensure correct reporting of memory DIMM
errors

Don't....
Don't Mix any DIMM capacities within the same group, as not all of the memory on the larger
DIMM's would be addressable:
- Larger DIMM's in group will take on identity of smallest DIMM capacity.
- Ability of automatic 2-way memory interleaving will be DISABLED.
Don't Mix third-party DIMM's and Sun supported DIMM's in same group. Actually, third-party
DIMM's of any size are NOT supported and may be the root-cause of the problems. They should
be completely removed until further troubleshooting is completed and the problems resolved.

Notes:
• Although DIMM capacities can differ between GROUP 0 or GROUP 1, automatic 4-way
memory interleaving will be DISABLED.
• The entire memory subsystem is addressable via the CPU0 memory controller, which is only
accessible when CPU0 is installed.
• Special note on third-party 2GB NG-DIMM's – these DIMM's have been purchased and
tested by engineering from the third-party manufacturers and proven to cause a variety of
signal integrity, thermal and power issues. Sun will not ship a 2GB NGDIMM on Sun Fire
280R server's due to lack of an approved and qualified vendor that can manufacture such a
NGDIMM to meet Sun specifications and operate within system cooling and power
requirements.

Configuration Reporting Examples:


“/usr/platform/`uname -i`/sbin/prtdiag” reports the memory configuration showing
all logical banks, interleaving, physical DIMM size and total memory in the system.

1. The following excerpt of “prtdiag” output shows a system with only Physical Group 0
populated with 4 x 256MB DIMM's, which gives a total memory size of 1024MB, or 1GB. One

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 25


row is listed for each logical bank present in the system. The 2nd column shows that CPU0 is
the memory controller. This is true for both physical groups in the Sun Fire 280R server
architecture. The 3rd and 4th column identifies the logical bank numbers present, and the size
of each logical bank. Since each logical bank is ½ of the total provided by that physical
group, this shows each logical bank is 512MB, made from 4 DIMM's, each contributing 1/4th
from each DIMM side, i.e. 128MB per DIMM side is contributed to each 512MB logical bank.
The 5th column reports status of the logical bank, but will never change on Sun Fire 280R
server since ASR is not supported. The 6th column already translates the total size of each
DIMM that is making up the logical banks, in this case, all are 256MB DIMM's. The interleave
factor is also reported, and is 2-way since only 1 physical group is populated with DIMM's.
The 7th column shows the CPU number that this memory controller is interleaving with. This
will be the same as MC ID if no interleaving is occurring between CPU's. Since the Sun Fire
280R server only has 1 memory controller active, CPU0, this will always show 0.
...
Memory size: 1024 Megabytes
...
===================== Memory Configuration ============================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
--- --- ---- ------ ----------- ------ ---------- -----------
CA 0 0 512MB no_status 256MB 2-way 0
CA 0 2 512MB no_status 256MB 2-way 0

2. The following excerpt of “prtdiag” output shows a system with both Physical Groups
populated, each with 4 x 1GB DIMM's which gives a total memory size of 8192MB, or 8GB.
This shows all 4 logical banks each of 2GB size (½ of the physical group's total memory), 0 &
2 contributed by Group 0 DIMM's, and 1 & 3 contributed by Group 1 DIMM's. Since both
physical groups are populated with the same sized DIMM's, we are able to do maximum 4-
way interleaving.
...
Memory size: 8192 Megabytes
...
====================== Memory Configuration============================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
--- --- ---- ------ ----------- ------ ---------- -----------
CA 0 0 2048MB no_status 1024MB 4-way 0
CA 0 1 2048MB no_status 1024MB 4-way 0
CA 0 2 2048MB no_status 1024MB 4-way 0
CA 0 3 2048MB no_status 1024MB 4-way 0

Possible Memory Configurations:


The Sun Fire 280R server supports NGDIMM's (or LC/CR1 DIMM's) of 128MB, 256MB, 512MB
or 1GB, for totals from 512MB to 8GB. The following are the typical standard configurations
shipped, though it is possible to make odd-numbered totals by filling the group's with different
DIMM sizes, though this limits interleaving to 2-way, as noted above.

Total DIMM's Locations


512MB 4 x 128MB Either Group 0 or 1
1GB 4 x 256MB Either Group 0 or 1
1GB 8 x 128MB Both Group 0 and 1

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 26


2GB 4 x 512MB Either Group 0 or 1
2GB 8 x 256MB Both Group 0 and 1
4GB 4 x 1GB Either Group 0 or 1
4GB 8 x 512MB Both Group 0 and 1
8GB 8 x 1GB Both Group 0 and 1

Patches

It is strongly advised to install the latest recommended Kernel Update Patches, as the latest
versions have improvements in memory error message reporting and will aid in diagnosing
memory problems. See the “Patches” information in Section 1 above, for necessary patches to
be configured prior to diagnosing memory errors.

Memory Replacement Guidelines


INTERMITTENT:
Check for the reporting of parity or ECC errors, otherwise ignore. Replace DIMM if 3 or
more correctable memory events occur within 24-hour period on same DIMM.
ribera pcisch: [ID 419818 kern.warning] WARNING: correctable error from
pci0 (safari id 8) during DVMA read transaction
ribera pcisch: [ID 307114 kern.info] Transaction was a block operation.
ribera pcisch: [ID 971798 kern.info] dvma access, Memory safari
command, address 00000000.7600bc00, owned_in not asserted.
ribera pcisch: [ID 635908 kern.info] AFSR=40000000.0800003e
AFAR=00000000.7600bc00,
ribera quad word offset 00000000.00000000, Memory Module J0304 id 8.

PERSISTENT:
Replace DIMM if 3 or more correctable memory events occur within 24-hour period on
same DIMM.
Jul 28 15:39:33 k1test unix: [ID 356634 kern.notice] 141 Intermittent,
167 Persistent, and 0 Sticky Softerrors accumulated
Jul 28 15:39:33 k1test unix: [ID 340762 kern.notice] from Memory Module
on J0100, Memory controller 0
Jul 28 15:39:36 k1test unix: [ID 596940 kern.warning] WARNING: [AFT0]
10 soft errors in less than 24:00 (hh:mm) detected from Memory Module
J0100

STICKY:
Replace DIMM on the first occurrence.
firefly unix: [ID 356634 kern.notice] 0 Intermittent, 0 Persistent, and
256 Sticky Softerrors accumulated
firefly unix: [ID 340762 kern.notice] from Memory Module on J0100,
Memory controller 0

Handling of Correctable Errors


When a processor detects a CE as a result of a read to main memory, it will correct the incoming
data and continue its operation. The error will be logged in the processor's asynchronous fault
status register (AFSR) and the faulting physical address will be logged in the asynchronous fault
address register (AFAR). The processor will then take a trap so that the error information can be
logged. The Solaris kernel takes care of this logging, and will record the event information in
“/var/adm/messages” by default, depending on the KUP version and kernel variable settings, as
noted in Section 1 above.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 27


One such event log, taken from a Sun Fire 280R server system running Solaris 8, appears
below:

Jul 28 15:39:36 k1test unix: [ID 596940 kern.warning] WARNING: [AFT0]


10 soft errors in less than 24:00 (hh:mm) detected from Memory Module
J0100
Jul 28 15:39:42 k1test SUNW,UltraSPARC-III+: [ID 536831 kern.info]
NOTICE: [AFT0] Corrected system bus (CE) Event detected by CPU0 at
TL=0, errID 0x0000020a.c65021c0
Jul 28 15:39:42 k1test AFSR 0x00000002<CE>.0000010c AFAR
0x00000000.7f2e08a0
Jul 28 15:39:42 k1test Fault_PC <unknown> Esynd 0x010c J0100
Jul 28 15:39:42 k1test SUNW,UltraSPARC-III+: [ID 327732 kern.info]
[AFT0] errID 0x0000020a.c65021c0 Corrected Memory Error on J0100 is
Intermittent
Jul 28 15:39:42 k1test SUNW,UltraSPARC-III+: [ID 941182 kern.info]
[AFT0] errID 0x0000020a.c65021c0 Data Bit 101 was in error and
corrected
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 800354 kern.info]
NOTICE: [AFT0] First Error Corrected system bus (CE) Event detected by
CPU0 at TL=0, errID 0x0000020e.175c6738
Jul 28 15:39:56 k1test AFSR 0x00000002<CE>.0000010c AFAR
0x00000000.1670e9b0
Jul 28 15:39:56 k1test Fault_PC 0x1000721c Esynd 0x010c J0100
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 465930 kern.info]
[AFT0] errID 0x0000020e.175c6738 Corrected Memory Error on J0100 is
Intermittent
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 190338 kern.info]
[AFT0] errID 0x0000020e.175c6738 Data Bit 101 was in error and
corrected
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 240978 kern.info]
[AFT2] errID 0x0000020e.175c6738 PA=0x00000000.1670e980
Jul 28 15:39:56 k1test E$tag 0x00000000.59092492 E$state_6
Exclusive
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x00) 0x705f6d61.736b5f74 0x6f5f696e.64657800 ECC 0x1a9
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x10) 0x7274735f.64617461 0x5f6d7367.5f73697a ECC 0x0e0
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x20) 0x65006e64.5f6c6f61 0x64006d6c.645f7469 ECC 0x176
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 895151 kern.info]
[AFT2] E$Data (0x30) 0x6d657273.5f617265 0x5f72756e.6e696e67 ECC 0x082
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 929717 kern.info]
[AFT2] D$ data not available
Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 335345 kern.info]
[AFT2] I$ data not available

It is important to recognize that the first two errors in the above output are the result of one single
CE event, as evidenced by the identical errID value. The third error is a subsequent error of the
same type. Each of the messages is tagged with an asynchronous fault tag (AFT) to identify the
data being logged. Continuation messages begin with four spaces. The different AFT tag values
are: AFT0 for correctable errors; AFT1 for uncorrectable errors as well as for errors that result in
a panic; AFT2 and AFT3 are used for logging diagnostic data and other error related messaging
The extracts below were taken from the previous example:

Jul 28 15:39:42 k1test SUNW,UltraSPARC-III+: [ID 327732 kern.info]


[AFT0] errID 0x0000020a.c65021c0 Corrected Memory Error on J0100 is
Intermittent

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 28


Jul 28 15:39:56 k1test SUNW,UltraSPARC-III+: [ID 240978 kern.info]
[AFT2] errID 0x0000020e.175c6738 PA=0x00000000.1670e980
Jul 28 15:39:56 k1test E$tag 0x00000000.59092492 E$state_6 Exclusive

– errID is a timestamp of the event. This is very useful for correlating multiple errors that
occurred at the same time

– AFSR and AFAR are the asynchronous fault status and address registers.
On UltraSPARC III (750MHz) CPU's, there is only one AFSR and AFAR recording the most
recent event. On UltraSPARC III Cu (900MHz or faster) CPU's, there are 2 AFSR and
AFAR's recorded. The primary is denoted AFSR/AFAR and records the most recent event.
The secondary is denoted AFSR2/AFAR2 and records the first error event logged. This CPU
enhancement is useful for troubleshooting by identifying the source of the first error.

– Fault_PC is the value of the program counter (PC) at the time of the fault and is dependent
upon the fault type as to whether the value is valid. See below for more information on
decoding these registers.

– Esynd is the ECC syndrome captured and can be used to determine the DIMM within the
Bank in the event of a single-bit correctable error (CE).

– J #### is the identifier of the memory module which corresponds to the faulting address on
the Sun Fire 280R server, in the event of a single-bit correctable error (CE) similar to this one.
In the event of a multi-bit uncorrectable error (UE), the DIMM cannot be identified distinctly,
so the Group is reported as J#### J#### J#### J#### where the DIMM slots for either Group
0 or Group 1 are listed.

– The Solaris software error handling code provides a disposition code as one of Intermittent,
Persistent, or Sticky. The definition of each of these codes is:

– Intermittent means the error was not detected on a reread of the affected memory
location. This can occur due to many things and are should not normally be acted upon.

– Persistent means the error was detected again on a reread of the affected memory
location but the scrub operation corrected it. This is indicative of a potentially failing DIMM
and should 3 Persistent errors occur within 24 hours, the DIMM should be replaced. In
addition, soft errors caused by transient random events such as cosmic rays, would also
appear as persistent. However since these events are typically random in nature, it is
unlikely to repeat the error at the same AFAR address in multiple events, so is easily
separated from true persistent errors. These random events are part of the reason for the
3 events on the same DIMM in 24 hours rule.

– Sticky means the error is likely a hard fault of a failing DRAM device and the DIMM should
be replaced as soon as possible.

The above examples show memory errors that occurred while a CPU was reading/writing to
memory. Similar errors that are in memory may occur while the IO-Bridge is reading/writing to
memory, and these typically are of the form “NOTICE: correctable error detected by pci0 (safari
id 8) during DVMA read transaction” and “NOTICE: correctable error detected by pci0 (safari id
1) during PIO write transaction”. See also the section above on “Detecting Bad CPU Writers” for
example outputs of these transactions.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 29


Special Syndromes
Be aware of, and be on the lookout for UE errors caused by other events. These are
recognizable by an Esynd containing one of the three special ECC syndromes 0x003, 0x071 and
0x11c. These have special meaning that needs to be identified and interpreted differently when
troubleshooting memory and L2 cache events.

The ECC special syndrome is a flag used to indicate the data was corrupted by a previous
transaction, likely CPU module cache event, and not due to the memory itself. Note that a
message is additionally printed with the special syndrome event, to indicate exactly this “Two
Bits in error, likely from E$”. The 3 special syndromes are caused when the CPU accessing
memory recognizes the other Safari Bus event and “poisons” or flips 2 specific bits, generating
these syndromes. To determine the correct bad part, it is critical to look back through the full /
var/adm/messages logs in search of additional events which do not have an Esynd with a special
syndrome but are related and the cause of the special syndrome. It is these additional non-
special syndrome events that may pinpoint which CPU module likely caused this bad data to be
in memory initially. Note the msgbuf contained in any core file generated by the panic usually
does not contain sufficient log history to show the prior event that enables diagnosis to the CPU
module. Also note that the associated events may be logged prior or after the special syndrome
event, and should be related by their errID.

The events to look for associated with each special syndrome event occurring are:
0x003 (ECC Check bits 0 & 1 flipped) - EDU event
0x071 (Data bits 126 & 127 flipped) - CPU or WDU event
0x11c (Data bits 0 & 1 flipped) - BERR event
See InfoDoc 43642 for detailed information on the meaning of these event types.

In the following example, the UE event with Esynd 0x071 on CPU0 may be mis-interpreted as a
bad DIMM in memory Group 0, whereas careful examination of preceding events shows WDU
and UCU events on CPU1 with non-special syndromes. Note also some later INVALID AFSR
events on subsequent UE errors seen by CPU1. Therefore the bad hardware in this example is
CPU1 module.

WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU0 Privileged


Data Access at TL=0, errID 0x0001a2f1.fa7ce508
AFSR 0x00100004<PRIV,UE>.00000071 AFAR 0x00000000.1e684c50
Fault_PC 0x10073c24 Esynd 0x0071 J0100 J0202 J0304 J0406
[AFT1] errID 0x0001a2f1.fa7ce508 Two Bits in error, likely from E$
WDU/CPU
WARNING: [AFT1] UCU Event on CPU1 in Privileged mode at TL=0, errID
0x0001a2f1.faeb8238
AFSR 0x00300624<ME,PRIV,UCC,UCU,WDU,UE>.000000e3 AFAR
0x00000000.ff9440d0 AMBIGUOUS
Fault_PC 0x10141c80 Esynd 0x00e3 AMBIGUOUS
[AFT1] errID 0x0001a2f1.faeb8238 Three Bits were in error
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU1 in
Privileged mode at TL=0, errID 0x0001a2f1.faeb8238
AFSR 0x00300624<ME,PRIV,UCC,UCU,WDU,UE>.000000e3 AFAR
0x00000000.ff9440d0 INVALID
Fault_PC 0x10141c80 Esynd 0x00e3 AMBIGUOUS
[AFT1] errID 0x0001a2f1.faeb8238 Three Bits were in error
WARNING: [AFT1] WDU Event on CPU1 at TL=0, errID 0x0001a2f1.faeb8238
AFSR 0x00300624<ME,PRIV,UCC,UCU,WDU,UE>.000000e3 AFAR
0x00000000.ff9440d0 INVALID
Fault_PC 0x10141c80 Esynd 0x00e3 AMBIGUOUS
[AFT1] errID 0x0001a2f1.faeb8238 Three Bits were in error
WARNING: [AFT1] Uncorrectable system bus (UE) Event on CPU1 in

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 30


Privileged mode at TL=0, errID 0x0001a2f1.faeb8238
AFSR 0x00300064<ME,PRIV,WDC,WDU,UE>.00000071 AFAR
0x00000000.12d36e50 AMBIGUOUS
Fault_PC 0x10141c80 Esynd 0x0071 AMBIGUOUS
[AFT1] errID 0x0001a2f1.faeb8238 Two Bits in error, likely from E$
WDU/CPU
WARNING: [AFT1] WDU Event on CPU1 at TL=0, errID 0x0001a2f1.faeb8238
AFSR 0x00300064<ME,PRIV,WDC,WDU,UE>.00000071 AFAR
0x00000000.12d36e50 AMBIGUOUS
Fault_PC 0x10141c80 Esynd 0x0071 AMBIGUOUS
[AFT1] errID 0x0001a2f1.faeb8238 Two Bits were in error
NOTICE: Scheduling clearing of error on page 0x00000000.12d36000
WARNING: [AFT1] WDC Event on CPU1 at TL=0, errID 0x0001a2f1.faeb8238
AFSR 0x00300064<ME,PRIV,WDC,WDU,UE>.00000071 AFAR
0x00000000.12d36e50 INVALID
Fault_PC 0x10141c80 Esynd 0x0071 INVALID
WARNING: [AFT1] Orphaned UCU Event on CPU1 Privileged Data Access at
TL=0, errID 0x0001a2f1.faf16478
AFSR 0x00100204<PRIV,UCU,UE>.00000071 AFAR 0x00000000.1e684c50
Fault_PC 0x10044478 Esynd 0x0071 AMBIGUOUS J0100 J0202 J0304 J0406
[AFT1] errID 0x0001a2f1.faf16478 Two Bits were in error

Solaris Memory Scrubber

All Sun Fire 280R systems require Solaris 8 or later, and therefore include the memory scrubber
tuned to the current best practice. The purpose of the scrubber is to read all of physical memory
within 12 hours, and detect correctable errors that may likely turn into transient uncorrectable
errors. The read is done in 8MB pages under kernel protection so any uncorrectable errors that
occur during the operation will not cause a panic. The messages produced by failing bits the
scrubber identifies are different from those reported above, so if the scrubber reports correctable
errors, repeating every 12 hours, there is likely a hard error of a DIMM that needs replacing.

What If the DIMM replacement does not fix the error?


There are things to investigate in the event the first DIMM replacement does not correct the
problem. In these cases, it is essential to run diagnostics and validation tools such as SunVTS,
and even do some trial-and-error FRU movement to try and stress the system into accelerating
the failure to enable faster failing FRU isolation.

1. Move or switch the DIMM's to the opposite bank, and if the problem persists on the same
DIMM slot, then this may be poor solder joint or other manufacturing defect that is affecting
CPU0 (the memory management unit on Sun Fire 280R server) and its address lines. If the
problem follows the DIMM to the other bank, this might indicate a possible DOA DIMM or
another DIMM is actually causing noise on the bus lines and masking itself as the problem.
These are more difficult to determine and may need a lot of trial and error to identify the truly
bad DIMM.
2. If only CPU0 or only CPU1 is reporting this, it is possibly a problem with a single bit on that
CPU module. This may be traced to a poor solder joint or other manufacturing defect that is
affecting just a single bit unique to that CPU module's connector. These are more difficult to
isolate on a single CPU system.
3. If both CPU's in a 2 CPU system are reporting this then it is possible there is a problem in the
datapath between the CPU's and the memory DIMM. This might indicate a possible DOA
DIMM or another DIMM is actually causing noise on the bus lines and masking itself as the
problem. It is possible for it also to be a problem with the motherboard Safari Bus ASIC's,
though this is unlikely.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 31


4. If both CPU's are reporting errors on 2 DIMM's which are the same numbered DIMM's in each
bank, then it is less likely to be a problem with 2 DIMM's, or the motherboard. Two types of
problem may show this type of behavior. Check to see if there is any evidence of a bad
CPU0 or CPU1 writer, using the “Detecting Bad CPU Writers” section above. If there is no
evidence of this, then there is likely a problem with the memory controller CPU0. This may be
traced to a poor solder joint or other manufacturing defect that is affecting just a single
address bit on CPU0 module's connector.
5. The presence of PIO write errors during disk or other I/O (e.g. SunVTS disktest) may be an
indication that a CPU is writing bad ECC into the memory, masking that the CPU is bad. See
the “Detecting Bad CPU Writers” section above for more information.
6. If the CPU module is thought to be the cause, then it should be very thoroughly inspected for
bent pins on the MB slot and CPU connector as it may be an inherent problem that would
affect subsequent FRU replacements. See the Motherboard section below for more details.

OBP Firmware does not complete initialization with “Data


Access Error”, “Corrected ECC Error” (memory), “Fast ECC
Error” (L2 cache), “Fast Data Access MMU Miss”

Background Information:
This message indicates that OBP memory has been trashed and it is unable to access either its
own instructions and data, or an operation it is performing to initialize a memory or L2 cache
device has failed. This is most commonly seen when a break or XIR or Solaris has crashed in
some manner, and running commands normally at the ok prompt fail in this manner, since
memory is trashed from the prior crash. In those cases, this should be ignored and OBP reset
with the “reset-all” command. When these errors occur during system initialization following
a reboot, prior to getting to the ok prompt, then there is likely a hardware problem. It is possible
the problem was also detected by POST diagnostics prior to OBP using the bad hardware, but
since Sun Fire 280R server does not support Automatic System Recovery (ASR), there is no
way to offline and prevent OBP from using the bad hardware prior to completing its initialization
where it would report the results of POST and fail to boot.

Recommended Solution Steps:


1. The following commands should be run at ok prompt (for example after the following error):

<...prior OBP initialization output...>


Probing /pci@8,700000 Device 5 network usb Corrected ECC Error
i. {0} ok .cpu-afsr
E_SYND:1cc M_SYND:0 CE:1 UE:0 EDU: 0 EDC: 0 WDU: 0 WDC: 0 CPU: 0
CPC: 0
UCU: 0 UCC: 0 BERR: 0 TO: 0 IVU: 0 IVC: 0 EMU: 0 EMC: 0 ISAP: 0
IERR: 0 PERR: 0
PRIV: 0 ME: 0
This command provides the AFSR pre-decoded into the Esynd of the event, and the set
bits that correspond to the various cache, memory and bus events that could have caused
the “Corrected ECC Error”.

ii.{0} ok 0 4d spacex@ .vpt


0000.0001.ffe1.1f40
This command provides the AFAR of the access that set the error bits in the AFSR.

iii.The following commands provide the current values of the CPU registers and
information on what code most recently ran, that could be used in engineering debug if

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 32


there is an OBP, Solaris or Application bug causing the problem.
{0} ok .registers
Normal Alternate MMU Vector
0: 0 0 0 0
1: 1039ebc fff646c0 f00436dc 222
2: 0 f0000000 0 c
3: 1 4 0 14003f8
4: 0 4 3c0 1400000
5: 0 fa000000000f fff68000 1000
6: 0 1007538 800000003ff5c0b6 14007e8
7: 2a100045d40 4048 2 60
%PC f0046d34 %nPC f0046d38
%TBA 1000000 %CCR 0 XCC:nzvc ICC:nzvc
{0} ok .pstate
AG:0 IE:1 PRIV:1 AM:0 PEF:1 RED:0 MM:0 TLE:0 CLE:0 MG:0 IG:0
{0} ok .errors
{0} ok .trap-registers
%TL:1 %TT:60 %TPC:100416e8 %TnPC:100416ec
%TSTATE:4400001603 %CWP:3
%PSTATE:16 AG:0 IE:1 PRIV:1 AM:0 PEF:1 RED:0 MM:0 TLE:0 CLE:0
MG:0 IG:0
%ASI:0 %CCR:44 XCC:nZvc ICC:nZvc
<... output truncated for documentation example ...>
{0} ok ctrace
PC: 100416e8
Last leaf: call 10031f70 from 10041768
0 w %o0-%o7: (b b 0 1041b2f8 2a10001fd20 0 2a10001f1b1
10041768 )
call 100416d4 from 100406b0
1 w %o0-%o7: (1041b2f8 0 1041c290 10423a00 0 0 2a10001f261
100406b0 )
jmpl 1004060c from 100295bc
2 w %o0-%o7: (0 0 0 1041b2f8 3000006b188 0 2a10001f311
100295bc )
{0} ok 0 w begin .locals %o7 .adr cr (+w) key? or until
<... output truncated for documentation example ...>
{0} ok
iv.{0} ok 1 switch-cpu
This switches OBP from running on the current CPU indicated in the { } brackets, to the
other CPU identified in the command. If the initially running CPU was {1}, use the
“0 switch-cpu” command instead. If there is no { } brackets prior to the ok prompt, then
the system only has 1 CPU.

v. Repeat steps i, ii and iii to gather the same data for the second CPU.

2. Using the “Interpreting AFSR & AFAR outputs” section below, and the manual ECC decode
procedure in Appendix B (OBP does not do this decoding automatically) to determine which
FRU (DIMM Slot or CPU) is the most likely cause of the problem.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 33


Interpreting AFSR & AFAR outputs
Background Information:
The Asynchronous Fault Status Register (AFSR) and the Asynchronous Fault Address Register
(AFAR) provide clear information as to the error encountered by a CPU or IO-Bridge (schizo)
during a normal transaction of data and instructions to/from memory or cache. As there have
been numerous bugs related to incorrect reporting and interpreting of AFSR/AFAR combinations
in error messages by the kernel, it is very important to ensure the latest Kernel Update Patch
(KUP) is installed for Solaris. See the above section for more details.

Decoding AFSR's
The AFSR can be decoded with a tool developed by PTS EMEA engineers. This tool is
available here:
http://cpre-emea.uk/cgi-bin/afsr/afsr.pl

In most instances Sun Fire 280R server's output the AFSR as four 16-bit portions (4 x 4 hex
values) separated by periods. Unfortunately the tool requires very specific input, and requires
any AFSR entered to be free of all “.” periods or to contain only a single “.” period separating the
two 32-bit portions (2 x 8 hex values) of the AFSR that make up the 64-bit register. This is due
to the original tool being designed based on error messages from UltraSPARC I and II-based
systems. It is recommended for the purposes of Customer Facing engineers using this tool, to
remove all “.” periods from the output provided by the system. If you don't do this, the tool will
incorrectly decode the input, as it will find the first “.” period and ignore the last 2 x 4 hex values.

Example from a system:

CPU: 0000.0000.0000.0000 AFSR: 0008.0000.0000.0000 AFAR: 0000.0400.0470.0200

When entered into the tool as supplied by the system, the AFSR “0008.0000.0000.0000” is
decoded as follows:

AFSR: 0x800000000

Bits Field Use


~~~~~~~~~~~~~~~~~~~
<35> EDU uncorrectable ECC error from Ecache

This would typically indicate an L2 cache problem local to the CPU module reporting it.

When entered into the tool with all “.” periods removed, the AFSR “0008000000000000” is
decoded as follows:

AFSR: 0x8000000000000

Bits Field Use


~~~~~~~~~~~~~~~~~~~
<51> PERR system interface protocol error

Note the additional 0's that make up the full 64-bit register value. This error would typically
indicate a problem on the system bus while the CPU was requesting data/instructions from the
other CPU, memory, or the IO-Bridge. The AFAR could be used to help narrow down
specifically which part of the system bus was being accessed at the time, using the fixed address
ranges listed below.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 34


In other words, incorrectly decoding this AFSR would have led to the misdiagnosis that the CPU
was bad, when in actual fact the problem lies somewhere other than the CPU.

Note that the decoder tool requires selecting the appropriate device type, since the different CPU
and IO-Bridge devices have different meanings for each error status bit stored in the AFSR of
that device type. On Sun Fire 280R server, the 3 devices that are used are:
Cheetah – UltraSPARC III (750MHz)
Cheetah+ - UltraSPARC III Cu / III+ (900MHz, 1015MHz, 1200MHz CPU's)
Schizo – PCI IO-Bridge

Once the tool has been used to correctly decode the AFSR into an error type, InfoDoc 43642 in
conjunction with FIN I0954-1, and the corresponding AFAR should be used to narrow down
which FRU is the suspect cause of the error. Where an Esynd # is given, and a version of KUP
on the system earlier than 108528-16 (Solaris 8) or 112233-01 (Solaris 9), it is useful to follow
the procedure for manually decoding the Esynd, AFSR and AFAR down to a specific DIMM or
bank, described in Appendix B, and additionally available here:
http://pts-americas.west/vsp/wgs/products/littleneck/excalibur.mem.pdf

Decoding AFAR's
The Sun Fire 280R architecture is the same as the Sun Blade 1000/2000 architecture originally
described in the “Excalibur Architecture Manual v1.0”. This is available for download from a
number of internal websites, including the PTS Americas website here:
http://pts-americas.west/vsp/desktop/products/excalibur/excal_architecture_manual_1.0.pdf

Memory AFAR's:
Cacheable Memory lies in the 0x0 through 0x3ff.ffff ffff address space. Any AFAR in this range
may be an address in physical memory, or in physical cache. No distinction is possible between
the two, but such a distinction can be drawn based on the AFSR error type that is flagged with
the AFAR. The memory address space is initialized by OBP which sets up the interleaving
pattern and prints out the ranges being used according to the physical memory present in the
system. This is printed only when OBP parameter “diag-switch?” is set “true” which is the
current variable that affects OBP output verbosity. Note that this may change in a future OBP
version. The message to look for on Sun Fire 280R server is:

Membase: 0000.0000.0000.0000
MemSize: 0000.0000.4000.0000

This indicates that 8GB memory address space has been allocated starting at address 0x0.
Since Sun Fire 280R server has only 1 memory controller active (CPU0), this is relatively simple
to understand. Other platforms such as Sun Fire V480 and Sun Fire V880 require more detailed
output from OBP to determine which CPU/Memory Slot and CPU is associated with which
memory address ranges.

The AFAR can be used only to interpret which bank of memory is the source of the error, in
systems where both memory banks contain DIMMs (i.e. 8 DIMMs). It cannot be used to
determine which DIMM within the bank is at fault. If the error is of correctable type (CE), then the
Esynd # which is part of the AFSR can be used for narrowing to the specific DIMM following the
procedure in the document referenced above in the AFSR section.

The procedure for interpreting the bank is:


1. Decode the AFAR into its bits.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 35


2. Bits 9-6 (as counted from the right) provide the LM field which refers to the logical
bank. Since Sun Fire 280R server only has 4-way interleaving, we only care about bits
7-6. These bits provide 4 binary values which correspond to the 4 logical banks as
follows:
Logical Physical
Bits [7-6]
Bank Bank
00 0 Group 0
01 1 Group 1
10 2 Group 0
11 3 Group 1
Group 0 is DIMM's J0100, J0202, J0304, J0406
Group 1 is DIMM's J0101, J0203, J0305, J0407

I/O & Special AFAR's:

The I/O address space is the next 4 Terabytes of address space from 0x400.0000 0000 to
0x7ff.ffff ffff The address ranges are allocated to the various I/O devices as follows, per the
Excalibur Architecture Manual:

Safari Configuration Space:


Address Range Safari Device
0x400 0000 0000 - 0x400 007f ffff CPU0
0x400 0080 0000 - 0x400 00ff ffff CPU1
0x400 0100 0000 - 0x400 03ff ffff Reserved
0x400 0400 0000 - 0x400 047f ffff Schizo
0x400 0480 0000 - 0x401 ffff ffff Reserved

Additional device memory ranges for all the specific onboard devices are defined in Chapter 2 of
the Excalibur Architecture Manual.

NewLink Devices Space (unused):


0x500 0000 0000 - 0x5ff ffff ffff

PCI Memory Space:


0x600 0000 0000 - 0x6ff ffff ffff

UPA64 Space:
0x700 0000 0000 - 0x701 ffff ffff UPA64S Slot 0
0x702 0000 0000 - 0x703 ffff ffff UPA64S Slot 1
0x704 0000 0000 - 0x7ff efff ffff Reserved

NOTE: Addresses in the ranges 0x7fb. 0x7fc. 0x7fd. 0x7fe. !


These addresses are seen regularly and are special addresses that correspond to the
PCI busses setup by OBP and used for PCI transactions to/from actual devices on the
bus.

Only 0x7fd. 0x7fe. are used in Sun Fire 280R server since there is only 1 IO-Bridge. The
7fd corresponds to the 66MHz bus and the 7fe corresponds to the 33MHz bus. The
additional bits in the rest of the address can be used to translate to a device in particular
on the bus if one knows how the data in PCI bus transactions are constructed. If one of
these special addresses is seen in the AFAR, then it is a sign of a failure during a
transaction to a PCI card or onboard device. It is recommended in these cases to rule

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 36


out any PCI cards as suspect.

BootBus Space:

0x7ff f000 0000 - 0x7ff f00f ffff Motherboard Flash PROM space / PROM Emulator
0x7ff f010 0000 - 0x7ff f0ff ffff PROM Emulator/ Reserved
0x7ff f100 0000 - 0x7ff f7ff ffff Reserved
0x7ff f800 0000 - 0x7ff f8ff ffff Philips I2C controller, PCF8584
0x7ff f900 0000 - 0x7ff f9ff ffff SuperI/O
0x7ff fa00 0000 - 0x7ff faff ffff Serial Lines Controller
0x7ff fb00 0000 - 0x7ff ffef ffff Reserved
0x7ff fff0 0000 - 0x7ff ffff ffff BBC (internal Registers)

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 37


Disks
General Disk Troubleshooting

Background Information & Procedures:


Most disk problems on the Sun Fire 280R server are related to a specific disk, and not the disk
backplane or cables. Note that FC-AL disks have a World Wide Number (WWN) attached to
each disk which affects how devices appear to Solaris depending on the slot they are installed
in.

1. Determine the build date of a system based on the serial number of the system. See “Sun
Fire 280R server Serial Numbers” in Section 1 above, for information on how to determine
this.

2. Check the below FIN's for information as to known firmware upgrade issues. These FIN's are
controlled pro-active and should be done on all potentially affected systems as soon as
possible.

3. To confirm a bad disk, there are a few things that can be checked. If the disk was just
replaced, and similar errors from prior to the replacement are continuing, then most likely the
new disk is DOA.

a. Carefully examine the output of the “/usr/bin/iostat -E” command, looking for any error
events that are affecting one of the two disks. Look for non-zero counts on the first, 4th and
5th lines. If both disks have non-zero counts, it could be problems with one disk and
artifacts of that problem on the other disk, so this case would be noticeable if there are
significantly higher error counts on one disk compared to the other. Sample output from a
Sun Fire 280R server's disk drive is:
# iostat -E
<...>
ssd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU Product: MAN3735F SUN72G Revision: 0704 Serial
No: 0304V87742
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
<...>

b. It is strongly suggested that the “/var/adm/messages” file be examined for errors. In basic
terms the errors likely to appear are "individual disk" types of error, or "bus or host based"
types.

i. Below is an an example of a "individual disk" type of error. Errors of this


type are very likely to be caused by the disk in question.

penu22028 scsi: [ID 107833 kern.warning] WARNING: /


pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf966fd5,0 (ssd0):
Feb 18 16:29:16 penu22028 Error for Command: read(10)
Error Level: Retryable
Feb 18 16:29:16 penu22028 scsi: [ID 107833 kern.notice]
Requested Block:55980576 Error Block: 55980583
Feb 18 16:29:16 penu22028 scsi: [ID 107833 kern.notice]
Vendor: SEAGATE Serial Number: 0218P1LH33
Feb 18 16:29:16 penu22028 scsi: [ID 107833 kern.notice]

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 38


Sense Key: Media Error
Feb 18 16:29:16 penu22028 scsi: [ID 107833 kern.notice]
ASC: 0x11 (unrecovered read error), ASCQ: 0x0, FRU: 0xe4

Errors of this type generally indicate the drive listed needs to be replaced. Notice
that this type of error lists "Vendor" , "Sense Key" , and "ASC/ASCQ" information.
These values will vary with the type of drive error and are explained further in
InfoDoc 14140. To relate the info given above to which "cXtXdX" disk is being identified
match the WWN of w21000004cf966fd5,0 from the error above to the output of the format
command:

# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf966fd5,0
1. c1t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w500000e010368268,0
Specify disk (enter its number):

So in the example above the failing drive is c1t0d0

ii. The three examples below are of the "bus or host based" type error. That in no way
implies that a disk could not be at fault.

Example 1. The problem was troubleshot by booting from CDROM and running
"test" from the format analyze menu. By swapping drive positions it was
determined that the drive was failing. Using the format program will be explained
later in this section.

Dec 16 11:57:40 marge qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(0):
Loop OFFLINE
Dec 16 11:58:43 marge qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(0):
Loop ONLINE
Dec 16 11:58:54 marge scsi: [ID 243001 kern.warning] WARNING: /pci@8,
600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf96a89f,0 (ssd0):
Dec 16 11:58:54 marge SCSI transport failed: reason 'tran_err':
retrying command

Example 2. The message is from "picld" which is the daemon that monitors
environmental data. Notice that both disks are called in error. The problem was
the internal disk backplane.

Mar 31 11:30:03 bou280r-01 picld[72]: [ID 961923 daemon.error] WARNING:


Device DISK0 failure detected by sensor DISK0_FAULT_SENSOR
Mar 31 11:30:03 bou280r-01 picld[72]: [ID 961923 daemon.error] WARNING:
Device DISK1 failure detected by sensor DISK1_FAULT_SENSOR
Mar 31 11:30:33 bou280r-01 picld[72]: [ID 449286 daemon.error] Device
DISK0 OK
Mar 31 11:30:33 bou280r-01 picld[72]: [ID 449286 daemon.error] Device
DISK1 OK
Mar 31 11:31:03 bou280r-01 picld[72]: [ID 961923 daemon.error] WARNING:
Device DISK0 failure detected by sensor DISK0_FAULT_SENSOR

Example 3. There were a number of OFFLINE/ONLINE messages ultimately


resulting in the disks being offlined. The cause was a failing disk controller on the

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 39


motherboard.

NOTICE: Qlogic qlc(0): Loop OFFLINE


NOTICE: Qlogic qlc(0): Loop ONLINE
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037efa9e1,0 (ssd5) offline
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w2100002037e3eb92,0 (ssd4) offline

4. The following diagnostic tips should be used to assist troubleshooting:

a. Use of "probe-scsi-all" from the ok> prompt should usually be the first diagnostic run since
it is not dependent on any operating system to run.All the disks should be seen.

b. Next, use "obdiag" from the ok> prompt which presents a menu of devices. Set the
environment variables test-args = subtests,verbose,media,bist,iopaths and diag-level =
max, then run the “test-all” command at the obdiag> prompt.

c. If the drives are seen okay, boot Solaris in single-user mode from either CDROM or
network (“boot cdrom -s” or “boot net -s”. This provides the advantage of using an device
tree image loaded into memory rather that that which is loaded on the disk, which is
helpful in isolating problems which are suspect to the Solaris being damaged or mis-
configured, as well as allowing swapping of drive positions without worrying about the
effects of the WWN and slot id's. Once you have booted the Solaris image you can enter
the format utility and run some analyze tests.

# format
Searching for disks...done

AVAILABLE DISK SELECTIONS:


0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w500000e0102b5291,0
Specify disk (enter its number): 0
selecting c1t0d0
[disk formatted]
Warning: Current Disk has mounted partitions.

FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> analyze

ANALYZE MENU:

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 40


read - read only test (doesn't harm SunOS)
refresh - read then write (doesn't harm data)
test - pattern testing (doesn't harm data)
write - write then read (corrupts data)
compare - write, read, compare (corrupts data)
purge - write, read, write (corrupts data)
verify - write entire disk, then verify (corrupts data)
print - display data buffer
setup - set analysis parameters
config - show analysis parameters
!<cmd> - execute <cmd> , then return
quit
analyze>

It is suggested you choose carefully what tests you will run as some will write over the operating
system. To further refine the running of the tests available in format, use the options available in
the setup sub menu.

5. Additional preparation for disk problems:

a. Consider configuring the “/etc/syslog.conf” file to log messages onto another system as
well as locally. See the “syslog.conf” man page for more details.

b. If a volume manager is being used it it suggested to run the following commands


periodically from a cron job and save the outputs to a location that is regularly backed up:

i. For Veritas Volume Manager (VxVM):


vxdisk list
vxprint -ht
vxprint -g <disk_group> -vpshm (run this once for each disk_group name)

For more information on this procedure that should be useful for any regular VxVM
system administration, see InfoDoc 12006.

ii. For Solstice DiskSuite (SDS/SVM):


metstat
metastat -p
metadb
metadb -i

Seagate Drives
FIN# I0816-1: Seagate ST336605FC 36GB and ST373405FC 73G drives with firmware 0438
(or below) could be susceptible to label corruption which results in the drive and its data being no
longer accessible. This FIN affects drives in systems built before June 2002. All such disks
should be proactively updated to prevent data and availability loss, prior to disks failing and
needing replacement.

Fix: Install patch 109962-07 and download F/W 0538 or 0638 to Seagate 36GB ST336605FC
disks and 73GB ST373405FC disks.

# /usr/sbin/luxadm display /dev/rdsk/c1t0d0s2


DEVICE PROPERTIES for disk: /dev/rdsk/c1t0d0s2
Status(Port A): O.K.
Vendor: SEAGATE
Product ID: ST336704FSUN36G

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 41


^^^^
OK - update if ST336605F
WWN(Node): 2000002037e98149
WWN(Port A): 2100002037e98149
Revision: 0726
^^^^
OK - update if above model is not 0538.

Fujitsu Drives
FIN# I0963-1: Fujitsu 73GB HDD will not be recognized during a 'boot net' operation on Sun
Blade 2000 or Sun Fire 280R platforms. This FIN affects drives in systems shipped from
approximately July 2002 to January 2003.

Fix: Upon failure update affected Fujitsu 73.4GB disk drive (MAN3735FC) having firmware
version 0604 to firmware version 0704 via patch 109962-10. The patchId 109962-11 has been
released and available since May 15, 2003.

DVD
FIN# I0723-1: Unable to boot Solaris 8 Update 7 (HW 02/02) (or later) DVD-ROM Media from
Toshiba DVD/CDROM. This FIN affects systems shipped prior to November 2001.

Fix: Install Patch 111649-03 for all Toshiba SD-M1401 drives having part number 390-0025-01.
This patch is compatible with Solaris releases 2.5.1, 2.6, 7, and 8.

Note on Internal FC-AL Subsystem:


The FC-AL subsystem in the Sun Fire 280R server is based upon the Qlogic ISP2200 controller.
This is contained on the motherboard and the qlc firmware for this controller is a part of the OBP
firmware. This onboard firmware is updated via the OBP firmware patch and was updated from
v1.12 to v1.14 in OBP 4.6.x or later. The controller is the same as that used for the Amber
single-port PCI card, which also has qlc firmware that is contained on the card itself, and is
updated via a special patch. The controller is supported by the MPxIO project (SAN foundation
kit) and drivers, and the related patches that provide MPxIO support are recommended to get the
latest driver support in Solaris.
The onboard FC-AL loop extends from the 2 disk drives, the internal FC-AL backplane, and out
to the external HSSDC port. This is a single loop and has no capability for dual-loop or
multipathing. Also, due to loop and length restrictions, it is not supported to attach any array
larger than the Sun StorEdge Multipack-FC to the external HSSDC port.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 42


PCI Cards
Background Information & Procedures:
See the “Detecting Bad CPU Writers” above in the CPU section for more information on CPU-IO
transaction errors. See the “Interpreting AFSR & AFAR Outputs” above in the Memory section
for more information on PCI I/O memory addressing.

The Sun Fire 280R server contains four PCI slots on two PCI Busses from the single IO-Bridge
(Schizo) on the motherboard. PCI slot 1 provides option for 64-bit 66MHz 3.3V or 33MHz 5V
cards, and PCI slots 2, 3 and 4 provide option for 64-bit 33MHz 5V cards. All slots accept
universal keyed 3.3V/5V cards. It is recommended best-practice not to place a 33MHz card into
the 66MHz slot if possible, as this will slow the whole bus performance down to 33MHz, thereby
halving the performance of the on-board 66MHz FC-AL disk controller for the internal disks. The
theoretical bandwidth provided by the IO-Bridge is 1.2GB/s maximum throughput between PCI
and the Safari bus. The 66MHz bus provides 8 bytes (64-bit) x 66MHz = 528MB/s maximum
throughput, and the 33MHz bus provides 8 bytes (64-bit) x 33MHz = 264MB/s maximum
throughput, shared between all devices and slots on each bus.

The two PCI busses are designated in the device tree as “/pci@8,600000” where the ,600000
indicates the 66Mhz bus, and “/pci@8,700000” where the ,700000 indicates the 33Mhz bus. The
number 8 indicates the safari agent ID of this component, referring to the IO-Bridge itself.

The following describes the internal devices on each Bus and the slot device numbers, as well
as how to interpret the “Device #” messages that are reported in PCI errors. This is reproduced
from an article created by a PTS EMEA engineer (Mick Mullins) and is also available here:

http://cpre-emea.uk/technotes/showentry.php?id=1108404754

66MHz Bus - /pci@8,600000


This bus is made up of two devices, one being an on-board controller (fc-al) and the other being
a single PCI slot (J2301). The Excalibur architecture manual reports that this bus can support up
to four devices though only two are used in the Sun Fire 280R server. The path for the devices
on this bus are:
/pci@8,600000/"device"@1 – PCI card (Slot 1 J2301)
/pci@8,600000/SUNW,qlc@4 -- on-board FC-AL Qlogic isp2200 controller

33MHz Bus - /pci@8,700000


This bus is made up of five devices out of a total possible of six. The devices are three PCI
slots, one dual channel SCSI controller and the RIO multifunction controller. The path for the
devices on this bus are:
/pci@8,700000/"device"@1 -- PCI card (Slot 4 J2601)
/pci@8,700000/"device"@2 -- PCI card (Slot 3 J2501)
/pci@8,700000/"device"@3 -- PCI card (Slot 2 J2401)
/pci@8,700000/ebus@5 -- on-board RIO multifunction controller
/pci@8,700000/scsi@6 -- on-board SCSI controller (Symbios/LSI 53C876)

Interpreting the Device Number in PCI Error Messages

The device numbers assigned by the IO-Bridge (Schizo) to the devices on the two PCI busses is
based on the PCI req/gnt lines from the IO-Bridge to each individual bus device. The following

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 43


table indicates which req/gnt line is used for each device:

/pci@8,600000 - 66Mhz
PCI 64conn J2301 - EPCI_GNT_0
ISP2200 CONTROLLER – EPCI_GNT_3

/pci@8,700000 – 33Mhz
PCI 64conn J2601 - PCI_GNT_0
PCI 64conn J2501 - PCI_GNT_1
PCI 64conn J2401 - PCI_GNT_2
RIO CONTROLLER - PCI_GNT_3
RIO CONTROLLER - PCI_GNT_4
SYM53C876 SCSI - PCI_GNT_5

Note that the RIO controller supports two PCI req/gnt pairs to minimize DMA latency in the
system. DMA requests from the channel engines are routed to both PCI req/gnt pairs following
arbitration and availability of internal resources. In systems without the second PCI req/gnt pair,
RIO can use the single pair to request the bus.

These tables show how to decode the device #'s that are reported in PCI error messages, based
on the table of used req/gnt lines:

/pci@8,600000 - 66Mhz
Bits 3 2 1 0 Device Type DEVICE #
0 0 0 0 Bus idle
0 0 0 1 PCI slot J2301 DEVICE 0
0 0 1 0 Not used
0 1 0 0 Not used
1 0 0 0 ISP2200 On-board DEVICE 3

/pci@8,700000 – 33Mhz
Bits 5 4 3 2 1 0 Device Type DEVICE #
0 0 0 0 0 0 Bus idle
0 0 0 0 0 1 PCI slot J2401 DEVICE 0
0 0 0 0 1 0 PCI slot J2501 DEVICE 1
0 0 0 1 0 0 PCI slot J2601 DEVICE 2
0 0 1 0 0 0 RIO On-board DEVICE 3
0 1 0 0 0 0 RIO On-board DEVICE 4
1 0 0 0 0 0 SYMB 53c876 On-board DEVICE 5

Note that DEVICE 6 is the SCHIZO chip itself, on both buses. This is indicated in the Schizo
ASIC specs (Sect. 22.4.1.1 PCI Control & Status register (ERR_SLOT bits 55:48))

Example PCI Error Message:

Mar 9 20:44:23 foobar pcisch: [ID 831440 kern.warning] WARNING: pcisch-


1: PCI fault log start: Mar 9 20:44:23 foobar pcisch: [ID 630226
kern.notice] PCI error occurred on device #0
Mar 9 20:44:23 foobar pcisch: [ID 120591 kern.notice] dwordmask=0
bytemask=3
Mar 9 20:44:23 foobar pcisch: [ID 607383 kern.notice] pcisch-1: PCI
primary error (20):
Mar 9 20:44:23 foobar pcisch: [ID 938423 kern.notice] Master Abort
Mar 9 20:44:23 foobar pcisch: [ID 259679 kern.notice] pcisch-1: PCI
secondary error (0):

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 44


Mar 9 20:44:23 foobar pcisch: [ID 467665 kern.notice] pcisch-1: PBM
AFAR 0.001000c0:
Mar 9 20:44:23 foobar pcisch: [ID 127741 kern.warning] WARNING:
pcisch1: PCI config space CSR=0x22a0&lt;received-master-abort&gt;
Mar 9 20:44:23 foobar pcisch: [ID 141464 kern.notice] pcisch-1: PCI
fault log end.
Mar 9 20:44:24 foobar unix: [ID 578303 kern.notice] pcisch-1: PCI bus 2
error(s)!
foobar# grep pcisch /etc/path_to_inst

"/pci@8,700000" 0 "pcisch"
"/pci@8,600000" 1 "pcisch"

“pcisch-1” is the driver instance reporting the PCI error. Checking /etc/path_to_inst file shows us
that the path of instance 1 of the pcisch driver is bound to “/pci@8,600000”, so the error has
occurred on the 66Mhz bus. Using the tables above, the message “PCI error occurred on
device #0” specifically relates to the PCI slot J2301. The message “PCI config space
CSR=0x22a0<received-master-abort>“ indicates that the PBM within the Schizo received
a master abort signal from an external device. This leads to a suspect PCI card in slot J2301.

FIN #I0722-1:

Due to bug 4482600 in the Schizo ASIC, an interaction between 64-bit and 32-bit cards may
cause a PCI SERR panic. The bug is due to Schizo putting incorrect parity when filling the upper
32-bit data, when 32-bit cards using the lower 32-bits of data are doing long PIO transactions.
Any 64-bit card may legitimately check and detect the bad parity and initiate the PCI SERR
panic. The bug is fixed in hardware in Schizo version 2.4, which has not actually shipped as of
this writing. The hardware fix will be available in motherboard 501-6230-10 or later, which will
contain Schizo version 2.5 or ELE version 1.1+.

It was found through investigation of Sun PCI cards, that this problem only occurred between
Sun PCI graphics cards PGX32 and PGX64, where the other PCI card checking the parity and
asserting SERR was an Emulex Lightpulse FC-AL adapter. To workaround this problem given
that the hardware fix is not available, 2 solutions are possible:

1. Due to the architecture of the Sun Fire 280R server, there is only one bus can be affected by
this problem, where two interacting cards are installed in two of the three 33MHz slots. Move
one of the 2 cards (either Emulex or PGX) that are causing the problem to the 66MHz slot,
thus isolating the cards on to separate buses where they cannot interact.

2. Apply the workarounds listed in the FIN for the PGX32 and PGX64 cards respectively. This
requires installation of minimum revision patches for these drivers, as well as some specific
commands required to configure OBP firmware and Solaris driver variables to disable the
behavior that causes the Emulex card to check the bad parity.

Theoretically this problem can occur between any third-party PCI cards as well. Please use the
PTS engagement mechanism to escalate any additional interactions showing similar behavior to
this bug, for which workaround 1 above is not possible due to the customer configuration
requirements, and new third-party PCI cards are triggering symptoms similar to this bug and FIN.

Only one other case has been reported and escalated which was occurring between two
application-specific designed cards. In this case, the customer required three cards of this type

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 45


in the system which were actually interacting between each other, so they had no choice but to
have two together on the 33MHz bus. The hardware fix was initially tested using a special test
motherboard that contained a Schizo 2.4, and the problem still occurred with these PCI cards.
This problem was determined to be caused by the compilation of the card drivers and special
user application. Though the card drivers were not performing transactions similar to those seen
in the Emulex case, using Sun Forte 6.x C compilers were creating code that would trigger the
Schizo bug as soon as the application was run that accessed the cards. This was resolved by
re-compiling the drivers and user application with an earlier revision of Sun Workshop 4.2 and
5.0 C compilers.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 46


Power Supplies and PDB

Background Information & Procedures:


Machines suddenly rebooting or powering off unexpectedly without any messages on the
console or in the “/var/adm/messages” file could be caused by several reasons. Check the
following in this order:
1) The AC power source(s) and cables. Is the problem local to one machine or
are other experiencing similar reboots?
2) The power supplies. Try swapping PS0 and PS1 around.
3) RSC. Check the RSC logs with the loghistory and consolehistory commands.
What? You're not using RSC?!?! A bad RSC card could still be the cause of
your problem, if installed, even if you're not using it. Try pulling it out and see if
you still have problems. (You'll get some error messages, but the system
should be ok for testing)
4) Power Distribution Board
5) Internal cabling – all power cables and the 3-wire I2C cable.
6) Motherboard. Unlikely.

Thermal Event Issue

FIN #I0992-1: A Small Number of Power Distribution Boards may experience a limited thermal
event at the Power Supply 1 connector. This is due to a material issue with the connector being
susceptible to humidity and out-of-dimension when PS1 was installed at system assembly
typically. PS1 connector will have broken ears and can be visually inspected for. Technically PS0
can be affected too, but this has never been seen in the field, due to PS0 being installed into
systems at a different assembly location than PS1. This is a reactive FIN – replace on failure.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 47


Motherboard
Background Information & Procedures:
The motherboard should be the last item you consider when troubleshooting for bad hardware,
unless there is a clear problem related to the onboard SCSI or FCAL controllers, or other I/O
devices that are part of the motherboard.

The most common problem with the motherboard is bent pins on the CPU connectors due to
improper installation of CPU modules. The pins should be very carefully inspected on the CPU
connectors and the Motherboard slots for even the slightest amount of damage or movement
from the perpendicular that they should line up with. If this is apparently the problem, then the
Motherboard and Both CPU's should be replaced together to remove any suspect parts from the
system, as any bent pin on 1 FRU will bend the pins on the other FRU and all subsequent FRU's
that touch that slot or connector. For all Fatal Resets, RSE's, repeated DIMM or apparent CPU
problems, it is very rarely due to a bad motherboard component itself, so replacing the
Motherboard only in these situations is not going to help.

When replacing CPU modules be sure to follow instructions precisely:

1) Remove CPU modules using the opposite procedure as the install step (4) below, ensuring to
alternate between screws every half to full turn of the driver tool.
2) Closely inspect pins on motherboard and CPU module for damaged pins. They may be
difficult to see. Do not re-use any damaged component.
3) Use torque tool part number 250-1611 and not the ring tool 340-6395
4) Insert CPU modules;
a. Turn both thumbscrews by hand simultaneously to locate screws in thread, until
screws are finger tight.

b. Turn one screw a half or full turn clockwise and then turn opposite screw a half
or full turn clockwise using the provided torque tool in the unit.

c. Repeat above step until both screws lock and the CPU module is
securely in place. The torque tool will present an audible "click" when the screws
are at the correct 5 inch pound torque spec

d. Repeat steps 1-3 for other CPU module (if present).


e. Make sure the screw is not re-tightened after the first “click”, or it will be over
tightened and can cause damage.
When suspecting bent pins or insufficient torque is the cause of problems:
DO NOT tighten screws currently in place, regardless of if you think they are loose. The only
correct and approved way to “check” the torque is correct, is to remove and re-install the module
following the procedures above. Due to the nature of the design, it is expected that the screws
will slacken from the CPU shroud over time, but this does not loosen the Motherboard slot/CPU
Module connector pressure holding the module in place. Tightening screws that seem loose will
over-torque the connector at the motherboard slot, and could result in cracked and damaged
connectors.
Torque Tool Issues – SunAlert 55900
There have been reports of torque tools not functioning properly, and coming apart. The potential

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 48


problem is identified on the top cap label by part number 250-1611-02 only with date codes:
12/02, 01/03, 02/03, 03/03, 04/03, 05/03, and 06/03. There is a Sun Alert 55900 for this issue.

Firmware prints “IDPROM Contents Invalid” and 0's for Ethernet MAC Address
The most common reason that this occurs, is due to a prior POST or OBP initialization error that
has caused the system to stop initializing prematurely. As a result, the IDPROM may not have
been read yet, so typing the “banner” command at ok prompt, or OBP printing the “banner” gives
this output. Look for previous failures in the console log, such as a CPU or Memory type error,
that may be diagnosable and the reason that OBP got into this state. Troubleshooting this type
of output should start with running POST using the keyswitch in DIAG position.

On rare occasions, this may be caused by a bad Socketed SEEPROM (NVRAM) chip, that did
not program correctly on an OBP update, or has bent pins (for example after transferring to a
new motherboard FRU).

On very rare occasions, this may be caused by an OBP bug 4446946 that affected a small
number of very early systems that shipped with OBP version 4.0.46. It is unlikely any customer
system still has this version of OBP, as the fix for this bug was in the first released OBP patch
with OBP 4.2.2, which was also incorporated onto new motherboards (and new systems) not
long after Sun Fire 280R server was released. Any failures that occurred for this bug are likely to
have already happened well before the writing of this guide.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 49


Miscellaneous Issues
Sun Rack 900 (NGR)
There was a change made to the original rack kit (-01) that precludes the most commonly
shipped rack kit (-02) from fitting into Sun Rack 900 and some 3rd party racks.

A new rack kit (-04) is now shipping, in addition to adding M6 screws for use with NGR, and can
be ordered as a FRU for those CU's moving older systems into new racks. The -03 kit makes
slides long enough to fit the 900 rack, while the -04 adds the M6 screws.

The part number for the Rack Rail Kit is: 560-2625-04

The trim strips on earlier shipping servers do not properly fit on the NGR rack. Servers with
serial numbers 325xxxx or later have trim strips that fit both 10x32 and M6 screws. The trim
strips serve no functional purpose and are merely for decoration. For older systems without the
new trim strips, being relocated into new racks, either throw away the strips and secure the
server with the bare metal which has holes large enough for both screws. Alternatively, the
holes in the plastic can be bored out with a hand file or knife to make it large enough.

This issue is also described in FIN I0995-1.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 50


Appendix A: Trap Types Table for UltraSPARC III CPU's
(From Joint Programmers Specification 1 (JPS1): Commonality & JPS1: UltraSPARC III Extensions)

TT Description
000 Reserved
001 Power On Reset
002 Watchdog Reset
003 Externally Initiated Reset
004 Software Initiated Reset
005 RED State Exception
006 ... 007 Reserved
008 Instruction Access Exception
009 Instruction Access MMU Miss
00a Instruction Access Error
00b ... 00f Reserved
010 Illegal Instruction
011 Privileged Opcode
012 unimplemented LDD
013 unimplemented STD
014 ... 01f Reserved
020 FP Disabled
021 FP Exception IEEE 754
022 FP Exception Other
023 TAG Overflow
024 ... 027 Clean Window
028 Division by Zero
029 ... 02f Reserved
030 Data Access Exception
031 Data Access MMU Miss
032 Data Access Error
034 Memory Address not Aligned
035 LDDF Memory Address not Aligned
036 STDF Memory Address not Aligned
037 Privileged Action
038 LDQF Memory Address no Aligned
039 STQF Memory Address no Aligned
03a ... 03f Reserved
040 Asynchronous Data Error
041 ... 04f Interrupt Level 1 - 15
050 ... 05f Reserved
060 Interrupt Vector
061 PA Watchpoint
062 VA Watchpoint
063 Corrected ECC Error
064 ... 067 Fast Instruction Access MMU Miss
068 ... 06b Fast Data Access MMU Miss
06c ... 06f Fast Data Access Protection
070 ... 07f Implementation Dependent Exception
070 Fast ECC Error (UltraSPARC III-only Extension – L2 Cache ECC Error)
080 ... 09f Spill Normal 0 - 7
0a0 ... 0bf Spill Other 0 - 7
0c0 ... 0df Fill Normal 0 - 7
0e0 ... 0ff Fill Other 0 - 7
100 ... 17f Trap Instruction (Ticc)
180 ... 1ff Reserved

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 51


Appendix B: Manual Decoding of ECC Memory Errors

(For use with Solaris 8 Kernel Update


Patch 108528-15 or earlier, and OBP level failures)
There are two types of memory that can be the cause of an ECC error. There is L2 cache
memory that is physically part of the CPU Module FRU, and there is system-bus memory that is
comprised of DIMM's located in sockets on the motherboard. The asynchronous fault status
register (AFSR) indicates the error type, and the asynchronous fault address register (AFAR) is
the address in virtual memory that was being accessed when the error occurred. On Solaris 8
with Kernel Update Patch (KUP) 108528-15 or earlier, the error message Solaris generates will
only occur after the 256th event on Sun Fire 280R server's, and will list the DIMM location
incorrectly. In these cases, the following manual procedure should be followed to determine the
correct DIMM location that is faulty, and all Solaris messages showing “Memory Module Jxxxx”
locations should be ignored. The procedure was written by Matthew Finch (PTS EMEA) and is
also provided on the following web page:

http://cpre-emea.uk/technotes/showentry.php?id=1916233701

We will use the following error as an example. The system is a Sun Fire 280R server. The error
is a CE (correctable error) reported by the IO-Bridge (Schizo) on an earlier KUP with the bugs
that cause the incorrect DIMM(s) to be identified. The same decode process would be correct
for other forms of CE errors as well. This procedure is also valid for Sun Blade 1000 and 2000
systems that use the common motherboard, CPU modules and Excalibur architecture.
WARNING: correctable error from pci0 (safari id 8) during
DVMA read transaction
Transaction was a block operation.
dvma access, Memory safari command, address 00000000.3a73e550,
owned_in asserted.
AFSR=40000000.48400098 AFAR=00000000.3a73e550,
quad word offset 00000000.00000001, Memory Module J0100 J0202
J0304 J0406 id 8.
syndrome bits 98
mtag 0, mtag ecc syndrome 0

Lets first look at the AFSR. Bits 8 - 0 comprise the system-bus or L2 cache data ECC syndrome
(Esynd). In this example it would break down as follows:

AFSR=40000000.48400098
.................../^\
................../ | \
................./ | \
............ 0000 1001 1000 = last three bytes from AFSR above
.......bits.... 8 7654 3210 = 098
ECC syndrome (Esynd) = 098
x coordinate = 8
y coordinate = 09

Using the truth table below find where the "y" value (vertical left margin) and the "x" value
(horizontal top margin) intersect to find the data bit value. In this example Esynd 098 decodes to
data bit 114. Notice that some of the values are not single data bits but instead are single ecc
check bits or multibit errors as described below. The following procedures assumes the table
above decoded to a single data bit or ECC check bit in error, i.e Correctable Error (CE).

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 52


0-127 Data bits
C0-C8 Ecc check single bit error,check bit 0-8
M2 Probable Double bit error within a nibble
M3 Probable Triple bit error within a nibble
M4 Probable Quad bit error within a nibble
M Multibit error

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 53


This procedure assumes there is a data bit in error.If the value you have is
not a data bit you can replace all DIMMS in the group. or place the settings
listed on page 60 in the /etc/system file in the hope that errors naming the
specific DIMM will be reported.If the system panics and leaves a core see
http://pts-americas.west/vsp/wgs/products/littleneck/SCAT.html to use the
“scat” utilty to look for a bad DIMM.

Break the AFAR into bits for bytes 0,1 and 2

AFAR=00000000.3a73e550
.................../^\
................../ | \
................./ | \
............ 0101 0101 0000 = last three bytes from AFAR
..........bits 98 76 = 0101

As the procedure says you want to use the 9 - 6 bits with the LM (lower mask value) to
determine the logical bank.In this example you can see that bits 9 - 6 are 0101. The type of
interleaving used (2 or 4 way) determines which of the bits 0101 are used and which are don't
care bits as shown it table 3-6.To go any further at this point you must look at
the output of “prtdiag -v” to determine what the interleaving factor is.

# prtdiag -v
...
Memory size: 2048 Megabytes
================ Memory Configuration============================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
---- --- ---- ------ -------- ------ ---------- -----------
CA 0 0 512MB no_status 256MB 2-way 0
CA 0 2 512MB no_status 256MB 2-way 0

The type of interleaving used is a function of the number of DIMMS in the system as well as the
size of the DIMMS. When there is 2-way interleaving shown in the “prtdiag -v” output, it will also
be clear whether there is only 1 group of DIMMs present (logical banks 0/2 or 1/3 ONLY will be
listed), or 2 groups of different size DIMMs present (all 4 logical banks listed).

In the case of 2 groups of different size DIMMs present, it is necessary to use the AFAR address
upper bits, and the output of OBP initialization with diag-level=max which tells where the 2
groups of address ranges start, in order the determine whether the error occurred while
addressing Group 0 or Group 1 memory. An example output of OBP messages to look for and
prtdiag from a system with 2 different size DIMMs in the 2 Groups:

@(#)OBP 4.10.1 2003/04/09 10:56 Sun Fire 280R


...
<POST and OBP initialization messages>
...
Memory Configuration:
Segment @ Base: 0 Size: 2048 MB ( 2-Way)
Segment @ Base: 80000000 Size: 1024 MB ( 2-Way)
...
<pci probing>
banner

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 54


# prtdiag -v
...
Memory size: 3072 Megabytes
================ Memory Configuration=====================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
---- --- ---- ------ -------- ------ ---------- -----------
CA 0 0 512MB no_status 256MB 2-way 0
CA 0 1 1024MB no_status 512MB 2-way 1
CA 0 2 512MB no_status 256MB 2-way 0
CA 0 3 1024MB no_status 512MB 2-way 1

The above would be interpreted as AFAR addresses from 0 to 0.7fffffff are the 2048MB segment
composed of the 2 x 1024MB logical banks, and addresses from 0.80000000 and higher are the
1024MB segment composed of the 2x512MB logical banks. It follows from the table below that
logical banks 0,2 are Group 0, and logical bank 1,3 are Group 1 DIMMs so the example above if
there were 2 sets of 2 different sized DIMM's, the AFAR is in the first range given it is
0.3a73e550 therefore we'd be looking at Group 1 memory.

Going back to the original example the system is using 2-way interleaving and only has Group 0
memory installed per “prtdiag” output, so the first three bits are don't care so 0101 is xxx1 which
is lower mask 1. That would make lower mask 1 equate to Logical Bank 2.

In the case of all 8 DIMMs being the same size with 4-way interleaving shown in “prtdiag -v”, use
the following table of LM bits to determine the appropriate logical bank.

Logical Bank Lower Mask LM b'


2-way interleaving 4-way interleaving
0 0 00
1 0 01
2 1 10
3 1 11

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 55


So based on the following chart our Logical Bank 2 is part of Group 0.

Group 0 Logical Banks Physical Bank


J0100 Bank 0 & 2 Bank 0
J0202 " "
J0304 " "
J0406 " "
Group 1
J0101 Bank 1 & 3 Bank 1
J0203 " "
J0305 " "
J0407 " "

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 56


Now that we have the data bit value of 114 and know to look in the Group 0 dimms we can
proceed. Find the data bit value of 114 on the left margin of the table below.Look at the columns
to the right and you will see the value D[16].

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 57


Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 58
From Figure 3-11 above, we can see that DRAM D[16] is located on DIMM1. From the table
above, we are looking for DIMM1 in Group 0 which is J0202.

So going back to our chart we can see that DIMM1 in Group 0 is location J0202.

Group 0 Logical Banks Physical Bank

J0100 Bank 0 & 2 Bank 0 DIMM0


J0202 " " DIMM1
J0304 " " DIMM2
J0406 " " DIMM3
Group 1
J0101 Bank 1 & 3 Bank 1 DIMM0
J0203 " " DIMM1
J0305 " " DIMM2
J0407 " " DIMM3

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 59


If the Esynd value you have decoded is not a single bit error, i.e. “M2”, “M3”, “M4” or “M”, you
have what is considered an Uncorrectable Error (UE) and basically leaves you with only a few
choices. Once the AFAR has decoded to the bank of 4 DIMM's, you can replace all DIMMs in
the group knowing you will definitely get the bad one since the multiple bits in error may be
across one or more suspect DIMMs. Replacing the bank of 4 DIMM's is the currently accepted
best practice for these types of errors.

Alternatively, you can upgrade KUP to 108528-16 or later, or assuming you stay on the older
KUP, place the following settings in the “/etc/system” file and reboot:
set ce_verbose=1
set aft_verbose=1
Stressing the system with SunVTS “ramtest”, power-cycle testing, and physically moving DIMM's
around and running POST with “diag-level=max” may reveal a specific DIMM too, but it may take
a long time as this is relying on seeing a single Correctable Error (CE) that reports to a single
DIMM rather than the bank of 4 DIMM's. Also, if the system panics and leaves a core, it may be
possible to identify a trend of errors on a particular bad DIMM from the kernel soft error rate
counters, using the "fm" or “scat” utilties.

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 60


Appendix C: Device Tree Layout for Sun Fire 280R server

Sun Proprietry/Confidential – Internal Use Only; Revision 1.2.1 Mar 9, 2005 61

Das könnte Ihnen auch gefallen