Beruflich Dokumente
Kultur Dokumente
Troubleshooting Guide
Abstract
The intent of this Troubleshooting Guide is to provide Support Services Customer Facing
engineers, the ability to understand and recognize known problems on Sun Fire 280R servers,
and to know what information is necessary, and how to gather it, for correctly diagnosing certain
classes of difficult problems. The stated goal is that the customer receive the same service
responses and assistance for Sun Fire 280R server cases, regardless of the expertise level of
the Customer Facing engineer they open the case with, or the difficulty of the problem. The
guide is organized into two sections. The first section covers setup of the system in preparation
for enabling diagnostic data to be captured. The second section categorizes common problems
encountered based on the FRU's in the Sun Fire 280R server. It must be realized that in some
problems, even with all possible output data, there is still difficulty in narrowing down the cause
between several possible FRU types. In these cases, it is useful to apply basic trial-and-error
procedures with a second known good system and parts, to eliminate as many of the suspects
as possible, by testing them one at a time.
Motherboard................................................................48
Miscellaneous Issues
Sun Rack 900 (NGR).......................................50
Appendix
Appendix A: Trap Types Table for UltraSPARC III CPU's....51
Appendix B: Manual Decoding of ECC Memory Errors......52
Appendix C: Device Tree Layout for Sun Fire 280R
Server..........................................................61
The serial number is located on the system in two places. There is a label on the rear of the
system, to the left of the PCI slots which contains the S/N, top level part number P/N 6xx-xxxx-xx
of the original system configuration, text similar to “Assembled in <Country>” and a barcode.
There is also a label on the front of the system, on the metal immediately below the internal disk
drives, which contains just the S/N and P/N.
The following is a breakdown of valid serial numbers for Sun Fire 280R systems:
Valid serial numbers on the Sun Fire 280R server start at approximately 109xxxxx, as the first
systems shipped were manufactured in week 9 of 2001. Plant codes where Sun Fire 280R
server systems have been assembled are:
Foot Hill Ranch, CA C
Ashton, UK S
Santa Palomba, Italy Z
Toronto, Canada AD
Kladno, Czech Republic AA
http://pts-americas.west/vsp/wgs/products/littleneck/patches_index.html
The brief page provides a handy table list of just the patches and their synopsis. The detailed
page provides more detail as to specific bug fixes in each patch revision that affects Sun Fire
280R server's. The text file version provides the table list in a convenient text version for
emailing to CU's.
In the event a CU refuses to update patches as the first step in troubleshooting their problem,
please make use of the bug lists provided on the detailed page, and the information in this guide
to make them aware of specific “why” they should load the patch, how it will take care of this
issue without needless hardware replacements, and try to sell them some of the patch
management solutions that Sun provides such as PatchPro, Patch Manager (Solaris 9), SRS
NetConnect, use of Solaris Management Console, or periodic “flash” jumpstart installs from a
standard image. If there are further bug or information that the CU still needs to justify the patch
update, please make PTS VSP aware of the situation by opening a PTS VSP Engagement Task
in Radiance, and we will work with the Customer Facing engineer to assist in fulfilling the CU's
specific needs.
Kernel Update Patch (KUP) for Solaris 8 108528-16 (or later) and Solaris 9 KUP 112233-01 (or
later) changes the way in which error events for memory and cache errors are logged. On prior
KUP revisions, only the 256th error counted for each event type would be logged to the console
and “/var/adm/messages”, which is not useful in most cases. To make every error be logged
on older revisions, the “/etc/system” file needs to be modified to include “set ce_verbose = 1”.
With the newer KUP revisions, no additional setting of “/etc/system” parameters is required,
as the default for VSP servers is now to log all error events in “/var/adm/messages”. Also, if the
CU has console logging and/or RSC configured, the CU can choose to have the output sent to
the console device in addition to “/var/adm/messages” by setting the following variables in
“/etc/system”:
set ce_verbose_memory = 2;
set ce_verbose_other = 2;
Versions earlier than Solaris 8 KUP 108528-16 or Solaris 9 KUP 112233-03 may require hand-
decoding of AFSR and AFAR's to ensure the output from Solaris is interpreting the correct bad
DIMM or Bank. In addition, new functionality has been (and is continuing to be) added to the
kernel (KUP-20 & KUP-06) to allow offlining of identified failing DIMMs by preventing pages from
being allocated to the physical memory that DIMM covers. This is designed to allow the system
to continue running until such time as the reported error and failing DIMM can be replaced during
a regular maintenance window. The core of this functionality is expected to be included in
Solaris 8 KUP 108528-24 and Solaris 9 KUP 112233-09 when released, and these will be the
new minimum recommended versions when available.
Future bug fixes and KUP schedules are available internally here:
http://jurassic.eng/shared/ON/patch_docs/data/
Sun Install Check (SunIC) tool, also uses Explorer and the eRAS database of checks, and is
available for use with new Sun Fire 280R server installations, and/or verification that older
systems are up to current levels to avoid known issues. The externally available production
SunIC version is updated with new checks every 2 weeks. It no longer requires Explorer to be
installed, but does still use explorer as part of its opearation, and is available for separate
download here:
http://wwws.sun.com/software/installcheck/index.html
Console Logging
Console logging is recommended to capture as much information as possible about the system
state if/when important events occur. Systems urgently needing maintenance may not be able to
log messages elsewhere. Examples of this are when troubleshooting POST failures of critical
components, Fatal Reset errors and RED State Exceptions. In these conditions, either Solaris
has not yet started, or the Solaris operating environment terminates abruptly, and although it
sends messages to the system console, the operating environment software does not log any
messages in traditional file system locations like the “/var/adm/messages“ file.
The error logging daemon, syslogd, automatically records various system warnings and errors in
the “/var/adm/messages“ files. By default, many of these system messages are also
displayed on the system console and stored in the “/var/adm/messages“ file. You can direct
where these messages are stored or have them sent to a remote system by setting up system
message logging. For more information, see "How to Customize System Message Logging" in
the System Administration Guide: Advanced Administration, which is part of the Solaris System
Administrator Collection.
RSC provides an event log of RSC-defined events that have been detected, e.g fan tray failures
or system resets/power state changes, accessible with the “loghistory” command. RSC also logs
the system console when configured (including POST diagnostics if optionally configured), using
four logs, accessible through the “consolehistory” command. For more details on these
commands, see the RSC documentation linked below. The four logs are named “boot”, “run”,
“boot-old” and “run-old”. These essentially log the output on “boot” up to the time that Solaris
starts to boot, then it cycles over to the “run” log, and then the “old” store the logs from the
immediatly prior boot/run. Each log can be individually viewed with “consolehistory” and all four
logs should be looked at when troubleshooting errors logged on the console. In some failure
situations, a large stream of data is sent to the system console. Because RSC log messages are
written into a "circular buffer" that holds 64 Kbytes of data, it is possible that the output identifying
the original failing component can be overwritten. If it is possible for the customer to configure
RSC and connect that to a logging console, that would be the ideal situation.
For more information on RSC, refer to the PTS TOI (updated February 2003), available in PDF
or StarOffice here:
http://pts-americas.west/vsp/wgs/products/littleneck/RSC/
The RSC software and documentation are available on the Solaris Supplemental CDROM in the
Solaris 8 or 9 Media Kit, or downloadable from here:
http://www.sun.com/servers/rsc.html
OpenBoot (OBP) Firmware should also be updated regularly and kept up to the latest available
version. To help facilitate this, and reduce maintenance window downtime, the later versions of
firmware patch include a shell script to update the firmware directly from a normal running
Solaris. The patch also includes the older method that requires booting from the special binary
update file.
Keeping up to date on OBP firmware ensures always having and using the latest POST and
OBDiag diagnostic testing components, as well as correction of diagnosability output, OBP
behaviour and initialization bugs. POST and OBDiag tests are continuously being improved to
identify newly discovered hardware failure modes, as well as to provide better diagnostic error
reporting. For Sun Fire 280R server's, POST and OBDiag versions are typically the same as the
OBP version number, however a deviation did occur with OBP 4.5.19 and 4.5.21 versions, which
includes POST 4.7.4 to take advantage of the new FPU test to detect the same type of problem
the CPU Diagnostic Monitor will identify, as described in SunAlert 55081. The current version as
of Aug 29th 2004 is OBP, POST and OBDiag is 4.13.0 in patch 111292-17.
A future version of OBP may change these settings to be the defaults. This will significantly
increase boot time, particularly on a system with 8GB of memory, where diag-level=max memory
tests take ~36 minutes with 1.2GHz CPU's, and up to 50 minutes with 750MHz CPU's. If this
increased boot time poses a problem for the customer who regularly reboots or otherwise resets
the machine or are experiencing software-induced panic's, it may be preferred to leave diag-
switch?=false. In these cases, diagnostics may be enabled temporarily for one-time runs using
either the system keyswitch turned to DIAG position temporarily, or the RSC “bootmode diag”
command. This is recommended any time the system is powered on e.g. after any hardware
change, or power outage. If the system then develops what is suspected to be a hardware
problem, then enable diagnostics by setting the variable “diag-switch?=true” after the first failure,
to ensure any subsequent failures report verbose full messages and run through max level
POST.
This is normally a task that would be completed just prior to placing a system into the production
environment.
1. Access the system console. Check that the core dump process is enabled. As root, type the
dumpadm command.
# dumpadm
Dump content: kernel pages
Dump device: /dev/dsk/c0t0d0s1 (swap)
Savecore directory: /var/crash/machinename
Savecore enabled: yes
By default, the core dump process is enabled in the Solaris 8 operating environment.
2. Verify that there is sufficient swap space to dump memory. Type the swap -l command.
# swap -l
swapfile dev swaplo blocks free
/dev/dsk/c0t3d0s0 32,24 16 4097312 4062048
/dev/dsk/c0t1d0s0 32,8 16 4097312 4060576
/dev/dsk/c0t1d0s1 32,9 16 4097312 4065808
To determine how many bytes of swap space are available, multiply the number in the blocks
column by 512 bytes per block. Taking the number of blocks from the first entry, c0t3d0s0,
calculate as follows:
4097312 blocks x 512 bytes/block = 2097823744 bytes.
The result is approximately 2 Gbytes is available to capture core dump files.
3. Verify that there is sufficient file system space for storing the core dump files. Type the
df -k command.
# df -k /var/crash/`uname -n`
# df -k1
Filesystem kbytes used avail capacity Mounted on
/dev/dsk/c1t0d0s0 832109 552314 221548 72% /
/proc 0 0 0 0% /proc
fd 0 0 0 0% /dev/fd
mnttab 0 0 0 0% /etc/mntab
swap 3626264 16 362624 81% /var/run
swap 3626656 408 362624 81% /tmp
/dev/dsk/c1t0d0s7 33912732 9 33573596 1% /export/home
5. Type the dumpadm -s command to specify a location for storing the dump files generated by
savecore. See the dumpadm (1M) man page for more information.
# dumpadm -s /export/home/
Dump content: kernel pages
Dump device: /dev/dsk/c3t5d0s1 (swap)
Savecore directory: /export/home
Savecore enabled: yes
Before placing the system into a production environment, it might be useful to test whether the
core dump setup works. This procedure might take some time depending on the amount of
installed memory.
1. Back up all your data and access the system console.
2. Take the core dump using either of the two following methods:
A) If you have the Dump Device setup with dumpadm to be a dedicated device (i.e. not
swap), you can test the dump on the system live using the savecore -L command. This
takes a snapshot of the live running Solaris system, and saves it to the dump device
configured without actually rebooting or altering the system in any way.
B) If you have the Dump Device setup with dumpadm to be the default swap device, you
need to gracefully shut down the system using the shutdown command. Then, at the ok
prompt, issue the sync command.
You should see "dumping" messages on the system console. During this process, you can see
the savecore messages.
SCAT is useful to determine if the problem can be quickly attributed by hardware without having
to engage a kernel engineer for a simple hardware issue. An example would be having only a
corefile for data and looking at the msgbuf to find the server was experiencing UE errors which
would be a hardware issue.
Background Information:
Fatal Reset errors and RED State Exceptions are most often caused by hardware problems. In
some isolated cases, software can cause a Fatal Reset error or RED State Exception. Typically,
these are device driver problems that can be identified easily. Information on known problems
can usually be found in SunSolve Online in known bugs and patches, or by contacting the third-
party driver vendor.
Fatal Reset Errors:
Hardware Fatal Reset errors are the result of an "illegal" hardware state that is detected by the
system. A hardware Fatal Reset error can either be a transient error or a hard error. A transient
error causes intermittent failures. A hard error causes persistent failures that occur in the same
way. The following example shows a sample Fatal Reset error alert from the system console
with OBP variable “diag-switch?=true“:
With diag-switch?=false, only 1 line is reported which does not provide any diagnostic data.
It is critical to note that the word “reported” means just that. The device(s) listed on this line
detected are reporting the error, and are not necessarily (or normally) the root-cause of the error.
Using the below procedures and tools, the above example decodes to a “PERR System Protocol
Error” with an address in I/O Space and “ISAP system request parity error on incoming
address” which is an error on the data being detected by the victim - CPU1. This example is
typical of a problem associated with a suspect IO-Bridge (Schizo) on the motherboard or a faulty
Fatal hardware errors (bus protocol errors and internal errors) are reported in the EMU Error
Status Register(EESR) if the corresponding bit mask bits are 0 in the EMU Error Mask Register
(EEMR)
For each bit in the EESR, there is a corresponding bit in the EMU Shadow Resister (ESR) to
allow designers to gain visibility to the error status of the EMU .
The EMU Shadow Register carries out only two functions;
1. Capturing values on the EESR to scannable flops in the ESR
2. Shifting out the captured values through the scan-out port.
A more detailed description of register functions can be found in the SPARC JPS1
Implementation Supplement p.n. 806-6754
The problem is that we do not do a JTAG scan of the CPUs after OBP 4.5.9 because it can
cause the system to hang and not recover. We are not getting all the data that may be needed to
indicate the failing component. Please see bug id 4635979.
Troubleshooting in this situation would be searching for data in other places that may indicate
what has lead to the fatal reset. Look at the big picture. Are there any indications of errors or
problems in the /var/adm/messages file? Has the servers use or configuration changed recently?
The use of diagnostics such as SUNvts and POST should be used to see if the fatal condition
can be triggered and then isolated.
RED_State Exceptions:
A RED State Exception (RSE) condition is most commonly a hardware fault that is detected by
the system. A RSE causes a loss of system integrity, which would jeopardize the system if
Solaris software continued to operate. Therefore, Solaris software terminates ungracefully
without logging any details of the RED State Exception error in the /var/adm/messages file,
and all output is only logged to the system console. It is critical to obtain the first RSE output, as
subsequent outputs may show cascading (or looping) RSE errors as a symptom of the original
error. The following example shows a sample RED State Exception error that CPU 0 reported on
the system console. Determining that CPU0 is bad in this example, would be done by
interpreting that the “Trap Type” (TT) event is 070 in all 5 “Trap Levels” (TL), which is a Fast
ECC Error, or L2 cache event local to the CPU module reporting, in this case CPU0, and there
were no RSE events by CPU1 or events from memory in “/var/adm/messages” occurring prior to
the RSE. The OBP command “.traps” can be run at the ok prompt to determine what trap
types are possible on this system, or for Sun Fire 280R server's these are listed in Appendix A.
Making the determination from trap types to failing component is highly subjective and relies on
reviewing all RSE outputs from one or both CPU modules, any other data available from other
types of errors, and knowledge of what was occurring on the system at the time of the event or
the system's service history.
CPU: 0000.0000.0000.0000
TL=0000.0000.0000.0005 TT=0000.0000.0000.0070
TPC=0000.0000.1014.6654 TnPC=0000.0000.1014.6658
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0004 TT=0000.0000.0000.0070
TPC=0000.0000.1014.667c TnPC=0000.0000.1014.6680
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0003 TT=0000.0000.0000.0070
TPC=0000.0000.1014.6654 TnPC=0000.0000.1014.6658
TSTATE=0000.0099.8008.1400
TL=0000.0000.0000.0002 TT=0000.0000.0000.0070
TPC=0000.0000.1014.64e0 TnPC=0000.0000.1014.64e4
TSTATE=0000.0099.8000.1500
TL=0000.0000.0000.0001 TT=0000.0000.0000.0070
TPC=0000.0000.1007.29a0 TnPC=0000.0000.1007.29a4
TSTATE=0000.0044.8000.1600
In these cases, this information can be manually decoded using the information in the Memory
section below.
The Fatal Reset and RED State Exception outputs can be decoded with a tool developed by
PTS EMEA engineers. These tools are available here:
http://cpre-emea.uk/cgi-bin/fatal.pl
Once the tools have been used to decode the output messages, they attempt to interpret any
AFSR and AFAR information or Trap Type information present. It is critical to note that the word
“reported” means just that with regards to Fatal Reset errors. The device(s) listed on this line of
the Fatal Reset error detected are reporting the error, and is (are) not necessarily (or normally)
the root-cause of the error. Note also that the output of these errors being correctly decodable
relies on the output provided by OBP 4.10.1 or later firmware, and should be the minimum
recommended firmware patch installed as detailed in Section 1 above.
To correctly diagnose a Fatal Reset or RSE error, it is necessary to gather as much data as
possible from the system console, and decode these outputs with the tools above and the
information detailed below in the Memory section, as well as using some reference information
contained in InfoDoc 43642, FIN I0954-1, and the Trap Types defined in “SparcV9 Joint
Programming Specification 1: Commonality” and “SparcV9 JPS1: UltraSPARC III Supplement”,
which is copied in Appendix A. Having gathered and decoded all of the data, and considering
the system's complete service history, the mostly likely FRU causing the error can be determined
accurately.
The mondo mechanism is used in SparcV9 architectures to send an interrupt to one or more
processors. In a multiprocessor system, when "CPU A" wants to interrupt "CPU B", CPU A
sends a mondo interrupt to CPU B. CPU A is the initiator and CPU B is supposed to respond to
the mondo dispatched by CPU A. If CPU B does not respond to the request of CPU A, CPU A
keeps retrying for a specified time.
Once this time limit is reached, a "send mondo timeout" panic is initiated by CPU A. As part of
the panic procedure CPU A will attempt to stop all other CPUs, and it will send an interrupt to all
other CPUs to request this. If some CPUs fail to stop as requested, then CPU A will complain
with "failed to stop" messages; hence a send mondo timeout to CPU B is often accompanied by
a "failed to stop CPU B" message.
There is a also a known set of specific SparcV9 processor instructions on the UltraSPARC III
family of CPU's that can trigger a send_mondo_set timeout panic to occur by locking up the CPU
into a state such that it ignores all incoming mondo interrupt requests, thereby guaranteeing the
timeout and subsequent panic will occur.
Background Information:
The “Invalid AFSR” message occurs when the CPU's AFSR register has been corrupted by
another earlier error. Since the nature of this message is extremely misleading it is critical to
have the complete messages file dating as far back as possible, and to ensure all error output is
being appropriately logged to the “/var/adm/messages” file. A specific improvement in Solaris
8 KUP 108528-16 (or later) and Solaris 9 KUP 112233-02 (or later) was added to improve the
diagnosability output of this message type.
Background Information:
The following messages are normally seen on the system console, whenever the system does a
reset, and indicate that the CPU FRUID SEEPROM has been read by OBP and is being used to
initialize the CPU's:
The “1” indicates the CPU present is an UltraSPARC III (Cheetah) module. The “2” indicates the
CPU present is an UltraSPARC III Cu (Cheetah+/Cheetah++) module. A message of this type
will be printed for every CPU present in the system.
Background Information:
ECC errors are usually a result of faulty DIMMs. In some cases ECC
errors are generated by UPA or safari devices writing bad ECC to memory or
from corruption on the datapath.
Serengeti, Starfire and Starcat/kitty all have datapath parity error
detection but the volume systems do not.
An Interrupt vector from a known Schizo to the reporting CPU. Both devices
involved in the transaction are reported in the messages so it is possible
to say there is a datapath fault between two points.
PIO:
Programmed Input/Output (PIO) is a way of moving data between devices in a computer in
which all data must pass through the processor. Due to the actions of each read/write operation
through the CPU this is slow when compared to DMA operations. PIO on Safari-bus systems is
a direct transfer between a CPU and the IO-Bridge (Schizo), where memory is not involved. The
unique aspect of these error events is that the device writing the data is logged.
DMA:
Direct Memory Access (DMA) is a capability that allow data to be sent directly from an attached
device (such as a disk drive or network driver) to the memory on the computer. The
microprocessor initially sets up the operation then is freed from involvement with the data
transfer, thus speeding up overall computer operation.
Examples:
1. PIO write transaction from a known CPU to the IO-Bridge (Schizo)
Note that these messages will be enhanced in a future revision of KUP under bug 4866710.
From this message we can see that Safari ID 0 (CPU0) was talking to pci0 (Schizo) and a
Correctable Error (CE) event occurred. Additional CE events logged may match the same
Esynd 79 to the same DIMM J# location on both groups of memory banks.
2. PIO write transaction from a known CPU to the IO-Bridge (Schizo), with a subsequent
matching DMA event:
The first two logged events occurred during a PIO write operation. The CPU will have read in
the data and checked the ECC, yet when the data ECC was checked by the pcisch driver it has
detected a correctable error, and logs the safari id 0 (CPU0) of the device that sent the data.
The third error logged was during a DVMA read transaction from memory to the safari id 8
device (Schizo). A CE occurred and this has implicated a memory dimm (J0100). The Esynd
bits 112 are the same as that reported in the writer error above and this is when this memory
was corrupted. The reported DVMA is a symptom of the bad CPU0 who wrote bad data into
memory. Two different PCI busses (A 66MHz & B 33MHz) are both implicated as well as the
memory modules. Not recognizing the meaning of the original PIO write error message could
result in wrong parts replaced. In this example a memory dimm might have been replaced
whereas the CPU was truly bad.
3. PIO read transaction - A read from the IO-Bridge (Schizo) to a known CPU
Nomenclature:
• Each Physical Group of 4 DIMM's contains 2 Logical Banks, each logical bank contributes ½
of the total memory provided by that Group of DIMM's, since all NGDIMM's are double-sided.
• Each side of each DIMM contributes 1/4th of the memory to each logical bank.
• Physical Group 0 contains Logical Banks 0 & 2 on DIMM's J0100, J0202, J0304, J0406
• Physical Group 1 contains Logical Banks 1 & 3 on DIMM's J0101, J0203, J0305, J0407
Do....
Do Install DIMM's in groups of four at a time within the same group.
Do Install at least 4 DIMM's in either GROUP 0 or GROUP 1 for minimum support.
Do Install same size DIMM's in same group for automatic 2-way memory interleaving between
the 2 logical banks in the group.
Do Install same size DIMM's in both groups for automatic 4-way memory interleaving between
the 4 logical banks in both groups.
Do Install the latest Kernel Update Patch (KUP) to ensure correct reporting of memory DIMM
errors
Don't....
Don't Mix any DIMM capacities within the same group, as not all of the memory on the larger
DIMM's would be addressable:
- Larger DIMM's in group will take on identity of smallest DIMM capacity.
- Ability of automatic 2-way memory interleaving will be DISABLED.
Don't Mix third-party DIMM's and Sun supported DIMM's in same group. Actually, third-party
DIMM's of any size are NOT supported and may be the root-cause of the problems. They should
be completely removed until further troubleshooting is completed and the problems resolved.
Notes:
• Although DIMM capacities can differ between GROUP 0 or GROUP 1, automatic 4-way
memory interleaving will be DISABLED.
• The entire memory subsystem is addressable via the CPU0 memory controller, which is only
accessible when CPU0 is installed.
• Special note on third-party 2GB NG-DIMM's – these DIMM's have been purchased and
tested by engineering from the third-party manufacturers and proven to cause a variety of
signal integrity, thermal and power issues. Sun will not ship a 2GB NGDIMM on Sun Fire
280R server's due to lack of an approved and qualified vendor that can manufacture such a
NGDIMM to meet Sun specifications and operate within system cooling and power
requirements.
1. The following excerpt of “prtdiag” output shows a system with only Physical Group 0
populated with 4 x 256MB DIMM's, which gives a total memory size of 1024MB, or 1GB. One
2. The following excerpt of “prtdiag” output shows a system with both Physical Groups
populated, each with 4 x 1GB DIMM's which gives a total memory size of 8192MB, or 8GB.
This shows all 4 logical banks each of 2GB size (½ of the physical group's total memory), 0 &
2 contributed by Group 0 DIMM's, and 1 & 3 contributed by Group 1 DIMM's. Since both
physical groups are populated with the same sized DIMM's, we are able to do maximum 4-
way interleaving.
...
Memory size: 8192 Megabytes
...
====================== Memory Configuration============================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
--- --- ---- ------ ----------- ------ ---------- -----------
CA 0 0 2048MB no_status 1024MB 4-way 0
CA 0 1 2048MB no_status 1024MB 4-way 0
CA 0 2 2048MB no_status 1024MB 4-way 0
CA 0 3 2048MB no_status 1024MB 4-way 0
Patches
It is strongly advised to install the latest recommended Kernel Update Patches, as the latest
versions have improvements in memory error message reporting and will aid in diagnosing
memory problems. See the “Patches” information in Section 1 above, for necessary patches to
be configured prior to diagnosing memory errors.
PERSISTENT:
Replace DIMM if 3 or more correctable memory events occur within 24-hour period on
same DIMM.
Jul 28 15:39:33 k1test unix: [ID 356634 kern.notice] 141 Intermittent,
167 Persistent, and 0 Sticky Softerrors accumulated
Jul 28 15:39:33 k1test unix: [ID 340762 kern.notice] from Memory Module
on J0100, Memory controller 0
Jul 28 15:39:36 k1test unix: [ID 596940 kern.warning] WARNING: [AFT0]
10 soft errors in less than 24:00 (hh:mm) detected from Memory Module
J0100
STICKY:
Replace DIMM on the first occurrence.
firefly unix: [ID 356634 kern.notice] 0 Intermittent, 0 Persistent, and
256 Sticky Softerrors accumulated
firefly unix: [ID 340762 kern.notice] from Memory Module on J0100,
Memory controller 0
It is important to recognize that the first two errors in the above output are the result of one single
CE event, as evidenced by the identical errID value. The third error is a subsequent error of the
same type. Each of the messages is tagged with an asynchronous fault tag (AFT) to identify the
data being logged. Continuation messages begin with four spaces. The different AFT tag values
are: AFT0 for correctable errors; AFT1 for uncorrectable errors as well as for errors that result in
a panic; AFT2 and AFT3 are used for logging diagnostic data and other error related messaging
The extracts below were taken from the previous example:
– errID is a timestamp of the event. This is very useful for correlating multiple errors that
occurred at the same time
– AFSR and AFAR are the asynchronous fault status and address registers.
On UltraSPARC III (750MHz) CPU's, there is only one AFSR and AFAR recording the most
recent event. On UltraSPARC III Cu (900MHz or faster) CPU's, there are 2 AFSR and
AFAR's recorded. The primary is denoted AFSR/AFAR and records the most recent event.
The secondary is denoted AFSR2/AFAR2 and records the first error event logged. This CPU
enhancement is useful for troubleshooting by identifying the source of the first error.
– Fault_PC is the value of the program counter (PC) at the time of the fault and is dependent
upon the fault type as to whether the value is valid. See below for more information on
decoding these registers.
– Esynd is the ECC syndrome captured and can be used to determine the DIMM within the
Bank in the event of a single-bit correctable error (CE).
– J #### is the identifier of the memory module which corresponds to the faulting address on
the Sun Fire 280R server, in the event of a single-bit correctable error (CE) similar to this one.
In the event of a multi-bit uncorrectable error (UE), the DIMM cannot be identified distinctly,
so the Group is reported as J#### J#### J#### J#### where the DIMM slots for either Group
0 or Group 1 are listed.
– The Solaris software error handling code provides a disposition code as one of Intermittent,
Persistent, or Sticky. The definition of each of these codes is:
– Intermittent means the error was not detected on a reread of the affected memory
location. This can occur due to many things and are should not normally be acted upon.
– Persistent means the error was detected again on a reread of the affected memory
location but the scrub operation corrected it. This is indicative of a potentially failing DIMM
and should 3 Persistent errors occur within 24 hours, the DIMM should be replaced. In
addition, soft errors caused by transient random events such as cosmic rays, would also
appear as persistent. However since these events are typically random in nature, it is
unlikely to repeat the error at the same AFAR address in multiple events, so is easily
separated from true persistent errors. These random events are part of the reason for the
3 events on the same DIMM in 24 hours rule.
– Sticky means the error is likely a hard fault of a failing DRAM device and the DIMM should
be replaced as soon as possible.
The above examples show memory errors that occurred while a CPU was reading/writing to
memory. Similar errors that are in memory may occur while the IO-Bridge is reading/writing to
memory, and these typically are of the form “NOTICE: correctable error detected by pci0 (safari
id 8) during DVMA read transaction” and “NOTICE: correctable error detected by pci0 (safari id
1) during PIO write transaction”. See also the section above on “Detecting Bad CPU Writers” for
example outputs of these transactions.
The ECC special syndrome is a flag used to indicate the data was corrupted by a previous
transaction, likely CPU module cache event, and not due to the memory itself. Note that a
message is additionally printed with the special syndrome event, to indicate exactly this “Two
Bits in error, likely from E$”. The 3 special syndromes are caused when the CPU accessing
memory recognizes the other Safari Bus event and “poisons” or flips 2 specific bits, generating
these syndromes. To determine the correct bad part, it is critical to look back through the full /
var/adm/messages logs in search of additional events which do not have an Esynd with a special
syndrome but are related and the cause of the special syndrome. It is these additional non-
special syndrome events that may pinpoint which CPU module likely caused this bad data to be
in memory initially. Note the msgbuf contained in any core file generated by the panic usually
does not contain sufficient log history to show the prior event that enables diagnosis to the CPU
module. Also note that the associated events may be logged prior or after the special syndrome
event, and should be related by their errID.
The events to look for associated with each special syndrome event occurring are:
0x003 (ECC Check bits 0 & 1 flipped) - EDU event
0x071 (Data bits 126 & 127 flipped) - CPU or WDU event
0x11c (Data bits 0 & 1 flipped) - BERR event
See InfoDoc 43642 for detailed information on the meaning of these event types.
In the following example, the UE event with Esynd 0x071 on CPU0 may be mis-interpreted as a
bad DIMM in memory Group 0, whereas careful examination of preceding events shows WDU
and UCU events on CPU1 with non-special syndromes. Note also some later INVALID AFSR
events on subsequent UE errors seen by CPU1. Therefore the bad hardware in this example is
CPU1 module.
All Sun Fire 280R systems require Solaris 8 or later, and therefore include the memory scrubber
tuned to the current best practice. The purpose of the scrubber is to read all of physical memory
within 12 hours, and detect correctable errors that may likely turn into transient uncorrectable
errors. The read is done in 8MB pages under kernel protection so any uncorrectable errors that
occur during the operation will not cause a panic. The messages produced by failing bits the
scrubber identifies are different from those reported above, so if the scrubber reports correctable
errors, repeating every 12 hours, there is likely a hard error of a DIMM that needs replacing.
1. Move or switch the DIMM's to the opposite bank, and if the problem persists on the same
DIMM slot, then this may be poor solder joint or other manufacturing defect that is affecting
CPU0 (the memory management unit on Sun Fire 280R server) and its address lines. If the
problem follows the DIMM to the other bank, this might indicate a possible DOA DIMM or
another DIMM is actually causing noise on the bus lines and masking itself as the problem.
These are more difficult to determine and may need a lot of trial and error to identify the truly
bad DIMM.
2. If only CPU0 or only CPU1 is reporting this, it is possibly a problem with a single bit on that
CPU module. This may be traced to a poor solder joint or other manufacturing defect that is
affecting just a single bit unique to that CPU module's connector. These are more difficult to
isolate on a single CPU system.
3. If both CPU's in a 2 CPU system are reporting this then it is possible there is a problem in the
datapath between the CPU's and the memory DIMM. This might indicate a possible DOA
DIMM or another DIMM is actually causing noise on the bus lines and masking itself as the
problem. It is possible for it also to be a problem with the motherboard Safari Bus ASIC's,
though this is unlikely.
Background Information:
This message indicates that OBP memory has been trashed and it is unable to access either its
own instructions and data, or an operation it is performing to initialize a memory or L2 cache
device has failed. This is most commonly seen when a break or XIR or Solaris has crashed in
some manner, and running commands normally at the ok prompt fail in this manner, since
memory is trashed from the prior crash. In those cases, this should be ignored and OBP reset
with the “reset-all” command. When these errors occur during system initialization following
a reboot, prior to getting to the ok prompt, then there is likely a hardware problem. It is possible
the problem was also detected by POST diagnostics prior to OBP using the bad hardware, but
since Sun Fire 280R server does not support Automatic System Recovery (ASR), there is no
way to offline and prevent OBP from using the bad hardware prior to completing its initialization
where it would report the results of POST and fail to boot.
iii.The following commands provide the current values of the CPU registers and
information on what code most recently ran, that could be used in engineering debug if
v. Repeat steps i, ii and iii to gather the same data for the second CPU.
2. Using the “Interpreting AFSR & AFAR outputs” section below, and the manual ECC decode
procedure in Appendix B (OBP does not do this decoding automatically) to determine which
FRU (DIMM Slot or CPU) is the most likely cause of the problem.
Decoding AFSR's
The AFSR can be decoded with a tool developed by PTS EMEA engineers. This tool is
available here:
http://cpre-emea.uk/cgi-bin/afsr/afsr.pl
In most instances Sun Fire 280R server's output the AFSR as four 16-bit portions (4 x 4 hex
values) separated by periods. Unfortunately the tool requires very specific input, and requires
any AFSR entered to be free of all “.” periods or to contain only a single “.” period separating the
two 32-bit portions (2 x 8 hex values) of the AFSR that make up the 64-bit register. This is due
to the original tool being designed based on error messages from UltraSPARC I and II-based
systems. It is recommended for the purposes of Customer Facing engineers using this tool, to
remove all “.” periods from the output provided by the system. If you don't do this, the tool will
incorrectly decode the input, as it will find the first “.” period and ignore the last 2 x 4 hex values.
When entered into the tool as supplied by the system, the AFSR “0008.0000.0000.0000” is
decoded as follows:
AFSR: 0x800000000
This would typically indicate an L2 cache problem local to the CPU module reporting it.
When entered into the tool with all “.” periods removed, the AFSR “0008000000000000” is
decoded as follows:
AFSR: 0x8000000000000
Note the additional 0's that make up the full 64-bit register value. This error would typically
indicate a problem on the system bus while the CPU was requesting data/instructions from the
other CPU, memory, or the IO-Bridge. The AFAR could be used to help narrow down
specifically which part of the system bus was being accessed at the time, using the fixed address
ranges listed below.
Note that the decoder tool requires selecting the appropriate device type, since the different CPU
and IO-Bridge devices have different meanings for each error status bit stored in the AFSR of
that device type. On Sun Fire 280R server, the 3 devices that are used are:
Cheetah – UltraSPARC III (750MHz)
Cheetah+ - UltraSPARC III Cu / III+ (900MHz, 1015MHz, 1200MHz CPU's)
Schizo – PCI IO-Bridge
Once the tool has been used to correctly decode the AFSR into an error type, InfoDoc 43642 in
conjunction with FIN I0954-1, and the corresponding AFAR should be used to narrow down
which FRU is the suspect cause of the error. Where an Esynd # is given, and a version of KUP
on the system earlier than 108528-16 (Solaris 8) or 112233-01 (Solaris 9), it is useful to follow
the procedure for manually decoding the Esynd, AFSR and AFAR down to a specific DIMM or
bank, described in Appendix B, and additionally available here:
http://pts-americas.west/vsp/wgs/products/littleneck/excalibur.mem.pdf
Decoding AFAR's
The Sun Fire 280R architecture is the same as the Sun Blade 1000/2000 architecture originally
described in the “Excalibur Architecture Manual v1.0”. This is available for download from a
number of internal websites, including the PTS Americas website here:
http://pts-americas.west/vsp/desktop/products/excalibur/excal_architecture_manual_1.0.pdf
Memory AFAR's:
Cacheable Memory lies in the 0x0 through 0x3ff.ffff ffff address space. Any AFAR in this range
may be an address in physical memory, or in physical cache. No distinction is possible between
the two, but such a distinction can be drawn based on the AFSR error type that is flagged with
the AFAR. The memory address space is initialized by OBP which sets up the interleaving
pattern and prints out the ranges being used according to the physical memory present in the
system. This is printed only when OBP parameter “diag-switch?” is set “true” which is the
current variable that affects OBP output verbosity. Note that this may change in a future OBP
version. The message to look for on Sun Fire 280R server is:
Membase: 0000.0000.0000.0000
MemSize: 0000.0000.4000.0000
This indicates that 8GB memory address space has been allocated starting at address 0x0.
Since Sun Fire 280R server has only 1 memory controller active (CPU0), this is relatively simple
to understand. Other platforms such as Sun Fire V480 and Sun Fire V880 require more detailed
output from OBP to determine which CPU/Memory Slot and CPU is associated with which
memory address ranges.
The AFAR can be used only to interpret which bank of memory is the source of the error, in
systems where both memory banks contain DIMMs (i.e. 8 DIMMs). It cannot be used to
determine which DIMM within the bank is at fault. If the error is of correctable type (CE), then the
Esynd # which is part of the AFSR can be used for narrowing to the specific DIMM following the
procedure in the document referenced above in the AFSR section.
The I/O address space is the next 4 Terabytes of address space from 0x400.0000 0000 to
0x7ff.ffff ffff The address ranges are allocated to the various I/O devices as follows, per the
Excalibur Architecture Manual:
Additional device memory ranges for all the specific onboard devices are defined in Chapter 2 of
the Excalibur Architecture Manual.
UPA64 Space:
0x700 0000 0000 - 0x701 ffff ffff UPA64S Slot 0
0x702 0000 0000 - 0x703 ffff ffff UPA64S Slot 1
0x704 0000 0000 - 0x7ff efff ffff Reserved
Only 0x7fd. 0x7fe. are used in Sun Fire 280R server since there is only 1 IO-Bridge. The
7fd corresponds to the 66MHz bus and the 7fe corresponds to the 33MHz bus. The
additional bits in the rest of the address can be used to translate to a device in particular
on the bus if one knows how the data in PCI bus transactions are constructed. If one of
these special addresses is seen in the AFAR, then it is a sign of a failure during a
transaction to a PCI card or onboard device. It is recommended in these cases to rule
BootBus Space:
0x7ff f000 0000 - 0x7ff f00f ffff Motherboard Flash PROM space / PROM Emulator
0x7ff f010 0000 - 0x7ff f0ff ffff PROM Emulator/ Reserved
0x7ff f100 0000 - 0x7ff f7ff ffff Reserved
0x7ff f800 0000 - 0x7ff f8ff ffff Philips I2C controller, PCF8584
0x7ff f900 0000 - 0x7ff f9ff ffff SuperI/O
0x7ff fa00 0000 - 0x7ff faff ffff Serial Lines Controller
0x7ff fb00 0000 - 0x7ff ffef ffff Reserved
0x7ff fff0 0000 - 0x7ff ffff ffff BBC (internal Registers)
1. Determine the build date of a system based on the serial number of the system. See “Sun
Fire 280R server Serial Numbers” in Section 1 above, for information on how to determine
this.
2. Check the below FIN's for information as to known firmware upgrade issues. These FIN's are
controlled pro-active and should be done on all potentially affected systems as soon as
possible.
3. To confirm a bad disk, there are a few things that can be checked. If the disk was just
replaced, and similar errors from prior to the replacement are continuing, then most likely the
new disk is DOA.
a. Carefully examine the output of the “/usr/bin/iostat -E” command, looking for any error
events that are affecting one of the two disks. Look for non-zero counts on the first, 4th and
5th lines. If both disks have non-zero counts, it could be problems with one disk and
artifacts of that problem on the other disk, so this case would be noticeable if there are
significantly higher error counts on one disk compared to the other. Sample output from a
Sun Fire 280R server's disk drive is:
# iostat -E
<...>
ssd0 Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: FUJITSU Product: MAN3735F SUN72G Revision: 0704 Serial
No: 0304V87742
Size: 73.40GB <73400057856 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 0 Predictive Failure Analysis: 0
<...>
b. It is strongly suggested that the “/var/adm/messages” file be examined for errors. In basic
terms the errors likely to appear are "individual disk" types of error, or "bus or host based"
types.
Errors of this type generally indicate the drive listed needs to be replaced. Notice
that this type of error lists "Vendor" , "Sense Key" , and "ASC/ASCQ" information.
These values will vary with the type of drive error and are explained further in
InfoDoc 14140. To relate the info given above to which "cXtXdX" disk is being identified
match the WWN of w21000004cf966fd5,0 from the error above to the output of the format
command:
# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. c1t0d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf966fd5,0
1. c1t1d0 <SUN72G cyl 14087 alt 2 hd 24 sec 424>
/pci@8,600000/SUNW,qlc@4/fp@0,0/ssd@w500000e010368268,0
Specify disk (enter its number):
ii. The three examples below are of the "bus or host based" type error. That in no way
implies that a disk could not be at fault.
Example 1. The problem was troubleshot by booting from CDROM and running
"test" from the format analyze menu. By swapping drive positions it was
determined that the drive was failing. Using the format program will be explained
later in this section.
Dec 16 11:57:40 marge qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(0):
Loop OFFLINE
Dec 16 11:58:43 marge qlc: [ID 686697 kern.info] NOTICE: Qlogic qlc(0):
Loop ONLINE
Dec 16 11:58:54 marge scsi: [ID 243001 kern.warning] WARNING: /pci@8,
600000/SUNW,qlc@4/fp@0,0/ssd@w21000004cf96a89f,0 (ssd0):
Dec 16 11:58:54 marge SCSI transport failed: reason 'tran_err':
retrying command
Example 2. The message is from "picld" which is the daemon that monitors
environmental data. Notice that both disks are called in error. The problem was
the internal disk backplane.
a. Use of "probe-scsi-all" from the ok> prompt should usually be the first diagnostic run since
it is not dependent on any operating system to run.All the disks should be seen.
b. Next, use "obdiag" from the ok> prompt which presents a menu of devices. Set the
environment variables test-args = subtests,verbose,media,bist,iopaths and diag-level =
max, then run the “test-all” command at the obdiag> prompt.
c. If the drives are seen okay, boot Solaris in single-user mode from either CDROM or
network (“boot cdrom -s” or “boot net -s”. This provides the advantage of using an device
tree image loaded into memory rather that that which is loaded on the disk, which is
helpful in isolating problems which are suspect to the Solaris being damaged or mis-
configured, as well as allowing swapping of drive positions without worrying about the
effects of the WWN and slot id's. Once you have booted the Solaris image you can enter
the format utility and run some analyze tests.
# format
Searching for disks...done
FORMAT MENU:
disk - select a disk
type - select (define) a disk type
partition - select (define) a partition table
current - describe the current disk
format - format and analyze the disk
repair - repair a defective sector
label - write label to the disk
analyze - surface analysis
defect - defect list management
backup - search for backup labels
verify - read and display labels
save - save new disk/partition definitions
inquiry - show vendor, product and revision
volname - set 8-character volume name
!<cmd> - execute <cmd>, then return
quit
format> analyze
ANALYZE MENU:
It is suggested you choose carefully what tests you will run as some will write over the operating
system. To further refine the running of the tests available in format, use the options available in
the setup sub menu.
a. Consider configuring the “/etc/syslog.conf” file to log messages onto another system as
well as locally. See the “syslog.conf” man page for more details.
For more information on this procedure that should be useful for any regular VxVM
system administration, see InfoDoc 12006.
Seagate Drives
FIN# I0816-1: Seagate ST336605FC 36GB and ST373405FC 73G drives with firmware 0438
(or below) could be susceptible to label corruption which results in the drive and its data being no
longer accessible. This FIN affects drives in systems built before June 2002. All such disks
should be proactively updated to prevent data and availability loss, prior to disks failing and
needing replacement.
Fix: Install patch 109962-07 and download F/W 0538 or 0638 to Seagate 36GB ST336605FC
disks and 73GB ST373405FC disks.
Fujitsu Drives
FIN# I0963-1: Fujitsu 73GB HDD will not be recognized during a 'boot net' operation on Sun
Blade 2000 or Sun Fire 280R platforms. This FIN affects drives in systems shipped from
approximately July 2002 to January 2003.
Fix: Upon failure update affected Fujitsu 73.4GB disk drive (MAN3735FC) having firmware
version 0604 to firmware version 0704 via patch 109962-10. The patchId 109962-11 has been
released and available since May 15, 2003.
DVD
FIN# I0723-1: Unable to boot Solaris 8 Update 7 (HW 02/02) (or later) DVD-ROM Media from
Toshiba DVD/CDROM. This FIN affects systems shipped prior to November 2001.
Fix: Install Patch 111649-03 for all Toshiba SD-M1401 drives having part number 390-0025-01.
This patch is compatible with Solaris releases 2.5.1, 2.6, 7, and 8.
The Sun Fire 280R server contains four PCI slots on two PCI Busses from the single IO-Bridge
(Schizo) on the motherboard. PCI slot 1 provides option for 64-bit 66MHz 3.3V or 33MHz 5V
cards, and PCI slots 2, 3 and 4 provide option for 64-bit 33MHz 5V cards. All slots accept
universal keyed 3.3V/5V cards. It is recommended best-practice not to place a 33MHz card into
the 66MHz slot if possible, as this will slow the whole bus performance down to 33MHz, thereby
halving the performance of the on-board 66MHz FC-AL disk controller for the internal disks. The
theoretical bandwidth provided by the IO-Bridge is 1.2GB/s maximum throughput between PCI
and the Safari bus. The 66MHz bus provides 8 bytes (64-bit) x 66MHz = 528MB/s maximum
throughput, and the 33MHz bus provides 8 bytes (64-bit) x 33MHz = 264MB/s maximum
throughput, shared between all devices and slots on each bus.
The two PCI busses are designated in the device tree as “/pci@8,600000” where the ,600000
indicates the 66Mhz bus, and “/pci@8,700000” where the ,700000 indicates the 33Mhz bus. The
number 8 indicates the safari agent ID of this component, referring to the IO-Bridge itself.
The following describes the internal devices on each Bus and the slot device numbers, as well
as how to interpret the “Device #” messages that are reported in PCI errors. This is reproduced
from an article created by a PTS EMEA engineer (Mick Mullins) and is also available here:
http://cpre-emea.uk/technotes/showentry.php?id=1108404754
The device numbers assigned by the IO-Bridge (Schizo) to the devices on the two PCI busses is
based on the PCI req/gnt lines from the IO-Bridge to each individual bus device. The following
/pci@8,600000 - 66Mhz
PCI 64conn J2301 - EPCI_GNT_0
ISP2200 CONTROLLER – EPCI_GNT_3
/pci@8,700000 – 33Mhz
PCI 64conn J2601 - PCI_GNT_0
PCI 64conn J2501 - PCI_GNT_1
PCI 64conn J2401 - PCI_GNT_2
RIO CONTROLLER - PCI_GNT_3
RIO CONTROLLER - PCI_GNT_4
SYM53C876 SCSI - PCI_GNT_5
Note that the RIO controller supports two PCI req/gnt pairs to minimize DMA latency in the
system. DMA requests from the channel engines are routed to both PCI req/gnt pairs following
arbitration and availability of internal resources. In systems without the second PCI req/gnt pair,
RIO can use the single pair to request the bus.
These tables show how to decode the device #'s that are reported in PCI error messages, based
on the table of used req/gnt lines:
/pci@8,600000 - 66Mhz
Bits 3 2 1 0 Device Type DEVICE #
0 0 0 0 Bus idle
0 0 0 1 PCI slot J2301 DEVICE 0
0 0 1 0 Not used
0 1 0 0 Not used
1 0 0 0 ISP2200 On-board DEVICE 3
/pci@8,700000 – 33Mhz
Bits 5 4 3 2 1 0 Device Type DEVICE #
0 0 0 0 0 0 Bus idle
0 0 0 0 0 1 PCI slot J2401 DEVICE 0
0 0 0 0 1 0 PCI slot J2501 DEVICE 1
0 0 0 1 0 0 PCI slot J2601 DEVICE 2
0 0 1 0 0 0 RIO On-board DEVICE 3
0 1 0 0 0 0 RIO On-board DEVICE 4
1 0 0 0 0 0 SYMB 53c876 On-board DEVICE 5
Note that DEVICE 6 is the SCHIZO chip itself, on both buses. This is indicated in the Schizo
ASIC specs (Sect. 22.4.1.1 PCI Control & Status register (ERR_SLOT bits 55:48))
"/pci@8,700000" 0 "pcisch"
"/pci@8,600000" 1 "pcisch"
“pcisch-1” is the driver instance reporting the PCI error. Checking /etc/path_to_inst file shows us
that the path of instance 1 of the pcisch driver is bound to “/pci@8,600000”, so the error has
occurred on the 66Mhz bus. Using the tables above, the message “PCI error occurred on
device #0” specifically relates to the PCI slot J2301. The message “PCI config space
CSR=0x22a0<received-master-abort>“ indicates that the PBM within the Schizo received
a master abort signal from an external device. This leads to a suspect PCI card in slot J2301.
FIN #I0722-1:
Due to bug 4482600 in the Schizo ASIC, an interaction between 64-bit and 32-bit cards may
cause a PCI SERR panic. The bug is due to Schizo putting incorrect parity when filling the upper
32-bit data, when 32-bit cards using the lower 32-bits of data are doing long PIO transactions.
Any 64-bit card may legitimately check and detect the bad parity and initiate the PCI SERR
panic. The bug is fixed in hardware in Schizo version 2.4, which has not actually shipped as of
this writing. The hardware fix will be available in motherboard 501-6230-10 or later, which will
contain Schizo version 2.5 or ELE version 1.1+.
It was found through investigation of Sun PCI cards, that this problem only occurred between
Sun PCI graphics cards PGX32 and PGX64, where the other PCI card checking the parity and
asserting SERR was an Emulex Lightpulse FC-AL adapter. To workaround this problem given
that the hardware fix is not available, 2 solutions are possible:
1. Due to the architecture of the Sun Fire 280R server, there is only one bus can be affected by
this problem, where two interacting cards are installed in two of the three 33MHz slots. Move
one of the 2 cards (either Emulex or PGX) that are causing the problem to the 66MHz slot,
thus isolating the cards on to separate buses where they cannot interact.
2. Apply the workarounds listed in the FIN for the PGX32 and PGX64 cards respectively. This
requires installation of minimum revision patches for these drivers, as well as some specific
commands required to configure OBP firmware and Solaris driver variables to disable the
behavior that causes the Emulex card to check the bad parity.
Theoretically this problem can occur between any third-party PCI cards as well. Please use the
PTS engagement mechanism to escalate any additional interactions showing similar behavior to
this bug, for which workaround 1 above is not possible due to the customer configuration
requirements, and new third-party PCI cards are triggering symptoms similar to this bug and FIN.
Only one other case has been reported and escalated which was occurring between two
application-specific designed cards. In this case, the customer required three cards of this type
FIN #I0992-1: A Small Number of Power Distribution Boards may experience a limited thermal
event at the Power Supply 1 connector. This is due to a material issue with the connector being
susceptible to humidity and out-of-dimension when PS1 was installed at system assembly
typically. PS1 connector will have broken ears and can be visually inspected for. Technically PS0
can be affected too, but this has never been seen in the field, due to PS0 being installed into
systems at a different assembly location than PS1. This is a reactive FIN – replace on failure.
The most common problem with the motherboard is bent pins on the CPU connectors due to
improper installation of CPU modules. The pins should be very carefully inspected on the CPU
connectors and the Motherboard slots for even the slightest amount of damage or movement
from the perpendicular that they should line up with. If this is apparently the problem, then the
Motherboard and Both CPU's should be replaced together to remove any suspect parts from the
system, as any bent pin on 1 FRU will bend the pins on the other FRU and all subsequent FRU's
that touch that slot or connector. For all Fatal Resets, RSE's, repeated DIMM or apparent CPU
problems, it is very rarely due to a bad motherboard component itself, so replacing the
Motherboard only in these situations is not going to help.
1) Remove CPU modules using the opposite procedure as the install step (4) below, ensuring to
alternate between screws every half to full turn of the driver tool.
2) Closely inspect pins on motherboard and CPU module for damaged pins. They may be
difficult to see. Do not re-use any damaged component.
3) Use torque tool part number 250-1611 and not the ring tool 340-6395
4) Insert CPU modules;
a. Turn both thumbscrews by hand simultaneously to locate screws in thread, until
screws are finger tight.
b. Turn one screw a half or full turn clockwise and then turn opposite screw a half
or full turn clockwise using the provided torque tool in the unit.
c. Repeat above step until both screws lock and the CPU module is
securely in place. The torque tool will present an audible "click" when the screws
are at the correct 5 inch pound torque spec
Firmware prints “IDPROM Contents Invalid” and 0's for Ethernet MAC Address
The most common reason that this occurs, is due to a prior POST or OBP initialization error that
has caused the system to stop initializing prematurely. As a result, the IDPROM may not have
been read yet, so typing the “banner” command at ok prompt, or OBP printing the “banner” gives
this output. Look for previous failures in the console log, such as a CPU or Memory type error,
that may be diagnosable and the reason that OBP got into this state. Troubleshooting this type
of output should start with running POST using the keyswitch in DIAG position.
On rare occasions, this may be caused by a bad Socketed SEEPROM (NVRAM) chip, that did
not program correctly on an OBP update, or has bent pins (for example after transferring to a
new motherboard FRU).
On very rare occasions, this may be caused by an OBP bug 4446946 that affected a small
number of very early systems that shipped with OBP version 4.0.46. It is unlikely any customer
system still has this version of OBP, as the fix for this bug was in the first released OBP patch
with OBP 4.2.2, which was also incorporated onto new motherboards (and new systems) not
long after Sun Fire 280R server was released. Any failures that occurred for this bug are likely to
have already happened well before the writing of this guide.
A new rack kit (-04) is now shipping, in addition to adding M6 screws for use with NGR, and can
be ordered as a FRU for those CU's moving older systems into new racks. The -03 kit makes
slides long enough to fit the 900 rack, while the -04 adds the M6 screws.
The part number for the Rack Rail Kit is: 560-2625-04
The trim strips on earlier shipping servers do not properly fit on the NGR rack. Servers with
serial numbers 325xxxx or later have trim strips that fit both 10x32 and M6 screws. The trim
strips serve no functional purpose and are merely for decoration. For older systems without the
new trim strips, being relocated into new racks, either throw away the strips and secure the
server with the bare metal which has holes large enough for both screws. Alternatively, the
holes in the plastic can be bored out with a hand file or knife to make it large enough.
TT Description
000 Reserved
001 Power On Reset
002 Watchdog Reset
003 Externally Initiated Reset
004 Software Initiated Reset
005 RED State Exception
006 ... 007 Reserved
008 Instruction Access Exception
009 Instruction Access MMU Miss
00a Instruction Access Error
00b ... 00f Reserved
010 Illegal Instruction
011 Privileged Opcode
012 unimplemented LDD
013 unimplemented STD
014 ... 01f Reserved
020 FP Disabled
021 FP Exception IEEE 754
022 FP Exception Other
023 TAG Overflow
024 ... 027 Clean Window
028 Division by Zero
029 ... 02f Reserved
030 Data Access Exception
031 Data Access MMU Miss
032 Data Access Error
034 Memory Address not Aligned
035 LDDF Memory Address not Aligned
036 STDF Memory Address not Aligned
037 Privileged Action
038 LDQF Memory Address no Aligned
039 STQF Memory Address no Aligned
03a ... 03f Reserved
040 Asynchronous Data Error
041 ... 04f Interrupt Level 1 - 15
050 ... 05f Reserved
060 Interrupt Vector
061 PA Watchpoint
062 VA Watchpoint
063 Corrected ECC Error
064 ... 067 Fast Instruction Access MMU Miss
068 ... 06b Fast Data Access MMU Miss
06c ... 06f Fast Data Access Protection
070 ... 07f Implementation Dependent Exception
070 Fast ECC Error (UltraSPARC III-only Extension – L2 Cache ECC Error)
080 ... 09f Spill Normal 0 - 7
0a0 ... 0bf Spill Other 0 - 7
0c0 ... 0df Fill Normal 0 - 7
0e0 ... 0ff Fill Other 0 - 7
100 ... 17f Trap Instruction (Ticc)
180 ... 1ff Reserved
http://cpre-emea.uk/technotes/showentry.php?id=1916233701
We will use the following error as an example. The system is a Sun Fire 280R server. The error
is a CE (correctable error) reported by the IO-Bridge (Schizo) on an earlier KUP with the bugs
that cause the incorrect DIMM(s) to be identified. The same decode process would be correct
for other forms of CE errors as well. This procedure is also valid for Sun Blade 1000 and 2000
systems that use the common motherboard, CPU modules and Excalibur architecture.
WARNING: correctable error from pci0 (safari id 8) during
DVMA read transaction
Transaction was a block operation.
dvma access, Memory safari command, address 00000000.3a73e550,
owned_in asserted.
AFSR=40000000.48400098 AFAR=00000000.3a73e550,
quad word offset 00000000.00000001, Memory Module J0100 J0202
J0304 J0406 id 8.
syndrome bits 98
mtag 0, mtag ecc syndrome 0
Lets first look at the AFSR. Bits 8 - 0 comprise the system-bus or L2 cache data ECC syndrome
(Esynd). In this example it would break down as follows:
AFSR=40000000.48400098
.................../^\
................../ | \
................./ | \
............ 0000 1001 1000 = last three bytes from AFSR above
.......bits.... 8 7654 3210 = 098
ECC syndrome (Esynd) = 098
x coordinate = 8
y coordinate = 09
Using the truth table below find where the "y" value (vertical left margin) and the "x" value
(horizontal top margin) intersect to find the data bit value. In this example Esynd 098 decodes to
data bit 114. Notice that some of the values are not single data bits but instead are single ecc
check bits or multibit errors as described below. The following procedures assumes the table
above decoded to a single data bit or ECC check bit in error, i.e Correctable Error (CE).
AFAR=00000000.3a73e550
.................../^\
................../ | \
................./ | \
............ 0101 0101 0000 = last three bytes from AFAR
..........bits 98 76 = 0101
As the procedure says you want to use the 9 - 6 bits with the LM (lower mask value) to
determine the logical bank.In this example you can see that bits 9 - 6 are 0101. The type of
interleaving used (2 or 4 way) determines which of the bits 0101 are used and which are don't
care bits as shown it table 3-6.To go any further at this point you must look at
the output of “prtdiag -v” to determine what the interleaving factor is.
# prtdiag -v
...
Memory size: 2048 Megabytes
================ Memory Configuration============================
Logical Logical Logical
MC Bank Bank Bank DIMM Interleave Interleaved
Brd ID num size Status Size Factor with
---- --- ---- ------ -------- ------ ---------- -----------
CA 0 0 512MB no_status 256MB 2-way 0
CA 0 2 512MB no_status 256MB 2-way 0
The type of interleaving used is a function of the number of DIMMS in the system as well as the
size of the DIMMS. When there is 2-way interleaving shown in the “prtdiag -v” output, it will also
be clear whether there is only 1 group of DIMMs present (logical banks 0/2 or 1/3 ONLY will be
listed), or 2 groups of different size DIMMs present (all 4 logical banks listed).
In the case of 2 groups of different size DIMMs present, it is necessary to use the AFAR address
upper bits, and the output of OBP initialization with diag-level=max which tells where the 2
groups of address ranges start, in order the determine whether the error occurred while
addressing Group 0 or Group 1 memory. An example output of OBP messages to look for and
prtdiag from a system with 2 different size DIMMs in the 2 Groups:
The above would be interpreted as AFAR addresses from 0 to 0.7fffffff are the 2048MB segment
composed of the 2 x 1024MB logical banks, and addresses from 0.80000000 and higher are the
1024MB segment composed of the 2x512MB logical banks. It follows from the table below that
logical banks 0,2 are Group 0, and logical bank 1,3 are Group 1 DIMMs so the example above if
there were 2 sets of 2 different sized DIMM's, the AFAR is in the first range given it is
0.3a73e550 therefore we'd be looking at Group 1 memory.
Going back to the original example the system is using 2-way interleaving and only has Group 0
memory installed per “prtdiag” output, so the first three bits are don't care so 0101 is xxx1 which
is lower mask 1. That would make lower mask 1 equate to Logical Bank 2.
In the case of all 8 DIMMs being the same size with 4-way interleaving shown in “prtdiag -v”, use
the following table of LM bits to determine the appropriate logical bank.
So going back to our chart we can see that DIMM1 in Group 0 is location J0202.
Alternatively, you can upgrade KUP to 108528-16 or later, or assuming you stay on the older
KUP, place the following settings in the “/etc/system” file and reboot:
set ce_verbose=1
set aft_verbose=1
Stressing the system with SunVTS “ramtest”, power-cycle testing, and physically moving DIMM's
around and running POST with “diag-level=max” may reveal a specific DIMM too, but it may take
a long time as this is relying on seeing a single Correctable Error (CE) that reports to a single
DIMM rather than the bank of 4 DIMM's. Also, if the system panics and leaves a core, it may be
possible to identify a trend of errors on a particular bad DIMM from the kernel soft error rate
counters, using the "fm" or “scat” utilties.