You are on page 1of 11

Zimbra: rlopez@gcs-iberia.

com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

Zimbra Collaboration Suite

rlopez@gcs-iberia.com

Re: Incidencia # 71978902 - M5000 cado --Snapshot --

sbado, 14 de noviembre de 2009 5:50:13

De: jmartinez@gcs-iberia.com Para: Angel.Rico@sun.com CC: ooh-spain-field@sun.com Archivos adjuntos: sech43edaeor13-xscf0_10.235.33.215_2009-11-14T04-35-39.zip (6608.2KB)
Adjunto snapshot ----- Mensaje original ----De: "Angel Rico" <Angel.Rico@Sun.COM> Para: "Jesus Martinez Fernandez" <jmartinez@gcs-iberia.com> Enviados: Sbado, 14 de Noviembre 2009 5:14:55 GMT +01:00 Amsterdam / Berln / Berna / Roma / Estocolmo / Viena Asunto: Re: Incidencia # 71978902 - M5000 cado Hola Jess. Te llamo a tu mvil en ste momento. Adjutno el doc. para hacer "clear" a componentes en fallo. Supongo es el que has estado aplicando. ================================= Document Audience: INTERNAL Document ID: 209792 Old Document ID: (formerly 89850) Title: Sun SPARC Enterprise Mx000 (OPL) Servers: Fault clearing and LEDs behavior Copyright Notice: Copyright 2009 Sun Microsystems, Inc. All Rights Reserved Update Date: Tue Aug 26 00:00:00 MDT 2008 Solution Type Technical Instruction Solution behavior 209792 : Sun SPARC Enterprise Mx000 (OPL) Servers: Fault clearing and LEDs

Related Categories * Home>Product>Systems>Servers

Description Fault Management Architecture on the Sun SPARC [TM] Enterprise Mx000 (OPL) Servers The implementation for the Fault Management Architecture on the Sun SPARC Enterprise Mx000 (OPL) Servers is complex. The goal of this document is not to describe how FMA behaves on the Sun SPARC Enterprise Mx000 (OPL) Servers but to help to identify and display the faults reported on the components of these platforms, how/when these faults can be cleared and how/when the fault LEDs are turned ON or OFF. Faults on Sun SPARC Enterprise Mx000 (OPL) Servers : FMA is obviously available on the Sun SPARC Enterprise domains running Solaris[TM] 10. But FMA has also been ported to run on the Service Processor and the fmd daemon is running as part of the XSCF software. Via the "Event Transport Module" (ETM) on top of the "Domain to Service Processor Communications Protocol" (DSCP), XSCF on the Service Processor and Solaris on the domain, we can communicate ereports and faults. When a fault is diagnosed, the system is able to identify one or several suspects, depending on the nature of the fault. The suspect or the list of the suspects can be

1 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

displayed using the 'fmdump' command on XSCF. Example where a list of suspects has been identified : XSCF> fmdump -v May 25 16:02:53.0556 6070a711-49ad-4b23-a172-5524274deceb SCF-8001-KC 66% upset.chassis.SPARC-Enterprise.io.disk.boot Problem in: hc:///chassis=0/iou=8/pcislot=0/ioua=0/pci_br=0/sas=0/disk=1 Affects: FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=sandc3-1-0/component=/IOU#8/HDD#1 33% upset.chassis.SPARC-Enterprise.io.disk.boot Problem in: hc:///chassis=0/iou=8/pcislot=0/ioua=0/pci_br=0/sas=0 Affects: FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=sandc3-1-0/component=/IOU#8/PCI#0/IOUA Example where only one suspect has been identified : XSCF> fmdump -v -u 7d1b6fac-ff1f-4d3d-afff-faf6c0a2ed07 Jun 15 02:53:32.1628 7d1b6fac-ff1f-4d3d-afff-faf6c0a2ed07 SCF-8005-PX 100% upset.chassis.domain.panic Problem in: hc:///chassis=0/domain=0 Affects: FRU: hc://:product-id=SPARC-Enterprise:chassis-id=BE80601000:server-id=sandc3-1-0/component=CHASSIS The Knowledge Articles available at http://www.sun.com/msg request to collect the result for a 'fmdump -m'. The "-m" option is available only on the XSCF (not on the Solaris domain) and displays the Fault Manager syslog message contents for the event(s). Example : XSCF> fmdump -m T-TIME: Fri Apr 13 08:06:05 PDT 2007 PLATFORM: SPARC-Enterprise, CSN: BE80601000, HOSTNAME: san-dc3-1-0 SOURCE: sde, REV: 1.12 EVENT-ID: cfcd90f3-5988-4707-ba8e-fdd03d417fc3 DESC: An internal fatal error within a strand on a CPU chip was detected. Refer to http://www.sun.com/msg/SCF-8000-EQ for more information. AUTO-RESPONSE: The domain using this CPU will be reset and the strand will be deconfigured. IMPACT: The domain using this CPU chip is reset. REC-ACTION: Schedule a repair action to replace the affected Field Replaceable Unit (FRU), the identity of which can be determined using fmdump -v -u EVENT_ID. Please consult the detail section of the knowledge article for additional information. Only users with the platop, platadm, or fieldeng privileges can run the 'fmdump' command. The information about the faulty status of the components is available in the CMEM database on XSCF. Based on the certainty of the fault, the following flags are set in the CMEM : * CFF => Certainly Faulty Flag * UFF => Uncertainly Faulty Flag Every FRU that is considered as suspect in the list has the uncertain_secondary_status set. But only the primary suspect may also have the CFF or UFF bit set. As a consequence for a fault detected and depending on the CFF or UFF bit set, the primary suspects in a suspect list are reported as "faulted" (completely broken/not working) or "degraded" (should be replaced, but is still working with some limitations) in the 'showhardconf' and 'showstatus' outputs. The secondary suspects show up as "degraded". Some components may be reported as "deconfigured" if these components are victims of a fault detected on another

2 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

component. As part of FMA ported on XSCF, the 'fmadm faulty' command is only available in Escalation mode. The resource cache has less info about platform faults than the "CMEM" database. CMEM is the database that has the real info about faulty FRUs for OPL, not the resource cache. Therefore, the 'showstatus' command is the preferred method to be used by the customer and field, as it provides a truly accurate picture of what is going on regarding faults. The 'showstatus' and 'showhardconf' commands are available to the users with the following privileges : useradm, platadm, platop, fieldeng Example for the showstatus and showhardconf commands reporting a chip on a CPU Module as "faulted". XSCF> showstatus CMU#1 Status:Normal; * CPUM#2-CHIP#0 Status:Faulted; XSCF> showhardconf -M SPARC Enterprise M9000; + Serial:BE80601000; Operator_Panel_Switch:Locked; + Power_Supply_System:Dual-3Phase; Ex:Dual-3Phase; SCF-ID:XSCF#0; + System_Power:On; [output omitted] CMU#1 Status:Normal; Ver:0101h; Serial:PP0642Z470 ; +FRU-Part-Number:CA06620-D001 A8 ; + Memory_Size:64 GB; CPUM#0-CHIP#0 Status:Normal; Ver:0201h; Serial:PP06447337 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; CPUM#1-CHIP#0 Status:Normal; Ver:0201h; Serial:PP06447340 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; * CPUM#2-CHIP#0 Status:Faulted; Ver:0201h; Serial:PP06447336 ; +FRU-Part-Number:CA06620-D021 A6 ; + Freq:2.280 GHz; Type:16; + Core:2;Strand:2; [output omitted] Example from a M9000 system where a CMU is reported as degraded due to some DIMMs deconfigured as a consequence for a fault detected on the Memory Address Controller. XSCF> fmdump -av TIME UUID MSG-ID Apr 29 20:03:02.7818 5817837d-6ee9-4ffd-af17-fee44d76da0d SCF-8005-CA 100% fault.chassis.SPARC-Enterprise.asic.sc.fe Problem in: hc:///chassis=0/cmu=6/sc=2 Affects: hc:///chassis=0/cmu=6/mac=2/bank=0 XSCF> showstatus * CMU#6 Status:Degraded; * MEM#00A Status:Deconfigured; * MEM#00B Status:Deconfigured; * MEM#01A Status:Deconfigured; * MEM#01B Status:Deconfigured; * MEM#02A Status:Deconfigured; * MEM#02B Status:Deconfigured; * MEM#03A Status:Deconfigured; * MEM#03B Status:Deconfigured; * MEM#10A Status:Deconfigured; * MEM#10B Status:Deconfigured; * MEM#11A Status:Deconfigured; * MEM#11B Status:Deconfigured; * MEM#12A Status:Deconfigured; * MEM#12B Status:Deconfigured; * MEM#13A Status:Deconfigured; * MEM#13B Status:Deconfigured; For some components considered as primary suspect and certainly faulty (CFF), the Maintenance Action Required bit is set. This information is available in a 'fmdump -V' output.

3 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

Example : XSCF> fmdump -Ve TIME CLASS Jun 15 2007 02:48:35.110134400 ereport.chassis.SPARC-Enterprise.cpu.SPARC64-VI.seofflinereq nvlist version: 0 class = ereport.chassis.SPARC-Enterprise.cpu.SPARC64-VI.se-offlinereq [output omitted] opl_platform = DC3 detected-by = ANALYZE maintenance-action-required = true __ttl = 0x1 __tod = 0x46726073 0x6908480 Further OPL FMA information can be found at OPL FMA. Steps to Follow Clearing Faults on Sun SPARC [TM] Enterprise Mx000 (OPL) Servers Clearing Faults (Sun SPARC Enterprise Mx000 (OPL) Servers running XCP 1050 or later) : 1. Usual process : The right course of action when the system has identified faulty components on a platform is to replace the primary suspect. In order to repair a fault with single or multiple FRUs, the typical repair action will be: * Replace the first FRU indicted in the suspect list; if the FRU is a CMU or sub-FRU of a CMU (on M8000/M9000), IOU (on M8000/M9000), FAN, PSU, DDC_A (on M8000), XSCFU (on M8000/M9000), or XSCFU_C (M9000 plus expansion cabinet) use the 'replacefru' command to do so; Otherwise, use cold replacement; The fieldeng privileges are required to run the 'replacefru' command. * For all the other FRUs on the suspect list (secondary suspects), use the 'clearfault' command. The 'clearfault' command will return with an error message when trying to clear a FRU that is the first suspect on a suspect list : "clearfault: Fault cannot be cleared for this FRU" You must be in Service or Escalation mode to run the 'clearfault' command. * Sun Shared Shell The Service Engineer should offer the use of Shared Shell to accomplish the service mode and clearfault. Before the Shared shell session is terminated the "disablemodes" command should be run. 2. Complex cases : In some more complex cases, the context might lead to the decision to not replace the first FRU in the suspect list. This might be due to the diagnosis engine being wrong or the FRU in a single-FRU indictment seems to be wrong or first FRU has already been replaced and this is a repeat fault with an identical list etc... The decision to not replace the first FRU in the suspect list MUST be done by Service and the whole process of clearing a fault without replacing the suspect component must be done under the supervision of TSC. The following section describe how to handle the complex cases. 2.1 - Power cycle (NFB Off/On) : For the systems running XCP1050, it's possible to clear the fault status for primary suspects by power-cycling the platform and the keyswitch is in the service position. For the systems running XCP1060 and later, faults are not be cleared on NFB-on, no matter what the position of the keyswitch. Note : Whatever the component (with ot without FRUID), a power cycle with the keyswitch is in the Locked position will not have any effect on the fault status of this component. Except if clearfault/clearstatus/clearfru have been invoked previously; see the section

4 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

below. 2.2 - Commands available to clear the faults : 2.2.1 - clearfru / clearstatus : These 2 commands can be used to clear the fault information of all the FRUs (clearfru) or the fault information of FRUs that have been detected as faulty units (clearstatus). You must be in Escalation mode to run the clearfru / clearstatus commands. The domains must be down and an immediate platform power cycle is required. The component is reported as faulted as long as the power cycle hasn't occured. Using clearfru and clearstatus must be done only on directions from TSC and/or Engineering. Example : XSCF> showstatus CMU#0 Status:Faulted; service# clearstatus /CMU#0 XSCF> showstatus No failures found in System Initialization. 2.2.2 - clearfault : The 'clearfault' command provides a much clever way to manage faults for primary suspects and can be used in a way to : * Immediately clear the fault status for other FRUs than the first suspect in a multi-suspect list, * Clear the fault for FRUs reported as the first suspect in a suspect list in the following cases : 2.2.2.1 - During the next power-cycle by using 'clearfault [-l]' (no need for the keyswitch to be in the service position). As a consequence, the next power-cycle clears the faults for the components. Notes : * If the user invokes 'clearfault -l', there is no attempt to clear the fault status now. The user is prompted with the information that the "FRU will be marked to clear fault on next circuit breaker off and on." Example : service# clearfault -l /IOU#0 Fault will be cleared after circuit breaker off and on * If the user invokes 'clearfault' (without -l), the system will attempt to clear the status now. If the FRU is a primary suspect and clearfault cannot clear the fault status, it will prompt the user requesting whether or not the FRU should be marked to clear fault on next circuit breaker off and on; behaving the same manner as invoking 'clearfault -l' directly. Example : service# clearfault /IOU#0 clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on * If clearfault is invoked when another maintenance command is running, it will behave the same manner as invoking 'clearfault -l' directly. Example : service# clearfault /IOU#0 Unable to get maintenance lock

5 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: 2.2.2.2 - M4000/M5000 : * Immediately clear the fault for the first suspect in a suspect list if the FRU is a PSU or a FAN, Example : service# clearfault /FAN_B#0 Testing the hardware... * For the other FRUs, a power cycle is required to clear the fault. Example : service# clearfault /MBU_A/MEMB#0/MEM#0A clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: yes Fault will be cleared after circuit breaker off and on Note : With XCP 1050, there is no way to clear a UFF or CFF fault in a DDCR on a M4000/M5000 IOU using the clearfault command. The only way is to invoke clearfru in escalation mode and power cycle the platform. Reference : CR#6577745. This is fixed in release of XCP 1060. 2.2.2.3 - M8000/M9000 : * Immediately clear the fault for the first suspect in a suspect list if the FRU is a PSU or FAN or DDC_A or XSCFU_B or XSCFU_C. Example : service# clearfault /PSU#0 Testing the hardware... * Immediately clear fault if the FRU is a CMU or IOU, including sub-FRUs like CPUM, DIMM ... and non-FRU like DDC, SSM ..., without NFB off/on when : o the CMU, IOU (and sub-FRU and non-FRU) is not part of a running domain. See the Examples section below. o If the FRU is a CMU (and sub-FRU and non-FRU) or an IOU and is part of a running domain, Dynamic Reconfiguration (deleteboard) must be invoked prior to run the clearfault command in order to detach the FRU from the domain. After clearing the fault, the component can safely be added back to the domain. See the Examples section below. o If the FRU is a CMU (and sub-FRU and non-FRU) or an IOU and is part of a running domain but DR cannot be used then the domain must be powered off in order to immediately clear the fault status. Otherwise, the fault status will be cleared on next circuit breaker off and on. See the Examples section below. 2.3 - As a summary, to clear fault on a FRU : M4000 / M5000 : FRU Power cycle (XCP1050 only) clearfault clearstatus/clearfru PSU/FAN Keyswitch in service position Immediate Domains down and requiers a power cycle whatever the keyswitch position CPUM/DIMM/IOU/MEMB/OPNL Keyswitch in service position After a power cycle, whatever the keyswitch position Domains down and requiers a power cycle whatever the keyswitch position M8000 / M9000 : FRU Power cycle (XCP1050 only) clearfault clearstatus/clearfru PSU/FAN/XSCFU/DDC_A Keyswitch in service position Immediate Domains down and after a power cycle whatever the keyswitch position OPNL Keyswitch in service position After a power cycle, whatever the keyswitch position Domains down and after a power cycle whatever the keyswitch position CMU/CPUM/DIMM/IOU/non-FRU Keyswitch in service position if not part of a running domain : Domains down and after a power cycle whatever the keyswitch position . Immediate

6 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

if part of a running domain: . Immediate after DR (deleteboard) + clearfault . Immediate after powering off the domain + clearfault . otherwise after clearfault + NFB Off/On whatever the keyswitch position For more detail on accessing escalation or service mode see Sun Document 91335 LED behaviour : Each Mx000 system has an Operator Panel (OPNL) with 3 LEDs : * the Power LED, * the XSCF Standby LED, * the Check LED. When turned ON, the Check LED, aka the System Check LED, indicates a fault on the system. See below. Most of the FRUs on the SPARC Enterprise servers have a FRU check LED which reports that the unit contains an error. However, some FRUs like DIMMs or CPUMs do not have LEDs. Refer to the SPARC Enterprise Mx000/Mx000 Servers Service Manuals for more information about LEDs. For Sun SPARC Enterprise servers running a version of XCP later than 1050, the check LEDs will be set and reset as below : * the FRU check LED is set if the FRU is the sole FRU in a suspect list; including sub-FRU (CPUM, DIMM ...) and non-FRU (DDC, SSM ...). * the system check LED is set if there are any FRUs which is considered as the primary suspect (CFF / UFF) or secondary suspect; which means when 'showstatus' reports any FRUs as faulty or degraded. Including IO Box FRUs reported as suspect. Note that the check LED for the PSUs on the M8000/M9000 may not behave as expected; not being set when it's the primary suspect. Check LEDs behaviour after clearfault, clearstatus, replacefru : * replacefru : o the FRU's check LED is : + ON until the maintenance, + blinking during the maintenance, + OFF as soon as the replacefru as completed successfully. o the System check LED is OFF : + as soon as the replacefru as completed successfully, + and there is no other suspect component in the system left, * clearfault : o the FRU's check LED is turned off as soon as the clearfault command has succeeded successfully in clearing the fault for the FRU. o the System's check LED is turned off as soon as the fault status for the latest suspect component is cleared. This implies that the LED will turn off after the subsequent power cycle in certain cases as applicable. * clearstatus/ clearfru : the FRU and System check LEDs remain ON until the next power cycle, Faults on IOBox : Faults detected on IOBox are stored in the CMEM and in the FRUID of the IOBox (Status_CurrentR). This information is reported in the showstatus output on the XSCF. Example : XSCF> showstatus IOU#4 Status:Normal; * PCI#5 Status:Degraded; IOX@X156 Status:Normal; * IOB1 Status:Faulted; * PS0 Status:Degraded; * PS1 Status:Degraded;

7 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

When a fault is reported on the IOBox or its components, the Service LED on the IOBox or PSU is lit. When an iobox fru is discovered, dfrud reads the Status_CurrentR. If it contains fault info, the fault info is added to CMEM, and the Service led is turned on. This can be checked via the ioxadm command : XSCF> ioxadm env -v Location Sensor Min Min Alarm Value Max Alarm Max Units [...] IOX@X156/IOB1 SERVICE On - LED Even if a fault is reported on IOBox and Service LED is lit, the OPNL System Ckeck LED is not lit. The clearfault command can be used to clear the fault status for primary and secondary suspect on the IOBox and its components; similarly to any other components in the platform chassis (CMU, DIMM, IOU etc ...) for XCP > 1050. Example : XSCF> showstatus IOU#4 Status:Normal; * PCI#5 Status:Degraded; IOX@X156 Status:Normal; * IOB1 Status:Faulted; * PS0 Status:Degraded; * PS1 Status:Degraded; service> clearfault IOU#4-PCI#5 service> clearfault IOX@X156/IOB1 service> clearfault IOX@X156/PS0 service> clearfault IOX@X156/PS1 XSCF> showstatus No failures found in System Initialization. Clearing the LINK to the IOBox: Example: service> clearfault IOX@X1CK/IOB0/LINK As soon as there is no more fault status reported in the showstatus output then all the Service LEDs are cleared. There is no condition requiring to power cycle the IOBox to clear a (similar to clearfault -l). Hierarchical fault clearing : In certain cases, the faulted resources appear to be hierarchical. In the following example, after clearing the fault on CMU#0, we need to clear the fault on the subordinates. XSCF> showstatus * CMU#0 Status:Faulted; * CPUM#0-CHIP#0 Status:Faulted; * MEM#03A Status:Faulted; service# clearfault CMU#0 XSCF> showstatus CMU#0 Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * MEM#03A Status:Faulted; CMU#0 remains in the output, although not marked faulted, until the subordinates are cleared: service# clearfault CMU#0/CPUM#0 XSCF> showstatus CMU#0 Status:Normal; * MEM#03A Status:Faulted; service# clearfault CMU#0/MEM#03A fault status

8 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

XSCF> showstatus No failures found in System Initialization. 1. M4000 / M5000 : 1.1 - clearing a fault on a PSU : XSCF> showstatus * PSU#1 Status:Faulted; service# clearfault /PSU#1 Testing the hardware... XSCF> showstatus No failures found in System Initialization. 1.2 - clearing a fault on a DIMM : XSCF> showstatus MBU_A Status:Normal; MEMB#0 Status:Normal; * MEM#0A Status:Faulted; service# clearfault /MBU_A/MEMB#0/MEM#0A clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: yes Fault will be cleared after circuit breaker off and on XSCF> showstatus MBU_A Status:Normal; MEMB#0 Status:Normal; * MEM#0A Status:Faulted; 1.3 - clearing a fault on a CPUM : XSCF> showstatus MBU_A Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * CPUM#0-CHIP#1 Status:Faulted; service# clearfault /MBU_A/CPUM#0 clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on XSCF> showstatus MBU_A Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * CPUM#0-CHIP#1 Status:Faulted; 2. M8000 / M9000 : 2.1 - clearing a fault on a PSU : XSCF> showstatus * PSU#0 Status:Faulted; service# clearfault /PSU#0 Testing the hardware... XSCF> showstatus No failures found in System Initialization. 2.2 - clearing a fault on the OPNL : XSCF> showstatus * OPNL#0 Status:Faulted; service# clearfault /OPNL clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: y Fault will be cleared after circuit breaker off and on

9 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

XSCF> showstatus * OPNL#0 Status:Faulted; 2.3 - clearing a fault on an IOU not part of a running domain : XSCF> showstatus * IOU#1 Status:Faulted; XSCF> showboards -v -a XSB R DID(LSB) Assignment ---- - -------- ----------00-0 * 00(00) Assigned 01-0 * 00(01) Assigned 02-0 SP Unavailable 03-0 SP Unavailable Pwr ---y y y y Conn ---n n n n Conf ---n n n n Test ------Unknown Unknown Unknown Unknown Fault -------Normal Faulted Normal Normal COD ---n n n n

service# clearfault /IOU#1 Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. 2.4 - clearing a fault on a CMU not part of a running domain service# clearfault /CMU#2/CPUM#2 Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. 2.5 - clearing a fault on a CMU which is part of a running domain : XSCF> showstatus CMU#3 Status:Normal; * CPUM#0-CHIP#0 Status:Faulted; * OPNL#0 Status:Faulted; XSCF> showboards -v -a XSB R DID(LSB) Assignment ---- - -------- ----------00-0 00(00) Assigned 01-0 00(01) Assigned 03-0 00(03) Assigned Pwr ---y y y Conn ---y y y Conf ---y y y Test ------Passed Passed Passed Fault -------Normal Normal Degraded COD ---n n n

service# clearfault /CMU#3/CPUM#0 FRU cannot be detached clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: n We can use DR to detach the XSB and clear the fault. XSCF> deleteboard -c unassign 03-0 XSB#03-0 will be unassigned from domain immediately. Continue?[y|n] :y Start unconfiguring XSB from domain. Unconfigured XSB from domain. XSB power off sequence started. [1200sec] 0...end Operation has completed. XSCF> showboards -v -a XSB R DID(LSB) Assignment ---- - -------- ----------00-0 00(00) Assigned 01-0 00(01) Assigned 03-0 SP Available Pwr ---y y y Conn ---y y n Conf ---y y n Test ------Passed Passed Passed Fault -------Normal Normal Degraded COD ---n n n

service# clearfault /CMU#3/CPUM#0 Testing the hardware. This may take up to six minutes XSCF> showboards -v -a XSB R DID(LSB) Assignment ---- - -------- ----------00-0 00(00) Assigned 01-0 00(01) Assigned 03-0 00(03) Assigned Pwr ---y y y Conn ---y y y Conf ---y y y Test ------Passed Passed Passed Fault -------Normal Normal Normal COD ---n n n

10 de 11

17/11/2009 11:40

Zimbra: rlopez@gcs-iberia.com

http://127.0.0.1:7633/desktop/login.jsp?at=67ecab1f-91be-4cdf-ac2f-3...

XSCF> showstatus No failures found in System Initialization. 2.6 - clearing a fault on a CMU which is part of a running domain but DR cannot be used : XSCF> showstatus CMU#3 Status:Normal; * MEM#00A Status:Faulted; XSCF> clearfault /CMU#3/MEM#00A FRU cannot be detached clearfault: Fault cannot be cleared for this FRU. FRU will be marked to clear fault on next circuit breaker off and on. Continue? [y|n]: n XSCF> showboards -v -a XSB R DID(LSB) Assignment ---- - -------- ----------00-2 00(00) Assigned 03-0 00(12) Assigned Pwr ---y y Conn ---y y Conf ---y y Test ------Passed Passed Fault -------Normal Degraded COD ---n n

Since DR cannot be used for whatever reason, the domain must be powered off prior to using clearfault : XSCF> showdomainstatus -d 0 DID Domain Status 00 Powered Off service> clearfault /CMU#3/MEM#00A Testing the hardware. This may take up to six minutes XSCF> showstatus No failures found in System Initialization. XSCF> showboards -v -d 0 XSB R DID(LSB) Assignment ---- - -------- ----------00-2 * 00(00) Assigned 03-0 * 00(12) Assigned Pwr ---y y Conn ---n n Conf ---n n Test ------Passed Passed Fault -------Normal Normal COD ---n n

Product Sun Sun Sun Sun SPARC SPARC SPARC SPARC Enterprise Enterprise Enterprise Enterprise M5000 M4000 M9000 M8000 Server Server Server Server

Keywords opl, fault, clear, clearfault, clearstatus, fmdump, showstatus

================================= Angel Rico Technical Support Engineer Sun Microsystems de Mexico. Prol. Paseo de la Reforma 600-002 Phone x10369/+ 52(55)52-61-03-69 Email Angel.Rico@Sun.COM

11 de 11

17/11/2009 11:40