How To Fix Wrong DID Entries After A Disk Replacement

How to fix wrong DID entries after a disk replacement Sun/Solaris Cluster 3.0/3.
1: How to fix wrong DID entries after a disk replacement [ID 1007674.1] Applies to: Solaris Cluster - Version: 3.0 and later [Release: 3.0 and later ] All Platforms SymptomsThis document explains what to do if you are unable to bring a device group online and the messages show the following: Dec 13 19:58:21 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 474256 daemon.info] Validations of all specified global device services complete. Dec 13 19:58:25 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1 Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases Dec 13 19:58:26 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1 Dec 13 19:58:30 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/dbcal to this node failed: Node failed to become the primary. Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: becoming primary for disk-group1 Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.error] stderr: metaset: cronos: disk-group1: stale databases Dec 13 19:58:34 cronos Cluster.Framework: [ID 801593 daemon.notice] stdout: Stale database for diskset disk-group1 Dec 13 19:58:38 cronos SC[SUNW.HAStoragePlus,MSG01-RG,disk-group1,hastorageplus_prenet_start_private]: [ID 500133 daemon.warning] Device switchover of global service disk-group1 associated with path /global/jes1 to this node failed: Node failed to become the primary. Also it is possible that scdidadm -c complained about dids changed/missing. Changes CauseTypically, this problem happens after replacement of a bad disk, without following the right procedure. The right procedure to replace a bad disk with SVM/Disk Suite and Cluster 3.x is detailed in Technical Instruction Document 1004951.1 - Sun/Solaris Cluster 3.x: How to change SCSI JBOD disk with Solstice DiskSuite SDS // Solaris Volume Manager SVM. For instance: following that procedure it has been forgotten to run cfgadm command on the node that does not own the diskset, let's say node2. Now when scgdevs command is run it removes node2 from the list of nodes for the did instance that corresponds to the replaced disk and adds a new did instance for node2 only. You will end up with two did instances for the same physical disk, one for each nodes. In this scenario node2 could fail to take over the diskset since in the replicas it references a did instance for which node2 has no access. As an example in the scdidadm output below you can see that did instances 13 and 37 are present for disk c3t1d0
root@node2 # /usr/cluster/bin/scdidadm -L 13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13 37 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d37 Using scdidadm command you can verify that disk id's for those dids are different root@node1 # /usr/cluster/bin/scdidadm -o diskid -l d13 46554a495453552030304e3043344e4a2020202000000000 root@node2 # /usr/cluster/bin/scdidadm -o diskid -l d37 46554a495453552030315830383637342020202000000000 and on node1 the 'iostat -E' command returns for disk c3t1d0 a s/n different from the one returned on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section) root@node1 # /usr/bin/iostat -E sd31 Soft Errors: 203 Hard Errors: 242 Transport Errors: 272 Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ root@node2 # /usr/bin/iostat -E sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31 Vendor: FUJITSU Product: MAN3367M SUN36G Revision: 1502 Serial No: 01X0867 This outputs are due to the fact that node1 correctly references the s/n of the disk currently present while node2 is still referencing the s/n of the replaced disk. SolutionIf you are lucky the scdidadm -L shows two did instances (i.e. 13 and 37) for only one shared disk (i.e. c3t1d0). In this case first of all you have to check which of them is actually referencing a disk no longer present on the JBOD. It can be easily achieved visual inspecting the s/n of the disks currently present on the JBOD, let's say did 37 is the bad one (disk with s/n 01X0867 has been replaced) Now to fix this issue you have to remove disk c3t1d0 from node2 root@node2 # /usr/sbin/cfgadm -c unconfigure c3::dsk/c3t1d0 root@node2 # /usr/sbin/devfsadm -Cv Remove did instance 37 from cluster root@node2 # /usr/cluster/bin/scdidadm -C Verify with 'scdidadm -L' that did instace 37 has been cleared. You are ready to add disk c3t1d0 back on node2 (to find out the sd instance number of a c#t#d# disk see the "Additional Information" section) root@node2 # /usr/sbin/cfgadm -c configure c3::sd31 root@node2 # /usr/sbin/devfsadm On node2 verify that s/n for disk c3t1d0 has changed root@node2 # /usr/bin/iostat -E sd31 Soft Errors: 1 Hard Errors: 21 Transport Errors: 31 Vendor: FUJITSU Product: MAP3367N SUN36G Revision: 0401 Serial No: 00N0C4NJ
Add node2 on the list of nodes for did instance 13 root@node2 # /usr/cluster/bin/scgdevs Verify with 'scdidadm -L' that you have two entries for did instance 13, one for each node root@node2 # /usr/cluster/bin/scdidadm -L 13 node1:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13 13 node2:/dev/rdsk/c3t1d0 /dev/did/rdsk/d13 - If you are unlucky the scdidadm -L shows two did instances for many shared disks. In this case if nodes can be shutdown and you have an old scdidadm output you can execute the procedure provided below, otherwise you have to repeat the steps above for each of the affected shared disks. 1. With an old scdidadm -l output, first check how the DID layout looks like. # /usr/cluster/bin/scdidadm -l In this example, node "cronos" should look like: 1 cronos:/dev/rdsk/c0t0d0 /dev/did/rdsk/d1 2 cronos:/dev/rdsk/c1t1d0 /dev/did/rdsk/d2 3 cronos:/dev/rdsk/c1t0d0 /dev/did/rdsk/d3 4 cronos:/dev/rdsk/c2t40d0 /dev/did/rdsk/d4 5 cronos:/dev/rdsk/c3t44d23 /dev/did/rdsk/d5 6 cronos:/dev/rdsk/c2t40d23 /dev/did/rdsk/d6 7 cronos:/dev/rdsk/c2t40d22 /dev/did/rdsk/d7 8 cronos:/dev/rdsk/c2t40d21 /dev/did/rdsk/d8 9 cronos:/dev/rdsk/c2t40d20 /dev/did/rdsk/d9 10 cronos:/dev/rdsk/c2t40d19 /dev/did/rdsk/d10 11 cronos:/dev/rdsk/c2t40d18 /dev/did/rdsk/d11 12 cronos:/dev/rdsk/c2t40d17 /dev/did/rdsk/d12 13 cronos:/dev/rdsk/c2t40d16 /dev/did/rdsk/d13 14 cronos:/dev/rdsk/c2t40d15 /dev/did/rdsk/d14 15 cronos:/dev/rdsk/c2t40d14 /dev/did/rdsk/d15 16 cronos:/dev/rdsk/c2t40d13 /dev/did/rdsk/d16 17 cronos:/dev/rdsk/c2t40d12 /dev/did/rdsk/d17 18 cronos:/dev/rdsk/c2t40d11 /dev/did/rdsk/d18 19 cronos:/dev/rdsk/c2t40d10 /dev/did/rdsk/d19 20 cronos:/dev/rdsk/c2t40d9 /dev/did/rdsk/d20 21 cronos:/dev/rdsk/c2t40d8 /dev/did/rdsk/d21 22 cronos:/dev/rdsk/c2t40d7 /dev/did/rdsk/d22 23 cronos:/dev/rdsk/c2t40d6 /dev/did/rdsk/d23 24 cronos:/dev/rdsk/c2t40d5 /dev/did/rdsk/d24 25 cronos:/dev/rdsk/c2t40d4 /dev/did/rdsk/d25 26 cronos:/dev/rdsk/c2t40d3 /dev/did/rdsk/d26 27 cronos:/dev/rdsk/c2t40d2 /dev/did/rdsk/d27 28 cronos:/dev/rdsk/c2t40d1 /dev/did/rdsk/d28 29 cronos:/dev/rdsk/c3t44d22 /dev/did/rdsk/d29 30 cronos:/dev/rdsk/c3t44d21 /dev/did/rdsk/d30 31 cronos:/dev/rdsk/c3t44d20 /dev/did/rdsk/d31 32 cronos:/dev/rdsk/c3t44d19 /dev/did/rdsk/d32 33 cronos:/dev/rdsk/c3t44d18 /dev/did/rdsk/d33 34 cronos:/dev/rdsk/c3t44d17 /dev/did/rdsk/d34
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
cronos:/dev/rdsk/c3t44d16 /dev/did/rdsk/d35 cronos:/dev/rdsk/c3t44d15 /dev/did/rdsk/d36 cronos:/dev/rdsk/c3t44d14 /dev/did/rdsk/d37 cronos:/dev/rdsk/c3t44d13 /dev/did/rdsk/d38 cronos:/dev/rdsk/c3t44d12 /dev/did/rdsk/d39 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d40 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d41 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d42 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d43 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d44 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d45 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d46 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d47 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d48 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d49 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d50 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d51
And the output for node "volcano" should look like: 4 vulcano:/dev/rdsk/c3t44d0 /dev/did/rdsk/d4 5 vulcano:/dev/rdsk/c2t40d23 /dev/did/rdsk/d5 6 vulcano:/dev/rdsk/c3t44d23 /dev/did/rdsk/d6 7 vulcano:/dev/rdsk/c3t44d22 /dev/did/rdsk/d7 8 vulcano:/dev/rdsk/c3t44d21 /dev/did/rdsk/d8 9 vulcano:/dev/rdsk/c3t44d20 /dev/did/rdsk/d9 10 vulcano:/dev/rdsk/c3t44d19 /dev/did/rdsk/d10 11 vulcano:/dev/rdsk/c3t44d18 /dev/did/rdsk/d11 12 vulcano:/dev/rdsk/c3t44d17 /dev/did/rdsk/d12 13 vulcano:/dev/rdsk/c3t44d16 /dev/did/rdsk/d13 14 vulcano:/dev/rdsk/c3t44d15 /dev/did/rdsk/d14 15 vulcano:/dev/rdsk/c3t44d14 /dev/did/rdsk/d15 16 vulcano:/dev/rdsk/c3t44d13 /dev/did/rdsk/d16 17 vulcano:/dev/rdsk/c3t44d12 /dev/did/rdsk/d17 18 vulcano:/dev/rdsk/c3t44d11 /dev/did/rdsk/d18 19 vulcano:/dev/rdsk/c3t44d10 /dev/did/rdsk/d19 20 vulcano:/dev/rdsk/c3t44d9 /dev/did/rdsk/d20 21 vulcano:/dev/rdsk/c3t44d8 /dev/did/rdsk/d21 22 vulcano:/dev/rdsk/c3t44d7 /dev/did/rdsk/d22 23 vulcano:/dev/rdsk/c3t44d6 /dev/did/rdsk/d23 24 vulcano:/dev/rdsk/c3t44d5 /dev/did/rdsk/d24 25 vulcano:/dev/rdsk/c3t44d4 /dev/did/rdsk/d25 26 vulcano:/dev/rdsk/c3t44d3 /dev/did/rdsk/d26 27 vulcano:/dev/rdsk/c3t44d2 /dev/did/rdsk/d27 28 vulcano:/dev/rdsk/c3t44d1 /dev/did/rdsk/d28 29 vulcano:/dev/rdsk/c2t40d22 /dev/did/rdsk/d29 30 vulcano:/dev/rdsk/c2t40d21 /dev/did/rdsk/d30 31 vulcano:/dev/rdsk/c2t40d20 /dev/did/rdsk/d31 32 vulcano:/dev/rdsk/c2t40d19 /dev/did/rdsk/d32 33 vulcano:/dev/rdsk/c2t40d18 /dev/did/rdsk/d33 34 vulcano:/dev/rdsk/c2t40d17 /dev/did/rdsk/d34 35 vulcano:/dev/rdsk/c2t40d16 /dev/did/rdsk/d35 36 vulcano:/dev/rdsk/c2t40d15 /dev/did/rdsk/d36 37 vulcano:/dev/rdsk/c2t40d14 /dev/did/rdsk/d37 38 vulcano:/dev/rdsk/c2t40d13 /dev/did/rdsk/d38 39 vulcano:/dev/rdsk/c2t40d12 /dev/did/rdsk/d39 40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40 41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41 42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42 43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43
44 45 46 47 48 49 50 51 52 53 54
vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51 vulcano:/dev/rdsk/c0t0d0 /dev/did/rdsk/d52 vulcano:/dev/rdsk/c1t1d0 /dev/did/rdsk/d53 vulcano:/dev/rdsk/c1t0d0 /dev/did/rdsk/d54
Be aware that the sd instance number and the c#t#d# name of a shared disk is not necessarily the same on both nodes. For instance in the example above the same shared disks is referenced with c2t40d11 on node "vulcano" and with c3t44d11 on node "cronos" 2. Now, we have to change the affected/missing dids; to figure out which dids have changed, we have to execute: # /usr/cluster/bin/scdidadm -L And look for what did entries are not in both nodes (please note that the following example shows an output for nodes "cronos" and "volcano" with the problem): 40 vulcano:/dev/rdsk/c2t40d11 /dev/did/rdsk/d40 41 vulcano:/dev/rdsk/c2t40d10 /dev/did/rdsk/d41 42 vulcano:/dev/rdsk/c2t40d9 /dev/did/rdsk/d42 43 vulcano:/dev/rdsk/c2t40d8 /dev/did/rdsk/d43 44 vulcano:/dev/rdsk/c2t40d7 /dev/did/rdsk/d44 45 vulcano:/dev/rdsk/c2t40d6 /dev/did/rdsk/d45 46 vulcano:/dev/rdsk/c2t40d5 /dev/did/rdsk/d46 47 vulcano:/dev/rdsk/c2t40d4 /dev/did/rdsk/d47 48 vulcano:/dev/rdsk/c2t40d3 /dev/did/rdsk/d48 49 vulcano:/dev/rdsk/c2t40d2 /dev/did/rdsk/d49 50 vulcano:/dev/rdsk/c2t40d1 /dev/did/rdsk/d50 51 vulcano:/dev/rdsk/c2t40d0 /dev/did/rdsk/d51 ... * 55 cronos:/dev/rdsk/c3t44d11 /dev/did/rdsk/d55 * 56 cronos:/dev/rdsk/c3t44d10 /dev/did/rdsk/d56 * 57 cronos:/dev/rdsk/c3t44d9 /dev/did/rdsk/d57 * 58 cronos:/dev/rdsk/c3t44d8 /dev/did/rdsk/d58 * 59 cronos:/dev/rdsk/c3t44d7 /dev/did/rdsk/d59 * 60 cronos:/dev/rdsk/c3t44d6 /dev/did/rdsk/d60 * 61 cronos:/dev/rdsk/c3t44d5 /dev/did/rdsk/d61 * 62 cronos:/dev/rdsk/c3t44d4 /dev/did/rdsk/d62 * 63 cronos:/dev/rdsk/c3t44d3 /dev/did/rdsk/d63 * 64 cronos:/dev/rdsk/c3t44d2 /dev/did/rdsk/d64 * 65 cronos:/dev/rdsk/c3t44d1 /dev/did/rdsk/d65 * 66 cronos:/dev/rdsk/c3t44d0 /dev/did/rdsk/d66 These did entries don't have a match on the other node (the ones marked with an asterisk '*'), and in did, have changed, according with an output from scdidadm -l prior to the problem. Those dids that do not have a match are the ones that need to be changed; on this example, dids 55 to 66 need to be changed from 40 to 51. To change those back, we must do the following:
-Shutdown both nodes to the ok prompt. -Boot a node (the last one to come down) in single user mode, out of the cluster: boot -sx -Edit the file: /etc/cluster/ccr/did_instances And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51 -Excecute: # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances -o -Boot the node back in cluster mode. -Now boot the other node in single user mode, out of the cluster: boot -sx -Edit the file: /etc/cluster/ccr/did_instances And change entries as follows: 55 to 40, 56 to 41, ... , 66 to 51 -Then, excecute: # /usr/cluster/lib/sc/ccradm -i /etc/cluster/ccr/did_instances -Boot the node back in cluster mode. -Check if the problem has been fixed (steps 1 and 2 above). If everything has been fixed: -Check the output from metastat -s <setName> and see if any metadevices needs to be re-sync'ed. Additional Information To find out the sd instance number of a c#t#d# disk you have to match the disk path in the format output with a sd entry in /etc/path_to_inst file: root@node2 # /usr/sbin/format c3t1d0 <SUN36G cyl 24620 alt 2 hd 27 sec 107> /pci@8,700000/scsi@5,1/sd@1,0 root@node2 # /usr/bin/grep "/pci@8,700000/scsi@5,1/sd@1,0" /etc/path_to_inst "/node@2/pci@8,700000/scsi@5,1/sd@1,0" 31 "sd" In this case the sd instance number is 31. ==========
For Solaris Cluster 3.2 it is *much* easier to change did instance numbers. Technical Instruction Document 1009730.1 Solaris Cluster 3.2 renaming "did" devices

How To Fix Wrong DID Entries After A Disk Replacement

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

How To Fix Wrong DID Entries After A Disk Replacement

Hochgeladen von

Copyright:

Verfügbare Formate

How to fix wrong DID entries after a disk replacement Sun/Solaris Cluster 3.0/3.

Das könnte Ihnen auch gefallen