Cluster Private Interconnect Troubleshooting

Document 1315772.1 https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state...
Copyright (c) 2019, Oracle. All rights reserved. Oracle Confidential.
Solaris Cluster Troubleshooting Private Interconnect Transport Network in a 'failed' or 'faulted' State (Doc
ID 1315772.1)
APPLIES TO:
Solaris Cluster - Version 3.0 to 4.3 [Release 3.0 to 4.3]

Oracle Solaris on SPARC (32-bit)
Oracle Solaris on SPARC (64-bit)
Oracle Solaris on x86-64 (64-bit)
Oracle Solaris on x86 (32-bit)
PURPOSE
Solaris Cluster has a minimum of two private transports also known as private interconnect or private network for high availability.
This article applies to situations when you have one path in 'failed' or 'faulted' state.
If all private transport in failed/faulted state it is more likely that the node they are connected to is down or not part of the cluster.
In Sun Cluster 3.0 or 3.1 you would see the failure/errors in
#scstat
or
#scstat -W
- Cluster Transport Paths --
Endpoint Endpoint Status

-------- -------- ------
Transport path: clnode1:nxge1 clnode2:nxge1 faulted
Transport path: clnode1:nxge0 clnode2:nxge0 Path online
In Oracle Solaris Cluster 3.2 or higher you see it also with
# cluster status
or
# clinterconnect status
=== Cluster Transport Paths ===
Endpoint1 Endpoint2 Status

--------- --------- ------
clnode1:net1 clnode2:net1 Path online
clnode1:net0 clnode2:net0 faulted
This procedure uses nxge interfaces but it is not specific to nxge and applies to other types of transports. The article shows Solaris Cluster
3.2 and above command set. For older revisions see man pages for scconf and scsetup.
This resolution path can also be used to analyze cluster interconnect issue, when the speed is not correct, e.g. the interconnect is running
on 100Mbit instead of 1000Mbit.
TROUBLESHOOTING STEPS
1) Check current status of cluster and it's interconnect and monitor outputs:
Run the status command mentioned above "scstat -W" or "clintr status". Within execution of these command monitor the messages file
with:
# tail -f /var/adm/messages
and/or monitor the the console in a terminal window.
1 of 3 12-03-2019, 01:10
If all show "Path online" no further action is necessary.
If the nxge1 path shows "faulted" as in example above go to next step.
2) Check if the messages which are occured are relevant.
In this example errors are seen from in.mpathd as well as cluster interconnects are faulted:
=== Cluster Transport Paths ===
Endpoint1 Endpoint2 Status

--------- --------- ------
clnode1:nxge1 clnode2:nxge1 faulted
clnode1:nxge0 clnode2:nxge0 Path online
The following errors are observed in /var/adm/messages for the public network:
Feb 25 12:22:10 clnode1 in.mpathd[198]: [ID 215189 daemon.error] The link has gone down on nxge4
Feb 25 12:22:10 clnode1 in.mpathd[198]: [ID 594170 daemon.error] NIC failure detected on nxge4 of group primary
Beware that nxge4 is not part of the interconnect but it shows an error when executing the status command for the interconnect.
In such a case check the /etc/path_to_inst file. Especially when there was some maintenance on the node.
Ensure that the path entries in /etc/path_to_inst file match the configuration in Solaris Cluster. This means the device paths for nxge0 and
nxge1 should link to the necessary device paths of private interconnect. In the example above the nxge4 was moved to the device paths of
private interconnect and nxge1 was moved to the public network in /etc/path_to_inst file. When its necessary to correct the errors in
/etc/path_to_inst file then a reboot is required afterwards.
3) Reset the connection with software
Disable and re-enable the cable to clnode1:nxge1

This would reset the connection and you could see if it comes online.
#/usr/cluster/bin/clinterconnect disable clnode1:nxge1,switch2@1
#/usr/cluster/bin/clinterconnect enable clnode1:nxge1,switch2@1
If the transports now shows the Path online you are done.
4) To see adapter interface detailed configuration switch name use the following command
# scconf -pvv
---snip--- this is a configuration without switches -----snip-----
(budap2:e1000g7) Adapter enabled: yes

(budap2:e1000g7) Adapter transport type: dlpi
(budap2:e1000g7) Adapter property: device_name=e1000g
(budap2:e1000g7) Adapter property: device_instance=7
(budap2:e1000g7) Adapter property: lazy_free=1
(budap2:e1000g7) Adapter property: dlpi_heartbeat_timeout=10000
(budap2:e1000g7) Adapter property: dlpi_heartbeat_quantum=1000
(budap2:e1000g7) Adapter property: nw_bandwidth=80
(budap2:e1000g7) Adapter property: bandwidth=70
(budap2:e1000g7) Adapter property: ip_address=172.16.1.2
(budap2:e1000g7) Adapter property: netmask=255.255.255.128
(budap2:e1000g7) Adapter port names: 0
(budap2:e1000g7) Adapter port: 0

(budap2:e1000g7@0) Port enabled: yes
Cluster transport switches: <NULL>
Cluster transport cables
2 of 3 12-03-2019, 01:10
Endpoint Endpoint State

-------- -------- -----
Transport cable: node2:e1000g3@0 node1:e1000g3@0 Enabled
Transport cable: node2:e1000g7@0 node1:e1000g7@0 Enabled
-------- snip ----------
or use the new command "clintr show -v" for configuration details.
You can also look at your interface configuration in Solaris command line:
# ifconfig -a
bge2: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 4

inet 172.16.0.130 netmask ffffff80 broadcast 172.16.0.255
bge3: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 3
inet 172.16.1.2 netmask ffffff80 broadcast 172.16.1.127
clprivnet0: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 5
inet 172.16.4.2 netmask fffffe00 broadcast 172.16.5.255
clprivnet0 is the cluster private logical host failover interface over the physical private transports.
In Solaris 11 use the command dladm and ipadm.
5) Troubleshooting the hardware components
Observe your console and your /var/adm/messages file for possible errors.
If the connection is still faulted there is a potential hardware problem.
If you have a spare cable you can replace the cables that are going to the nxge1 interfaces and repeat the above procedure with
clinterconnect disable/enable. If the nxge1 links shows offline on either node the HBA might need to be replaced.
On the switch if you have extra ports you could try to move the cables (the same ones going to nxge1) to different port. Check with your
network administrator and ask them to check switch logs to see if the switch is OK or if it is logging errors.
Because there are several components in the Transport Path. It is difficult to give more specific instructions.
clnode1 - HBA ---- cable--- port -switch2- port----cable----HBA - clnode2
Any component on this chain can cause this to happen.
Pay attention to the interfaces and cables if you end up replacing them. If you accidentally remove the last working transport one of the
nodes will panic.
Didn't find what you are looking for?
3 of 3 12-03-2019, 01:10

Cluster Private Interconnect Troubleshooting

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster Private Interconnect Troubleshooting

Hochgeladen von

Copyright:

Verfügbare Formate

Document 1315772.1 https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrl-state...

Copyright (c) 2019, Oracle. All rights reserved. Oracle Confidential.

Solaris Cluster - Version 3.0 to 4.3 [Release 3.0 to 4.3]

In Sun Cluster 3.0 or 3.1 you would see the failure/errors in

- Cluster Transport Paths --

Endpoint Endpoint Status

In Oracle Solaris Cluster 3.2 or higher you see it also with

=== Cluster Transport Paths ===

Endpoint1 Endpoint2 Status

and/or monitor the the console in a terminal window.

If all show "Path online" no further action is necessary.

If the nxge1 path shows "faulted" as in example above go to next step.

2) Check if the messages which are occured are relevant.

=== Cluster Transport Paths ===

Endpoint1 Endpoint2 Status

3) Reset the connection with software

Disable and re-enable the cable to clnode1:nxge1

#/usr/cluster/bin/clinterconnect disable clnode1:nxge1,switch2@1

#/usr/cluster/bin/clinterconnect enable clnode1:nxge1,switch2@1

---snip--- this is a configuration without switches -----snip-----

(budap2:e1000g7) Adapter enabled: yes

(budap2:e1000g7) Adapter port: 0

Cluster transport switches: <NULL>

Cluster transport cables

Endpoint Endpoint State

-------- snip ----------

bge2: flags=1008843<UP,BROADCAST,RUNNING,MULTICAST,PRIVATE,IPv4> mtu 1500 index 4

In Solaris 11 use the command dladm and ipadm.

5) Troubleshooting the hardware components

clnode1 - HBA ---- cable--- port -switch2- port----cable----HBA - clnode2

Any component on this chain can cause this to happen.

Didn't find what you are looking for?

Das könnte Ihnen auch gefallen