Beruflich Dokumente
Kultur Dokumente
Sarvesh S. Patel, Bill Scales IBM Systems and Technology Group ISV Enablement January 2014
Table of contents
Abstract..................................................................................................................................... 1 Getting started .......................................................................................................................... 1
About IBM SAN Volume Controller Stretched Cluster services .............................................................. 1 About this guide ....................................................................................................................................... 2 Assumptions ............................................................................................................................................ 2
Resources ............................................................................................................................... 15 About the authors................................................................................................................... 15 Trademarks and special notices ........................................................................................... 16
Abstract
In this white paper, the procedure of creating an IBM System Storage SAN Volume Controller (SVC) cluster with the Enhanced Stretched Cluster topology is described. The white paper gives information regarding the stretched topology and the disaster recovery feature in the Enhanced Stretched Cluster feature. The white paper gives a brief introduction to the Enhanced Stretched Cluster feature and provides guidance on how to assign site awareness to the controllers and other entities. It also describes a procedure of enabling and disabling the feature and the procedure to be used in case of disaster recovery.
Getting started
This section gives a brief idea IBM System Storage SAN Volume Controller (SVC) Stretched Cluster implementation and different topologies used to configure the Stretched Cluster implementation.
1
IBM SAN Volume Controller (SVC) or SVC cluster in the entire guide are general terms that apply for IBM SVC platform only. IBM System Storage SAN Volume Controller Enhanced Stretched Cluster
The following two types of configurations are supported: Stretched Cluster configuration without using inter switch links As explained in the product documentation, this topology has direct connections from IBM SVC nodes to the switches from different power domains. Stretched Cluster configuration using inter switch links This implementation strategy has inter switch links between two sites having different power domains.
For the documentation on the solution as included in 6.3.0, refer to the following link: ftp://ftp.software.ibm.com/storage/san/sanvc/V6.3.0/SVC_Split_IO_Group_requirements_Errata_V1.pdf
Assumptions
Below are the assumptions considered while writing this white paper. The SVC clusters are successfully installed with the latest (at the time of this publication) IBM SVC 7.2.0 code levels (or above). The SVC clusters have the required licenses. (No separate license is required to enable Enhanced Stretched Cluster site awareness and site disaster recovery feature.) The storage SAN is configured according to the product documentation and the infrastructure to support SVC clusters in a stretched cluster using Fibre Channel 8G is properly in place. The user has the basic understanding and awareness of SVC stretched and split cluster concepts, SVC storage concepts, and configurations for host attach. The user knows the different heterogeneous SVC platforms that can be added in FC partnerships. The same apply for IP partnerships as well.
For SVC documentation, refer to: http://pic.dhe.ibm.com/infocenter/svc/ic/index.jsp Note: Refer the configuration section in above documentation for how Stretched Cluster works.
IP partnership and SVC terminology Metro Mirror, Global Mirror, and Global Mirror with Change Volumes SAN NAS
Brief description These are the different remote copy services supported on SVC platforms. Storage area network Network-attached storage Failure of a node within an I/O group fails causes virtual disk access through the surviving node. The IP addresses fail over to the surviving node in the I/O group. When the configuration node of the system fails, management IPs also fail over to an alternate node. When the failed node rejoins the system, all failed over IP addresses are failed back from the surviving node to the rejoined node and virtual disk access is restored through this node. Two nodes or two canisters form an I/O group. A single SVC system supports four I/O groups, that is, eight nodes. Fibre Channel Unless explicitly specified, a general term that is used to describe all applicable SVC platforms IBM SVC (CG8, CF8, 8G4), IBM Storwize V7000, Storwize V5000, Storwize V3700 and IBM PureFlex System storage nodes. Fibre Channel over Ethernet
Failover
Failback
Setup description
Hardware summary Minimum two IBM SVC nodes
Connectivity details
The Stretched Cluster system feature is supported with two types of implementations.
Both the implementations will not affect the behavior of site awareness and the site disaster recovery feature. It is completely dependent on how the administrator wants to configure the system
implementation. As mentioned in the earlier sections, the connectivity will not change and the feature is optional. The administrator can choose to use the same for disaster recovery. An important thing to notice here is, if the Stretched Cluster implementation has site awareness, only then the site disaster recovery feature can be invoked. Implementation details Without inter switch links This hardware implementation will be the same as recommended for a non-stretched cluster. No change in the connectivity is needed. Refer to the information center document for more details regarding the recommended connections. With inter switch links In the implementation described in the following figure strategy, two production sites are connected using inter switch links. These two sites can be in the same rack with different power domain, across racks, across data centers, and so on as it was supported earlier.
A new set of 'site' objects will be defined. These will be implicitly created for every system automatically. There will always be exactly three sites, numbered 1, 2, and 3. (Site index 0 will never be reported). There is no means of deleting sites or creating extra sites. The only configurable attribute for each site is its name. The default names for the sites are 'site1', 'site2', and 'site3'. Site1 and site2 are where the two halves of the Stretched Cluster are located. Site3 is the optional third site for a quorum tie-break disk. The appropriate 'site' instance will be referenced when a site value is defined for an object. Objects can also leave their 'site' value undefined. This is the default setting for an object. Enabling the site disaster recovery feature and correct operation of the disaster recovery feature requires assigning objects to the site. These are the mandatory three sites needed for a Stretched Cluster implementation. Site1: Production site 1 Site2: Production site 2 Site3: A site at a different location to house the quorum disk In IBM SVC version 7.2 onwards, these three sites would be present by default. A new CLI, lssite has been introduced to list down the sites.
The site names can be changed using the CLI, chsite. For example, the Stretched Cluster implementation is spread across two data centers, and therefore, named the site accordingly.
After the names are assigned to the sites, they can be viewed using the lssite command.
After assigning appropriate site awareness to nodes, the user can verify the site assignment using the lsnode command in concise as well as detailed view.
10
Connectivity is permitted between: Any node and controllers in site 3, or controllers with no site defined A node with no site defined and any controller
A node configured in a site and a controller MDisk configured to the same site A node configured in a site and a controller MDisk configured to site 3
The fault reporting algorithms for raising event logs in the case of missing connectivity are also adjusted to allow for these rules. When a controller is configured to site 1, then connectivity to nodes in site 2 is not expected or required, and is disregarded. Faults are only reported if any node in site 1 has inadequate connectivity (that is, if any node in site 1 has less than two SVC ports with visibility to the controller). Similarly, if a controller is configured to site 2, then connectivity to nodes in site 1 is disregarded.
11
Mu lti-WWNN c o n trolle r
When the site is changed on a multi-worldwide node name (WWNN) controller, all of the affected controllers are updated with the site setting on all controllers at the same time.
There is no precondition on sites being configured for controllers. The feature will not be operable on nodes that were absent until they rejoin the cluster. New clusters and clusters upgrading to 7.2.0 and later versions have the disaster recovery feature disabled by default. The site disaster recovery feature can be enabled or disabled by using the chsystem command and can be checked using the lssystem command.
Figure 9: Enabling the site disaster recovery feature using the chsystem command
Figure 10: Output of the lssystem command when the feature is enabled
12
13
satask leavecluster -force command. Before running these commands, the user must disconnect all the FC/FCoE cables from all the nodes which they want to re-add in the cluster.
Returning to normal operation after invoking the site disaster recovery feature
The user must follow a careful set of steps to ensure that the system maintains integrity as the two sites' connectivity is recovered. In particular, care is needed to not conflict with the activity of any still active nodes in the failed site, for example, if power is recovered after a failure. The following sequence of steps are required. 1. Disaster recovery feature is invoked - alert is raised indicating that this process must be used. 2. Access to all 'recovered site' volume copies is recovered. This includes the mirror-half of stretched volumes plus any single-copy volumes with defined local site. 3. Access to all other volume copies is lost. The user must treat all such storage as suspect and potentially corrupted. The conservative approach is to delete all such volume copies. Some users might choose to retain access to such volumes to attempt to recover some data. 4. Mirrored volumes with one online fresh local copy can be retained. 5. Access to all other site quorum disks is lost. All such quorum disks must be deleted. 6. This can be achieved by using rmmdisk if the MDisk no longer holds any volume copy. If there are volume copies that are being retained, then the process must use chquorum to select new quorum disks and prevent attempts to use the other site quorum disks. 7. All inter-system remote copy relationships, consistency groups, and partnerships must be destroyed (partnerships will be in the partially-configured state). 8. At this point, the user can address the missing nodes. This requires disconnecting the FC/FCoE connectivity of the missing nodes, then either unconfiguring the node using svctask rmnode (in the abandoned cluster) or satask leavecluster as described earlier or decommissioning the node so it can no longer access the shared storage and then issuing rmnode in the recovered cluster to inform it that this step has been performed. 9. When the last offline node from the failed site is repaired, the alert auto fixes and any non-local site volume copies become online. The process of reconstructing the system objects can begin, including: Defining quorum disks in the correct sites Re-creating volumes that were not automatically recovered earlier Re-creating any intra-system copy services that were deleted because their volumes were deleted Re-creating any inter-system Metro Mirror or Global Mirror objects Note that there is no need to explicitly re-enable the disaster recovery feature. The cluster topology remains as 'stretched', and when the event log auto fixes the cluster topology, the status returns to 'dual_site' and assuming that there are online nodes at both sites, the voting set will be manipulated to prepare for the next disaster recovery.
14
Resources
The following websites provide useful references to supplement the information contained in this paper: IBM Systems on IBM PartnerWorld ibm.com/partnerworld/systems/ IBM Publications Center http://www.elink.ibmlink.ibm.com/public/applications/publications/cgibin/pbi.cgi?CTY=US IBM Redbooks ibm.com/redbooks IBM developerWorks ibm.com/developerworks IBM SAN and SVC Stretched Cluster and VMware Solution Implementation ibm.com/redbooks/redbooks/pdfs/sg248072.pdf IBM SAN Volume Controller Stretched Cluster with PowerVM and PowerHA ibm.com/redbooks/redbooks/pdfs/sg248142.pdf SVC Split Cluster How it Works ibm.com/developerworks/community/blogs/storagevirtualization/entry/split_cluster?lang=en
15
16
Any references in this information to non-IBM websites are provided for convenience only and do not in any manner serve as an endorsement of those websites. The materials at those websites are not part of the materials for this IBM product and use of those websites is at your own risk.
17