You are on page 1of 520

Front cover

Draft Document for Review February 16, 2012 3:49 pm SG24-7521-02

SAN Volume Controller


Best Practices and Performance Guidelines
Read about best practices learned from the field Learn about SVC performance advantages Fine-tune your SVC

Mary Lovelace Otavio Rocha Filho Katja Gebuhr Ivo Gomilsek Ronda Hruby Paulo Neto Jon Parkes Leandro Torolho

ibm.com/redbooks

Draft Document for Review February 16, 2012 3:49 pm

7521edno.fm

International Technical Support Organization SAN Volume Controller Best Practices and Performance Guidelines October 2011

SG24-7521-02

7521edno.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: Before using this information and the product it supports, read the information in Notices on page iii.

Third Edition (October 2011) This edition applies to Version 6, Release 2 of the IBM System Storage SAN Volume Controller. This document created or updated on February 16, 2012.

Copyright International Business Machines Corporation 2011. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Draft Document for Review February 16, 2012 3:49 pm

7521spec.fm

Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.

Copyright IBM Corp. 2011. All rights reserved.

iii

7521spec.fm

Draft Document for Review February 16, 2012 3:49 pm

Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX alphaWorks DB2 developerWorks DS4000 DS6000 DS8000 Easy Tier Enterprise Storage Server FlashCopy Global Technology Services GPFS HACMP IBM Nextra pSeries Redbooks Redbooks (logo) Storwize System p System Storage System x System z Tivoli XIV z/OS

The following terms are trademarks of other companies: ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office. Microsoft, Windows NT, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Disk Magic, and the IntelliMagic logo are trademarks of IntelliMagic BV in the United States, other countries, or both. NetApp, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and other countries. Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates. QLogic, and the QLogic logo are registered trademarks of QLogic Corporation. SANblade is a registered trademark in the United States. VMware, the VMware "boxes" logo and design are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. Intel Xeon, Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

iv

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521TOC.fm

Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix October 2011, Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Part 1. Configuration guidelines and best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. SVC update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 SVC V5.1 enhancements and changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 SVC V6.1 enhancements and changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 SVC V6.2 enhancements and changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Contents of this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.1 Part 1 - Configuration Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4.2 Part 2 - Performance Best Practices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.3 Part 3 - Monitoring, Maintenance and Troubleshooting . . . . . . . . . . . . . . . . . . . . 10 1.4.4 Part 4 - Practical Scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2. SAN topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 SVC SAN topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Topology basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 ISL oversubscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Single switch SVC SANs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Basic core-edge topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Four-SAN core-edge topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.7 Common topology issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.8 Split clustered system / Stretch clustered system. . . . . . . . . . . . . . . . . . . . . . . . . 2.2 SAN switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Selecting SAN switch models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Switch port layout for large edge SAN switches . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Switch port layout for director-class SAN switches . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 IBM System Storage/Brocade b-type SANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 IBM System Storage/Cisco SANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.6 SAN routing and duplicate WWNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Zoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Types of zoning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Pre-zoning tips and shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 SVC internode communications zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 SVC storage zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 SVC host zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Sample standard SVC zoning configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.7 Zoning with multiple SVC clustered systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.8 Split storage subsystem configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 12 12 13 14 14 15 17 20 22 22 23 23 23 25 25 26 26 27 28 28 31 33 37 37

Copyright IBM Corp. 2011. All rights reserved.

7521TOC.fm

Draft Document for Review February 16, 2012 3:49 pm

2.4 Switch Domain IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Distance extension for remote copy services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Optical multiplexors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Long-distance SFPs/XFPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Fibre Channel: IP conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Tape and disk traffic sharing the SAN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Switch interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 IBM Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 iSCSI support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 iSCSI initiators and targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 iSCSI Ethernet configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Security and performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.4 Failover of port IP addresses and iSCSI names . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.5 iSCSI protocol limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3. SAN Volume Controller clustered system . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Advantages of virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 How does the SVC fit into your environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Scalability of SVC clustered systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Advantage of multi-clustered systems as opposed to single-clustered systems . 3.2.2 Growing or splitting SVC clustered systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Clustered system upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4. Backend storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Controller affinity and preferred path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Considerations for DS4000/DS5000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Setting DS4000/DS5000 so both controllers have the same WWNN . . . . . . . . . . 4.2.2 Balancing workload across DS4000/DS5000 controllers . . . . . . . . . . . . . . . . . . . 4.2.3 Ensuring path balance prior to MDisk discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 ADT for DS4000/DS5000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Selecting array and cache parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Logical drive mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Considerations for DS8000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Balancing workload across DS8000 controllers . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 DS8000 ranks to extent pools mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Mixing array sizes within a Storage Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Determining the number of controller ports for DS8000 . . . . . . . . . . . . . . . . . . . . 4.3.5 LUN masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 WWPN to physical port translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Considerations for XIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Cabling considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Host options and settings for IBM XIV systems . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Restrictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Considerations for V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Defining internal storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Configuring IBM Storwize V7000 storage systems . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Considerations for Third Party storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Pathing considerations for EMC Symmetrix/DMX and HDS . . . . . . . . . . . . . . . . . 4.7 Medium error logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Mapping physical LBAs to volume extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Using Tivoli Storage Productivity Center to identify storage controller boundaries. . . .

37 37 38 38 38 38 39 39 40 40 40 40 41 41 43 44 45 45 46 47 51 53 54 54 54 55 56 56 56 58 58 58 59 59 60 60 62 63 63 64 65 65 65 66 66 66 67 67 67

Chapter 5. Storage pools and Managed Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.1 Availability considerations for Storage Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 vi
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521TOC.fm

5.2 Selecting storage subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Selecting the Storage Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Selecting the number of arrays per Storage Pool . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Selecting LUN attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Considerations for IBM XIV Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 SVC quorum disk considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Tiered storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Adding MDisks to existing Storage Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Checking access to new MDisks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Persistent reserve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Renaming MDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Restriping (balancing) extents across a Storage Pool . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Installing prerequisites and the SVCTools package . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Running the extent balancing script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Removing MDisks from existing Storage Pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Migrating extents from the MDisk to be deleted . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Verifying an MDisks identity before removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.3 LUNs to MDisk translation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Remapping managed MDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Controlling extent allocation order for volume creation . . . . . . . . . . . . . . . . . . . . . . . . 5.11 Moving an MDisk between SVC clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72 73 73 74 75 76 79 80 80 80 80 81 81 82 85 85 85 86 94 95 97

Chapter 6. Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.1 Volume Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.1 Thin-provisioned volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.2 Space allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1.3 Thin-provisioned volume performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1.4 Limits on Virtual Capacity of Thin-provisioned volumes . . . . . . . . . . . . . . . . . . . 102 6.1.5 Testing an application with Thin-provisioned volume . . . . . . . . . . . . . . . . . . . . . 103 6.2 What is volume mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.1 Creating or adding a mirrored volume. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2.2 Availability of mirrored volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.2.3 Mirroring between controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3 Creating Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.1 Selecting the Storage Pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Changing the preferred node within an I/O Group . . . . . . . . . . . . . . . . . . . . . . . 106 6.3.3 Moving a volume to another I/O Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.4 Volume migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4.1 Image type to striped type migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4.2 Migrating to image type volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4.3 Migrating with volume mirroring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.5 Preferred paths to a volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.5.1 Governing of volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.6 Cache mode and cache-disabled volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.6.1 Underlying controller remote copy with SVC cache-disabled volumes . . . . . . . . 114 6.6.2 Using underlying controller flash copy with SVC cache disabled volumes . . . . . 115 6.6.3 Changing cache mode of volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.7 The effect of load on storage controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.8 Setting up FlashCopy services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.8.1 Steps to making a FlashCopy volume with application data integrity . . . . . . . . . 120 6.8.2 Making multiple related FlashCopy volumes with data integrity . . . . . . . . . . . . . 122 6.8.3 Creating multiple identical copies of a volume . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.8.4 Creating a FlashCopy mapping with the incremental flag. . . . . . . . . . . . . . . . . . 124

Contents

vii

7521TOC.fm

Draft Document for Review February 16, 2012 3:49 pm

6.8.5 Thin-provisioned FlashCopy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.6 Using FlashCopy with your backup application. . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.7 Using FlashCopy for data migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.8 Summary of FlashCopy rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.9 IBM Tivoli Storage FlashCopy Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8.10 IBM System Storage Support for Microsoft Volume Shadow Copy Service . . . Chapter 7. Remote Copy services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Remote Copy services: an introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Common terminology and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Intercluster link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 SVC functions by release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 What is new in SVC 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Remote copy features by release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Terminology and functional concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Remote copy partnerships and relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Global Mirror control parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Global Mirror partnerships and relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Asynchronous remote copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Understanding Remote Copy write operations . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.6 Asynchronous remote copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.7 Global Mirror write sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.8 Importance of write ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.9 Colliding writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.10 Link speed, latency, and bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.11 Choosing a link cable of supporting GM applications . . . . . . . . . . . . . . . . . . . . 7.3.12 Remote Copy Volumes: Copy directions an default roles. . . . . . . . . . . . . . . . . 7.4 Intercluster (Remote) link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 SAN configuration overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Switches and ISL oversubscription . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Zoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.4 Distance extensions for the Intercluster Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.5 Optical multiplexors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.6 Long-distance SFPs/XFPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.7 Fibre Channel: IP conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.8 Configuration of intercluster (long distance) links . . . . . . . . . . . . . . . . . . . . . . . . 7.4.9 Link quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.10 Hops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.11 Buffer credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Global Mirror design points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Global Mirror parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 chcluster and chpartnership commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 How GM Bandwidth is distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 1920 errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Global Mirror planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Summary of Metro Mirror and Global Mirror rules. . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Planning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Planning specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Global Mirror use cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Synchronize a Remote Copy relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Setting up GM relationships: saving bandwidth and resizing volumes . . . . . . . . 7.7.3 Master and auxiliary volumes and switching their roles . . . . . . . . . . . . . . . . . . . 7.7.4 Migrating a Metro Mirror relationship to Global Mirror. . . . . . . . . . . . . . . . . . . . .

125 126 126 127 128 128 131 132 133 134 135 135 138 138 139 139 140 142 142 142 143 144 144 145 146 147 147 147 148 148 149 150 150 150 150 151 152 153 154 155 156 156 160 160 160 161 162 164 164 165 166 167

viii

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521TOC.fm

7.7.5 Multiple Cluster Mirroring (MCM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.6 Performing three-way copy service functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.7 When to use storage controller Advanced Copy Services functions. . . . . . . . . . 7.7.8 Using Metro Mirror or Global Mirror with FlashCopy. . . . . . . . . . . . . . . . . . . . . . 7.7.9 Global Mirror upgrade scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8 Inter-cluster MM / GM source as FC target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9 States and steps in the GM relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Global Mirror states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.2 Disaster Recovery and GM/MM states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.3 State definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10 1920 errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Diagnosing and fixing 1920. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 Focus areas for 1920 errors (the usual suspects). . . . . . . . . . . . . . . . . . . . . . . 7.10.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.4 Disabling gmlinktolerance feature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.5 Cluster error code 1920: check list for diagnosis . . . . . . . . . . . . . . . . . . . . . . . 7.11 Monitoring Remote Copy relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 8. Hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Configuration recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Recommended host levels and host object name . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 The number of paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Host ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Port masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.5 Host to I/O Group mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.6 Volume size as opposed to quantity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.7 Host volume mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.8 Server adapter layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.9 Availability as opposed to error isolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Host pathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Preferred path algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Path selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Path management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Dynamic reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Volume migration between I/O Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 I/O queues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Queue depths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Multipathing software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Host clustering and reserves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 SDD compared to SDDPCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 Virtual I/O server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.6 Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.7 VMware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Mirroring considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Host-based mirroring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Automated path monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Load measurement and stress tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

167 171 173 173 174 175 177 177 179 180 182 182 183 186 187 188 189 191 192 192 192 193 194 194 194 194 199 199 199 199 200 201 201 203 205 205 207 207 209 212 213 215 216 217 220 221 221 222 223 223

Part 1. Performance best practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Contents

ix

7521TOC.fm

Draft Document for Review February 16, 2012 3:49 pm

Chapter 9. SVC 6.2 performance highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 SVC continuing performance enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Solid State Drives (SSDs) and Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Internal SSDs Redundancy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Performance scalability and I/O Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Real Time Performance Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 10. Backend performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Workload considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Tiering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Storage controller considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Backend IO capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Array considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Selecting the number of LUNs per array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Selecting the number of arrays per storage pool . . . . . . . . . . . . . . . . . . . . . . . 10.5 I/O ports, cache, throughput considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Back-end queue depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 MDisk transfer size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 SVC extent size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 SVC cache partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 DS8000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.1 Volume layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.3 Determining the number of controller ports for DS8000 . . . . . . . . . . . . . . . . . . 10.8.4 Storage pool layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8.5 Extent size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9 XIV considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.1 LUN size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.2 IO ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.3 Storage pool layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.9.4 Extent size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10 Storwize V7000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.1 Volume setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.2 IO ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.3 Storage pool layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.10.4 Extent size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11 DS5000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.1 Selecting array and cache parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.2 Considerations for controller configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.3 Mixing array sizes within the storage pool . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.11.4 Determining the number of controller ports for DS4000 . . . . . . . . . . . . . . . . . Chapter 11. Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Overview of Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Easy Tier concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.1 SSD arrays and MDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Disk tiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.3 Single tier storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.4 Multiple tier storage pools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.5 Easy Tier process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.6 Easy Tier operating modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.7 Easy Tier activation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Easy Tier implementation considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
SAN Volume Controller Best Practices and Performance Guidelines

227 228 229 230 231 232 233 234 235 235 235 244 245 245 247 247 248 250 252 253 253 258 258 260 264 265 265 266 268 268 268 268 270 273 275 275 275 277 278 278 279 280 280 280 281 281 281 282 283 284 284 284

Draft Document for Review February 16, 2012 3:49 pm

7521TOC.fm

11.3.2 Implementation rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Measuring and activating Easy Tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Measuring by using the Storage Advisor Tool . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Using Easy Tier with the SVC CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Initial cluster status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 Turning on Easy Tier evaluation mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Creating a multitier storage pool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.4 Setting the disk tier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.5 Checking a volumes Easy Tier mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.6 Final cluster status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Using Easy Tier with the SVC GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Setting the disk tier on MDisks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Checking Easy Tier status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Solid State Drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 12. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Application workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Transaction-based workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 Throughput-based workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Storage subsystem considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.4 Host considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Application considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Transaction environments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Throughput environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Data layout overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Layers of volume abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Storage administrator and AIX LVM administrator roles . . . . . . . . . . . . . . . . . . 12.3.3 General data layout recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.4 Database strip size considerations (throughput workload) . . . . . . . . . . . . . . . . 12.3.5 LVM volume groups and logical volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Database Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Data layout with the AIX virtual I/O (VIO) server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.2 Data layout strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Volume size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7 Failure boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

284 285 286 286 287 288 288 290 291 292 292 293 293 295 296 297 298 298 298 299 299 299 300 300 301 301 302 302 305 305 306 306 306 307 307 307

Part 2. Management, monitoring and troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Chapter 13. Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Using Tivoli Storage Productivity Center to analyze the SVC . . . . . . . . . . . . . . . . . . 13.1.1 IBM SAN Volume Controller (SVC) or Storwize V7000 . . . . . . . . . . . . . . . . . . 13.2 SVC considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 SVC traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 SVC best practice recommendations for performance . . . . . . . . . . . . . . . . . . . 13.3 Storwize V7000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Storwize V7000 traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Storwize V7000 best practice recommendations for performance . . . . . . . . . . 13.4 Top 10 reports for SVC and Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports. . . . . 13.4.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports . . 13.4.3 Top 10 for SVC #3: Managed Disk Group Performance reports. . . . . . . . . . . . 13.4.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports.
Contents

311 312 312 316 316 316 317 317 317 318 319 327 335 341 xi

7521TOC.fm

Draft Document for Review February 16, 2012 3:49 pm

13.4.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports. . . . . . . . 13.5 Reports for Fabric and Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 Switches reports: Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.2 Switch Port Data Rate performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Case study: Server - performance problem with one server . . . . . . . . . . . . . . . . . . . 13.7 Case study: Storwize V7000- disk performance problem . . . . . . . . . . . . . . . . . . . . . 13.8 Case study: Top volumes response time and I/O rate performance report. . . . . . . . 13.9 Case study: SVC and Storwize V7000 performance constraint alerts . . . . . . . . . . . 13.10 Case study: Fabric - monitor and diagnose performance . . . . . . . . . . . . . . . . . . . . 13.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration . . . . . 13.11.1 Ensuring that all SVC ports are online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.2 Verifying SVC port zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.3 Verifying paths to storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.11.4 Verifying host paths to the Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . 13.12 Using SVC or Storwize V7000 GUI for real-time monitoring . . . . . . . . . . . . . . . . . . 13.13 Gathering manually the SVC statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 14. Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Automating SVC and SAN environment documentation . . . . . . . . . . . . . . . . . . . . . . 14.1.1 Naming Conventions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.2 SAN Fabrics documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.3 SVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.4 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.5 Technical Support Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.6 Tracking Incident & Change tickets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.7 Automated Support Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1.8 Subscribing for SVC support information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Storage Management IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 Standard operating procedures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.1 Allocate and de-allocate volumes to hosts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3.2 Add and remove hosts in SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4 SVC Code upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 Prepare for upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.2 SVC Upgrade from 5.1 to 6.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.3 Upgrade SVC clusters participating in MM or GM . . . . . . . . . . . . . . . . . . . . . . 14.4.4 SVC upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 SAN modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.1 Cross-referencing HBA WWPNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.2 Cross-referencing LUNids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5.3 HBA replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6 SVC Hardware Upgrades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.1 Add SVC nodes to an existing cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.2 Upgrade SVC nodes in an existing cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.6.3 Move to a new SVC cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.7 Wrap up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 15. Troubleshooting and diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Common problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Host problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 SVC problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.3 SAN problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.4 Storage subsystem problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Collecting data and isolating the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

346 352 352 353 355 359 368 370 374 380 380 383 383 386 388 391 395 396 396 399 400 401 401 402 403 403 404 405 405 405 406 406 410 412 412 413 413 414 415 416 416 417 417 418 419 420 420 420 422 422 424

xii

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521TOC.fm

15.2.1 Host data collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 SVC data collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 SAN data collection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.4 Storage subsystem data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Recovering from problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Solving host problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Solving SVC problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.3 Solving SAN problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.4 Solving back-end storage problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Mapping physical LBAs to volume extents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Investigating a medium error using lsvdisklba . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.2 Investigating thin-provisioned volume allocation using lsmdisklba . . . . . . . . . . 15.5 Medium error logging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.1 Host-encountered media errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.2 SVC-encountered medium errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

424 427 431 435 438 438 440 443 443 447 447 447 448 448 449

Part 3. Practical examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Chapter 16. SVC scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 SVC upgrade with CF8 nodes and internal SSDs. . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 Move a AIX server to another LPAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Migration to new SVC using Copy Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 SVC Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Referenced Web sites. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 454 465 468 472 477 477 477 478 479 479

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

Contents

xiii

7521TOC.fm

Draft Document for Review February 16, 2012 3:49 pm

xiv

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521pref.fm

Preface
This IBM Redbooks publication captures several of the best practices based on field experience and describes the performance gains that can be achieved by implementing the IBM System Storage SAN Volume Controller at the V6.2 level. This book is intended for experienced storage, SAN, and SVC administrators and technicians. Readers are expected to have an advanced knowledge of the SAN Volume Controller (SVC) and SAN environment, and we recommend these books as background reading: IBM System Storage SAN Volume Controller, SG24-6423 Introduction to Storage Area Networks, SG24-5470 Using the SVC for Business Continuity, SG24-7371

The team that wrote this book


This book was produced by a team of specialists from around the world working at the International Technical Support Organization, San Jose Center. Mary Lovelace is a Project Manager at the International Technical Support Organization, San Jose Center. She has more than 20 years of experience with IBM in large systems, storage, and storage networking product education, system engineering and consultancy, and systems support. She has written many Redbooks publications about IBM z/OS storage products, IBM Tivoli Storage Productivity Center, Tivoli Storage Manager, and Scale Out NAS. Katja Gebuhr Katja Gebuhr is a Level 3 Service Specialist at IBM UK, Hursley. She joined IBM Germany in 2003, finished an apprenticeship as an IT-System Business Professional in 2006 and worked from then on for four years in Front End SAN Support, provided customer support for SAN Volume Controller and SAN products. Worked then for SVC Development Testing in Mainz and moved last year (2010) from IBM Germany to IBM UK. Providing Level 3 customer support worldwide for the SAN Volume Controller and IBM Storwize V7000. Ivo Gomilsek is a. Ronda Hruby is a V7000 and SAN Volume Controller Level 3 Support Engineer at the Almaden Research Center in San Jose, California. Prior to joining Level 3 in 2011, she supported multipathing software, and virtual tape products. Before joining the IBM Storage Software PFE organization in 2002, she worked in hardware and microcode development for more than 20 years and is a SNIA certified professional. Paulo Neto is a SAN Designer for Managed Storage Services supporting European clients. He has been with IBM for more than 23 years and has eleven years of storage and SAN experience. Before joining MSS, he provided Tivoli Storage Manager, SAN and IBM AIX support and services for IBM Global Technology Services in Portugal. His areas of expertise include SAN design, storage implementation, storage management and Disaster Recovery. He is an IBM Certified IT Specialist (Level 2) and a Brocade Certified Fabric Designer. He holds a BSc in Electronics and Computer Engineering from the ISEP, Portugal as well as an MSc in Informatics from the FCUP, Portugal. .

Copyright IBM Corp. 2011. All rights reserved.

xv

7521pref.fm

Draft Document for Review February 16, 2012 3:49 pm

Jon Parkes is a Level 3 Service Specialist at IBM UK, Hursley. He has over 15 years experience of testing, developing disk drives, storage products and applications. He is experienced in managing product test, product quality assurance activities and providing technical advocacy to Clients. For the past 4 years he has specialised in the test and support of SAN Volume Controller and V7000 product range. Otavio Rocha Filho is a SAN Storage Specialist for Strategic Outsourcing, IBM Brazil Global Delivery Center in Hortolandia. Since joining IBM in 2007 he's been the SAN Storage subject matter expert (SME) for many of its international customers. Working with Information Technology since 1988, he has been dedicated to storage solutions design, implementation and support since 1998, deploying the latest in Fibre Channel and SAN technology since its early years. Otavio's certifications include Open Group Master IT Specialist, Brocade SAN Manager, and ITIL Service Management Foundation. Leandro Torolho is an IT Specialist for IBM Global Services in Brazil. With a background in UNIX and Backup areas, he is currently a SAN Storage subject matter expert (SME) working on implementation and support for its international customers. He holds a Bachelor degree in Computer Science from USCS/SCS, So Paulo Brazil as well as Post-Graduation in Computer Networks from FASP/SP, So Paulo, Brazil. He has 10 years of experience on Information Technology and is AIX, TSM and ITIL certified. We extend our thanks to the following people for their contributions to this project. There are many people that contributed to this book. In particular, we thank the development and PFE teams in Hursley, England. The authors of the previous edition of this book were: Katja Gebuhr Alex Howell Nik Kjeldsen Jon Tate We also want to thank the following people for their contributions: Lloyd Dean Parker Grannis Brian Sherman Bill Wiegand

Become a published author


Join us for a two- to six-week residency program. Help write a book dealing with specific products or solutions, while getting hands-on experience with leading-edge technologies. You will have the opportunity to team with IBM technical professionals, IBM Business Partners, and Clients. Your efforts will help increase product acceptance and client satisfaction. As a bonus, you will develop a network of contacts in IBM development labs, and increase your productivity and marketability. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html

xvi

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521pref.fm

Comments welcome
Your comments are important to us! We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications in one of the following ways: Use the online Contact us review IBM Redbooks publications form found at: ibm.com/redbooks Send your comments in an e-mail to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400

Preface

xvii

7521pref.fm

Draft Document for Review February 16, 2012 3:49 pm

xviii

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521chang.fm

Summary of changes
This section describes the technical changes made in this edition of the book and in previous editions. This edition might also include minor corrections and editorial changes that are not identified. Summary of Changes for SG24-7521-02 for SAN Volume Controller Best Practices and Performance Guidelines as created or updated on February 16, 2012.

October 2011, Third Edition


This revision reflects the addition, deletion, or modification of new and changed information described below.

New information
SVC 6.2 function Space-Efficient VDisks SVC Console VDisk Mirroring

Copyright IBM Corp. 2011. All rights reserved.

xix

7521chang.fm

Draft Document for Review February 16, 2012 3:49 pm

xx

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521p01.fm

Part 1

Part

Configuration guidelines and best practices


In this part we discuss SVC configuration guidelines and best practices.

Copyright IBM Corp. 2011. All rights reserved.

7521p01.fm

Draft Document for Review February 16, 2012 3:49 pm

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Update20111209.fm

Chapter 1.

SVC update
In this chapter we will provide a summary of the enhancements in the SAN Volume Controller (SVC) since version 4.3.0. Changed terminology from previous SVC releases are explained. We also provide a summary of the contents of this book.

Copyright IBM Corp. 2011. All rights reserved.

7521Update20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

1.1 SVC V5.1 enhancements and changes


In this section we present the major enhancements and changes introduced in SVC V5.1. New capabilities with 2145-CF8 hardware engine SVC improves its performance capabilities going forward by upgrading to a 64-bit software kernel. This allows taking advantage of cache increases such as 24 GB provided in the new 2145-CF8. SVC V5.1 runs on all SVC 2145 models that use 64-bit hardware, which includes Models 8F2, 8F4, 8A4, 8G4, and CF8. The 2145-4F2 node (32 bit hardware) is not supported in this version. SVC V5.1 also supports the optional Solid State Drives (SSDs) on the 2145-CF8 which provides a new ultra-high-performance storage option. Each 2145-CF8 node supports up to four SSDs in conjunction with the required SAS adapter. Multi-Target Reverse IBM FlashCopy and Storage FlashCopy Manager With SVC V5.1, Reverse FlashCopy support is available. Reverse FlashCopy enables FlashCopy targets to become restore points for the source without breaking the FlashCopy relationship and without having to wait for the original copy operation to complete. It supports multiple targets and thus multiple rollback points. 1Gb iSCSI host attachment. SVC V5.1 delivers native support of the iSCSI protocol for host attachment. However all inter-node and backend end storage communications still flow through the Fibre Channel adapters. Split an SVC I/O group across long distances With the option to use 8 Gbps LW SFPs in the SVC 2145-CF8, SVC V5.1 introduces the ability to split an SVC I/O group across long distances. Remote authentication for users of SAN Volume Controller clusters SVC V5.1 provides the Enterprise Single Sign-on client to interact with an LDAP directory server such as IBM Tivoli Directory Server or Microsoft Active Directory. Remote copy functions The number of cluster partnerships has been raised from one up to a maximum of three partnerships, which means that a single SVC cluster can have partnerships of up to three clusters at the same time. It allows the establishment of multiple partnership topologies including star, triangle, mesh and daisy chain. The maximum number of remote copy relationships has been raised to 8192. Maximum VDisk size raised to 256TB. SVC V5.1 provides greater flexibility in expanding provisioned storage by increasing the allowable size of VDisks from the former 2TB limit to 256TB. Reclaiming unused disk space using space-efficient VDisks and VDisk Mirroring. SVC V5.1 enables the reclamation of unused allocated disk space when converting a fully allocated VDisk to a space-efficient virtual disk using the VDisk Mirroring functionality. New reliability, availability and serviceability (RAS) functionalities SAN Volume Controller's RAS capabilities are further enhanced in V5.1. Administrators benefit from better SVC availability and serviceability through automatic recovery of node metadata, in conjunction with improved error notification capabilities (across e-mail, syslog, SMNP). Error notification supports up to six e-mail destination addresses and the quorum disk management has been improved with a set of new commands.

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Update20111209.fm

Optional second management IP address configured on eth1 port The existing SVC node hardware has two Ethernet ports. Until SVC 4.3, only one Ethernet port (eth0) has been used for cluster configuration. In SVC V5.1 a second, new cluster IP address can be optionally configured on the eth1 port. Added interoperability Interoperability with new storage controllers, host operating systems, fabric devices and other hardware. An updated list can be found at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003553. Withdrawal of support for 2145-4F2 nodes (32 bit). As stated before, SVC V5.1 only supports SVC 2145 engines that use 64-bit hardware. SVC Entry Edition allows up to 250 drives, running only on 2145-8A4 nodes The SVC Entry Edition uses a per-disk-drive charge unit, and now may be used for storage configurations of up to 250 disk drives.

1.2 SVC V6.1 enhancements and changes


In this section we detail the major enhancements and changes introduced in SVC V6.1. A newly designed user interface (IBM XIV like). The SVC Console has a newly designed graphical user interface (GUI) which now runs on the SVC and can be accessed from anywhere on the network using a web browser. It includes several enhancements like greater flexibility of views, display of the command lines being executed, and improved user customization within the GUI. Customers using Tivoli Storage Productivity Center and IBM System Director can take advantage of integration points with the new SVC Console. SVC for XIV (5639-SX1) new licensing The new PID 5639-SX1, IBM SAN Volume Controller for XIV Software V6, is priced by the number of storage devices (also referred to as modules or enclosures). It eliminates the appearance of double charging for features bundled in XIV software license and can be combined with per TB license to extend SVC usage with a mix of backend storage subsystems. Service Assistant SVC V6.1 introduces a new method for performing service tasks on the system. In addition to performing service tasks from the front panel you can also service a node through an Ethernet connection using either a web browser or command line interface. The web browser runs a new service application called the Service Assistant. All functions that were previously available through the front panel are now available from the Ethernet connection, with the advantages of an easier to use interface and remote access from the cluster. Furthermore, SA commands can also be run through a USB stick allowing an easier serviceability. IBM Easy Tier added at no charge SVC V6.1 delivers IBM System Storage Easy Tier, which is a dynamic data relocation feature that allows host transparent movement of data among two tiers of storage. This includes the ability to automatically relocate volume extents with high activity to storage media with higher performance characteristics, while extents with low activity are migrated to storage media with lower performance characteristics. This capability aligns the SVC system with current workload requirements increasing the overall storage performance.

Chapter 1. SVC update

7521Update20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Temporarily withdrawal of support for SSDs on the 2145-CF8 nodes. At the time of writing 2145-CF8 nodes using internal Solid State Drives (SSDs) are unsupported with V6.1.0.x code (fixed in version 6.2). Interoperability with new storage controllers, host operating systems, fabric devices and other hardware. An updated list can be found at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003697. Removal of 15 character maximum name length restrictions SVC V6.1 supports object names up to 63 characters. Previous levels only supported up to 15 characters. SVC code upgrades The SVC Console code has been removed. Now you just need to update the SVC code. The upgrade from SVC V5.1 will require the usage of the formerly console interface or the command line. After the upgrade is complete, you can remove the existing ICA (console) application from your SSPC or Master Console. The new GUI is launched through a web browser pointing the SVC ip address. SVC to Backend Controller I/O change SVC V6.1 allows variable block sizes up to 256KB against 32KB supported in the previous versions. This is handled automatically by the SVC system without requiring any user control. Scalability The maximum extent size increased four times to 8GB. With an extent size of 8GB, the total storage capacity manageable per cluster is 32PB. The maximum volume size increased to 1PB. The maximum number of WWNN increased to 1024 allowing up to 1024 backend storage subsystems to be virtualized. SVC and Storwize V7000 Interoperability The virtualization layer of IBM Storwize V7000 is built upon the IBM SAN Volume Controller technology. SVC V6.1 is the first version supported in this environment. Terminology change To coincide with new and existing IBM products and functions, several common terms have changed and are incorporated in the SAN Volume Controller information. The following table shows the current and previous usage of the changed common terms.
Table 1-1 Terminology mapping table 6.1.0 SAN Volume Controller term event Previous SAN Volume Controller term error Description An occurrence of significance to a task or system. Events can include completion or failure of an operation, a user action, or the change in state of a process. The process of controlling which hosts have access to specific volumes within a cluster. A collection of storage capacity that provides the capacity requirements for a volume. The ability to define a storage unit (full system, storage pool, volume) with a logical capacity size that is larger than the physical capacity assigned to that storage unit.

host mapping storage pool thin provisioning (or thin-provisioned)

VDisk-to-host mapping managed disk (MDisk) group space-efficient

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Update20111209.fm

6.1.0 SAN Volume Controller term volume

Previous SAN Volume Controller term virtual disk (VDisk)

Description A discrete unit of storage on disk, tape, or other data recording medium that supports some form of identifier and parameter list, such as a volume label or input/output control.

1.3 SVC V6.2 enhancements and changes


In this section we discuss the enhancements and changes introduced in SVC V6.2. Support for the SAN Volume Controller 2145-CG8. The new 2145-CG8 engine contains 24GB of cache and four 8 Gbps Fibre Channel host bus adapter (HBA) ports for attachment to the SAN. The 2145-CG8 auto-negotiates the fabric speed on a per port basis and is not restricted to run at the same speed as other node pairs in the clustered system. 2145-CG8 can be added in pairs to an existing system comprised by 64-bit hardware nodes (8F2, 8F4, 8G4, 8A4, CF8 or CG8) up to the maximum of four pairs. 10Gb iSCSI host attachment The new 2145-CG8 node comes with the option to add a dual port 10 Gb Ethernet adapter, which can be used for iSCSI host attachment. The 2145-CG8 node also supports the optional use of SSD devices (up to four), however both options cannot coexist on the same SVC node. Real-time performance statistics though the management GUI. Real-time performance statistics provide short-term status information for the system. The statistics are shown as graphs in the management GUI. Historical data is kept for about five minutes, therefore Tivoli Storage Productivity Center can be used to capture more detailed performance information, to analyze mid and long term historical data and get a complete picture when developing the best performance solutions. SSD RAID (0,1 and 10). Optional SSDs are not accessible over the SAN. Their usage is done through the creation of RAID arrays. The supported RAID levels are 0, 1 and 10. In a RAID 1 or RAID 10 array the data is mirrored between SSDs on two nodes in the same I/O group. EasyTier available for use with SSDs on the 2145-CF8 and 2145-CG8 nodes. SVC V6.2 restarts the support of internal SSDs allowing Easy Tier to work with internal SDD storage pools. Support for FlashCopy target as Remote Copy source. SVC V6.2 allows a FlashCopy target volume to be a source volume in a remote copy relationship. Support for the VMware vStorage API for Array Integration (VAAI). SVC V6.2 fully supports the VMware VAAI protocols, One of the improvements provided with VAAI support is the ability to dramatically offload the I/O processing generated by performing a VMware Storage vMotion. Command-line interface (CLI) prefix removal The svctask and svcinfo command prefixes are no longer necessary when issuing a command. If you have existing scripts that use those prefixes, they will continue to function.
Chapter 1. SVC update

7521Update20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Licensing change: Removal of physical site boundary. The licensing for SVC systems (formerly clusters) within the same country belonging to the same customer can be aggregated in a single license. FlashCopy is now licensed on the main source volumes. SVC V6.2 changes the way the FlashCopy is licensed so that SVC now counts as the main source in FlashCopy relationships. Previously, if cascaded FlashCopy was set up, multiple source volumes would have to be licensed. Interoperability with new storage controllers, host operating systems, fabric devices and other hardware. An updated list can be found at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797. Exceed entitled virtualization license is allowed during 45 days from the installation date for the purpose of migrating data from one system to another. With the benefit of virtualization, SVC allows customers to bring new storage systems into their storage environment and very quickly and easily migrate data from their existing storage systems to the new storage systems. In order to facilitate this migration, IBM allows customers to temporarily (45 days from the date of installation of the SVC) exceed their entitled virtualization license for the purpose of migrating data from one system to another. The following table shows the current and previous usage of the changed common terms.
Table 1-2 Terminology mapping table 6.2.0 SAN Volume Controller term clustered system or system Previous SAN Volume Controller term cluster Description A collection of nodes that are placed in pairs (I/O groups) for redundancy, which provide a single management interface.

1.4 Contents of this book


This book is divided into four parts: Configuration Guidelines Performance Best Practices Monitoring, Maintenance and Troubleshooting Practical Scenarios

1.4.1 Part 1 - Configuration Guidelines


Chapter 1. SVC update
This chapter provides an update to the SVC versions, features and functions since this book was last updated.

Chapter 2. SAN topology


In this chapter we explain the SAN topology required for the SVC clustered system along with several fabric configuration scenarios.

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Update20111209.fm

Chapter 3. SAN Volume Controller clustered system


In this chapter, we discuss the advantages of virtualization and the optimal time to use virtualization in your environment. Furthermore, we describe the scalability options for the IBM System Storage SAN Volume Controller (SVC) and when to grow or split an SVC cluster.

Chapter 4. Backend storage


In this chapter we describe aspects and characteristics to consider when planning the attachment of a Backend Storage Device to be virtualized by an IBM System Storage SAN Volume Controller (SVC).

Chapter 5. Storage pools and Managed Disks


In this chapter we describe aspects to consider when planning Storage Pools for an IBM System Storage SAN Volume Controller (SVC) implementation. We also discuss several Managed Disk (MDisk) attributes and provide an overview of the process of adding and removing MDisks from existing Storage Pools.

Chapter 6. Volumes
In this chapter, we discuss several aspects and options when creating Volumes (formerly VDisks). We describe how to create, manage and migrate volumes across I/O groups. We enlarge upon the thin-provisioned volumes presenting performance and limit considerations. We also explain how to take advantage of mirroring and flashcopy capabilities.

Chapter 7. Remote Copy services


In this chapter, we discuss the best practices for using the Remote Copy services; Metro Mirror (MM) and Global Mirror (GM). The main focus is on intercluster GM relationships.

Chapter 8. Hosts
In this chapter we provide some recommendations about host attachment and pathing configuration regarding to performance and scalability. We discuss the host concerns about some volume operations like extend volume or migration between I/O groups. The host clustering and the underlying reserve policy is also discussed here.

1.4.2 Part 2 - Performance Best Practices


Chapter 9. SVC V6.2 performance highlights
In this chapter, we discuss the latest performance improvements achieved by SAN Volume Controller (SVC) code release 6.2, the new SVC node hardware models CF8 and CG8, and the new the new SSD internal array capability in conjunction with the Easy-Tier functionality.

Chapter 10. Backend performance considerations


In this chapter we provide some good practices how to configure and present backend storage arrays to the SVC clustered system for better performance and scalability.

Chapter 11. Frontend performance considerations


In this chapter we present several considerations and benefits on the overall virtualized storage frontend performance.

Chapter 1. SVC update

7521Update20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Chapter 12. Easy Tier


In this chapter we provide an overview of the EasyTier disk performance optimization feature and explain how to activate it for both evaluation and for automatic extent migration purposes.

Chapter 13. Applications


In this chapter, we provide information about laying out storage for the best performance for general applications, IBM AIX Virtual I/O (VIO) servers, and IBM DB2 databases specifically. While most of the specific information is directed to hosts running the IBM AIX operating system, the information is also relevant to other host types.

1.4.3 Part 3 - Monitoring, Maintenance and Troubleshooting


Chapter 14. SVC Monitoring
In this chapter, we first describe how to collect SAN topology and performance information using TotalStorage Productivity Center (TPC). We then show several examples of misconfiguration and failures, and how they can be identified in the TPC Topology Viewer and performance reports.

Chapter 15. Maintenance


In this chapter, we shall discuss some of the Best Practices in day-to-day activities of Storage Administration using SVC that can help you keeping your Storage Infrastructure with the levels of availability, reliability, and resiliency demanded by todays applications and yet keep up with their storage growth needs.

Chapter 16. Troubleshooting and diagnostics


We discuss and explain problems related to the SVC, Storage Area Network (SAN) environment, storage subsystems, hosts, and multipathing drivers. Furthermore, we explain how to collect the necessary problem determination data and how to overcome these problems.

1.4.4 Part 4 - Practical Scenarios


Chapter 17. SVC scenarios
In this chapter we provide working scenarios to reinforce and demonstrate the best practices and performance information in this book.

10

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Chapter 2.

SAN topology
The IBM System Storage Area Network (SAN) Volume Controller (SVC) has unique SAN fabric configuration requirements that differ from what you might be used to in your storage infrastructure. A quality SAN configuration can help you achieve a stable, reliable, and scalable SVC installation; conversely, a poor SAN environment can make your SVC experience considerably less pleasant. This chapter provides you with information to tackle this topic. Note: As with any of the information in this book, you must check the IBM System Storage SAN Volume Controller V6.2.0 - Software Installation and Configuration Guide, GC27-2286, and IBM System Storage SAN Volume Controller 6.2.0 Configuration Limits and Restrictions, S1003799, for limitations, caveats, updates, and so on that are specific to your environment. Do not rely on this book as the last word in SVC SAN design. Also, anyone planning for an SVC installation must be knowledgeable about general SAN design principles. Refer to the IBM System Storage SAN Volume Controller Support Web page for updated documentation before implementing your solution. The Web site is: http://www-947.ibm.com/support/entry/portal/Overview/Hardware/System_Storage/St orage_software/Storage_virtualization/SAN_Volume_Controller_(2145) Note: All document citations in this book refer to the 6.2 version of the SVC product documents. If you use a different version, refer to the correct edition of the documents. As you read this chapter, remember that this is a best practices book based on field experiences. Although there will be many other possible (and supported) SAN configurations not found in this chapter, we think they are not the most recommended.

Copyright IBM Corp. 2011. All rights reserved.

11

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

2.1 SVC SAN topology


The topology requirements for the SVC do not differ too much from any other storage device. What makes the SVC unique here is that it can be configured with a large number of hosts, which can cause interesting issues with SAN scalability. Also, because the SVC often serves so many hosts, an issue caused by poor SAN design can quickly cascade into a catastrophe.

2.1.1 Redundancy
One of the fundamental SVC SAN requirements is to create two (or more) entirely separate SANs that are not connected to each other over Fibre Channel in any way. The easiest way is to construct two SANs that are mirror images of each other. Technically, the SVC supports using just a single SAN (appropriately zoned) to connect the entire SVC. However, we do not recommend this design in any production environment. In our experience, we also do not recommend this design in development environments either, because a stable development platform is important to programmers, and an extended outage in the development environment can cause an expensive business impact. For a dedicated storage test platform, however, it might be acceptable.

Redundancy through Cisco VSANs or Brocade Virtual Fabrics


Despite VSANs and Virtual Fabrics can provide a logical separation within a single appliance, they do not replace the hardware redundancy. All SAN switches have been known to suffer from hardware or fatal software failures. Furthermore, redundant fabrics should be separated into different non-contiguous racks and fed from redundant power sources.

2.1.2 Topology basics


Note: Due to the nature of Fibre Channel, it is extremely important to avoid inter-switch link (ISL) congestion. While Fibre Channel (and the SVC) can, under most circumstances, handle a host or storage array that has become overloaded, the mechanisms in Fibre Channel for dealing with congestion in the fabric itself are not effective. The problems caused by fabric congestion can range anywhere from dramatically slow response time all the way to storage access loss. These issues are common with all high-bandwidth SAN devices and are inherent to Fibre Channel; they are not unique to the SVC. When an Ethernet network becomes congested, the Ethernet switches simply discard frames for which there is no room. When a Fibre Channel network becomes congested, the Fibre Channel switches instead stop accepting additional frames until the congestion clears, in addition to occasionally dropping frames. This congestion quickly moves upstream in the fabric and clogs the end devices (such as the SVC) from communicating anywhere. This behavior is referred to as head-of-line blocking, and while modern SAN switches internally have a non-blocking architecture, head-of-line-blocking still exists as a SAN fabric problem. Head-of-line-blocking can result in your SVC nodes being unable to communicate with your storage subsystems or mirror their write caches, just because you have a single congested link leading to an edge switch. No matter the size of your SVC installation, there are a few best practices that you need to apply to your topology design: All SVC node ports in a clustered system must be connected to the same SAN switches as all of the storage devices with which the SVC clustered system is expected to

12

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

communicate. Conversely, storage traffic and inter-node traffic must never cross an ISL, except during migration scenarios. High-bandwidth-utilization servers (such as tape backup servers) must also be on the same SAN switches as the SVC node ports. Putting them on a separate switch can cause unexpected SAN congestion problems. Putting a high-bandwidth server on an edge switch is a waste of ISL capacity. If at all possible, plan for the maximum size configuration that you ever expect your SVC installation to reach. As you will see in later parts of this chapter, the design of the SAN can change radically for larger numbers of hosts. Modifying the SAN later to accommodate a larger-than-expected number of hosts might produce a poorly-designed SAN. Moreover it can be difficult, expensive, and disruptive to your business. Planning for the maximum size does not mean that you need to purchase all of the SAN hardware initially. It only requires you to design the SAN considering the maximum size. Always deploy at least one extra ISL per switch. Not doing so exposes you to consequences from complete path loss (this is bad) to fabric congestion (this is even worse). The SVC does not permit the number of hops between the SVC clustered system and the hosts to exceed three hops, which is typically not a problem.

2.1.3 ISL oversubscription


The IBM System Storage SAN Volume Controller V6.2.0 - Software Installation and Configuration Guide, GC27-2286, specifies a suggested maximum host port to ISL ratio of 7:1. With modern 4 or 8 Gbps SAN switches, this ratio implies an average bandwidth (in one direction) per host port of approximately 57 MBps (4 Gbps). It you do not expect most of your hosts to reach anywhere near that value, it is possible to request an exception to the ISL oversubscription rule, known as a Request for Price Quotation (RPQ), from your IBM marketing representative. Before requesting an exception, however, consider the following factors: You must take peak loads into consideration, not average loads. For instance, while a database server might only use 20 MBps during regular production workloads, it might perform a backup at far higher data rates. Congestion to one switch in a large fabric can cause performance issues throughout the entire fabric, including traffic between SVC nodes and storage subsystems, even if they are not directly attached to the congested switch. The reasons for these issues are inherent to Fibre Channel flow control mechanisms, which are simply not designed to handle fabric congestion. Therefore, any estimates for required bandwidth prior to implementation must have a safety factor built into the estimate. On top of the safety factor for traffic expansion, implement a spare ISL or ISL trunk, as stated in 2.1.2, Topology basics on page 12. You need to still be able to avoid congestion if an ISL fails due to issues, such as a SAN switch line card or port blade failure. Exceeding the standard 7:1 oversubscription ratio requires you to implement fabric bandwidth threshold alerts. Anytime that one of your ISLs exceeds 70%, you need to schedule fabric changes to distribute the load further. You need to also consider the bandwidth consequences of a complete fabric outage. While a complete fabric outage is a fairly rare event, insufficient bandwidth can turn a single-SAN outage into a total access loss event. Take the bandwidth of the links into account. It is common to have ISLs run faster than host ports, which obviously reduces the number of required ISLs.

Chapter 2. SAN topology

13

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

The RPQ process involves a review of your proposed SAN design to ensure that it is reasonable for your proposed environment.

2.1.4 Single switch SVC SANs


The most basic SVC topology consists of nothing more than a single switch per SAN, which can be anything from a 16-port 1U switch for a small installation of just a few hosts and storage devices all the way up to a director with hundreds of ports. This design obviously has the advantage of simplicity, and it is a sufficient architecture for small to medium SVC installations. It is preferable to use a multi-slot director-class single switch over setting up a core-edge fabric made up solely of lower end switches. As stated in 2.1.2, Topology basics on page 12, keep the maximum planned size of the installation in mind if you decide to use this architecture. If you run too low on ports, expansion can be difficult.

2.1.5 Basic core-edge topology


The core-edge topology is easily recognized by most SAN architects, as illustrated in Figure 2-1 on page 15. It consists of a switch in the center (usually, a director-class switch), which is surrounded by other switches. The core switch contains all SVC ports, storage ports, and high-bandwidth hosts. It is connected via ISLs to the edge switches. The edge switches can be of any size. If they are multi-slot directors, they are usually fitted with at least a few oversubscribed line cards/port blades, because the vast majority of hosts do not ever require line-speed bandwidth, or anything close to it. Note that ISLs must not be on oversubscribed ports.

14

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

SVC Node 2 2 2

SVC Node

Core Switch

Core Switch

Edge Switch

Edge Switch

Edge Switch

Edge Switch

Host
Figure 2-1 Core-edge topology

Host

2.1.6 Four-SAN core-edge topology


For installations where even a core-edge fabric made up of multi-slot director-class SAN switches is insufficient, the SVC clustered system can be attached to four SAN fabrics instead of the normal two SAN fabrics. This design is especially useful for large, multi-clustered systems installations. As with a regular core-edge, the edge switches can be of any size, and multiple ISL links should be installed per switch. As you can see in Figure 2-2 on page 16, we have attached the SVC clustered system to each of four independent fabrics. The storage subsystem used also connects to all four SAN fabrics, even though this design is not required.

Chapter 2. SAN topology

15

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

SVC Node

SVC Node

Core Switch

Core Switch

Core Switch

Core Switch

Edge Switch

Edge Switch

Edge Switch

Edge Switch

Host
Figure 2-2 Four-SAN core-edge topology

Host

While certain clients have chosen to simplify management by connecting the SANs together into pairs with a single ISL link, we do not recommend this design. With only a single ISL connecting fabrics together, a small zoning mistake can quickly lead to severe SAN congestion. Using the SVC as a SAN bridge: With the ability to connect an SVC clustered system to four SAN fabrics, it is possible to use the SVC as a bridge between two SAN environments (with two fabrics in each environment). This configuration can be useful for sharing resources between the SAN environments without merging them. Another use is if you have devices with different SAN requirements present in your installation. When using the SVC as a SAN bridge, pay special attention to any restrictions and requirements that might apply to your installation.

16

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

2.1.7 Common topology issues


In this section, we describe common topology problems that we have encountered.

Accidentally accessing storage over ISLs


One common topology mistake that we have encountered in the field is to have SVC paths from the same node to the same storage subsystem on multiple core switches that are linked together (refer to Figure 2-3). This problem is commonly encountered in environments where the SVC is not the only device accessing the storage subsystems.

SVC Node 2 2

SVC Node

Switch

Switch

Switch

Switch

SVC -> Storage Traffic should be zoned to never travel over these links SVC-attach host
Figure 2-3 Spread out disk paths

Non-SVC-attach host

If you have this type of topology, it is extremely important to zone the SVC so that it will only see paths to the storage subsystems on the same SAN switch as the SVC nodes. Implementing a storage subsystem host port mask might also be feasible here. Note: This type of topology means you must have more restrictive zoning than what is detailed in 2.3.6, Sample standard SVC zoning configuration on page 33. Because of the way that the SVC load balances traffic between the SVC nodes and MDisks, the amount of traffic that transits your ISLs will be unpredictable and vary significantly. If you
Chapter 2. SAN topology

17

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

have the capability, you might want to use either Cisco Virtual SANs (VSANs) or Brocade Traffic Isolation to dedicate an ISL to high-priority traffic. However, as stated before, internode and SVC to backend storage communication should never cross ISLs.

Accessing storage subsystems over an ISL on purpose


This practice is explicitly advised against in the SVC configuration guidelines, because the consequences of SAN congestion to your storage subsystem connections can be quite severe. Only use this configuration in SAN migration scenarios, and when doing so, closely monitor the performance of the SAN. For most configurations, trunking is required and ISLs should be regularly monitored to detect failures.

SVC I/O Group switch splitting


Clients often want to attach another I/O Group to an existing SVC clustered system to increase the capacity of the SVC clustered system, but they lack the switch ports to do so. If this situation happens to you, there are two options: Completely overhaul the SAN during a complicated and painful redesign. Add a new core switch, and inter-switch link the new I/O Group and the new switch back to the original, as illustrated in Figure 2-4.

18

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Old I/O Group SVC Node SVC Node

New I/O Group SVC Node SVC Node

Old Switch

New Switch

Old Switch

New Switch

SVC -> Storage Traffic should be zoned and masked to never travel over these links, but they should be zoned for intraCluster communications Host Host

Figure 2-4 Proper I/O Group splitting

This design is a valid configuration, but you must take certain precautions: As stated in Accidentally accessing storage over ISLs on page 17, the zone and Logical Unit Number (LUN) mask the SAN and storage subsystems, so that you do not access the storage subsystems over the ISLs. This design means that your storage subsystems will need connections to both the old and new SAN switches. Have two dedicated ISLs between the two switches on each SAN with no data traffic traveling over them. The reason for this design is because if this link ever becomes congested or lost, you might experience problems with your SVC clustered system if there are also issues at the same time on the other SAN. If you can, set a 5% traffic threshold alert on the ISLs so that you know if a zoning mistake has allowed any data traffic over the links.

Chapter 2. SAN topology

19

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: It is not a best practice to use this configuration to perform mirroring between I/O Groups within the same clustered system. And, you must never split the two nodes in an I/O Group between various SAN switches within the same SAN fabric. The optional 8 Gbps LW SFPs in the 2145-CF8 and 2145-CG8 allow to split an SVC I/O group across long distances as described in Split clustered system / Stretch clustered system on page 20.

2.1.8 Split clustered system / Stretch clustered system


For high availability, you can split a SAN Volume Controller clustered system across three locations and mirror the data. A split clustered system configuration locates the active quorum disk at a third site. If communication is lost between the primary and secondary sites, the site with access to the active quorum disk continues to process transactions. If communication is lost to the active quorum disk, an alternative quorum disk at another site can become the active quorum disk. To configure a split clustered system, you need to follow specific rules: Directly connect each SAN Volume Controller node to one or more SAN fabrics at the primary and secondary sites. Sites are defined as independent power domains that would fail independently. Power domains could be located in the same room or across separate physical locations. Use a third site to house a quorum disk. The storage system that provides the quorum disk at the third site must support extended quorum disks. Storage systems that provide extended quorum support are listed at the IBM System Storage SAN Volume Controller Support Web page. Do not use powered devices to provide distance extension for the SAN Volume Controller to switch connections. Place independent storage systems at the primary and secondary sites, and use volume mirroring to mirror the host data between storage systems at the two sites. SAN Volume Controller nodes that are in the same I/O group and separated by more than 100 meters (109 yards) must use longwave Fibre Channel connections. A longwave small form-factor pluggable (SFP) transceiver can be purchased as an optional SAN Volume Controller component, and must be one of the longwave SFP transceivers listed at the IBM System Storage SAN Volume Controller Support Web page. http://www-947.ibm.com/support/entry/portal/Troubleshooting/Hardware/System_Sto rage/Storage_software/Storage_virtualization/SAN_Volume_Controller_(2145)/ Using inter-switch links (ISLs) in paths between SAN Volume Controller nodes in the same I/O group is not supported. Avoid using inter-switch links (ISLs) in paths between SAN Volume Controller nodes and external storage systems. If this is unavoidable, follow the workarounds already discussed before. Using a single switch at the third site can lead to the creation of a single fabric rather than two independent and redundant fabrics. A single fabric is an unsupported configuration. SAN Volume Controller nodes in the same system must be connected to the same Ethernet subnet. A SAN Volume Controller node must be located in the same rack as the 2145 UPS or 2145 UPS-1U that supplies its power.

20

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Some service actions require physical access to all SAN Volume Controller nodes in a system. If nodes in a split clustered system are separated by more than 100 meters, service actions might require multiple service personnel. Figure 2-5 illustrates an example of a split clustered system configuration. When used in conjunction with volume mirroring, this configuration provides a high availability solution that is tolerant of a failure at a single site.

active quorum Storage Subsystem SVC Node Storage Subsystem

Switch

Switch

Physical Location 3

SVC Node

Switch

Switch

Primary Site Physical Location 1

host

host

Secondary Site Physical Location 2

Figure 2-5 A split clustered system with a quorum disk located at a third site

Quorum placement
A split clustered system configuration locates the active quorum disk at a third site. If communication is lost between the primary and secondary sites, the site with access to the active quorum disk continues to process transactions. If communication is lost to the active quorum disk, an alternative quorum disk at another site can become the active quorum disk. Although a system of SAN Volume Controller nodes can be configured to use up to three quorum disks, only one quorum disk can be elected to resolve a situation where the system is partitioned into two sets of nodes of equal size. The purpose of the other quorum disks is to provide redundancy if a quorum disk fails before the system is partitioned. Note: SSD managed disks should not be chosen for quorum disk purposes as long as SSD lifespan depends on write workload.

Configuration summary
Generally, when the nodes in a system have been split among sites, configure the SAN Volume Controller system this way: Site 1: Half of SAN Volume Controller system nodes + one quorum disk candidate Site 2: Half of SAN Volume Controller system nodes + one quorum disk candidate
Chapter 2. SAN topology

21

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Site 3: Active quorum disk Disable the dynamic quorum configuration by using the chquorum command with the override yes option.

Note: Some fix levels of 6.2.0.x. do not support split clustered systems, please check the following flash for the latest details: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003853

2.2 SAN switches


In this section, we discuss several considerations when you select the Fibre Channel (FC) SAN switches for use with your SVC installation. It is important to understand the features offered by the various vendors and associated models in order to meet design and performance goals.

2.2.1 Selecting SAN switch models


In general, there are two classes of SAN switches: fabric switches and directors. While normally based on the same software code and Application Specific Integrated Circuit (ASIC) hardware platforms, there are differences in performance and availability. Directors feature a slotted design and have component redundancy on all active components in the switch chassis (for instance, dual-redundant switch controllers). A SAN fabric switch (or just a SAN switch) normally has a fixed port layout in a non-slotted chassis (there are exceptions to this rule though, such as the IBM/Cisco MDS9200 series, which features a slotted design). Regarding component redundancy, both fabric switches and directors are normally equipped with redundant, hot-swappable environmental components (power supply units and fans). In the past, over-subscription on the SAN switch ports had to be taken into account when selecting a SAN switch model. Over-subscription here refers to a situation in which the combined maximum port bandwidth of all switch ports is higher than what the switch internally can switch. For directors, this number can vary for different line card/port blade options, where a high port-count module might have a higher over-subscription rate than a low port-count module, because the capacity toward the switch backplane is fixed. With the latest generation SAN switches (both fabric switches and directors), this issue has become less important due to increased capacity in the internal switching. This situation is true for both switches with an internal crossbar architecture and switches realized by an internal core/edge ASIC lineup. For modern SAN switches (both fabric switches and directors), processing latency from ingress to egress port is extremely low and is normally negligible. When selecting the switch model, try to take the future SAN size into consideration. It is generally better to initially get a director with only a few port modules instead of having to implement multiple smaller switches. Having a high port-density director instead of a number of smaller switches also saves ISL capacity and therefore ports used for inter-switch connectivity.

22

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

IBM sells and support SAN switches from both of the major SAN vendors listed in the following product portfolios: IBM System Storage b-type/Brocade SAN portfolio IBM System Storage/Cisco SAN portfolio

2.2.2 Switch port layout for large edge SAN switches


While users of smaller, non-bladed, SAN fabric switches generally do not need to concern themselves with which ports go where, users of multi-slot directors must pay careful attention to where the ISLs are located in the switch. Generally, the ISLs (or ISL trunks) must be on separate port modules within the switch to ensure redundancy. The hosts must be spread out evenly among the remaining line cards in the switch. Remember to locate high-bandwidth hosts on the core switches directly.

2.2.3 Switch port layout for director-class SAN switches


Each SAN switch vendor has a selection of line cards/port blades available for their multi-slot director-class SAN switch models. Some of these options are over-subscribed, and some of them have full bandwidth available for the attached devices. For your core switches, we suggest only using line cards/port blades where the full line speed that you expect to use will be available. You need to contact your switch vendor for full line card/port blade option details. Your SVC ports, storage ports, ISLs, and high-bandwidth hosts need to be spread out evenly among your line cards in order to help prevent the failure of any one line card from causing undue impact to performance or availability.

2.2.4 IBM System Storage/Brocade b-type SANs


These are several of the features that we have found useful.

Fabric Watch
The Fabric Watch feature found in newer IBM/Brocade-based SAN switches can be useful as long as the SVC relies on a healthy properly functioning SAN. Fabric Watch is a SAN health monitor designed to enable real-time proactive awareness of the health, performance, and security of each switch. It automatically alerts SAN managers to predictable problems in order to help in avoiding costly failures. It tracks a wide spectrum of fabric elements, events and counters. Fabric Watch allows you to configure the monitoring and measuring frequency for each switch and fabric element and specify notification thresholds. Whenever these thresholds are exceed, Fabric Watch automatically provides notification using several methods, including e-mail messages, SNMP traps, log entries or even posts alerts to Data Center Fabric Manager (DCFM). The components that Fabric Watch monitors are grouped in classes as follows: Environment (like temperature) Fabric (zone changes, fabric segmentation, E_Port down, among others) Field Replaceable Unit (provides an alert when a part replacement is needed) Performance Monitor (for instance RX and TX performance between two devices)

Chapter 2. SAN topology

23

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Port (monitors port statistics and takes action based on the configured thresholds and actions. Actions may include port fencing.) Resource (RAM, flash, memory, and CPU) Security (monitors different security violations on the switch and takes action based on the configured thresholds and their actions) SFP (monitor the physical aspects of an SFP, such as voltage, current, RXP, TXP, and state changes in physical ports) By implementing Fabric Watch you benefit of an improved high availability from proactive notification. Furthermore, it allows you to reduce troubleshooting and root cause analysis (RCA) times. Fabric Watch is an optionally licensed feature of Fabric OS, however it is already included in the base licensing of the new IBM System Storage b-Series switches.

Bottleneck detection
A bottleneck is a situation where the frames of a fabric port cannot get through as fast as they should. In this condition the offered load is greater than the achieved egress throughput on the affected port. The bottleneck detection feature does not require any additional license. It identifies and alerts you to ISL or device congestion as well as device latency conditions in the fabric. The bottleneck detection also enables you to prevent degradation of throughput in the fabric and to reduce the time it takes to troubleshoot SAN performance problems. Bottlenecks are reported through RASlog alerts and SNMP traps and you can set alert thresholds for the severity and duration of the bottleneck. Starting in Fabric OS 6.4.0, you configure bottleneck detection on a per-switch basis, with per-port exclusions.

Virtual Fabrics
Virtual Fabrics adds the capability for physical switches to be partitioned into independently managed logical switches. There are multiple advantages of implementing Virtual Fabrics like hardware consolidation, improved security, resource sharing by several customers, among others. The following IBM System Storage platforms are Virtual Fabrics-capable: SAN768B SAN384B SAN80B-4 SAN40B-4 To configure Virtual Fabrics you do not need to install any additional license.

Fibre Channel Routing and Integrated Routing


Fibre Channel routing (FC-FC) is used to forward data packets between two or more (physical or virtual) fabrics while maintaining their independence from each other. Routers use headers and forwarding tables to determine the best path for forwarding the packets. This technology allows the development and management of large heterogeneous SANs, increasing the overall device connectivity. The main advantages of Fibre Channel Routing are: Increase the SAN connectivity interconnecting (not merging) several physical or virtual fabrics. Sharing devices across multiple fabrics. Centralized management

24

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Smooth fabric migrations during technology refresh projects. In conjunction with tunneling protocols (like FCIP) allows the connectivity between fabrics over long distances. Integrated Routing (IR) is a licensed feature that allows 8-Gbps FC ports of SAN768B and SAN384B among others, to be configured as EX_Ports (or VEX_Ports) supporting Fibre Channel routing. With IR capable switches or directors in conjunction with the respective license, you do not need to deploy external FC routers or FC router blades for FC-FC routing. For more information about the IBM System Storage b-type/Brocade products, refer to the following IBM Redbooks publications: Implementing an IBM b-type SAN with 8 Gbps Directors and Switches, SG24-6116 IBM System Storage b-type Multiprotocol Routing: An Introduction and Implementation, SG24-7544

2.2.5 IBM System Storage/Cisco SANs


We have found the following features to be useful.

Port Channels
To ease the required planning efforts for future SAN expansions, ISLs/Port Channels can be made up of any combination of ports in the switch, which means that it is not necessary to reserve special ports for future expansions when provisioning ISLs. Instead, you can use any free port in the switch for expanding the capacity of an ISL/Port Channel.

Cisco VSANs
Virtual SANs (VSANs) let you to achieve an improved SAN scalability, availability, and security by allowing multiple Fibre Channel SANs to share a common physical infrastructure of switches and ISLs. These benefits are achieved based upon independent Fibre Channel services and traffic isolation between VSANs. Using Inter VSAN Routing (IVR), you can establish a data communication path between initiators and targets located on different VSANs without merging VSANs into a single logical fabric. As long as VSANs may group ports across multiple physical switches, enhanced inter switch links (EISLs) can be used to carry traffic belonging to multiple VSANs (VSAN trunking). The main VSAN implementation advantages are hardware consolidation, improved security and resource sharing by several independent organizations like customers.It is possible to use Cisco VSANs, combined with inter-VSAN routes, to isolate the hosts from the storage arrays. This arrangement provides little benefit for a great deal of added configuration complexity. However, VSANs with inter-VSAN routes can be useful for fabric migrations from non-Cisco vendors onto Cisco fabrics, or other short-term situations. VSANs can also be useful if you have a storage array direct attached by hosts in conjunction with some space virtualized through the SVC. (In this instance, it is best to use separate storage ports for the SVC and the hosts. We do not advise using inter-VSAN routes to enable port sharing.)

2.2.6 SAN routing and duplicate WWNNs


The SVC has a built-in service feature that attempts to detect if two SVC nodes are on the same FC fabric with the same worldwide node name (WWNN). When this situation is detected, the SVC will restart and turn off its FC ports to prevent data corruption. This feature

Chapter 2. SAN topology

25

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

can be triggered erroneously if an SVC port from fabric A is zoned through a SAN router so that an SVC port from the same node in fabric B can log into the fabric A port. To prevent this situation from happening, it is important that whenever implementing advanced SAN FCR functions, be careful to ensure that the routing configuration is correct.

2.3 Zoning
Because it differs from traditional storage devices, properly zoning the SVC into your SAN fabric is a source of misunderstanding and errors. Despite the misunderstandings and errors, zoning the SVC into your SAN fabric is not particularly complicated. Note: Errors caused by improper SVC zoning are often fairly difficult to isolate, so create your zoning configuration carefully. Here are the basic SVC zoning steps: 1. 2. 3. 4. 5. 6. Create SVC internode communications zone. Create SVC clustered system. Create SVC Back-end storage subsystem zones. Assign back-end storage to the SVC. Create host SVC zones. Create host definitions on the SVC.

The zoning scheme that we describe next is slightly more restrictive than the zoning described in the IBM System Storage SAN Volume Controller V6.2.0 - Software Installation and Configuration Guide, GC27-2286. The Configuration Guide is a statement of what is supported, nevertheless this publication is a statement of our understanding of the best way to set up zoning, even if other ways are possible and supported.

2.3.1 Types of zoning


Modern SAN switches have three types of zoning available: port zoning, worldwide node name (WWNN) zoning, and worldwide port name (WWPN) zoning. The preferred method is to use only WWPN zoning. There is a common misconception that WWPN zoning provides poorer security than port zoning, which is not the case. Modern SAN switches enforce the zoning configuration directly in the switch hardware, and port binding functions can be used to enforce that a given WWPN must be connected to a particular SAN switch port. Note: Avoid using a zoning configuration with port and worldwide name zoning intermixed. There are multiple reasons not to use WWNN zoning. For hosts, it is absolutely a bad idea, because the WWNN is often based on the WWPN of only one of the HBAs. If you have to replace that HBA, the WWNN of the host will change on both fabrics, which will result in access loss. In addition, it also makes troubleshooting more difficult, because you have no consolidated list of which ports are supposed to be in which zone, and therefore, it is difficult to tell if a port is missing.

26

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Special note for IBM/Brocade SAN Webtools users


If you use the Brocade Webtools Graphical User Interface (GUI) to configure zoning, you must take special care not to use WWNNs. When looking at the tree of available worldwide names (WWNs), the WWNN is always presented one level higher than the WWPNs. Refer to Figure 2-6 on page 27 for an example. Make sure that you use a WWPN, not the WWNN.

Figure 2-6 IBM/Brocade Webtools zoning

2.3.2 Pre-zoning tips and shortcuts


Now, we describe several tips and shortcuts for the SVC zoning.

Naming convention and zoning scheme


It is important to have a defined naming convention and zoning scheme when creating and maintaining an SVC zoning configuration. Failing to have a defined naming convention and zoning scheme can make your zoning configuration extremely difficult to understand and maintain. Remember that different environments have different requirements, which means that the level of detailing in the zoning scheme will vary among environments of different sizes. It is important to have an easily understandable scheme with an appropriate level of detailing and then to be consistent whenever making changes to the environment. Refer to 14.1.1, Naming Conventions on page 396 for suggestions for an SVC naming convention.

Chapter 2. SAN topology

27

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

Aliases
We strongly recommend that you use zoning aliases when creating your SVC zones if they are available on your particular type of SAN switch. Zoning aliases make your zoning easier to configure and understand and cause fewer possibilities for errors. One approach is to include multiple members in one alias, because zoning aliases can normally contain multiple members (just like zones). We recommend that you create aliases for: One that holds all the SVC node ports on each fabric One for each storage subsystem (or controller blade, in the case of DS4x00 units) One for each I/O Group port pair (that is, it needs to contain the 1st node in the I/O Group, port 2, and the 2nd node in the I/O Group, port 2) Host aliases can be omitted in smaller environments, as in our lab environment.

2.3.3 SVC internode communications zone


This zone needs to contain every SVC node port on the SAN fabric. While it will overlap with the storage zones that you will create soon, it is handy to have this zone as a fail-safe, in case you ever make a mistake with your storage zones. When configuring zones for communication between nodes in the same system, the minimum configuration requires that all Fibre Channel ports on a node detect at least one Fibre Channel port on each other node in the same system. You cannot reduce the configuration in this environment.

2.3.4 SVC storage zones


You need to avoid zoning different vendor storage subsystems together; the ports from the storage subsystem need to be split evenly across the dual fabrics. Each controller might have its own recommended best practice. All nodes in a system must be able to detect the same ports on each back-end storage system. Operation in a mode where two nodes detect a different set of ports on the same storage system is degraded, and the system logs errors that request a repair action. This can occur if inappropriate zoning is applied to the fabric or if inappropriate LUN masking is used.

IBM DS5000 and IBM DS4000 storage controllers


Each IBM DS5000 and IBM DS4000 storage subsystem controller consists of two separate blades. It is a best practice that these two blades are not in the same zone if you have attached them to the same SAN (see Figure 2-7 on page 29). There might be a similar best practice suggestion from non-IBM storage vendors; contact them for details.

28

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

CtrlA_Fa bricA

SAN Fabric A Ctrl A DS 4000 /DS5 000 Ctrl B


XIV
CtrlB_Fa bricA

1 2 3 4

CtrlA_Fa bricB

1 2 3 4

SAN Fabric B
CtrlB_Fa bricB

Netwo rk

SVC nod es

Figure 2-7 Example of zoning a DS4000/DS5000 as a back-end controller

For more information about zoning the IBM System Storage IBM DS5000 or IBM DS4000 within the SVC, refer to the following IBM Redbooks publication: IBM Midrange System Storage Implementation and Best Practices Guide, SG24-6363.

To take advantage of the combined capabilities of SVC and XIV, you should zone two ports (one per fabric) from each interface module with the SVC ports. You need to decide which XIV ports you are going to use for the connectivity with the SVC. If you do not use and do not have plans to use XIV remote mirroring, you must change the role of port 4 from initiator to target on all XIV interface modules and use ports 1 and 3 from every interface module into the fabric for the SVC attachment. Otherwise, you must use ports 1 and 2 from every interface modules instead of ports 1 and 3. Each HBA port on the XIV Interface Module is designed and set to sustain up to 1400 concurrent I/Os. However port 3 will only sustain up to 1000 concurrent I/Os if port 4 is defined as initiator. In Figure 2-8 we show how to zone an XIV frame as an SVC storage controller.

Note: Only single rack XIV configurations are supported by SVC. Multiple single racks can be supported where each single rack is seen by SVC as a single controller.

Chapter 2. SAN topology

29

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

2 1 2 1 2 1 2 1 2 1 2 1

4 3 4 3 4 3 4 3 4 3 4 3 1 2 3 4 1 2 3 4

SAN Fabric A

SAN Fabric B

XIV Patch Panel

Network

SVC nodes

Figure 2-8 Example of zoning an XIV as a back-end controller

Storwize V7000
Storwize V7000 external storage systems can present volumes to a SAN Volume Controller. A Storwize V7000 system, however, cannot present volumes to another Storwize V7000 system. To zone the Storwize V7000 as an SVC back-end storage controller, the minimum requirement is having every SVC node with the same Storwise V7000 view which should be at least one port per Storwize 7000 canister. Figure 2-9 shows an example of how the SVC can be zoned with the Storwize V7000.

30

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

Storwize V7000

Figure 2-9 Example of zoning a Storwize V7000 as a back-end controller

2.3.5 SVC host zones


There must be a single zone for each host port. This zone must contain the host port, and one port from each SVC node that the host will need to access. While there are two ports from each node per SAN fabric in a usual dual-fabric configuration, make sure that the host only accesses one of them. Refer to Figure 2-10 on page 32. This configuration provides four paths to each volume, which is the number of paths per volume for which IBM Subsystem Device Driver (SDD) multipathing software and the SVC have been tuned.

Canister 1 Canister 2

SAN Fabric A
3 4

1 2 3 4

1 1 2 2 3 4

SAN Fabric B
3 4

Network

SVC nodes

Chapter 2. SAN topology

31

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

D
I/O Group 0

SVC Node

SVC Node

Zone Foo_Slot3_SAN_A Switch A

Zone Bar_Slot2_SAN_A

Zone Foo_Slot5_SAN_B Switch B

Zone Bar_Slot8_SAN_B

Zone: Foo_Slot3_SAN_A 50:00:11:22:33:44:55:66 SVC_Group0_Port_A Zone: Bar_Slot2_SAN_A 50:11:22:33:44:55:66:77 SVC_Group0_Port_C

Zone: Foo_Slot5_SAN_B 50:00:11:22:33:44:55:67 SVC_Group0_Port_D Zone: Bar_Slot8_SAN_B 50:11:22:33:44:55:66:78 SVC_Group0_Port_B

Host Foo

Host Bar

Figure 2-10 Typical host SVC zoning

The IBM System Storage SAN Volume Controller V6.2.0 - Software Installation and Configuration Guide, GC27-2286, discusses putting many hosts into a single zone as a supported configuration under certain circumstances. While this design usually works just fine, instability in one of your hosts can trigger all sorts of impossible to diagnose problems in the other hosts in the zone. For this reason, you need to only have a single host in each zone (single initiator zones). It is a supported configuration to have eight paths to each volume, but this design provides no performance benefit (indeed, under certain circumstances, it can even reduce performance), and it does not improve reliability or availability by any significant degree. To obtain the best overall performance of the system and to prevent overloading, the workload to each SAN Volume Controller port must be equal. This can typically involve zoning approximately the same number of host Fibre Channel ports to each SAN Volume Controller Fibre Channel port.

Hosts with four (or more) HBAs


If you have four host bus adapters (HBAs) in your host instead of two HBAs, it takes a little more planning. Because eight paths are not an optimum number, you must instead configure your SVC Host Definitions (and zoning) as though the single host is two separate hosts. During volume assignment, you alternate which volume was assigned to one of the pseudo-hosts.

32

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

The reason that we do not just assign one HBA to each of the paths is because, for any specific volume, one node solely serves as a backup node (a preferred node scheme is used). The load is never going to get balanced for that particular volume. It is better to load balance by I/O Group instead, and let the volume get automatically assigned to nodes.

2.3.6 Sample standard SVC zoning configuration


This section contains a sample standard zoning configuration for an SVC clustered system. Our sample setup has two I/O Groups, two storage subsystems, and eight hosts. (Refer to Figure 2-11.) Obviously, the zoning configuration must be duplicated on both SAN fabrics; we will show the zoning for the SAN named A.
Note: All SVC Nodes have two connections per switch. SVC Node SVC Node SVC Node SVC Node

Switch A

Switch B

Peter

Barry

Jon

Ian

Thorsten

Ronda

Deon

Foo

Figure 2-11 Example SVC SAN

For the sake of brevity, we only discuss SAN A in our example.

Aliases
Unfortunately, you cannot nest aliases, so several of these WWPNs appear in multiple aliases. Also, do not be concerned if none of your WWPNs looks like the example; we made a few of them up when writing this book. Note that certain switch vendors (for example, McDATA) do not allow multiple-member aliases, but you can still create single-member aliases. While creating single-member aliases

Chapter 2. SAN topology

33

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

does not reduce the size of your zoning configuration, it still makes it easier to read than a mass of raw WWPNs. For the alias names, we have appended SAN_A on the end where necessary to distinguish that these alias names are the ports on SAN A. This system helps if you ever have to perform troubleshooting on both SAN fabrics at one time.

SVC clustered system alias


As a side note, the SVC has an extremely predictable WWPN structure, which helps make the zoning easier to read. It always starts with 50:05:07:68 (refer to Example 2-1) and ends with two octets that distinguish for you which node is which. The first digit of the third octet from the end identifies the port number in the following way: 50:05:07:68:01:4x:xx:xx Port 1 50:05:07:68:01:3x:xx:xx Port 2 50:05:07:68:01:1x:xx:xx Port 3 50:05:07:68:01:2x:xx:xx Port 4 The clustered system alias that we create will be used for the internode communications zone, for all back-end storage zones, and also in any zones that you need for remote mirroring with another SVC clustered system (which will not be discussed in this example).
Example 2-1 SVC clustered system alias

SVC_Cluster_SAN_A: 50:05:07:68:01:40:37:e5 50:05:07:68:01:10:37:e5 50:05:07:68:01:40:37:dc 50:05:07:68:01:10:37:dc 50:05:07:68:01:40:1d:1c 50:05:07:68:01:10:1d:1c 50:05:07:68:01:40:27:e2 50:05:07:68:01:10:27:e2

SVC I/O Group port pair aliases


These are the basic building-blocks of our host zones. Because the best practices that we have described specify that each HBA is only supposed to see a single port on each node, these aliases are the aliases that will be included. To have an equal load on each SVC node port, you need to roughly alternate between the ports when creating your host zones. Refer to Example 2-2.
Example 2-2 I/O Group port pair aliases

SVC_Group0_Port1: 50:05:07:68:01:40:37:e5 50:05:07:68:01:40:37:dc SVC_Group0_Port3: 50:05:07:68:01:10:37:e5 50:05:07:68:01:10:37:dc SVC_Group1_Port1: 50:05:07:68:01:40:1d:1c 50:05:07:68:01:40:27:e2

34

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

SVC_Group1_Port3: 50:05:07:68:01:10:1d:1c 50:05:07:68:01:10:27:e2

Storage subsystem aliases


The first two aliases here are similar to what you might see with an IBM System Storage DS4800 storage subsystem with four back-end ports per controller blade. We have created different aliases for each blade in order to isolate the two controllers from each other, which is a best practice suggested by DS4000/DS5000 development. Because the IBM System Storage DS8000 has no concept of separate controllers (at least, not from the viewpoint of a SAN), we put all the ports on the storage subsystem into a single alias. Refer to Example 2-3.
Example 2-3 Storage aliases

DS4k_23K45_Blade_A_SAN_A 20:04:00:a0:b8:17:44:32 20:04:00:a0:b8:17:44:33 DS4k_23K45_Blade_B_SAN_A 20:05:00:a0:b8:17:44:32 20:05:00:a0:b8:17:44:33 DS8k_34912_SAN_A 50:05:00:63:02:ac:01:47 50:05:00:63:02:bd:01:37 50:05:00:63:02:7f:01:8d 50:05:00:63:02:2a:01:fc

Zones
Remember when naming your zones that they cannot have identical names as aliases. Here is our sample zone set, utilizing the aliases that we have just defined.

SVC internode communications zone


This zone is simple; it only contains a single alias (which happens to contain all of the SVC node ports). And yes, this zone does overlap with every single storage zone. Nevertheless, it is good to have it as a fail-safe, given the dire consequences that will occur if your clustered system nodes ever completely lose contact with one another over the SAN. Refer to Example 2-4.
Example 2-4 SVC clustered system zone

SVC_Cluster_Zone_SAN_A: SVC_Cluster_SAN_A

SVC Storage zones


As we have mentioned earlier, we put each of the storage controllers (and, in the case of the DS4000/DS5000 controllers, each blade) into a separate zone. Refer to Example 2-5.
Example 2-5 SVC Storage zones

SVC_DS4k_23K45_Zone_Blade_A_SAN_A: SVC_Cluster_SAN_A

Chapter 2. SAN topology

35

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

DS4k_23K45_Blade_A_SAN_A SVC_DS4k_23K45_Zone_Blade_B_SAN_A: SVC_Cluster_SAN_A DS4K_23K45_BLADE_B_SAN_A SVC_DS8k_34912_Zone_SAN_A: SVC_Cluster_SAN_A DS8k_34912_SAN_A

SVC Host zones


We have not created aliases for each host, because each host is only going to appear in a single zone. While there will be a raw WWPN in the zones, an alias is unnecessary, because it will be obvious where the WWPN belongs. Notice that all of the zones refer to the slot number of the host, rather than SAN_A. If you are trying to diagnose a problem (or replace an HBA), it is extremely important to know on which HBA you need to work. For IBM System p hosts, we have also appended the HBA number (FCS) into the zone name, which makes device management easier. While it is possible to get this information out of SDD, it is nice to have it in the zoning configuration. We alternate the hosts between the SVC node port pairs and between the SVC I/O Groups for load balancing. While we are just simply alternating in our example, you might want to balance the load based on the observed load on ports and I/O Groups. Refer to Example 2-6.
Example 2-6 SVC Host zones

WinPeter_Slot3: 21:00:00:e0:8b:05:41:bc SVC_Group0_Port1 WinBarry_Slot7: 21:00:00:e0:8b:05:37:ab SVC_Group0_Port3 WinJon_Slot1: 21:00:00:e0:8b:05:28:f9 SVC_Group1_Port1 WinIan_Slot2: 21:00:00:e0:8b:05:1a:6f SVC_Group1_Port3 AIXRonda_Slot6_fcs1: 10:00:00:00:c9:32:a8:00 SVC_Group0_Port1 AIXThorsten_Slot2_fcs0: 10:00:00:00:c9:32:bf:c7 SVC_Group0_Port3 AIXDeon_Slot9_fcs3: 10:00:00:00:c9:32:c9:6f SVC_Group1_Port1 36
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

AIXFoo_Slot1_fcs2: 10:00:00:00:c9:32:a8:67 SVC_Group1_Port3

2.3.7 Zoning with multiple SVC clustered systems


Unless two clustered systems participate in a mirroring relationship, all zoning must be configured so that the two systems do not share a zone. If a single host requires access to two different clustered systems, create two zones with each zone to a separate system. The back-end storage zones must also be separate, even if the two clustered systems share a storage subsystem.

2.3.8 Split storage subsystem configurations


There might be situations where a storage subsystem is used both for SVC attachment and direct-attach hosts. In this case, it is important that you pay close attention during the LUN masking process on the storage subsystem. Assigning the same storage subsystem LUN to both a host and the SVC will almost certainly result in swift data corruption. If you perform a migration into or out of the SVC, make sure that the LUN is removed from one place at the exact same time that it is added to another place.

2.4 Switch Domain IDs


All switch Domain IDs should be unique between both fabrics, and the switch name should incorporate the Domain ID. Having a domain ID that is totally unique makes troubleshooting problems much easier in situations where an error message contains the FCID of the port with a problem.

2.5 Distance extension for remote copy services


To implement remote copy services over a distance, you have several choices: Optical multiplexors, such as DWDM or CWDM devices Long-distance small form-factor pluggable transceivers (SFPs) and XFPs Fibre Channel IP conversion boxes Of those options, the optical varieties of distance extension are the gold standard. IP distance extension introduces additional complexity, is less reliable, and has performance limitations. However, we do recognize that optical distance extension is impractical in many cases due to cost or unavailability. Note: Distance extension must only be utilized for links between SVC clustered systems. It must not be used for intra-clustered system communication. Technically, distance extension is supported for relatively short distances, such as a few kilometers (or miles). Refer to the IBM System Storage SAN Volume Controller Restrictions, S1003799, for details explaining why this arrangement is not recommended.

Chapter 2. SAN topology

37

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

2.5.1 Optical multiplexors


Optical multiplexors can extend your SAN up to hundreds of kilometers (or miles) at extremely high speeds, and for this reason, they are the preferred method for long distance expansion. When deploying optical multiplexing, make sure that the optical multiplexor has been certified to work with your SAN switch model. The SVC has no allegiance to a particular model of optical multiplexor. If you use multiplexor-based distance extension, closely monitor your physical link error counts in your switches. Optical communication devices are high-precision units. When they shift out of calibration, you start to see errors in your frames.

2.5.2 Long-distance SFPs/XFPs


Long-distance optical transceivers have the advantage of extreme simplicity. No expensive equipment is required, and there are only a few configuration steps to perform. However, ensure that you only use transceivers designed for your particular SAN switch. Each switch vendor only supports a specific set of small form-factor pluggable transceivers (SFPs/XFPs), so it is unlikely that Cisco SFPs will work in a Brocade switch.

2.5.3 Fibre Channel: IP conversion


Fibre Channel IP conversion is by far the most common and least expensive form of distance extension. It is also a form of distance extension that is complicated to configure, and relatively subtle errors can have severe performance implications. With Internet Protocol (IP)-based distance extension, it is imperative that you dedicate bandwidth to your Fibre Channel (FC) IP traffic if the link is shared with other IP traffic. Do not assume that because the link between two sites is low traffic or only used for e-mail that this type of traffic will always be the case. Fibre Channel is far more sensitive to congestion than most IP applications. You do not want a spyware problem or a spam attack on an IP network to disrupt your SVC. Also, when communicating with your organizations networking architects, make sure to distinguish between megabytes per second as opposed to megabits. In the storage world, bandwidth is usually specified in megabytes per second (MBps, MB/s, or MB/sec), while network engineers specify bandwidth in megabits (Mbps, Mbit/s, or Mb/sec). If you fail to specify megabytes, you can end up with an impressive-sounding 155 Mb/sec OC-3 link, which is only going to supply a tiny 15 MBps or so to your SVC. With the suggested safety margins included, this is not an extremely fast link at all. Exact details of the configurations of these boxes is beyond the scope of this book; however, the configuration of these units for the SVC is no different than any other storage device.

2.6 Tape and disk traffic sharing the SAN


If you have free ports on your core switch, there is no problem with putting tape devices (and their associated backup servers) on the SVC SAN; however, you must not put tape and disk traffic on the same Fibre Channel host bus adapter (HBA). Do not put tape ports and backup servers on different switches. Modern tape devices have high bandwidth requirements and to do so can quickly lead to SAN congestion over the ISL between the switches. 38
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

2.7 Switch interoperability


The SVC is rather flexible as far as switch vendors are concerned. The most important requirement is that all of the node connections on a particular SVC clustered system must all go to switches of a single vendor. This requirement means that you must not have several nodes or node ports plugged into vendor A, and several nodes or node ports plugged into vendor B. While the SVC supports certain combinations of SANs made up of switches from multiple vendors in the same SAN; in practice, we do not particularly recommend this approach. Despite years of effort, interoperability among switch vendors is less than ideal, because the Fibre Channel standards are not rigorously enforced. Interoperability problems between switch vendors are notoriously difficult and disruptive to isolate, and it can take a long time to obtain a fix. For these reasons, we suggest only running multiple switch vendors in the same SAN long enough to migrate from one vendor to another vendor, if this setup is possible with your hardware. It is acceptable to run a mixed-vendor SAN if you have gained agreement from both switch vendors that they will fully support attachment with each other. In general, Brocade will interoperate with McDATA under special circumstances. Contact your IBM marketing representative for details (McDATA here refers to the switch products sold by the McDATA Corporation prior to their acquisition by Brocade Communications Systems). QLogic/BladeCenter FCSM will work with Cisco. We do not advise interoperating Cisco with Brocade at this time, except during fabric migrations, and only then if you have a back-out plan in place. We also do not advise that you connect the QLogic/BladeCenter FCSM to Brocade or McDATA. When connecting Bladecenter switches to a core one, you might consider using the NPIV technology. When having SAN fabrics with multiple vendors, pay special attention to any particular requirements. For instance, observe from which switch in the fabric the zoning must be performed.

2.8 IBM Tivoli Storage Productivity Center


IBM Tivoli Storage Productivity Center can be used to create, administer, and monitor your SAN fabrics. There is nothing special that you need to do to use it to administer an SVC SAN fabric as opposed to any other SAN fabric. We discuss information about Tivoli Storage Productivity Center in in Chapter 13, Monitoring on page 311. For further information, consult the following IBM Redbooks publications: Tivoli Storage Productivity Center V4.2 Release Guide, SG24-7894 SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364 You may also consult the IBM Tivoli Storage Productivity Center documentation web site http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp, or contact your IBM marketing representative.

Chapter 2. SAN topology

39

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

2.9 iSCSI support


iSCSI is a block-level protocol that encapsulates SCSI commands into TCP/IP packets, and thereby leverages an existing IP network, instead of requiring expensive Fibre-Channel HBAs and SAN fabric infrastructure. Since SVC V5.1.0, the iSCSI is an alternative to Fibre Channel host attachment. Nevertheless, all inter-node communications as well as the SVC to back-end storage communications (or even with remote clustered systems), are established though the fibre channel links.

2.9.1 iSCSI initiators and targets


In an iSCSI configuration, the iSCSI host or server sends requests to a node. The host contains one or more initiators that attach to an IP network to initiate requests to and receive responses from an iSCSI target. Each initiator and target are given a unique iSCSI name such as an iSCSI qualified name (IQN) or an extended-unique identifier (EUI). An IQN is a 223-byte ASCII name. An EUI is a 64-bit identifier. An iSCSI name represents a worldwide unique naming scheme that is used to identify each initiator or target in the same way that worldwide node names (WWNNs) are used to identify devices in a fibre-channel fabric. An iSCSI target is any device that receives iSCSI commands. The device can be an end node such as a storage device, or it can be an intermediate device such as a bridge between IP and fibre-channel devices. Each iSCSI target is identified by a unique iSCSI name. The SAN Volume Controller can be configured as one or more iSCSI targets. Each node that has one or both of its node Ethernet ports configured becomes an iSCSI target. To transport SCSI commands over the IP network, an iSCSI driver must be installed on the iSCSI host and target. The driver is used to send iSCSI commands and responses through a network interface controller (NIC) or an iSCSI HBA in the host or target hardware.

2.9.2 iSCSI Ethernet configuration


A clustered system management IP address is used for access to the SVC CLI, Console (Tomcat) GUI as well as the CIMOM. Each clustered system has one or two clustered system IP addresses. These IP addresses are bound to Ethernet port one and port two respectively of the current configuration node. You can configure a service IP address per clustered system or per node and it will be bound to Ethernet port one. Each Ethernet port on each node can be configured with one iSCSI port address. As you can see, on-board Ethernet ports can be either used for management/ service or for iSCSI I/O. If you are using IBM Tivoli Storage Productivity Center or equivalent application to monitor the performance of your SAN Volume Controller clustered system, it is recommended to separate this management traffic from iSCSI host I/O traffic, such as using node port 1 for management traffic and node port 2 for iSCSI I/O.

2.9.3 Security and performance


Although all SVC V6.2 capable engines support iSCSI host attachments, the new 2145-CG8 node allows the option to add 10 Gigabit Ethernet connectivity with two ports per SVC hardware engine in order to improve the iSCSI connection throughput. It is recommended to use a private network between iSCSI initiators and targets to ensure the required performance and security. The cfgportip command which configures a new port IP address for a given node/port, allows you to set the maximum transmission unit (MTU). The

40

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SAN Fabric20111209.fm

default value is 1500, with a maximum of 9000. An MTU of 9000 (jumbo frames), enables you to save CPU utilization and increases the efficiency. It reduces the overhead and increases the payload. Jumbo frames provide you with improved iSCSI performance. Hosts may use standard NICs or Converged Network Adapters. For standard NICs you need to use the Operating System iSCSI Host Attachment software driver. Converged Network Adapters are able to offload TCP/IP processing and some of them even the iSCSI protocol. These intelligent adapters release CPU cycles for the main host applications. For a complete list of supported software and hardware iSCSI host attachment drivers, please consult the SAN Volume Controller Supported Hardware List, Device Driver, Firmware and Recommended Software Levels V6.2, S1003797 https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797.

2.9.4 Failover of port IP addresses and iSCSI names


Fibre Channel host attachment relies on host multipathing software to provide high availability in the event of loss of a node in an I/O group. iSCSI allows failover without host multipathing: to achieve this, the partner node in the I/O group takes over the port IP address(es) and iSCSI names of a failed node. When the partner node returns to the online state, its IP addresses and iSCSI names will failback after a delay of 5 minutes. This ensures that the recently-online node is stable before allowing the host to begin using it for I/O again. The svcinfo lsportip command lists a nodes own IP addresses and iSCSI names, as well as those of its partner node. The partner nodes addresses and names are identified by the failover field being set to yes. The failover_active value of yes in the svcinfo lsnode output indicates that the partner nodes IP addresses and iSCSI names have failed over to a given node.

2.9.5 iSCSI protocol limitations


When using an iSCSI connection, you must consider the iSCSI protocol limitations: There is no SLP support for discovery. Header and data digest support is provided only if the initiator is configured to negotiate. Only one connection per session is supported. A maximum of 256 iSCSI sessions per SAN Volume Controller iSCSI target is supported. Only ErrorRecoveryLevel 0 (session restart) is supported. The behavior of a host that supports both Fibre Channel and iSCSI connections and accesses a single volume can be unpredictable and depends on the multipathing software. There can be a maximum of four sessions coming from one iSCSI initiator to a SAN Volume Controller iSCSI target.

Chapter 2. SAN topology

41

7521SAN Fabric20111209.fm

Draft Document for Review February 16, 2012 3:49 pm

42

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SVC Cluster.fm

Chapter 3.

SAN Volume Controller clustered system


In this chapter, we discuss the advantages of virtualization and the optimal time to use virtualization in your environment. Furthermore, we describe the scalability options for the IBM System Storage SAN Volume Controller (SVC) and when to grow or split an SVC clustered system.

Copyright IBM Corp. 2011. All rights reserved.

43

7521SVC Cluster.fm

Draft Document for Review February 16, 2012 3:49 pm

3.1 Advantages of virtualization


The IBM System Storage SAN Volume Controller (SVC), which is shown in Figure 3-1, enables a single point of control for disparate, heterogeneous storage resources. The SVC enables you to join capacity from various heterogeneous storage subsystem arrays into one pool of capacity for better utilization and more flexible access. This design helps the administrator to control and manage this capacity from a single common interface instead of managing several independent disk systems and interfaces. Furthermore, the SVC can improve the performance and efficiency of your storage subsystem array by introducing 24 GB of cache memory in each node and the option of using internal Solid State Drives (SSDs) in conjunction with the Easy Tier function. SVC virtualization provides users with the ability to move data non-disruptively between different storage subsystems. This can be very useful, for instance, when replacing an existing storage array with a new one or when moving data in a tiered storage infrastructure. Volume mirroring feature allows storing two copies of a volume on different storage subsystems. This function helps to improve application availability in the event of failure or disruptive maintenance to an array or disk system. Moreover, those two mirror copies can be placed at a distance of 10 kilometers (6.2 miles) when using long wave SFPs in conjunction with a split-clustered system configuration. As a virtualization functionality, thin provisioned volumes allows provisioning storage volumes based upon future growth just requiring the physical storage for the current utilization. This is very helpful for host operating systems that do not support logical volume managers. In addition to remote replication services, local copy services allow a set of copy functionalities. Multiple target FlashCopy volumes for a single source, incremental FlashCopy and the Reverse FlashCopy functions enrich the virtualization layer provided by SVC. FlashCopy is commonly used for backup activities and source of point-in-time remote copy relationships, among other purposes. Reverse FlashCopy allows a quick restore of a previous snapshot without having to break the FlashCopy relationship and without having to wait for the original copy. This is very useful, for instance, after a failing host application upgrade or data corruption. It allows you to restore the previous snapshot almost instantaneously. If you are presenting storage to multiple clients with different performance requirements, SVC is also extremely attractive because you can create a tiered storage environment and provision storage accordingly.

Figure 3-1 SVC CG8 model

44

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SVC Cluster.fm

3.1.1 How does the SVC fit into your environment


Here is a short list of the SVC features: Combines capacity into a single pool Manages all types of storage in a common way from a common point Improves storage utilization and efficiency by providing more flexible access to storage assets Reduces the physical storage usage when allocating volumes (formerly VDisks) for future growth by enabling thin-provisioning. Reduces the physical storage usage when converting allocating volumes (formerly VDisks) for future growth by enabling thin-provisioning. Provisions capacity to applications easier through a new graphical user interface based on the popular IBM XIV interface Improves performance through caching, optional SSD utilization and striping data across multiple arrays. Creates tiered storage pools Optimize SSD storage efficiency in tiering deployments with the Easy Tier feature Provides advanced copy services over heterogeneous storage arrays Removes or reduces the physical boundaries or storage controller limits associated with any vendor storage controllers Insulates host applications from changes to the physical storage infrastructure Allows data migration among storage systems without interruption to applications Brings common storage controller functions into the Storage Area Network (SAN), so that all storage controllers can be used and can benefit from these functions Delivers low cost SAN performance through 1 Gbps and 10 Gbps iSCSI host attachments in addition to Fibre Channel Enables a single set of advanced network based replication services that operate in a consistent manner regardless of the type of storage being used. Improves server efficiency through VMware vStorage APIs, offloading some storage related tasks that were previously performed by VMware Enable a more efficient consolidated management with plugins to support Microsoft System Center Operations Manager (SCOM) and VMware vCenter

3.2 Scalability of SVC clustered systems


The SAN Volume Controller is highly scalable, and it can be expanded up to eight nodes in one clustered system. An I/O Group is formed by combining a redundant pair of SVC nodes (IBM System x server-based). Highly available I/O Groups are the basic configuration element of an SVC clustered system. The most recent SVC node (2145-CG8), includes a four- port 8 Gbps-capable host bus adapter (HBA), which is designed to allow the SVC to connect and operate at up to 8 Gbps SAN fabric speed. It also contains 24GB of cache memory that is mirrored with the counterpart nodes cache. Adding I/O Groups to the clustered system is designed to linearly increase system performance and bandwidth. An entry level SVC configuration contains a single I/O Group. The SVC can scale out to support four I/O Groups, 1024 host servers and 8192 volumes (formerly VDisks). This flexibility means that SVC configurations can start small with an attractive price to suit smaller clients or
Chapter 3. SAN Volume Controller clustered system

45

7521SVC Cluster.fm

Draft Document for Review February 16, 2012 3:49 pm

pilot projects and yet can grow to manage extremely large storage environments up to 32 PB of virtualized storage.

3.2.1 Advantage of multi-clustered systems as opposed to single-clustered systems


Growing or adding new I/O Groups to an SVC clustered system is a decision that has to be made when either a configuration limit is reached or when the I/O load reaches a point where a new I/O Group is needed.

Monitor CPU performance


As long as the CPU performance is related to I/O performance, you should consider monitoring the clustered system nodes, when the system growing concern is related to excessive I/O load. You can do it through the real-time performance statistics GUI or using the Tivoli Storage Productivity Center to capture more detailed performance information. You can also use the non-officially supported svcmon tool available at: http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS3177 When the CPUs become consistently 70% busy, you must consider either: Adding more nodes to the clustered system and moving part of the workload onto the new nodes Moving several volumes to a different less busy I/O Group Several of the activities that affect CPU utilization are: Volume activity: The preferred node is responsible for I/Os for the volume and coordinates sending the I/Os to the alternate node. While both systems will exhibit similar CPU utilization, the preferred node is a little busier. To be precise, a preferred node is always responsible for the destaging of writes for volumes that it owns. Therefore, skewing preferred ownership of volumes toward one node in the I/O Group will lead to more destaging, and therefore, more work on that node. Cache management: The purpose of the cache component is to improve performance of read and write commands by holding part of the read or write data in SVC memory. The cache component must keep the caches on both nodes consistent, because the nodes in a caching pair have physically separate memories. Mirror Copy activity: The preferred node is responsible for coordinating copy information to the target and also ensuring that the I/O Group is up-to-date with the copy progress information or change block information. As soon as Global Mirror is enabled, there is an additional 10% overhead on I/O work due to the buffering and general I/O overhead of performing asynchronous Peer-to-Peer Remote Copy (PPRC). Processing I/O requests for thin-provisioned volumes does increase the SVC CPU overheads. After you reach the performance or configuration maximum for an I/O Group, you can add additional performance or capacity by attaching another I/O Group to the SVC clustered system.

SVC I/O Group limits


Table 3-1 on page 47 shows the current maximum limits for one SVC I/O Group. Reaching one of those limits on a non-full configured SVC system might require the addition of a new pair of nodes (I/O group).

46

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SVC Cluster.fm

Table 3-1 Maximum configurations for an I/O Group Objects SAN Volume Controller nodes I/O Groups Volumes per I/O Group Host IDs per I/O Group Maximum number Eight Four 2048 256 (Cisco, Brocade, or McDATA) 64 QLogic 512 (Cisco, Brocade, or McDATA) 128 QLogic 1024 TB Comments Arranged as four I/O Groups Each containing two nodes Includes managed-mode and image-mode volumes A host object may contain both Fibre Channel ports and iSCSI names N/A

Host ports (FC and iSCSI) per I/O Group Metro/Global Mirror volume capacity per I/O Group

There is a per I/O Group limit of 1024 TB on the amount of Primary and Secondary volume address space, which can participate in Metro/Global Mirror relationships. This maximum configuration will consume all 512 MB of bitmap space for the I/O Group and allow no FlashCopy bitmap space. The default is 40 TB which consumes 20MB of bitmap memory. This is a per I/O group limit on the amount of FlashCopy mappings using bitmap space from a given I/O Group. This maximum configuration will consume all 512 MB of bitmap space for the I/O Group and allow no Metro Mirror or Global Mirror bitmap space.The default is 40 TB which consumes 20MB of bitmap memory.

FlashCopy volume capacity per I/O Group

1024 TB

3.2.2 Growing or splitting SVC clustered systems


Growing an SVC clustered system can be done concurrently up to a maximum of eight SVC nodes per I/O Groups). Table 3-2 on page 48 contains an extract of the total SVC clustered system configuration limits.

Chapter 3. SAN Volume Controller clustered system

47

7521SVC Cluster.fm

Draft Document for Review February 16, 2012 3:49 pm

Table 3-2 Maximum SVC clustered system limits Objects SAN Volume Controller nodes MDisks Maximum number Eight 4 096 Comments Arranged as four I/O Groups The maximum number of logical units that can be managed by SVC. This number includes disks that have not been configured into storage pools. Includes managed-mode and image-mode volumes. The maximum requires an 8 node clustered system. Maximum requires an extent size of 8192 MB to be used. A host object may contain both Fibre Channel ports and iSCSI names N/A

Volumes (formerly VDisks) per system

8 192

Total storage capacity manageable by SVC Host objects (IDs) per clustered system

32 PB 1 024 (Cisco, Brocade, and McDATA fabrics) 155 CNT 256 QLogic 2048 (Cisco, Brocade, and McDATA fabrics) 310 CNT 512 QLogic

Total Fibre Channel ports and iSCSI names per system

If you exceed one of the current maximum configuration limits for the fully deployed SVC clustered system, you then scale out by adding a new SVC clustered system and distributing the workload to it. Because the current maximum configuration limits can change, use the following link to get a complete table of the current SVC restrictions: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003799 Splitting an SVC system or having a secondary SVC system provides you with the ability to implement a disaster recovery option in the environment. Having two SVC clustered systems in two locations allows work to continue even if one site is down. With the SVC Advanced Copy functions, you can copy data from the local primary environment to a remote secondary site. The maximum configuration limits apply here as well. Another advantage of having two clustered systems is the option of using the SVC Advanced Copy functions. The licensing is based on: The total amount of storage (in gigabytes) that is virtualized The Metro Mirror and Global Mirror capacity in use (primary and secondary) The FlashCopy source capacity in use In each case, the number of terabytes (TBs) to order for Metro Mirror and Global Mirror is the total number of source TBs and target TBs participating in the copy operations. FlashCopy is licensed so that SVC now counts as the main source in FlashCopy relationships.

48

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SVC Cluster.fm

Requirements for growing the SVC clustered system


Before adding a new I/O Group to the existing SVC clustered system, you must make changes. Consider this high-level overview of the requirements and tasks involved: The SVC clustered system is healthy, all errors fixed, and the installed code supports the new nodes. All managed disks are online If you are adding a node that has been used previously, you might consider changing its worldwide node name (WWNN) before adding it to the SVC clustered system. Please consult Chapter 3. SAN Volume Controller user interfaces for servicing your system in IBM System Storage SAN Volume Controller Troubleshooting Guide, GC27-2284-01. Install the new nodes and connect them to the LAN and SAN Power-on the new nodes Include the new nodes in the inter-node communication zones as well as in the back-end zones. LUN mask back-end storage LUNs (managed disks) in order to include the WWPNs of the SVC nodes that you want to add. Add the SAN Volume Controller nodes to the clustered system Check the SVC status including nodes, managed disks and (storage) controllers. For an overview about adding a new I/O Group, see Replacing or adding nodes to an existing clustered system in the IBM System Storage SAN Volume Controller Software Installation and Configuration Guide, GC27-2286-01.

Splitting the SVC clustered system


Splitting the SVC clustered system might become a necessity if the maximum number of eight SVC nodes is reached, and you have a requirement to grow the environment beyond the maximum number of I/Os that a clustered system can support, maximum number of attachable subsystem storage controllers, or any other maximum mentioned in the IBM System Storage SAN Volume Controller 6.2.0 Configuration Limits and Restrictions (S1003799) at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003799 Instead of having one SVC clustered system handling all I/O operations, hosts, and subsystem storage attachments, the goal here is to create a second SVC clustered system so that we equally distribute all of the workload over the two SVC clustered systems.

Approaches for splitting


There are a number of approaches that you can take for splitting an SVC clustered system. The first, and probably the easiest, way is to create a new SVC clustered system, attach storage subsystems and hosts to it, and start putting workload on this new SVC clustered system. The next options are more intensive, and they involve performing more steps: Create a new SVC clustered system and start moving workload onto it. To move the workload from an existing SVC clustered system to a new SVC clustered system, you can use the Advanced Copy features, such as Metro Mirror and Global Mirror. We describe this scenario in Chapter 7, Remote Copy services on page 131.

Chapter 3. SAN Volume Controller clustered system

49

7521SVC Cluster.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: This move involves an outage from the host system point of view, because the worldwide port name (WWPN) from the subsystem (SVC I/O Group) does change. You can use the volume managed mode to image mode migration to move workload from one SVC clustered system to the new SVC clustered system. Migrate a volume from managed mode to image mode, reassign the disk (logical unit number (LUN) masking) from your storage subsystem point of view, introduce the disk to your new SVC clustered system, and use the image mode to manage mode migration. We describe this scenario in Chapter 6, Volumes on page 99. Note: This scenario also invokes an outage to your host systems and the I/O to the involved SVC volumes. From a user perspective, the first option (creating an SVC clustered system and start putting workload on it) is the easiest way to expand your system workload. The second is more difficult, involves more steps (replication services), and requires more preparation in advance. The third option (managed to image mode migration) is the choice that involves the longest outage to the host systems, and therefore, we do not prefer this option. It is not very common to reduce the number of I/O groups. It can happen when replacing old nodes with new more powerful ones. It can also occur in a remote partnership when more bandwidth is required on one side and there is spare bandwidth on the other side. Adding or upgrading SVC node hardware. If you have a clustered system of six or fewer nodes of older hardware, and you have purchased new hardware, you can choose to either start a new clustered system for the new hardware or add the new hardware to the old clustered system. Both configurations are supported. While both options are practical, we recommend that you add the new hardware to your existing clustered system. This recommendation is only true if, in the short term, you are not scaling the environment beyond the capabilities of this clustered system. By utilizing the existing clustered system, you maintain the benefit of managing just one clustered system. Also, if you are using mirror copy services to the remote site, you might be able to continue to do so without having to add SVC nodes at the remote site.

Upgrading hardware
You have a couple of choices to upgrade an existing SVC systems hardware. The choices depend on the size of the existing clustered system.

Up to six nodes
If your clustered system has up to six nodes, you have these options available: Add the new hardware to the clustered system, migrate volumes to the new nodes and then retire the older hardware when it is no longer managing any volumes. This method requires a brief outage to the hosts to change the I/O Group for each volume. Swap out one node in each I/O Group at a time and replace it with the new hardware. We recommend that you engage an IBM service support representative (IBM SSR) to help you with this process. You can perform this swap without an outage to the hosts.

50

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521SVC Cluster.fm

Up to eight nodes
If your clustered system has eight nodes, the options are similar: Swap out a node in each I/O Group one at a time and replace it with the new hardware. We recommend that you engage an IBM SSR to help you with this process. You can perform this swap without an outage to the hosts, and you need to swap a node in one I/O Group at a time. Do not change all I/O Groups in a multi-I/O Group clustered system at one time. Move the volumes to another I/O Group so that all volumes are on three of the four I/O Groups. You can then remove the remaining I/O Group with no volumes and add the new hardware to the clustered system. As each pair of new nodes is added, volumes can then be moved to the new nodes, leaving another old I/O Group pair that can be removed. After all the old pairs are removed, the last two new nodes can be added, and if required, volumes can be moved onto them. Unfortunately, this method requires several outages to the host, because volumes are moved between I/O Groups. This method might not be practical unless you need to Implement the new hardware over an extended period of time, and the first option is not practical for your environment.

Combination of previous methods


You can mix the previous two options described for upgrading SVC nodes. New SVC hardware provides considerable performance benefits on each release, and there have been substantial performance improvements since the first hardware release. Depending on the age of your existing SVC hardware, the performance requirements might be met by only six or fewer nodes of the new hardware. If this situation fits, you might be able to utilize a mix of the previous two steps. For example, use an IBM SSR to help you upgrade one or two I/O Groups, and then move the volumes from the remaining I/O Groups onto the new hardware. For more details about replacing nodes non-disruptively or expanding an existing SVC clustered system, refer to IBM System Storage SAN Volume Controller Software Installation and Configuration Guide Version 6.2.0, GC27-2286-01.

3.3 Clustered system upgrade


The SVC clustered system is designed to perform a concurrent code update. During the automatic upgrade process, each system node is upgraded and restarted sequentially while its I/O operations are directed to the partner node. In this way the overall concurrent upgrade process relies on both I/O Group high availability and host multipathing driver. Although the SVC code upgrade is designed to be concurrent, there are multiple host components such as operating system level, multipath driver or HBA driver which might require to be updated leading the host operating system to be restarted. It is very important to plan up front the host requirements for the target SVC code. If you are upgrading from SVC V5.1 or earlier code, to ensure the compatibility between the SVC code and the SVC Console GUI, review the SAN Volume Controller and SVC Console (GUI) Compatibility (S1002888) web page at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1002888

Chapter 3. SAN Volume Controller clustered system

51

7521SVC Cluster.fm

Draft Document for Review February 16, 2012 3:49 pm

Furthermore, certain concurrent upgrade paths are only available through an intermediate level. Refer to the following Web page for more information, SAN Volume Controller Concurrent Compatibility and Code Cross Reference (S1001707): https://www-304.ibm.com/support/docview.wss?uid=ssg1S1001707

SVC code update steps


Even though the SVC code update is concurrent, we recommend that you perform several steps in advance: Before applying a code update, ensure that there are no open problems in your SVC, SAN, or storage subsystems. Use the Run maintenance procedure on the SVC and fix the open problems first. For more information, refer to 15.3.2, Solving SVC problems on page 440. It is also extremely important to check your host dual pathing. Make sure that from the hosts point of view that all paths are available. Missing paths can lead to I/O problems during the SVC code update. Refer to Chapter 8, Hosts on page 191 for more information about hosts. You should also confirm there are no hosts in degraded status. Take an svc_snap -c and copy the tgz out of the clustered system. The -c flag enables running a fresh config_backup. It is wise to schedule a time for the SVC code update during low I/O activity. Upgrade the Master Console GUI first. Allow the SVC code update to finish before making any other changes in your environment. Allow at least one hour to perform the code update for a single SVC I/O Group and 30 minutes for each additional I/O Group. In a worst case scenario, an update can take up to two and a half hours, which implies that the SVC code update will also update the BIOS, SP, and the SVC service card. Important: The Concurrent Code Upgrade (CCU) might appear to stop for a long time (up to an hour) if it is upgrading a low level BIOS. Never power off during a CCU unless you have been instructed to power off by IBM service personnel. If the upgrade encounters a problem and fails, the upgrade will be backed out. New features are not available until all nodes in the clustered system are at the same level. Features, which are dependent on a remote clustered system such as Metro Mirror or Global Mirror, might not be available until the remote cluster is at the same level too.

52

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

Chapter 4.

Backend storage
In this chapter we describes aspects and characteristics to consider when planning the attachment of a Backend Storage Device to be virtualized by an IBM System Storage SAN Volume Controller (SVC).

Copyright IBM Corp. 2011. All rights reserved.

53

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

4.1 Controller affinity and preferred path


In this section, we describe the architectural differences between common storage subsystems in terms of controller affinity (also referred to as preferred controller) and preferred path. In this context, affinity refers to the controller in a dual-controller subsystem that has been assigned access to the back-end storage for a specific LUN under nominal conditions (that is to say, both controllers are active). Preferred path refers to the host side connections that are physically connected to the controller that has the assigned affinity for the corresponding LUN being accessed. All storage subsystems that incorporate a dual-controller architecture for hardware redundancy employ the concept of affinity. For example, if a subsystem has 100 LUNs, 50 of them have an affinity to controller 0, and 50 of them have an affinity to controller 1. This means that only one controller is serving any specific LUN at any specific instance in time; however, the aggregate workload for all LUNs is evenly spread across both controllers. This relationship exists during normal operation; however, each controller is capable of controlling all 100 LUNs in the event of a controller failure. For the DS4000, preferred path is important, because Fibre Channel cards are integrated into the controller. This architecture allows dynamic multipathing and active/standby pathing through Fibre Channel cards that are attached to the same controller (the SVC does not support dynamic multipathing) and an alternate set of paths, which are configured to the other controller that will be used if the corresponding controller fails. For example, if each controller is attached to hosts through two Fibre Channel ports, 50 LUNs will use the two Fibre Channel ports in controller 0, and 50 LUNs will use the two Fibre Channel ports in controller 1. If either controller fails, the multipathing driver will fail the 50 LUNs associated with the failed controller over to the other controller and all 100 LUNs will use the two ports in the remaining controller. The DS4000 differs from the DS8000, because it has the capability to transfer ownership of LUNs at the LUN level as opposed to the controller level. For DS8000 the concept of preferred path is not used, because Fibre Channel cards are outboard of the controllers, and therefore, all Fibre Channel ports are available to access all LUNs regardless of cluster affinity. While cluster affinity still exists, the network between the outboard Fibre Channel ports and the controllers performs the appropriate controller routing as opposed to the DS4000 where controller routing is performed by the multipathing driver in the host, such as with IBM Subsystem Device Driver (SDD) and Redundant Disk Array Controller (RDAC).

4.2 Considerations for DS4000/DS5000


In this section, we discuss controller configuration considerations for DS4000/DS5000.

4.2.1 Setting DS4000/DS5000 so both controllers have the same WWNN


The SAN Volume Controller recognizes that the DS4000/DS5000 controllers belong to the same storage system unit if they both have the same World Wide Node Name. There are a number of ways to determine whether this is set correctly for SVC. The WWPN and WWNN of all devices logged in to the fabric can be checked from the SAN switch GUI. Confirm that the WWPN of all DS4000/DS5000 host ports are unique but the WWNN are identical for all ports belonging to a single storage unit.

54

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

The same information can be obtained from the Controller section when viewing the Storage Subsystem Profile from the Storage Manager GUI, which will list the WWPN and WWNN information for each host port: World-wide port identifier: 20:27:00:80:e5:17:b5:bc World-wide node identifier: 20:06:00:80:e5:17:b5:bc If the controllers are setup with different WWNNs, then you should run the script SameWWN.script that is bundled with the Storage Manager client download file to have it changed. Caution: This procedure is intended for initial configuration of the DS4000/DS5000. The script must not be run in a live environment because all hosts accessing the storage subsystem will be affected by the changes.

4.2.2 Balancing workload across DS4000/DS5000 controllers


A best practice when creating arrays is to spread the disks across multiple controllers, as well as alternating slots, within the enclosures. This practice improves the availability of the array by protecting against enclosure failures that affect multiple members within the array, as well as improving performance by distributing the disks within an array across drive loops. You spread the disks across multiple enclosures, as well as alternating slots, within the enclosures by using the manual method for array creation. Figure 4-1 shows a Storage Manager view of a 2+p array that is configured across enclosures. Here, we can see that each disk of the three disks is represented in a separate physical enclosure and that slot positions alternate from enclosure to enclosure.

Figure 4-1 Storage Manager view Chapter 4. Backend storage

55

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

4.2.3 Ensuring path balance prior to MDisk discovery


It is important that LUNs are properly balanced across storage controllers prior to performing MDisk discovery. Failing to properly balance LUNs across storage controllers in advance can result in a suboptimal pathing configuration to the back-end disks, which can cause a performance degradation. Ensure that storage subsystems have all controllers online and that all LUNs have been distributed to their preferred controller (local affinity) prior to performing MDisk discovery. Pathing can always be rebalanced later, however, often not until after lengthy problem isolation has taken place. If you discover that the LUNs are not evenly distributed across the dual controllers in a DS4000/DS5000, you can dynamically change the LUN affinity. However, the SVC will move them back to the original controller, and the storage subsystem will generate an error indicating that the LUN is no longer on its preferred controller. To fix this situation, you need to run the SVC command svctask detectmdisk or use the GUI option Detect MDisks. SVC will query the DS4000/DS5000 again and access the LUNs through the new preferred controller configuration.

4.2.4 ADT for DS4000/DS5000


The DS4000/DS5000 has a feature called Auto Logical Drive Transfer (ADT). This feature allows logical drive level failover as opposed to controller level failover. When you enable this option, the DS4000/DS5000 moves LUN ownership between controllers according to the path used by the host. For the SVC, the ADT feature is enabled by default when you select IBM TS SAN VCE host type. Note: It is important that you select IBM TS SAN VCE host type when configuring the DS4000/DS5000 for SVC attachment in order to allow the SVC to properly manage the back-end paths. If the host type is incorrect, SVC will report a 1625 (incorrect controller configuration) error. Refer to Chapter 15, Troubleshooting and diagnostics on page 419 for information regarding checking the back-end paths to storage controllers.

4.2.5 Selecting array and cache parameters


In this section we discuss SVC array and cache parameters.

DS4000/DS5000 array width


With Redundant Array of Independent Disks 5 (RAID 5) arrays, determining the number of physical drives to put into an array always presents a compromise. Striping across a larger number of drives can improve performance for transaction-based workloads. However, striping can also have a negative effect on sequential workloads. A common mistake that people make when selecting array width is the tendency to focus only on the capability of a single array to perform various workloads. However, you must also consider in this decision the aggregate throughput requirements of the entire storage server. A large number of physical disks in an array can create a workload imbalance between the controllers, because only one controller of the DS4000/DS5000 actively accesses a specific array. When selecting array width, you must also consider its effect on rebuild time and availability.

56

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

A larger number of disks in an array increases the rebuild time for disk failures, which can have a negative effect on performance. Additionally, more disks in an array increase the probability of having a second drive fail within the same array prior to the rebuild completion of an initial drive failure, which is an inherent exposure to the RAID 5 architecture. Best practice: For the DS4000/DS5000, we recommend array widths of 4+p and 8+p.

Segment size
With direct-attached hosts, considerations are often made to align device data partitions to physical drive boundaries within the storage controller. For the SVC, aligning device data partitions to physical drive boundaries within the storage controller is less critical based on the caching that the SVC provides and based on the fact that there is less variation in its I/O profile, which is used to access back-end disks. For the SVC, the only opportunity for full stride writes occurs with large sequential workloads, and in that case, the larger the segment size is, the better. Larger segment sizes can adversely affect random I/O, however. The SVC and controller cache do a good job of hiding the RAID 5 write penalty for random I/O, and therefore, larger segment sizes can be accommodated. The primary consideration for selecting segment size is to ensure that a single host I/O will fit within a single segment to prevent accessing multiple physical drives. Testing has shown that the best compromise for handling all workloads is to use a segment size of 256 KB. Best practice: We recommend a segment size of 256 KB as the best compromise for all workloads.

Cache block size


The size of the cache memory allocation unit can be either 4K, 8K, 16K, or 32K. Earlier models of DS4000 using the 2Gb Fibre Channel (FC) adapters have their block size configured as 4KB by default. For the newest models (on firmware 7.xx and higher), the default cache memory is 8KB. Best practice: We recommend that you keep the default cache block values and use the host type IBM TS SAN VCE to establish the correct cache block size for the SAN Volume Controller cluster. Table 4-1 is a summary of the recommended SVC and DS4000/DS5000 values.
Table 4-1 Recommended SVC values Models SVC SVC DS4000/DS5000 DS4000a DS5000 DS4000/DS5000 DS4000/DS5000 Attribute Extent size (MB) Managed mode Segment size (KB) Cache block size (KB) Cache block size (KB) Cache flush control Readahead Value 256 Striped 256 4 KB (default) 8 KB (default) 80/80 (default) 1(Enabled)

Chapter 4. Backend storage

57

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

Models DS4000/DS5000 DS4000/DS5000

Attribute RAID 5 RAID 6

Value 4+p, 8+p 8+P+Q

a. For the newest models (on firmware 7.xx and higher) use 8 KB.

4.2.6 Logical drive mapping


All logical drives must be mapped to the single host group representing the entire SVC cluster. It is not permitted to map LUNs to certain nodes or ports in the SVC cluster while excluding others. The Access LUN allows in-band management of a IBM System Storage DS4000/DS5000, and it must only be mapped to hosts capable of running the Storage Manager Client and Agent. The San Volume Controller will ignore the Access LUN if it is mapped to it. Nonetheless, it is good practice to remove the Access LUN from the SVC host group mappings. Important: The Access LUN must never be mapped as LUN #0.

4.3 Considerations for DS8000


In this section, we discuss controller configuration considerations for DS8000.

4.3.1 Balancing workload across DS8000 controllers


When configuring storage on the IBM System Storage DS8000 disk storage subsystem, it is important to ensure that ranks on a device adapter (DA) pair are evenly balanced between odd and even extent pools. Failing to do this can result in a considerable performance degradation due to uneven device adapter loading. The DS8000 assigns server (controller) affinity to ranks when they are added to an extent pool. Ranks that belong to an even-numbered extent pool have an affinity to server0, and ranks that belong to an odd-numbered extent pool have an affinity to server1. Example 4-1 shows an example of a correct configuration that balances the workload across all four DA pairs and evenly balanced between odd and even extent pools. Notice that arrays residing on the same DA pair are split between groups 0 and 1.
Example 4-1 lsarray command output dscli> lsarray -l Date/Time: Aug 8, 2008 8:54:58 AM CEST IBM DSCLI Version:5.2.410.299 DS: IBM.2107-75L2321 Array State Data RAID type arsite Rank DA Pair DDMcap(10^9B) diskclass =================================================================================== A0 Assign Normal 5 (6+P+S) S1 R0 0 146.0 ENT A1 Assign Normal 5 (6+P+S) S9 R1 1 146.0 ENT A2 Assign Normal 5 (6+P+S) S17 R2 2 146.0 ENT A3 Assign Normal 5 (6+P+S) S25 R3 3 146.0 ENT A4 Assign Normal 5 (6+P+S) S2 R4 0 146.0 ENT A5 Assign Normal 5 (6+P+S) S10 R5 1 146.0 ENT A6 Assign Normal 5 (6+P+S) S18 R6 2 146.0 ENT

58

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

A7

Assign

Normal

5 (6+P+S)

S26

R7

146.0

ENT

dscli> lsrank -l Date/Time: Aug 9, 2008 2:23:18 AM CEST IBM DSCLI Version: 5.2.410.299 DS: IBM.2107-75L2321 ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts ====================================================================================== R0 0 Normal Normal A0 5 P0 extpool0 fb 779 779 R1 1 Normal Normal A1 5 P1 extpool1 fb 779 779 R2 0 Normal Normal A2 5 P2 extpool2 fb 779 779 R3 1 Normal Normal A3 5 P3 extpool3 fb 779 779 R4 1 Normal Normal A4 5 P5 extpool5 fb 779 779 R5 0 Normal Normal A5 5 P4 extpool4 fb 779 779 R6 1 Normal Normal A6 5 P7 extpool7 fb 779 779 R7 0 Normal Normal A7 5 P6 extpool6 fb 779 779

4.3.2 DS8000 ranks to extent pools mapping


When configuring the DS8000, two different approaches for the rank to extent pools mapping exist: One rank per extent pool Multiple ranks per extent pool using DS8000 Storage Pool Striping (SPS) The most common approach is to map one rank to one extent pool, which provides good control for volume creation, because it ensures that all volume allocation from the selected extent pool will come from the same rank. The SPS feature became available with the R3 microcode release for the DS8000 series and effectively means that a single DS8000 volume can be striped across all the ranks in an extent pool (therefore, the functionality is often referred as extent pool striping). So, if a given extent pool includes more than one rank, a volume can be allocated using free space from several ranks (which also means that SPS can only be enabled at volume creation, no reallocation is possible). The SPS feature requires that your DS8000 layout has been well thought-out from the beginning to utilize all resources in the DS8000. If this is not done, SPS might cause severe performance problems (for example, if configuring a heavily loaded extent pool with multiple ranks from the same DA pair). Because the SVC itself stripes across MDisks, the SPS feature is not as relevant here as when accessing the DS8000 directly and should not be used. Best practice: Configure one rank per extent pool.

Cache
For the DS8000, you cannot tune the array and cache parameters. The arrays will be either 6+p or 7+p, depending on whether the array site contains a spare and whether the segment size (contiguous amount of data that is written to a single disk) is 256 KB for fixed block volumes. Caching for the DS8000 is done on a 64 KB track boundary.

4.3.3 Mixing array sizes within a Storage Pool


Mixing array sizes within a Storage Pool in general is not of concern. Testing has shown no measurable performance differences between selecting all 6+p arrays and all 7+p arrays as opposed to mixing 6+p arrays and 7+p arrays. In fact, mixing array sizes can actually help balance workload, because it places more data on the ranks that have the extra performance capability provided by the eighth disk. There is one small exposure here in the case where an
Chapter 4. Backend storage

59

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

insufficient number of the larger arrays are available to handle access to the higher capacity. In order to avoid this situation, ensure that the smaller capacity arrays do not represent more than 50% of the total number of arrays within the Storage Pool. Best practice: When mixing 6+p arrays and 7+p arrays in the same Storage Pool, avoid having smaller capacity arrays comprise more than 50% of the arrays.

4.3.4 Determining the number of controller ports for DS8000


Configure a minimum of eight controller ports to the SVC per controller regardless of the number of nodes in the cluster. Configure 16 controller ports for large controller configurations where more than 48 ranks are being presented to the SVC cluster. Additionally, we recommend that no more than two ports of each of the DS8000s 4-port adapters are used. Table 4-2 shows the recommended number of DS8000 ports and adapters based on rank count.
Table 4-2 Recommended number of ports and adapters Ranks 2 - 48 > 48 Ports 8 16 Adapters 4-8 8 - 16

The DS8000 populate Fibre Channel (FC) adapters across two to eight I/O enclosures, depending on configuration. Each I/O enclosure represents a separate hardware domain. Ensure that adapters configured to different SAN networks do not share the same I/O enclosure as part of our goal of keeping redundant SAN networks isolated from each other. Best practices that we recommend: Configure a minimum of eight ports per DS8000. Configure 16 ports per DS8000 when > 48 ranks are presented to the SVC cluster. Configure a maximum of two ports per four port DS8000 adapter. Configure adapters across redundant SAN networks from different I/O enclosures.

4.3.5 LUN masking


For a given storage controller, all SVC nodes must see the same set of LUNs from all target ports that have logged into the SVC nodes. If target ports are visible to the nodes that do not have the same set of LUNs assigned, SVC treats this situation as an error condition and generates error code 1625. Validating the LUN masking from the storage controller and then confirming the correct path count from within the SVC are critical. The DS8000 performs LUN masking based on volume group. Example 4-2 shows showvolgrp output command for volume group V0, which contains sixteen LUNs and are being presented to a 2-node SVC cluster.

60

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

Example 4-2 The showvolgrp command output dscli> showvolgrp V0 Date/Time: August 3, 2011 3:03:15 PM PDT IBM DSCLI Version: 7.6.10.511 DS: IBM.2107-75L3001 Name SVCCF8 ID V0 Type SCSI Mask Vols 1001 1002 1003 1004 1005 1006 1007 1008 1101 1102 1103 1104 1105 1106 1107 1108

Example 4-3 on page 61 shows lshostconnect output from the DS8000. Here, you can see that all 8 ports of the 2-node cluster are assigned to the same volume group (V0) and, therefore, have been assigned to the same four LUNs.
Example 4-3 The lshostconnect command output dscli> lshostconnect Date/Time: August 3, 2011 3:04:13 PM PDT IBM DSCLI Version: 7.6.10.511 DS: IBM.2107-75L3001 Name ID WWPN HostType Profile portgrp volgrpID ESSIOport =========================================================================================== SVCCF8_N1P1 0000 500507680140BC24 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N1P2 0001 500507680130BC24 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N1P3 0002 500507680110BC24 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N1P4 0003 500507680120BC24 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N2P1 0004 500507680140BB91 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N2P3 0005 500507680110BB91 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N2P2 0006 500507680130BB91 San Volume Controller 0 V0 I0003,I0103 SVCCF8_N2P4 0007 500507680120BB91 San Volume Controller 0 V0 I0003,I0103 dscli>

Additionally, you can see from the lshostconnect output that only the SVC WWPNs are assigned to V0. Important: Data corruption can occur if LUNs are assigned to both SVC nodes and non-SVC nodes, that is, direct-attached hosts. Next, we show you how the SVC sees these LUNs if the zoning is properly configured. The Managed Disk Link Count (mdisk_link_count) represents the total number of MDisks presented to the SVC cluster by that specific controller. Example 4-4 shows the output storage controller general details via SVC command line interface (CLI).
Example 4-4 lscontroller command output IBM_2145:svccf8:admin>svcinfo lscontroller DS8K75L3001 id 1 controller_name DS8K75L3001 WWNN 5005076305FFC74C mdisk_link_count 16

Chapter 4. Backend storage

61

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

max_mdisk_link_count 16 degraded no vendor_id IBM product_id_low 2107900 product_id_high product_revision 3.44 ctrl_s/n 75L3001FFFF allow_quorum yes WWPN 500507630500C74C path_count 16 max_path_count 16 WWPN 500507630508C74C path_count 16 max_path_count 16 IBM_2145:svccf8:admin>

In this case, we can see that the Managed Disk Link Count is 16, which is correct for our example. Example 4-4 on page 61 also shows the storage controller port details. Here, a path_count represents a connection from a single node to a single LUN. Because we have two nodes and sixteen LUNs in this example configuration, we expect to see a total of 32 paths with all paths evenly distributed across the available storage ports. We have validated that this configuration is correct, because we see sixteen paths on one WWPN and sixteen paths on the other WWPN for a total of 32 paths.

4.3.6 WWPN to physical port translation


Storage controller WWPNs can be translated to physical ports on the controllers for isolation and debugging purposes. Additionally, you can use this information for validating redundancy across hardware boundaries. In Example 4-5, we show the WWPN to physical port translations for the DS8000.
Example 4-5 DS8000 WWPN format

WWPN format for DS8000 = 50050763030XXYNNN XX = adapter location within storage controller Y = port number within 4-port adapter NNN = unique identifier for storage controller IO Bay Slot XX IO Bay Slot XX Port Y B1 S1 S2 S4 S5 00 01 03 04 B5 S1 S2 S4 S5 20 21 23 24 P1 0 P2 4 P3 8 B2 S1 S2 S4 S5 08 09 0B 0C B6 S1 S2 S4 S5 28 29 2B 2C P4 C B3 S1 S2 S4 S5 10 11 13 14 B7 S1 S2 S4 S5 30 31 33 34 B4 S1 S2 S4 S5 18 19 1B 1C B8 S1 S2 S4 S5 38 39 3B 3C

62

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

4.4 Considerations for XIV


In this section, we discuss controller configuration considerations for IBM XIV Storage System.

4.4.1 Cabling considerations


The IBM XIV supports both iSCSI and Fibre Channel protocols but when connecting to SVC, only Fibre Channel ports can be utilized. To take advantage of the combined capabilities of SVC and XIV, you should connect two ports from every interface module into the fabric for SVC use. You need to decide which ports you want to use for the connectivity. If you do not use and do not have plans to use XIV functionality for remote mirroring or data migration, you must change the role of port 4 from initiator to target on all XIV interface modules and use ports 1 and 3 from every interface module into the fabric for SVC use. Otherwise, you must use ports 1 and 2 from every interface modules instead of ports 1 and 3. Figure 4-2 shows a two node cluster using redundant fabrics.

Figure 4-2 Two node redundant SVC cluster configuration

SVC supports a maximum of 16 ports from any disk system. The IBM XIV System supports from 8 to 24 FC ports, depending on the configuration (from 6 to 15 modules). Table 4-3 indicates port usage for each IBM XIV System configuration.

Chapter 4. Backend storage

63

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

Table 4-3 Number of SVC ports and XIV Modules Number of IBM XIV Modules 6 9 10 11 12 13 14 15 IBM XIV System Modules with FC Ports Module 4,5 Module 4,5,7,8 Module 4,5,7,8 Module 4,5,7,8,9 Module 4,5,7,8,9 Module 4,5,6,7,8,9 Module 4,5,6,7,8,9 Module 4,5,6,7,8,9 Number of FC ports available on IBM XIV 8 16 16 20 20 24 24 24 Ports Used per card on IBM XIV 1 1 1 1 1 1 1 1 Number of SVC ports utilized 4 8 8 10 10 12 12 12

Port naming convention


The port naming convention for the IBM XIVsystem ports is: WWPN: 5001738NNNNNRRMP 001738 = Registered identifier for XIV NNNNN = Serial number in hex RR = Rack ID (01) M = Module ID (4-9) P = Port ID (0-3)

4.4.2 Host options and settings for IBM XIV systems


You must use specific settings to identify SAN Volume Controller clustered systems as hosts to IBM XIV Storage System systems. An XIV IBM Nextra host is a single WWPN, so one XIV Nextra host must be defined for each SAN Volume Controller node port in the clustered system. An XIV Nextra host is considered to be a single SCSI initiator. Up to 256 XIV Nextra hosts can be presented to each port. Each SAN Volume Controller host object that is associated with the XIV Nextra system must be associated with the same XIV Nextra LUN map because each LUN can only be in a single map. An IBM XIV Storage System Type Number 2810 host can consist of more than one WWPN. Configure each SAN Volume Controller node as an IBM XIV Storage System Type Number 2810 host and create a cluster of IBM XIV Storage System systems that corresponds to each of the SAN Volume Controller nodes in the SAN Volume Controller system.

Creating a host object for SVC for an IBM XIV type 2810
Although a single host instance can be created for use in defining and then implementing the SVC, the ideal host definition for use with SVC is to consider each node of the SVC (a minimum of two) an instance of a cluster. When creating the SVC host definition, first select Add Cluster and give the SVC host definition a name. Next, select Add Host and give the first node instance a Name making sure to click the Cluster drop-down box and select the SVC cluster you just created. 64
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

After these have been added, repeat the steps for each instance of a node in the cluster. From there, right-click a node instance and select Add Port. In Figure 4-3 on page 65, note that four ports per node can be added to ensure the host definition is accurate.

Figure 4-3 SVC host definition on IBM XIV Storage System

By implementing the SVC as listed above, host management will ultimately be simplified and statistical metrics will be more effective because performance can be determined at the node level instead of the SVC cluster level. For example, after the SVC is successfully configured with the XIV Storage System, if an evaluation of the volume management at the I/O Group level is needed to ensure efficient utilization among the nodes, a comparison of the nodes can achieved using the XIV Storage System statistics.

4.4.3 Restrictions
Here we list restrictions for using the XIV as backend storage for the SVC.

Clearing SCSI reservations and registrations


You must not use the vol_clear_keys command to clear SCSI reservations and registrations on volumes that are managed by SAN Volume Controller.

Copy functions for IBM XIV Storage System models


Advanced copy functions for IBM XIV Storage System models such as taking a snapshot and remote mirroring cannot be used with disks that are managed by the SAN Volume Controller clustered system. Thin provisioning is not supported for use with SAN Volume Controller.

4.5 Considerations for V7000


In this section, we discuss controller configuration considerations for IBM Storwise V7000 Storage Systems.

4.5.1 Defining internal storage


Specially when planning to attach a V7000 on the SVC we recommend that you create the Arrays (MDisks) manually (via command line interface), instead of using the V7000 presets.

Chapter 4. Backend storage

65

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

Make sure you select one disk drive per enclosure and that each enclosure selected is part of the same chain (when possible). The recommendation when defining V7000 internal storage is to create a 1 by 1 relationship meaning: 1 Storage Pool to 1 MDisk (array) to 1 Volume, then map the volume to the SVC host. Note: The SVC level 6.2 supports V7000 MDisks larger than 2 TB. Since V7000 can have mixed disk drive type, such as SSDs, SAS and Nearline SAS, pay attention when mapping V7000 volume to the SVC Storage Pools (as MDisks), assigning the same disk drive type (array) to the same SVC Storage Pool characteristic. For example, assuming you have 2 V7000 arrays where one (model A) of them is configured as a RAID5 using 300 GB SAS drives, and the other one (model B) is configured as a RAID5 using 2TB Nearline SAS drives, when mapping to the SVC, assign the model A to one specific Storage Pool (model A) and the model B to another specific Storage Pool (model B). Important: Make sure you are using the same extent size value on both sides (V7000 and SVC). We recommend the usage of 256 MB as extent size.

4.5.2 Configuring IBM Storwize V7000 storage systems


Storwize V7000 external storage systems can present volumes to a SAN Volume Controller. A Storwize V7000 system, however, cannot present volumes to another Storwize V7000 system. To configure the Storwize V7000 system, follow these general tasks: On the Storwize V7000 system, first define a host object and add all worldwide port names (WWPNs) from the SAN Volume Controller to it. On the Storwize V7000 system, create host mappings between each volume on the Storwize V7000 system that you want to manage by using the SAN Volume Controller and the SAN Volume Controller host object that you have created. The volumes that are presented by the Storwize V7000 system appear in the SAN Volume Controller managed disk (MDisk) view. The Storwize V7000 system appears in the SAN Volume Controller view with a vendor ID of IBM and a product ID of 2145.

4.6 Considerations for Third Party storage


Due to the amount of Third Party storage options available (supported), we will not cover all of them here. We recommend you look at IBM System Storage SAN Volume Controller Software Installation and Configuration Guide Version 6.2.0, GC27-2286-01 for details about each specific model.

4.6.1 Pathing considerations for EMC Symmetrix/DMX and HDS


There are certain storage controller types that present a unique worldwide node name (WWNN) and worldwide port name (WWPN) for each port. This action can cause problems

66

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

when attached to the SVC, because the SVC enforces a WWNN maximum of four per storage controller. Because of this behavior, you must be sure to group the ports if you want to connect more than four target ports to an SVC.

4.7 Medium error logging


Medium errors on back-end MDisks can be encountered by Host I/O and by SVC background functions, such as volume migration and FlashCopy. If an SVC receives a medium error from a storage controller, it attempts to identify which Logical Block Address (LBAs) are affected by this MDisk problem and record those LBAs as having virtual medium errors. If a medium error is encountered on a read from the source during a migration operation, the medium error is logically moved to the equivalent position on the destination. This is achieved by maintaining for each MDisk a set of Bad Blocks. Any read operation which touches a bad block will fail with a medium error SCSI. If a destage from the cache touches a location in the medium error table and the resulting write to the Managed Disk is successful then the bad block is deleted. For details on how to troubleshoot a medium error refer to Chapter 15, Troubleshooting and diagnostics on page 419.

4.8 Mapping physical LBAs to volume extents


Starting with SVC 4.3, new functionality is available which makes it easy to find the volume extent to which a physical MDisk LBA maps and to find the physical MDisk LBA to which the volume extent maps. There are a number of situations where this functionality might be useful: If a storage controller reports a medium error on a logical drive, but SVC has not yet taken MDisks offline, you might want to establish which volumes will be affected by the medium error. When investigating application interaction with Thin-provisioned volumes, it can be useful to find out whether a given volume LBA has been allocated or not. If an LBA has been allocated when it has not intentionally been written to, it is possible that the application is not designed to work well with thin volumes. The two commands are svcinfo lsmdisklba and svcinfo lsvdisklba. Their output varies depending on the type of volume (for example, Thin-provisioned as opposed to fully allocated) and type of MDisk (for example, quorum as opposed to non-quorum). For full details, refer to the IBM System Storage SAN Volume Controller Software Installation and Configuration Guide Version 6.2.0, GC27-2286-01

4.9 Using Tivoli Storage Productivity Center to identify storage controller boundaries
It is often desirable to map the virtualization layer to determine which volumes and hosts are utilizing resources for a specific hardware boundary on the storage controller. For example, when a specific hardware component, such as a disk drive, is failing, and the administrator is
Chapter 4. Backend storage

67

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

interested in performing an application level risk assessment. Information learned from this type of analysis can lead to actions taken to mitigate risks, such as scheduling application downtime, performing volume migrations, and initiating FlashCopy. Tivoli Storage Productivity Center allows the mapping of the virtualization layer to occur quickly, and using Tivoli Storage Productivity Center eliminates mistakes that can be made by using a manual approach. Figure 4-4 on page 68 shows how a failing disk on a storage controller can be mapped to the MDisk that is being used by an SVC cluster. To display this panel, click Physical Disk RAID5 Array Logical Volume MDisk.

Figure 4-4 Mapping MDisk

Figure 4-5 completes the end-to-end view by mapping the MDisk through the SVC to the attached host. Click MDisk MDGroup VDisk host disk.

Figure 4-5 Host mapping

68

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Storage Controller.fm

Chapter 4. Backend storage

69

7521Storage Controller.fm

Draft Document for Review February 16, 2012 3:49 pm

70

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Chapter 5.

Storage pools and Managed Disks


In this chapter we describe aspects to consider when planning Storage Pools for an IBM System Storage SAN Volume Controller (SVC) implementation. We discuss various Managed Disk (MDisk) attributes, as well as provide an overview of the process of adding and removing MDisks from existing Storage Pools.

Copyright IBM Corp. 2011. All rights reserved.

71

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

5.1 Availability considerations for Storage Pools


While the SVC itself provides many advantages through the consolidation of storage, it is important to understand the availability implications that storage subsystem failures can have on availability domains within the SVC cluster. In this section, we point out that while the SVC offers significant performance benefits through its ability to stripe across back-end storage volumes, it is also worthwhile to consider the effects that various configurations will have on availability. When selecting Managed Disks (MDisks) for a Storage Pool, performance is often the primary consideration, however, there are many instances where the availability of the configuration is traded for little or no performance gain. Note: Increasing the performance potential of a Storage Pool does not necessarily equate to a gain in application performance. Remember that the SVC must take the entire Storage Pool offline if a single MDisk in that Storage Pool goes offline. For instance, if you have 40 arrays of 1 TB each for a total capacity of 40 TB, with all 40 arrays placed in the same Storage Pool, you have put the entire 40 TB of capacity at risk if one of the 40 arrays fail (therefore, causing an MDisk to go offline). If you then spread the 40 arrays out over some of Storage Pools, the effect of an array failure (an offline MDisk) affects less storage capacity, thus, limiting the failure domain. An exception exists when talking about IBM XIV Storage System because this product has particular characteristics. Refer to 5.3.3, Considerations for IBM XIV Storage System on page 75. The following availability best practices will guide you on to well-designed Storage Pools. Each storage subsystem must be used with only a single SVC cluster. Each Storage Pool must only contain MDisks from a single storage subsystem (an exception exist when working with Easy Tier. Refer to Chapter 11, Easy Tier on page 279. Each Storage Pool must contain MDisks from no more than approximately 10 storage subsystem arrays.

5.2 Selecting storage subsystems


When selecting storage subsystems, the decision comes down to the ability of the storage subsystem to be more reliable, resilient and meet application requirements. Once the SVC does not provide any data redundancy, the availability characteristics of the storage subsystems controllers have the most impact on the overall availability of the data virtualized by the SVC. Performance is also a determining factor, where adding a SVC as a front end, result in considerable gains. Another fact is the ability of your storage subsystems to be scaled up, or scaled out. For example, the DS8000 is a scale-up architecture that delivers best of breed performance per unit, and the DS4000/DS5000 can be scaled out with enough units to deliver the same performance.

72

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

A significant consideration when comparing native performance characteristics between storage subsystem types, is the amount of scaling that is required to meet the performance objectives. While lower performing subsystems can typically be scaled to meet performance objectives, the additional hardware that is required lowers the availability characteristics of the SVC cluster. Remember that all storage subsystems possess an inherent failure rate, and therefore, the failure rate of a Storage Pool becomes the failure rate of the storage subsystem times the number of units. Of course, there might be other factors that lead you to select one storage subsystem over another, such as utilizing available resources or a requirement for additional features and functions, like the IBM System z attach capability.

5.3 Selecting the Storage Pool


Reducing hardware failure boundaries for back-end storage (for example, having enclosure protection on your DS4000 array) is only part of what you must consider. When determining Storage Pool layout, you also need to consider application boundaries and dependencies in order to identify any availability benefits that one configuration might have over another. Sometimes reducing the hardware failure boundaries is not always an advantage from an application perspective, for instance, fixing an applications volumes into a single Storage Pool. As opposite to it, splitting an applications volumes across multiple Storage Pools, increase the chances to have an application outage if one of the Storage Pools associated to that application goes offline. We recommend that you start using one Storage Pool per application volumes, then split the volumes across other Storage Pools if you observes that this specific Storage Pool is saturated. Note: For most clusters a capacity of 1 to 2 PB is sufficient. A best practice is to use 256 MB or, for larger clusters, 512 MB as the standard extent size. On the other hand, when working with IBM XIV Storage System, the recommended extent size is 1 GB.

Capacity planning consideration


When configuring Storage Pools, we advise that you consider leaving a small amount of MDisk capacity that can be used as swing (spare) capacity for image mode volume migrations. A good general rule is to allow enough space equal to the capacity of your biggest configured volumes.

5.3.1 Selecting the number of arrays per Storage Pool


The capability to stripe across disk arrays is the single most important performance advantage of the SVC; however, striping across more arrays is not necessarily better. The objective here is to only add as many arrays to a single Storage Pool as required to meet the performance objectives. Because it is usually difficult to determine what is required in terms of performance, the tendency is to add far too many arrays to a single Storage Pool, which again, increases the failure domain as we discussed previously in 5.1, Availability considerations for Storage Pools on page 72. It is also worthwhile to consider the effect of aggregate load across multiple Storage Pools. It is clear that striping workload across multiple arrays has a positive effect on performance when you are talking about dedicated resources, but the performance gains diminish as the

Chapter 5. Storage pools and Managed Disks

73

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

aggregate load increases across all available arrays. For example, if you have a total of eight arrays and are striping across all eight arrays, your performance is much better than if you were striping across only four arrays. However, if the eight arrays are divided into two LUNs each and are also included in another Storage Pool, the performance advantage drops as the load of Storage Pool2 approaches that of Storage Pool1, which means that when workload is spread evenly across all Storage Pools, there will be no difference in performance. More arrays in the Storage Pool have more of an effect with lower performing storage controllers. So, for example, we require fewer arrays from a DS8000 than we do from a DS4000 to achieve the same performance objectives. Table 5-1 shows the recommended number of arrays per Storage Pool that is appropriate for general cases. Again, when it comes to performance, there can always be exceptions. Refer to Chapter 10, Backend performance considerations on page 233.
Table 5-1 Recommended number of arrays per Storage Pool Controller type DS4000/DS5000 DS6000/DS8000 IBM Storwise V7000 Arrays per Storage Pool 4 - 24 4 - 12 4 - 12

RAID 5 compared to RAID 10


In general, RAID 10 arrays are capable of higher throughput for random write workloads than RAID 5, because RAID 10 only requires two I/Os per logical write compared to four I/Os per logical write for RAID 5. For random reads and sequential workloads, there is typically no benefit. With certain workloads, such as sequential writes, RAID 5 often shows a performance advantage. Obviously, selecting RAID 10 for its performance advantage comes at an extremely high cost in usable capacity, and, in most cases, RAID 5 is the best overallchoice. When considering RAID 10, we recommend that you use DiskMagic to determine the difference in I/O service times between RAID 5 and RAID 10. If the service times are similar, the lower cost solution makes the most sense. If RAID 10 shows a service time advantage over RAID 5, the importance of that advantage must be weighed against its additional cost.

5.3.2 Selecting LUN attributes


We generally recommend that you configure LUNs to use the entire array, which is especially true for midrange storage subsystems where multiple LUNs configured to an array have shown to result in a significant performance degradation. The performance degradation is attributed mainly to smaller cache sizes and the inefficient use of available cache, defeating the subsystems ability to perform full stride writes for Redundant Array of Independent Disks 5 (RAID 5) arrays. Additionally, I/O queues for multiple LUNs directed at the same array can have a tendency to overdrive the array. Higher end storage controllers, such as the IBM System Storage DS8000 series, make this much less of an issue through the use of large cache sizes. However, arrays with large capacity might require that multiple LUNs are created due to MDisk size limit. Besides that, on higher end storage controllers, most workloads show the difference between a single LUN per array compared to multiple LUNs per array to be negligible. In cases like that where you have more than one LUN per array, we recommend including the LUNs into the same Storage Pool. 74
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Table 5-2 provides our recommended guidelines for array provisioning on IBM storage subsystems.
Table 5-2 Array provisioning Controller type IBM System Storage DS4000/DS5000 IBM System Storage DS6000/DS8000 IBM Storwise V7000 LUNs per array 1 1-2 1

The selection of LUN attributes for Storage Pools require the following primary considerations: Selecting array size Selecting LUN size Number of LUNs per array Number of physical disks per array Important: We generally recommend that LUNs are created to use the entire capacity of the array. All LUNs (MDisks) for a Storage Pool creation must have the same performance characteristics. If MDisks of varying performance levels are placed in the same Storage Pool, the performance of the Storage Pool can be reduced to the level of the poorest performing MDisk. Likewise, all LUNs must also possess the same availability characteristics. Remember that the SVC does not provide any Redundant Array of Independent Disks (RAID) capabilities within a Storage Pool. The loss of access to any one of the MDisks within the Storage Pool impacts the entire Storage Pool. However, with the introduction of Volume Mirroring in SVC 4.3, you can protect against the loss of a Storage Pool by mirroring a volume across multiple Storage Pools. Refer to Chapter 6, Volumes on page 99 for more information. We recommend these best practices for LUN selection within a Storage Pool: LUNs are the same type. LUNs are the same RAID level. LUNs are the same RAID width (number of physical disks in array). LUNs have the same availability and fault tolerance characteristics. MDisks created on LUNs with varying performance and availability characteristics must be placed in separate Storage Pools.

5.3.3 Considerations for IBM XIV Storage System


The IBM XIV Storage System currently supports from 27 TB to 79 TB of usable capacity when using 1 TB drives, or from 55 TB to 161 TB when using 2 TB disks. The minimum volume size is 17 GB. Although smaller LUNs can be created, LUNs should be defined on 17GB boundaries to maximize the physical space available. Note: Although SVC V6.2 support MDisks up to 256 TB, at the time of writing, there is no supportability for having MDisks larger than 2 TB on IBM XIV Storage System.

Chapter 5. Storage pools and Managed Disks

75

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

SVC has a maximum of 511 LUNs that can be presented from the IBM XIV System, and SVC does not currently support dynamically expanding the size of the MDisk. Because the IBM XIV System configuration grows from 6 to 15 modules, use the SVC rebalancing script (refer to 5.7, Restriping (balancing) extents across a Storage Pool on page 81) to restripe volume extents to include new MDisks. For a fully populated rack, with 12 ports, you should create 48 volumes of 1632 GB each. Tip: Always use the largest volumes possible without exceeding 2 TB. The table below shows the number of 1632 GB LUNs created, depending on the XIV capacity:
Table 5-3 Values using 1632 GB LUNs Number of LUNs (MDisks) at 1632 GB each 16 26 30 33 37 40 44 48 IBM XIV System TB used 26.1 42.4 48.9 53.9 60.4 65.3 71.8 78.3 IBM XIV System TB Capacity Available 27 43 50 54 61 66 73 79

The best use of the SVC virtualization solution with the XIV Storage System can be achieved by executing LUN allocation using these basic parameters: Allocate all LUNs, known to the SVC as MDisks, to one Storage Pool. If multiple IBM XIV Storage Systems are being managed by SVC, there should be a separate Storage Pool for each physical IBM XIV System. This design provides a good queue depth on the SVC to drive XIV adequately. Use 1 GB or larger extent sizes because this large extent size ensures that data is striped across all XIV Storage System drives.

5.4 SVC quorum disk considerations


When back-end storage is initially added to an SVC cluster as a Storage Pool, three quorum disks are automatically created by allocating space from the assigned MDisks, and just one of them is selected as the active quorum disk. As more back-end storage controllers (and therefore Storage Pools) are added to the SVC cluster, the quorum disks do not get reallocated to span multiple back-end storage subsystems. To eliminate a situation where all quorum disks go offline due to a back-end storage subsystem failure, we recommend allocating quorum disks on multiple back-end storage subsystems. This design is of course only possible when multiple back-end storage subsystems (and therefore multiple Storage Pools) are available.

76

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Important: Do not assign internal SVC SSD drives as a quorum disk. Even when there is only a single storage subsystem, but multiple Storage Pools created from this, the quorum disk must be allocated from several Storage Pools to avoid an array failure causing the loss of the quorum. Reallocating quorum disks can be done from either the SVC GUI or from the SVC command line interface (CLI). To list SVC cluster Quorum MDisks and view their number and status, issue the svcinfo lsquorum command as shown in Example 5-1.
Example 5-1 lsquorum command

IBM_2145:ITSO-CLS4:admin>svcinfo lsquorum quorum_index status id name controller_id 0 online 0 mdisk0 0 1 online 1 mdisk1 0 2 online 2 mdisk2 0

controller_name active object_type ITSO-4700 yes mdisk ITSO-4700 no mdisk ITSO-4700 no mdisk

To move one of your SVC Quorum MDisks from one MDisk to another, or from one storage subsystem to another, use the svctask chquorum command as shown in Example 5-2.
Example 5-2 chquorum command

IBM_2145:ITSO-CLS4:admin>svctask chquorum -mdisk 9 2 IBM_2145:ITSO-CLS4:admin>svcinfo lsquorum quorum_index status id name controller_id 0 online 0 mdisk0 0 1 online 1 mdisk1 0 2 online 2 mdisk9 1

controller_name active object_type ITSO-4700 yes mdisk ITSO-4700 no mdisk ITSO-XIV no mdisk

As you can see in Example 5-2, the quorum index 2 has been moved from MDisk2 on ITSO-4700 controller to MDisk9 on ITSO-XIV controller. Note: Although the command setquorum (deprecated) still works, we recommend to use the chquorum command to change the quorum association. The cluster uses the quorum disk for two purposes: as a tie breaker in the event of a SAN fault, when exactly half of the nodes that were previously members of the cluster are present; and to hold a copy of important cluster configuration data. There is only one active quorum disk in a cluster; however, the cluster uses three MDisks as quorum disk candidates. The cluster automatically selects the actual active quorum disk from the pool of assigned quorum disk candidates. If a tiebreaker condition occurs, then the one-half portion of the cluster nodes that is able to reserve the quorum disk after the split has occurred locks the disk and continues to operate. The other half stops its operation. This design prevents both sides from becoming inconsistent with each other.

Chapter 5. Storage pools and Managed Disks

77

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: To be considered eligible as a quorum disk, these criteria must be followed: A MDisk must be presented by a disk subsystem that is supported to provide SVC quorum disks. The controller has been manually allowed to be a quorum disk candidate using the svctask chcontroller -allowquorum yes command. A MDisk must be in managed mode (no image mode disks). A MDisk must have sufficient free extents to hold the cluster state information, plus the stored configuration metadata. A MDisk must be visible to all of the nodes in the cluster. There are special considerations concerning the placement of the active quorum disk for a stretched or split cluster and split I/O Group configurations. Details are available at this website: http://www-01.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003311 Note: Running an SVC cluster without a quorum disk can seriously affect your operation. A lack of available quorum disks for storing metadata will prevent any migration operation (including a forced MDisk delete). Mirrored volumes can be taken offline if there is no quorum disk available. This behavior occurs because synchronization status for mirrored volumes is recorded on the quorum disk. During the normal operation of the cluster, the nodes communicate with each other. If a node is idle for a few seconds, a heartbeat signal is sent to ensure connectivity with the cluster. If a node fails for any reason, the workload that is intended for it is taken over by another node until the failed node has been restarted and readmitted to the cluster (which happens automatically). If the microcode on a node becomes corrupted, resulting in a failure, the workload is transferred to another node. The code on the failed node is repaired, and the node is readmitted to the cluster (again, all automatically). The number of extents required depends on the extent size for the Storage Pool containing the MDisk. Table 5-4 provides the number of extents reserved for quorum use by extent size.
Table 5-4 Number of extents reserved by extent size Extent size (MB) 16 32 64 128 256 512 1024 2048 4096 8192 Number of extents reserved for quorum use. 17 9 5 3 2 1 1 1 1 1

78

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

5.5 Tiered storage


The SVC makes it easy to configure multiple tiers of storage within the same SVC cluster. You might have Single-tiered and/or Multitiered Storage Pools, as outlined below: In a Single-tiered storage pool, the MDisks should have the following characteristics to avoid inducing performance problems and other issues: They have the same hardware characteristics, for example, the same RAID type, RAID array size, disk type, and disk revolutions per minute (RPMs). The disk subsystems providing the MDisks must have similar characteristics, for example, maximum input/output operations per second (IOPS), response time, cache, and throughput. The MDisks used are of the same size and are therefore MDisks that provide the same number of extents. If that is not feasible, you will need to check the distribution of the volumes extents in that storage pool. In a Multitiered storage pool, you will have a mix of MDisks with more than one type of disk tier attribute. For example, a storage pool containing a mix of generic_hdd and generic_ssd MDisks. A multitiered storage pool will therefore contain MDisks with various characteristics, as opposed to a single-tier storage pool. However, it is a best practice for each tier to have MDisks of the same size and MDisks that provide the same number of extents. Multi-tiered storage pools are used to enable the automatic migration of extents between disk tiers using the SVC Easy Tier function. More information about SVC Easy Tier is described in Chapter 11, Easy Tier on page 279. It is likely that the MDisks (LUNs) presented to the SVC cluster will have various performance attributes due to the type of disk or RAID array that they reside on. The MDisks may be on 15K RPM Fibre Channel or SAS disk, Nearline SAS or SATA, or even solid state disk (SSDs). Therefore, a storage tier attribute is assigned to each MDisk, with the default being generic_hdd. With SVC V6.2 a new tier 0 (zero) level disk attribute is available for SSDs and it is known as generic_ssd. You can also define tiers of storage using storage controllers of varying performance and availability levels. Then, you can easily provision them based on host, application, and user requirements. Remember that a single tier of storage can be represented by multiple Storage Pools. For example, if you have a large pool of tier 3 storage that is provided by many low-cost storage controllers, it is sensible to use a number of Storage Pools. Using a number of Storage Pools prevents a single offline volume from taking all of the tier 3 storage offline. When multiple storage tiers are defined, you need to take precautions to ensure that storage is provisioned from the appropriate tiers. You can ensure that storage is provisioned from the appropriate tiers through Storage Pool and MDisk naming conventions, along with clearly defined storage requirements for all hosts within the installation. Note: When multiple tiers are configured, it is a best practice to clearly indicate the storage tier in the naming convention used for the Storage Pools and MDisks.

Chapter 5. Storage pools and Managed Disks

79

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

5.6 Adding MDisks to existing Storage Pools


Before adding MDisks to existing Storage Pools, ask yourself first why you are doing this. If MDisks are being added to the SVC cluster to provide additional capacity, consider adding them to a new Storage Pool. Recognize that adding new MDisks to existing Storage Pools will reduce the reliability characteristics of the Storage Pool and risk destabilizing the Storage Pool if hardware problems exist with the new LUNs. If the Storage Pool is already meeting its performance objectives, we recommend that, in most cases, you add the new MDisks to new Storage Pools rather than add the new MDisks to existing Storage Pools. Important: Do not add a MDisk to a storage pool if you want to create an image mode volume from the MDisk that you are adding. As soon as you add a MDisk to a storage pool it becomes managed, and extent mapping is not necessarily one-to-one anymore.

5.6.1 Checking access to new MDisks


You must be careful when adding MDisks to existing Storage Pools to ensure the availability of the Storage Pool is not compromised by adding a faulty MDisk. Because loss of access to a single MDisk will cause the entire Storage Pool to go offline. Starting with SVC 4.2.1 a new feature was introduced to test a MDisk automatically for reliable read/write access before being added to a Storage Pool, so no user action is required. The test will fail if: One or more nodes cannot access the MDisk through the chosen controller port. I/O to the disk does not complete within a reasonable time. The SCSI inquiry data provided for the disk is incorrect or incomplete. The SVC cluster suffers a software error during the MDisk test. Note that image-mode MDisks are not tested before being added to a Storage Pool, because an offline image-mode MDisk will not take the Storage Pool offline.

5.6.2 Persistent reserve


A common condition where MDisks can be configured by SVC, but cannot perform R/W is in the case where a persistent reserve (PR) has been left on a LUN from a previously attached host. Subsystems that are exposed to this condition were previously attached with IBM Subsystem Device Driver (SDD) or SDDPCM, because support for PR comes from these multipath drivers. You do not see this condition on DS4000 when previously attached using RDAC, because RDAC does not implement PR. In this condition, you need to rezone LUNs and map them back to the host holding the reserve or to another host that has the capability to remove the reserve through the use of a utility, such as lquerypr (included with SDD and SDDPCM) or Windows SDD Persistent Reserve Tool.

5.6.3 Renaming MDisks


We recommend that you rename MDisks from their SVC-assigned name after you discover them. Using a naming convention for MDisks that associates the MDisk to the controller and array helps during problem isolation and avoids confusion that can lead to an administration error.

80

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Note that when multiple tiers of storage exist on the same SVC cluster, you might also want to indicate the storage tier in the name as well. For example, you can use R5 and R10 to differentiate RAID levels or you can use T1, T2, and so on to indicate defined tiers. Best practice: Use a naming convention for MDisks that associates the MDisk with its corresponding controller and array within the controller, for example, DS8K_R5_12345.

5.7 Restriping (balancing) extents across a Storage Pool


Adding MDisks to existing Storage Pools can result in reduced performance across the Storage Pool due to the extent imbalance that will occur and the potential to create hot spots within the Storage Pool. After adding MDisks to Storage Pools, we recommend that extents are rebalanced across all available MDisks by using the command line interface (CLI) by manual command entry. Alternatively, you can automate rebalancing the extents across all available MDisks by using a Perl script, available as part of the SVCTools package from the IBM alphaWorks Web site. If you want to manually balance extents, you can use the following CLI commands (remember that svcinfo and svctask prefixes are no longer required) to identify and correct extent imbalance across Storage Pools: lsmdiskextent migrateexts lsmigrate The following section describes how to use the script from the SVCTools package to rebalance extents automatically. You can use this script on any host with Perl and an SSH client installed; we show how to install it on a Windows Server 2003 server.

5.7.1 Installing prerequisites and the SVCTools package


For this test, we installed SVCTools on a Windows Server 2003 server. The major prerequisites are: PuTTY: This tool provides SSH access to the SVC cluster. If you are using an SVC Master Console or a System Storage Productivity Center (SSPC) server, it has already been installed. If not, you can download PuTTY from the authors Web site at: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html The easiest package to install is the Windows installer, which installs all the PuTTY tools in one location. Perl: Perl packages for Windows are available from a number of sources. We used ActivePerl, which can be downloaded free-of-charge from: http://www.activestate.com/Products/activeperl/index.mhtml The SVCTools package is available at: http://www.alphaworks.ibm.com/tech/svctools This package is a compressed file, which can be extracted to wherever is convenient. We extracted it to C:\SVCTools on the Master Console. The key files for the extent balancing script are: The SVCToolsSetup.doc file, which explains the installation and use of the script in detail

Chapter 5. Storage pools and Managed Disks

81

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

The lib\IBM\SVC.pm file, which must be copied to the Perl lib directory. With ActivePerl installed in C:\Perl, copy it to C:\Perl\lib\IBM\SVC.pm. The examples\balance\balance.pl file, which is the rebalancing script.

5.7.2 Running the extent balancing script


The Storage Pool on which we tested the script was unbalanced, because we recently expanded it from four MDisks to eight MDisks. Example 5-3 shows that all of the volume extents are on the original four MDisks.
Example 5-3 The lsmdiskextent script output showing an unbalanced Storage Pool

IBM_2145:itsosvccl1:admin>lsmdisk -filtervalue "mdisk_grp_name=itso_ds45_18gb" id name status mode mdisk_grp_id mdisk_grp_name capacity ctrl_LUN_# controller_name UID 0 mdisk0 online managed 1 itso_ds45_18gb 18.0GB 0000000000000000 itso_ds4500 600a0b80001744310000011a4888478c00000000000000000000000000000000 1 mdisk1 online managed 1 itso_ds45_18gb 18.0GB 0000000000000001 itso_ds4500 600a0b8000174431000001194888477800000000000000000000000000000000 2 mdisk2 online managed 1 itso_ds45_18gb 18.0GB 0000000000000002 itso_ds4500 600a0b8000174431000001184888475800000000000000000000000000000000 3 mdisk3 online managed 1 itso_ds45_18gb 18.0GB 0000000000000003 itso_ds4500 600a0b8000174431000001174888473e00000000000000000000000000000000 4 mdisk4 online managed 1 itso_ds45_18gb 18.0GB 0000000000000004 itso_ds4500 600a0b8000174431000001164888472600000000000000000000000000000000 5 mdisk5 online managed 1 itso_ds45_18gb 18.0GB 0000000000000005 itso_ds4500 600a0b8000174431000001154888470c00000000000000000000000000000000 6 mdisk6 online managed 1 itso_ds45_18gb 18.0GB 0000000000000006 itso_ds4500 600a0b800017443100000114488846ec00000000000000000000000000000000 7 mdisk7 online managed 1 itso_ds45_18gb 18.0GB 0000000000000007 itso_ds4500 600a0b800017443100000113488846c000000000000000000000000000000000 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk0 id number_of_extents copy_id 0 64 0 2 64 0 1 64 0 4 64 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk1 id number_of_extents copy_id 0 64 0 2 64 0 1 64 0 4 64 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk2 id number_of_extents copy_id 82
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

0 64 0 2 64 0 1 64 0 4 64 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 64 0 2 64 0 1 64 0 4 64 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent

mdisk3

mdisk4 mdisk5 mdisk6 mdisk7

The balance.pl script was then run on the Master Console using the command: C:\SVCTools\examples\balance>perl balance.pl itso_ds45_18gb -k "c:\icat.ppk" -i 9.43.86.117 -r -e In this command: itso_ds45_18gb is the Storage Pool to be rebalanced. -k "c:\icat.ppk" gives the location of the PuTTY private key file, which is authorized for administrator access to the SVC cluster. -i 9.43.86.117 gives the IP address of the cluster. -r requires that the optimal solution is found. If this option is not specified, the extents can still be somewhat unevenly spread at completion, but not specifying -r will often require fewer migration commands and less time. If time is important, it might be preferable to not use -r at first, and then rerun the command with -r if the solution is not good enough. -e specifies that the script will actually run the extent migration commands. Without this option, it will merely print the commands that it might have run. This option can be used to check that the series of steps is logical before committing to migration. In this example, with 4 x 8 GB volumes, the migration completed within around 15 minutes. You can use the command svcinfo lsmigrate to monitor progress; this command shows a percentage for each extent migration command issued by the script. After the script had completed, we checked that the extents had been correctly rebalanced. Example 5-4 shows that the extents had been correctly rebalanced. In a test run of 40 minutes of I/O (25% random, 70/30 R/W) to the four volumes, performance for the balanced Storage Pool was around 20% better than for the unbalanced Storage Pool.
Example 5-4 The lsmdiskextent output showing a balanced Storage Pool

IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk0 id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk1 id number_of_extents copy_id 0 32 0 2 32 0 1 32 0
Chapter 5. Storage pools and Managed Disks

83

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

4 31 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 33 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0 IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent id number_of_extents copy_id 0 32 0 2 32 0 1 32 0 4 32 0

mdisk2

mdisk3

mdisk4

mdisk5

mdisk6

mdisk7

Notes on the use of the extent balancing script


To use the extent balancing script: Migrating extents might have a performance impact, if the SVC or (more likely) the MDisks are already at the limit of their I/O capability. The script minimizes the impact by using the minimum priority level for migrations. Nevertheless, many administrators prefer to run these migrations during periods of low I/O workload, such as overnight. There are command line options other than balance.pl that you can use to tune how extent balancing works, for example, excluding certain MDisks or certain volumes from the rebalancing. Refer to the SVCToolsSetup.doc in svctools.zip for details. Because the script is written in Perl, the source code is available for you to modify and extend its capabilities. If you want to modify the source code, make sure that you pay attention to the documentation in Plain Old Documentation (POD) format within the script.

84

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

5.8 Removing MDisks from existing Storage Pools


You might want to remove MDisks from a Storage Pool, for example, when decommissioning a storage controller. When removing MDisks from a Storage Pool, consider whether to manually migrate extents from the MDisks. It is also necessary to make sure that you remove the correct MDisks. Sufficient space: The removal only takes place if there is sufficient space to migrate the volumes data to other extents on other MDisks that remain in the storage pool. After you remove the MDisk from the storage pool, it takes time to change the mode from managed to unmanaged depending on the size of the MDisk you are removing.

5.8.1 Migrating extents from the MDisk to be deleted


If an MDisk contains volume extents, these extents need to be moved to the remaining MDisks in the Storage Pool. Example 5-5 shows how to list the volumes that have extents on a given MDisk using the CLI.
Example 5-5 Listing which volumes have extents on an MDisk to be deleted

IBM_2145:itsosvccl1:admin>svcinfo lsmdiskextent mdisk14 id number_of_extents copy_id 5 16 0 3 16 0 6 16 0 8 13 1 9 23 0 8 25 0 Specify the -force flag on the svctask rmmdisk command, or check the corresponding checkbox in the GUI. Either action causes the SVC to automatically move all used extents on the MDisk to the remaining MDisks in the Storage Pool. Alternatively, you might want to manually perform the extent migrations, otherwise, the automatic migration will randomly allocate extents to MDisks (and areas of MDisks). After all extents have been manually migrated, the MDisk removal can proceed without the -force flag.

5.8.2 Verifying an MDisks identity before removal


It is critical that MDisks appear to the SVC cluster as unmanaged prior to removing their controller LUN mapping. Unmapping LUNs from the SVC that are still part of a Storage Pool will result in the Storage Pool going offline and will impact all hosts with mappings to volumes in that Storage Pool. If the MDisk has been named using the best practices, the correct LUNs will be easier to identify. However, we recommend that the identification of LUNs that are being unmapped from the controller match the associated MDisk on the SVC using the Controller LUN Number field and the unique identifier (UID) field. The UID is unique across all MDisks on all controllers on the other hand, the Controller LUN Number is only unique within a given controller and for a certain host. Therefore when using

Chapter 5. Storage pools and Managed Disks

85

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

the Controller LUN Number, you must check that you are managing the correct storage controller and check that you are looking at the mappings for the correct SVC host object.

Tip: Having your back-end storage controllers renamed as recommended, will help you also on MDisks identification.

For details on how to correlate back-end volumes (LUNs) to MDisks, refer to the next section 5.8.3, LUNs to MDisk translation

5.8.3 LUNs to MDisk translation


The correct correlation between the backend volume (LUN) with the SVC MDisk is crucial to avoid mistakes and possible outages. This section show you how to correlate the back-end volume with MDisk for storage controllers DS4000, DS8000, XIV and V7000.

DS4000
The DS4000 volumes should be identified using the Logical Drive ID along with the LUN Number associated with the host mapping. For the following example we will refer to these values: Logical Drive ID = 600a0b80001744310000c60b4e2eb524 LUN Number = 3 To identify the Logical Drive ID using the Storage Manager Software, right click on a volume and go to the Properties option. Refer to Figure 5-1 on page 87 as an example.

86

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Figure 5-1 Logical Drive Properties for DS4000

To identify your LUN Number, go to the Mappings View, select your SVC Host Group then look at the LUN column on the right side. Refer to Figure 5-2 as an example.

Figure 5-2 Mappings View for DS4000

In order to correlate the above LUN with your correspondent MDisk, look at the MDisk details and check the UID field. The first 32 bits (600a0b80001744310000c60b4e2eb524) of the MDisk UID field should be exactly the same as your DS4000 Logical Drive ID. Then make sure that the associated DS4000 LUN Number is correlating with the SVC ctrl_LUN_#, for this task convert your DS4000 LUN Number in Hexdecimal and check the latest 2 bits on the SVC ctrl_LUN_# field. In our example on Figure 5-3 on page 88, its 0000000000000003. Note: The command line interface (CLI) references the Controller LUN Number as ctrl_LUN_#, and the Graphical User Interface (GUI) reference as LUN.

Chapter 5. Storage pools and Managed Disks

87

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 5-3 MDisk details for DS4000 volume.

DS8000
The LUN ID will only uniquely identify LUNs within the same storage controller. If multiple storage devices are attached to the same SVC cluster, the LUN ID needs to be combined with the WWNN attribute in order to uniquely identify LUNs within the SVC cluster. To get the world wide node name (WWNN) of the DS8000 controller take the first 16 digits of the MDisk uid, and change the first digit from 6 to 5. As an example, from 5005076305ffc74c to 6005076305ffc74c The DS8000 LUN when viewed as SVC ctrl_LUN_# is decodes as: 40XX40YY00000000 - where XX is the LSS (Logical Subsystem) and YY is the LUN within the LSS. The LUN ID as seen by the DS8000 is the 4 digits starting from the 29th. 6005076305ffc74c000000000000100700000000000000000000000000000000 Figure 5-4 on page 89 shows LUN ID fields that are displayed from the DS8000 Storage Manager.

88

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Figure 5-4 DS8000 Storage Manager view for LUN ID

From the MDisk details panel in Figure 5-5, the Controller LUN Number field is 4010400700000000, which translates to LUN ID 0x1007 (represented in Hex).

Figure 5-5 MDisk Details for DS8000 volume

We can also identify the storage controller from the Storage Subsystem field as DS8K75L3001, which had been manually assigned.

Chapter 5. Storage pools and Managed Disks

89

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

IBM XIV
The XIV volumes should be identified using the volume Serial Number along with the LUN Number associated with the host mapping. For the following example we will reference to these values: Serial Number = 897 LUN Number = 2 To identify the volume Serial Number, right click on a volume and go to the Properties option. Refer to Figure 5-6 as an example.

Figure 5-6 XIV Volume Properties

To identify your LUN Number, go to the Volumes by Hosts view, expand your SVC Host Group then refer to the LUN column. Refer to Figure 5-7 on page 91 as an example.

90

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Figure 5-7 XIV Volumes by Hosts

The MDisk UID field is composed by part of the controller WWNN from bits 2 to 13. You might check those bits with the svcinfo lscontroller command as showed on Example 5-6 below.
Example 5-6 lscontroller command

IBM_2145:tpcsvc62:admin>svcinfo lscontroller 10 id 10 controller_name controller10 WWNN 5001738002860000 ... The correlation can now be performed by taking the first 16 bits from the MDisk UID field. The bits from 1 to 13 are referent to the controller WWNN as showed above. The bits 14 to 16 are the XIV volume Serial Number (897) in Hexdecimal format (that results in 381 Hex). See translation details below: 0017380002860381000000000000000000000000000000000000000000000000 0017380002860 = controller WWNN (bits 2 to 13) 381 = XIV volume Serial Number converted in Hex In order to correlate the SVC ctrl_LUN_#, take the XIV Volume Number and convert in Hexdecimal format, then check the latest 3 bits from the SVC ctrl_LUN_#. In our example its 0000000000000002 as showed on Figure 5-8 on page 92.

Chapter 5. Storage pools and Managed Disks

91

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 5-8 MDisk details for XIV volume.

V7000
The IBM Storwize V7000 solution is built upon the IBM SAN Volume Controller (SVC) technology base and use similar terminology, so on the first time correlating V7000 Volumes with SVC MDisks can be confused. Looking at the V7000 side first, you will have to check the Volume UID that was presented to the SVC Host. Refer to Figure 5-9 on page 93 as an example.

92

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

Figure 5-9 V7000 Volume Details

Right after, check the SCSI ID number for that specific volume on the Host Maps tab. This value will be used to match the SVC ctrl_LUN_# (in Hexdecimal format). Refer to Figure 5-10 as an example.

Figure 5-10 V7000 Volume Details for Host Maps

Chapter 5. Storage pools and Managed Disks

93

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

On the SVC side, look at the MDisk details and compare the MDisk UID field with the V7000 Volume UID, they should be exactly the same (the first 32 bits). Refer to Figure 5-11 as an example.

Figure 5-11 SVC MDisk Details for V7000 volumes

Then, double check that the SVC ctrl_LUN_# is the V7000 SCSI ID number in Hexdecimal format. In our example its 0000000000000004

5.9 Remapping managed MDisks


You generally do not unmap managed MDisks from the SVC, because it causes the Storage Pool to go offline. However, if managed MDisks have been unmapped from the SVC for a specific reason, it is important to know that the LUN must present the same (UID, SSID, LUN_ID, etc.) attributes to the SVC before it has been mapped back. If the LUN is mapped back with different attributes, the SVC will recognize this MDisk as a new MDisk, and the associated Storage Pool will not come back online. Consider this situation for storage controllers that support LUN selection, because selecting a different LUN ID will change the UID. If the LUN has been mapped back with a different LUN ID, it must be remapped again using the previous LUN ID. Another instance where the UID can change on a LUN is in the case where DS4000 support has regenerated the metadata for the logical drive definitions as part of a recovery procedure. When logical drive definitions are regenerated, the LUN will appear as a new LUN just as it does when it is created for the first time (the only exception is that the user data will still be present). In this case, restoring the UID on a LUN back to its prior value can only be done with the assistance of DS4000 support. Both the previous UID and the subsystem identifier (SSID) will be required, both of which can be obtained from the controller profile. Refer to Figure 5-1 on 94
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

page 87 for an example of the Logical Drive Properties panel for a DS4000 logical drive. This panel shows Logical Drive ID (UID) and SSID.

5.10 Controlling extent allocation order for volume creation


When creating a virtual disk, it is sometimes desirable to control the order in which extents are allocated across the MDisks in the Storage Pool for the purpose of balancing workload across controller resources. For example, you can alternate extent allocation across DA pairs and even and odd extent pools in the DS8000. For this reason, is very important to plan the order that the MDisk are included on the Storage Pools because the extents allocation will follow the sequence that the MDisks were added. Note: When volumes are created, the MDisk that will contain the first extent is selected by a pseudo-random algorithm then the remaining extents are allocated across MDisks in the Storage Pool in a round-robin fashion, in the order in which the MDisks were added to the Storage Pool and according with MDisks free extents available. Table 5-5 shows the initial discovery order of six MDisks. Note that adding these MDisks to a Storage Pool in this order results in three contiguous extent allocations alternating between the even and odd extent pools, as opposed to alternating between extent pools for each extent.
Table 5-5 Initial discovery order LUN ID 1000 1001 1002 1100 1101 1102 MDisk ID 1 2 3 4 5 6 MDisk name mdisk01 mdisk02 mdisk03 mdisk04 mdisk05 mdisk06 Controller resource DA pair/extent pool DA2/P0 DA6/P16 DA7/P30 DA0/P9 DA4/P23 DA5/P39

To change extent allocation so that each extent alternates between even and odd extent pools, the MDisks can be removed from the Storage Pool then re added to the Storage Pool in their new order. Table 5-6 on page 96 shows how the MDisks have been re added to the Storage Pool in their new order, so the extent allocation will alternate between even and odd extent pools.

Chapter 5. Storage pools and Managed Disks

95

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

Table 5-6 MDisks re added LUN ID 1000 1100 1001 1101 1002 1102 MDisk ID 1 4 2 5 3 6 MDisk name mdisk01 mdisk04 mdisk02 mdisk05 mdisk03 mdisk06 Controller resource DA pair/extent pool DA2/P0 DA0/P9 DA6/P16 DA4/P23 DA7/P30 DA5/P39

There are two options available for volume creation. We describe both options along with the differences between the two options: Option A: Explicitly select the candidate MDisks within the Storage Pool that will be used (through the command line interface - CLI - only). Note that when explicitly selecting the MDisk list, the extent allocation will round-robin across MDisks in the order that they are represented on the list starting with the first MDisk on the list: Example A1: Creating a volume with MDisks from the explicit candidate list order: md001, md002, md003, md004, md005, and md006. The volume extent allocations then begin at md001 and alternate round-robin around the explicit MDisk candidate list. In this case, the volume is distributed in the following order: md001, md002, md003, md004, md005, and md006. Example A2: Creating a volume with MDisks from the explicit candidate list order: md003, md001, md002, md005, md006, and md004. The volume extent allocations then begin at md003 and alternate round-robin around the explicit MDisk candidate list. In this case, the volume is distributed in the following order: md003, md001, md002, md005, md006, and md004. Option B: Do not explicitly select the candidate MDisks within a Storage Pool that will be used (through the command line interface (CLI) or GUI). Note that when the MDisk list is not explicitly defined, the extents will be allocated across MDisks in the order that they were added to the Storage Pool, and the MDisk that will receive the first extent will be randomly selected. Example B1: Creating a volume with MDisks from the candidate list order (based on this definitive list from the order that the MDisks were added to the Storage Pool: md001, md002, md003, md004, md005, and md006. The volume extent allocations then begin at a random MDisk starting point (let us assume md003 is randomly selected) and alternate round-robin around the explicit MDisk candidate list based on the order that they were added to the Storage Pool originally. In this case, the volume is allocated in the following order: md003, md004, md005, md006, md001, and md002. Be advised that when creating Striped volumes specifying the MDisks order (if not well planned) you might have the first extent for several volumes in only one MDisk, what can lead to poor performance for workloads that place a large I/O load on the first extent of each volume, or that create multiple sequential streams. Recommendation: In a daily basis administration, create the Striped volumes without specifying the MDisks order.

96

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Managed Disk Groups.fm

5.11 Moving an MDisk between SVC clusters


It can sometimes be desirable to move an MDisk to a separate SVC cluster. Before beginning this task, consider the alternatives, which include: Using Metro Mirror or Global Mirror to copy the data to a remote cluster. One instance in which this might not be possible is where the SVC cluster is already in a mirroring partnership with another SVC cluster, and data needs to be migrated to a third cluster. Attaching a host server to two SVC clusters and using host-based mirroring to copy the data. Using storage controller-based copy services. If you use storage controller-based copy services, make sure that the volumes containing the data are image-mode and cache-disabled. If none of these options are appropriate, follow these steps to move an MDisk to another cluster: 1. Ensure that the MDisk is in image mode rather than striped or sequential. If the MDisk is in image mode, the MDisk contains only the raw client data and not any SVC metadata. If you want to move data from a non-image mode volume, first use the svctask migratetoimage command to migrate to a single image-mode MDisk. For a Thin-provisioned volume, image mode means that all metadata for the volume is present on the same MDisk as the client data, which will not be readable by a host, but it will be able to be imported by another SVC cluster. 2. Remove the image-mode volumes from the first cluster using the svctask rmvdisk command. Note: You must not use the -force option of the svctask rmvdisk command. If you use the -force option, data in cache will not be written to the disk, which might result in metadata corruption for a Thin-provisioned volume.

3. Check by using svcinfo lsvdisk that the volume is no longer displayed. You must wait until it is removed to allow cached data to destage to disk. 4. Change the back-end storage LUN mappings to prevent the source SVC cluster from seeing the disk, and then make it available to the target cluster. 5. Perform an svctask detectmdisk command on the target cluster. 6. Import the MDisk to the target cluster. If it is not a Thin-provisioned volume, you will use the svctask mkvdisk command with the -image option. If it is a Thin-provisioned volume, you will also need to use two other options: -import instructs the SVC to look for thin volume metadata on the specified MDisk. -rsize indicates that the disk is Thin-provisioned. The value given to -rsize must be at least the amount of space that the source cluster used on the Thin-provisioned volume. If it is smaller, a 1862 error will be logged. In this case, delete the volume and try the svctask mkvdisk command again. The volume is now online. If it is not, and the volume is Thin-provisioned, check the SVC error log for an 1862 error; if an 1862 error is present, it will indicate why the volume import failed (for example, metadata corruption). You might then be able to use the repairsevdiskcopy command to correct the problem.

Chapter 5. Storage pools and Managed Disks

97

7521Managed Disk Groups.fm

Draft Document for Review February 16, 2012 3:49 pm

98

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Chapter 6.

Volumes
In this chapter, we discuss Volumes (formerly VDisks) and the usage of Flashcopy. We describe creating them, managing them, and migrating them across I/O Groups.

Copyright IBM Corp. 2011. All rights reserved.

99

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

6.1 Volume Overview


There are three types of volumes: striped, sequential, and image. These types are determined by the way in which the extents are allocated from the storage pool, as explained here: A volume created in striped mode has extents allocated from each MDisk in the storage pool in a round-robin fashion. With a sequential mode volume, extents are allocated sequentially from an MDisk. Image mode is a one-to-one mapped extent mode volume.

Striping compared to sequential type


With extremely few exceptions, you must always configure volumes using striping. However, one exception to this rule is an environment where you have a 100% sequential workload where disk loading across all volumes is guaranteed to be balanced by the nature of the application. For example, specialized video streaming applications are exceptions to this rule. Another exception to this rule is an environment where there is a high dependency on a large number of flash copies. In this case, FlashCopy loads the volumes evenly and the sequential I/O, which is generated by the flash copies, has higher throughput potential than what is possible with striping. This situation is a rare exception given the unlikely requirement to optimize for FlashCopy as opposed to online workload.

6.1.1 Thin-provisioned volumes


Volumes can be configured to be either thin-provisioned or fully allocated. The thin-provisioned volumes are created with different capacities: real and virtual capacities. You can still create volumes in striped, sequential, or image mode virtualization policy, just as you can any other volume. The real capacity defines how much disk space is actually allocated to a volume. The virtual capacity is the capacity of the volume that is reported to other SVC components (for example, FlashCopy or Remote Copy) and to the hosts. A directory maps the virtual address space to the real address space. The directory and the user data share the real capacity. There are two operating modes for thin-provisioned volumes being Auto-Expand or not, and you can switch the mode at any time. If you select the Auto-Expand feature, the SVC automatically add a fixed amount of additional real capacity to the thin volume as required. Autoexpand therefore attempts to maintain a fixed amount of unused real capacity for the volume. This amount is known as the contingency capacity. The contingency capacity is initially set to the real capacity that is assigned when the volume is created. If the user modifies the real capacity, the contingency capacity is reset to be the difference between the used capacity and real capacity. A volume that is created without the autoexpand feature, and thus has a zero contingency capacity, will go offline as soon as the real capacity is used and needs to expand.

100

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Note: We strongly recommend that you have warning threshold enabled (via email or SNMP trap) when working with Thin-provisioned volumes - on the volume and on the Storage Pool side - specially when you do not use the Auto-Expand mode. Otherwise, the thin volume will go offline in case of runs out of space. Autoexpand will not cause the real capacity to grow much beyond the virtual capacity. The real capacity can be manually expanded to more than the maximum that is required by the current virtual capacity, and the contingency capacity will be recalculated. A thin-provisioned volume can be converted nondisruptively to a fully allocated volume, or vice versa, by using the volume mirroring function. For example, you can add a thin-provisioned copy to a fully allocated primary volume and then remove the fully allocated copy from the volume after they are synchronized. The fully allocated to thin-provisioned migration procedure uses a zero-detection algorithm so that grains containing all zeros do not cause any real capacity to be used. Tip: Consider using thin-provisioned volumes as targets in Flash Copy relationships.

6.1.2 Space allocation


As mentioned, when a thin-provisioned volume is initially created, a small amount of the real capacity is used for initial metadata. Write I/Os to the grains of the thin volume that have not previously been written to will cause grains of the real capacity to be used to store metadata and user data. Write I/Os to grains that have previously been written to will update the grain where data was previously written. Note: The grain is defined when the volume is created and can be 32 KB, 64 KB, 128 KB, or 256 KB. Smaller granularities can save more space, but they have larger directories. When you use thin-provisioning with FlashCopy (FC), specify the same grain size for both thin-provisioned volume and FC. For more details about thin-provisioned FlashCopy, refer to 6.8.5, Thin-provisioned FlashCopy on page 125.

6.1.3 Thin-provisioned volume performance


Thin-provisioned volumes require more I/Os because of the directory accesses: For truly random workloads, a thin-provisioned volume requires approximately one directory I/O for every user I/O, so performance will be 50% of a normal volume. The directory is 2-way write-back cached (just like the SVC fastwrite cache), so certain applications perform better. Thin-provisioned volumes require more CPU processing, so the performance per I/O group will be lower. You need to use the striping policy in order to spread thin-provisioned volumes across many Storage Pools. Important: Do not use thin-provisioned volumes where high I/O performance is required.

Chapter 6. Volumes

101

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

Thin-provisioned volumes only save capacity if the host server does not write to the whole volumes. Whether the thin-provisioned volume works well is partly dependent on how the filesystem allocated the space: Certain filesystems (for example, NTFS (NT File System)) will write to the whole volume before overwriting deleted files, while other filesystems will reuse space in preference to allocating new space. Filesystem problems can be moderated by tools, such as defrag or by managing storage using host Logical Volume Managers (LVMs). The thin-provisioned volume is also dependent on how applications use the filesystem, for example, certain applications only delete log files when the filesystem is nearly full. Note: There is no recommendation for thin-provisioned volume and best performance or practice. As already explained, it depends on what is used in the particular environment. For the absolute best performance, use fully allocated volumes instead of a thin provisioned volume. For more considerations on performace, refer to Part 1, Performance best practices on page 225.

6.1.4 Limits on Virtual Capacity of Thin-provisioned volumes


There are a couple of factors (extent and grain size) that limit the virtual capacity of thin-provisioned volumes over and above those that limit the capacity of regular volumes. Refer to the tables (Table 6-1 and Table 6-2) for the maximum thin-provisioned volume virtual capacities for given extent and grain size.
Table 6-1 Maximum thin volume virtual capacities for given extent size Extent size, MB 16 32 64 128 256 512 1024 2048 4096 8192 Max volume real capacity, GB 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144 524,288 1,048,576 Max thin virtual capacity, GB 2,000 4,000 8,000 16,000 32,000 65,000 130,000 260,000 520,000 1,040,000

102

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Table 6-2 Maximum thin volume virtual capacities for given grain size Grain size, KB 32 64 128 256 Max thin virtual capacity, GB 260,000 520,000 1,040,000 2,080,000

6.1.5 Testing an application with Thin-provisioned volume


To help you understand what works in combination with thin-provisioned volumes, perform this test: 1. Create an thin-provisioned volume with Auto-Expand turned off. 2. Test the application. 3. If the application and thin-provisioned volume do not work well, the volume will fill up and in the worst case, it will go offline. 4. If the application and thin-provisioned volume do work well, the volume will not fill up and will remain online. 5. You can configure warnings and also monitor how much capacity is being used. 6. If necessary, the user can expand or shrink the real capacity of the volume. 7. When you have determined if the combination of the application and thin-provisioned volume works well, you can enable Auto-Expand.

6.2 What is volume mirroring


With the volume mirroring feature we can create a volume with one or two copies, providing a simple RAID-1 function; thus, a volume will have two physical copies of its data. These copies can be in the same or in different Storage Pools (with different extent sizes of the Storage Pool). The first Storage Pool that is specified contains the primary copy. If a volume is created with two copies, both copies use the same virtualization policy, just as any other volume. But there is also a way to have two copies of a volume with different virtualization policies. In combination with thin-provisioning, each mirror of a volume can be thin-provisioned or fully allocated and in striped, sequential, or image mode. A mirrored volume has all of the capabilities of a volume and also the same restrictions (for example, a mirrored volume is owned by an I/O Group, just as any other volume). This feature also provides a point-in-time copy functionality that is achieved by splitting a copy from the volume.

6.2.1 Creating or adding a mirrored volume


When a mirrored volume is created and the format has been specified, all copies are formatted before the volume comes online. The copies are then considered synchronized. Alternatively, with the no synchronization option chosen, the mirrored volumes are not synchronized.
Chapter 6. Volumes

103

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

This might be helpful in these cases: If it is known that the already formatted MDisk space will be used for mirrored volumes. If it is just not required, that the copies are synchronized.

6.2.2 Availability of mirrored volumes


Volume mirroring provides a low level of Redundant Array of Independent Disks 1 (RAID 1) to protect against controller and Storage Pool failure, because it allows you to create a volume with two copies, which are in different Storage Pools. If one storage controller or Storage Pool failed, a volume copy is not affected if it has been placed on a different storage controller or in a different Storage Pool. For FlashCopy usage, a mirrored volume is only online to other nodes if it is online in its own I/O Group and if the other nodes have visibility to the same copies as the nodes in the I/O Group. If a mirrored volume is a source volume in a FlashCopy relationship, asymmetric path failures or a failure of the mirrored volumes I/O Group can cause the target volume to be taken offline.

6.2.3 Mirroring between controllers


As mentioned, one advantage of mirrored volumes is having the volume copies on different storage controllers/Storage Pools. Normally, the read I/O is directed to the primary copy, but the primary copy must be available and synchronized. Important: For the best practice and best performance, put all the primary mirrored volumes on the same storage controller, or you might see a performance impact. Selecting the copy that is allocated on the higher performance storage controller will maximize the read performance of the volume. The write performance will be constrained by the lower performance controller, because writes must complete to both copies before the volume is considered to have been written successfully.

6.3 Creating Volumes


Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933-00, fully describes the creation of volumes. The best practices that we strongly recommend are: Decide on your naming convention before you begin. It is much easier to assign the correct names at the time of volume creation than to modify them afterwards. Each volume has an I/O group and preferred node that balances the load between nodes in the I/O group, so balance the volumes across the I/O Groups in the cluster to balance the load across the cluster. In configurations with large numbers of attached hosts where it is not possible to zone a host to multiple I/O Groups, it might not be possible to choose to which I/O Group to attach the volumes. The volume has to be created in the I/O Group to which its host belongs. For moving a volume across I/O Groups, refer to 6.3.3, Moving a volume to another I/O Group on page 106.

104

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Note: Migrating volumes across I/O Groups is a disruptive action. Therefore, it is best to specify the correct I/O Group at the time of volume creation. By default, the preferred node, which owns a volume within an I/O Group, is selected on a load balancing basis. At the time of volume creation, the workload to be put on the volume might not be known. But it is important to distribute the workload evenly on the SVC nodes within an I/O Group. The preferred node cannot easily be changed. If you need to change the preferred node, refer to 6.3.2, Changing the preferred node within an I/O Group on page 106. The maximum number of volumes per I/O Group is 2048. The maximum number of volumes per cluster is 8192 (eight node cluster). The smaller the extent size that you select, the finer the granularity of the volume of space occupied on the underlying storage controller. A volume occupies an integer number of extents, but its length does not need to be an integer multiple of the extent size. The length does need to be an integer multiple of the block size. Any space left over between the last logical block in the volume and the end of the last extent in the volume is unused. A small extent size is used in order to minimize this unused space. The counter view to this view is that the smaller the extent size, the smaller the total storage volume that the SVC can virtualize. The extent size does not affect performance. For most clients, extent sizes of 128 MB or 256 MB give a reasonable balance between volume granularity and cluster capacity. There is no longer a default value set. Extent size is set during the Managed Disk (MDisk) Group creation. Important: volumes can only be migrated between Storage Pools that have the same extent size, except for mirrored volumes. The two copies can be in different Storage Pools with different extent sizes. As mentioned in the first section of this chapter, a volume can be created as thin-provisioned or fully allocated, in one of these three modes: striped, sequential, or image and with one or two copies (volume mirroring). With extremely few exceptions, you must always configure volumes using striping mode. Note: Electing to use sequential mode over striping requires a detailed understanding of the data layout and workload characteristics in order to avoid negatively impacting the system performance.

6.3.1 Selecting the Storage Pool


As discussed in 6.3.1, Selecting the Storage Pool on page 105, you can use the SVC to create tiers (each one with different performance characteristics) of storage. The best practice when creating volumes at the first time for a new server is to have all the volumes for this specific server on a unique Storage Pool. Later on if you observes that the Storage Pool is saturated or your server demands more performance, start moving some volumes to another Storage Pool, or move all the volumes to a higher tier Storage Pool. Remember that having volumes from the same server in more than one Storage Pool you are increasing the availability risk in case of any of the Storage Pools related to that server goes offline.

Chapter 6. Volumes

105

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

6.3.2 Changing the preferred node within an I/O Group


Currently there is no non-disruptive method to change the preferred node within an I/O Group. The easiest way is editing the volume properties as showed on the Figure 6-1.

Figure 6-1 changing the preferred node

As you can see also from the Figure 6-1, changing the preferred node is disruptive to host trafic, so the best practice to perform this operation is: a. Cease I/O operations to the volume. b. Disconnect the volume from the host operating system. For example, in Windows, remove the drive letter. c. On the SVC, unmap the volume from the host. d. On the SVC, change the preferred node. e. On the SVC, remap the volume to the host. f. Rediscover the volume on the host. g. Resume I/O operations on the host.

6.3.3 Moving a volume to another I/O Group


The procedure of migrating a volume between I/O Groups is disruptive, because access to the volume is lost. If a volume is moved between I/O Groups, the path definitions of the volumes are not refreshed dynamically. The old IBM Subsystem Device Driver (SDD) paths must be removed and replaced with the new one. The best practice is to migrate volumes between I/O Groups with the hosts shut down. Then, follow the procedure listed in 8.2, Host pathing on page 199 for the reconfiguration of SVC volumes to hosts. We recommend that you remove the stale configuration and reboot the host in order to reconfigure the volumes that are mapped to a host.

106

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

When migrating a volume between I/O Groups, you have the ability to specify the preferred node, if desired, or you can let SVC assign the preferred node. Ensure that when you migrate a volume to a new I/O Group, you quiesce all I/O operations for the volume. Determine the hosts that use this volume and make sure its properly zoned to the target SVC I/O group. Stop or delete any FlashCopy mappings or Metro/Global Mirror relationships that use this volume. To check if the volume is part of a relationship or mapping, issue the svcinfo lsvdisk command that is shown in Example 6-1 where vdiskname/id is the name or ID of the volume.
Example 6-1 Output of lsvdisk command

IBM_2145:svccf8:admin>svcinfo lsvdisk TEST_1 id 2 name TEST_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id many mdisk_grp_name many capacity 1.00GB type many formatted no mdisk_id many mdisk_name many FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000002 ... Look for the FC_id and RC_id fields. If these fields are not blank, the volume is part of a mapping or a relationship. The procedure is: 1. Cease I/O operations to the volume. 2. Disconnect the volume from the host operating system. For example, in Windows, remove the drive letter. 3. Stop any copy operations. 4. Issue the command to move the volume (refer to Example 6-2). This command does not work while there is data in the SVC cache that is to be written to the volume. After two minutes, the data automatically destages if no other condition forces an earlier destaging. 5. On the host, rediscover the volume. For example in Windows, run a rescan, then either mount the volume or add a drive letter. Refer to Chapter 8, Hosts on page 191. 6. Resume copy operations as required. 7. Resume I/O operations on the host. After any copy relationships are stopped, you can move the volume across I/O Groups with a single command in an SVC: svctask chvdisk -iogrp newiogrpname/id vdiskname/id

Chapter 6. Volumes

107

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

In this command, newiogrpname/id is the name or ID of the I/O Group to which you move the volume and vdiskname/id is the name or ID of the volume. Example 6-2 shows the command to move the volume named TEST_1 from its existing I/O Group, io_grp0, to io_grp1.
Example 6-2 Command to move a volume to another I/O Group

IBM_2145:svccf8:admin>svctask chvdisk -iogrp io_grp1 TEST_1 Migrating volumes between I/O Groups can be a potential issue if the old definitions of the volumes are not removed from the configuration prior to importing the volumes to the host. Migrating volumes between I/O Groups is not a dynamic configuration change. It must be done with the hosts shut down. Then, follow the procedure listed in Chapter 8, Hosts on page 191 for the reconfiguration of SVC volumes to hosts. We recommend that you remove the stale configuration and reboot the host to reconfigure the volumes that are mapped to a host. For details about how to dynamically reconfigure IBM Subsystem Device Driver (SDD) for the specific host operating system, refer to Multipath Subsystem Device Driver: Users Guide, GC52-1309, where this procedure is also described in great depth. Note: Do not move a volume to an offline I/O Group under any circumstances. You must ensure that the I/O Group is online before moving the volumes to avoid any data loss.

This command will not work if there is any data in the SVC cache, which has to be flushed out first. There is a -force flag; however, this flag discards the data in the cache rather than flushing it to the volume. If the command fails due to outstanding I/Os, it is better to wait a couple of minutes after which the SVC will automatically flush the data to the volume. Note: Using the -force flag can result in data integrity issues.

6.4 Volume migration


A volume can be migrated from one storage pool to another storage pool regardless of the virtualization type (image, striped, or sequential). The command varies, depending on the type of migration, as shown in Table 6-3
Table 6-3 Migration types and associated commands Storage pool-to-storage pool type Managed to managed / Image to managed Managed to image / Image to image Command migratevdisk migratetoimage

Migrating a volume from one Storage Pool to another is non-disruptive to the host application using the volume. Depending on the workload of the SVC, there might be a slight performance impact. For this reason, we recommend that you migrate a volume from one Storage Pool to another when there is a relatively low load on the SVC.

108

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Rule: For the migration to be acceptable, the source and destination storage pool must have the same extent size. Note that volume mirroring can also be used to migrate a volume between storage pools. This method can be used if the extent sizes of the two pools are not the same. Below we discuss the best practices to follow when you perform volume migrations.

6.4.1 Image type to striped type migration


When migrating existing storage into the SVC, the existing storage is brought in as image type volumes, which means that the volume is based on a single MDisk. In general, we recommend that the volume is migrated to a striped type volume, which is striped across multiple MDisks and, therefore, multiple RAID arrays as soon as it is practical. You generally expect to see a performance improvement by migrating from image type to striped type. Example 6-3 shows the command. This process is fully described in Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933-00.
Example 6-3 Image mode migration command

IBM_2145:svccf8:admin>svctask migratevdisk -mdiskgrp MDG1DS4K -threads 4 -vdisk Migrate_sample This command migrates our volume, Migrate_sample, to the Storage Pool, MDG1DS4K, and uses four threads while migrating. Note that instead of using the volume name, you can use its ID number. You can monitor the migration process by using the command svcinfo lsmigrate as showed on Example 6-4.
Example 6-4 Monitoring the migration process

IBM_2145:svccf8:admin>svcinfo lsmigrate migrate_type MDisk_Group_Migration progress 0 migrate_source_vdisk_index 3 migrate_target_mdisk_grp 2 max_thread_count 4 migrate_source_vdisk_copy_id 0 IBM_2145:svccf8:admin>

6.4.2 Migrating to image type volume


An image type volume is a direct straight through mapping to exactly one image mode MDisk. If a volume is migrated to another MDisk, the volume is represented as being in managed mode during the migration. It is only represented as an image type volume after it has reached the state where it is a straight through mapping. Image type disks are used to migrate existing data into an SVC and to migrate data out of virtualization. Image type volumes cannot be expanded. The usual reason for migrating a volume to an image type volume is to move the data on the disk to a non-virtualized environment. This operation is also carried out to enable you to change the preferred node that is used by a volume. Refer to 6.3.2, Changing the preferred node within an I/O Group on page 106.
Chapter 6. Volumes

109

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

In order to migrate a striped type volume to an image type volume, you must be able to migrate to an available unmanaged MDisk. The destination MDisk must be greater than or equal to the size of the volume you want to migrate. Regardless of the mode in which the volume starts, it is reported as managed mode during the migration. Both of the MDisks involved are reported as being in image mode during the migration. If the migration is interrupted by a cluster recovery, the migration will resume after the recovery completes. You must perform these command line steps: 1. To determine the name of the volume to be moved, issue the command: svcinfo lsvdisk The output is in the form that is shown in Example 6-5.
Example 6-5 The lsvdisk output

IBM_2145:svccf8:admin>svcinfo lsvdisk -delim : id:name:IO_group_id:IO_group_name:status:mdisk_grp_id:mdisk_grp_name:capacity:t ype:FC_id:FC_name:RC_id:RC_name:vdisk_UID:fc_map_count:copy_count:fast_write_st ate:se_copy_count 0:NYBIXTDB02_T03:0:io_grp0:online:3:MDG4DS8KL3331:20.00GB:striped:::::600507680 18205E12000000000000000:0:1:empty:0 1:NYBIXTDB02_2:0:io_grp0:online:0:MDG1DS8KL3001:5.00GB:striped:::::600507680182 05E12000000000000007:0:1:empty:0 2:TEST_1:0:io_grp0:online:many:many:1.00GB:many:::::60050768018205E120000000000 00002:0:2:empty:0 3:Migrate_sample:0:io_grp0:online:2:MDG1DS4K:2.00GB:striped:::::60050768018205E 12000000000000012:0:1:empty:0 2. In order to migrate the volume, you need the name of the MDisk to which you will migrate it. Example 6-6 shows the command that you use.
Example 6-6 The lsmdisk command output

IBM_2145:svccf8:admin>lsmdisk -delim : id:name:status:mode:mdisk_grp_id:mdisk_grp_name:capacity:ctrl_LUN_#:controller_ name:UID:tier 0:D4K_ST1S12_LUN1:online:managed:2:MDG1DS4K:20.0GB:0000000000000000:DS4K:600a0b 8000174233000071894e2eccaf00000000000000000000000000000000:generic_hdd 1:mdisk0:online:array:3:MDG4DS8KL3331:136.2GB::::generic_ssd 2:D8K_L3001_1001:online:managed:0:MDG1DS8KL3001:20.0GB:4010400100000000:DS8K75L 3001:6005076305ffc74c000000000000100100000000000000000000000000000000:generic_h dd ... 33:D8K_L3331_1108:online:unmanaged:::20.0GB:4011400800000000:DS8K75L3331:600507 6305ffc747000000000000110800000000000000000000000000000000:generic_hdd 34:D4K_ST1S12_LUN2:online:managed:2:MDG1DS4K:20.0GB:0000000000000001:DS4K:600a0 b80001744310000c6094e2eb4e400000000000000000000000000000000:generic_hdd From this command, we can see that D8K_L3331_1108 is candidate for the image type migration, because it is unmanaged. 3. We now have enough information to enter the command to migrate the volume to image type, and you can see the command in Example 6-7 on page 111.

110

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Example 6-7 The migratetoimage command

IBM_2145:svccf8:admin>svctask migratetoimage -vdisk Migrate_sample -threads 4 -mdisk D8K_L3331_1108 -mdiskgrp IMAGE_Test 4. If there is no unmanaged MDisk to which to migrate, you can remove an MDisk from an Storage Pool. However, you can only remove an MDisk from an Storage Pool if there are enough free extents on the remaining MDisks in the group to migrate any used extents on the MDisk that you are removing.

6.4.3 Migrating with volume mirroring


Volume mirroring offers the facility to migrate volumes between Storage Pools with different extent sizes: 1. First, add a copy to the target Storage Pool. 2. Wait until the synchronization is complete. 3. Remove the copy in the source Storage Pool. The migration from a thin-provisioned to a fully allocated volume is almost the same: 1. Add a target fully allocated copy. 2. Wait for synchronization to complete. 3. Remove the source thin-provisioned copy.

6.5 Preferred paths to a volume


For I/O purposes, SVC nodes within the cluster are grouped into pairs, which are called I/O Groups. A single pair is responsible for serving I/O on a specific volume. One node within the I/O Group represents the preferred path for I/O to a specific volume. The other node represents the non-preferred path. This preference alternates between nodes as each volume is created within an I/O Group to balance the workload evenly between the two nodes. The SVC implements the concept of each volume having a preferred owner node, which improves cache efficiency and cache usage. The cache component read/write algorithms are dependent on one node owning all the blocks for a specific track. The preferred node is set at the time of volume creation either manually by the user or automatically by the SVC. Because read miss performance is better when the host issues a read request to the owning node, you want the host to know which node owns a track. The SCSI command set provides a mechanism for determining a preferred path to a specific volume. Because a track is just part of a volume, the cache component distributes ownership by volume. The preferred paths are then all the paths through the owning node. Therefore, a preferred path is any port on a preferred controller, assuming that the SAN zoning is correct. Note: The performance can be better if the access is made on the preferred node. The data can still be accessed by the partner node in the I/O Group in the event of a failure.

By default, the SVC assigns ownership of even-numbered volumes to one node of a caching pair and the ownership of odd-numbered volumes to the other node. It is possible for the ownership distribution in a caching pair to become unbalanced if volume sizes are

Chapter 6. Volumes

111

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

significantly different between the nodes or if the volume numbers assigned to the caching pair are predominantly even or odd. To provide flexibility in making plans to avoid this problem, the ownership for a specific volume can be explicitly assigned to a specific node when the volume is created. A node that is explicitly assigned as an owner of a volume is known as the preferred node. Because it is expected that hosts will access volumes through the preferred nodes, those nodes can become overloaded. When a node becomes overloaded, volumes can be moved to other I/O Groups, because the ownership of a volume cannot be changed after the volume is created. We described this situation in 6.3.3, Moving a volume to another I/O Group on page 106. SDD is aware of the preferred paths that SVC sets per volume. SDD uses a load balancing and optimizing algorithm when failing over paths; that is, it tries the next known preferred path. If this effort fails and all preferred paths have been tried, it load balances on the non-preferred paths until it finds an available path. If all paths are unavailable, the volume goes offline. It can take time, therefore, to perform path failover when multiple paths go offline. SDD also performs load balancing across the preferred paths where appropriate.

6.5.1 Governing of volumes


I/O governing effectively throttles the amount of IOPS (or MBs per second) that can be achieved to and from a specific volume. You might want to use I/O governing if you have a volume that has an access pattern that adversely affects the performance of other volumes on the same set of MDisks, for example, a volume that uses most of the available bandwidth.
Of course, if this application is highly important, migrating the volume to another set of MDisks might be advisable. However, in some cases, it is an issue with the I/O profile of the application rather than a measure of its use or importance. Base the choice between I/O and MB as the I/O governing throttle on the disk access profile of the application. Database applications generally issue large amounts of I/O, but they only transfer a relatively small amount of data. In this case, setting an I/O governing throttle based on MBs per second does not achieve much throttling. It is better to use an IOPS throttle. At the other extreme, a streaming video application generally issues a small amount of I/O, but it transfers large amounts of data. In contrast to the database example, setting an I/O governing throttle based on IOPS does not achieve much throttling. For a streaming video application, it is better to use an MB per second throttle. Before running the chvdisk command, run the lsvdisk command against the volume that you want to throttle in order to check its parameters as shown in Example 6-8.
Example 6-8 The lsvdisk command output

IBM_2145:svccf8:admin>svcinfo lsvdisk TEST_1 id 2 name TEST_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id many mdisk_grp_name many capacity 1.00GB type many

112

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

formatted no mdisk_id many mdisk_name many FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000002 throttling 0 preferred_node_id 2 fast_write_state empty cache readwrite ... The throttle setting of zero indicates that no throttling has been set. Having checked the volume, you can then run the chvdisk command. To just modify the throttle setting, we run: svctask chvdisk -rate 40 -unitmb TEST_1 Running the lsvdisk command now gives us the output that is shown in Example 6-9.
Example 6-9 Output of lsvdisk command

IBM_2145:svccf8:admin>svcinfo lsvdisk TEST_1 id 2 name TEST_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id many mdisk_grp_name many capacity 1.00GB type many formatted no mdisk_id many mdisk_name many FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000002 virtual_disk_throttling (MB) 40 preferred_node_id 2 fast_write_state empty cache readwrite ... This example shows that the throttle setting (virtual_disk_throttling) is 40 MBps on this volume. If we had set the throttle setting to an I/O rate by using the I/O parameter, which is the default setting, we do not use the -unitmb flag: svctask chvdisk -rate 2048 TEST_1 You can see in Example 6-10 that the throttle setting has no unit parameter, which means that it is an I/O rate setting.
Chapter 6. Volumes

113

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

Example 6-10 The chvdisk command and lsvdisk output

IBM_2145:svccf8:admin>svctask chvdisk -rate 2048 TEST_1 IBM_2145:svccf8:admin>svcinfo lsvdisk TEST_1 id 2 name TEST_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id many mdisk_grp_name many capacity 1.00GB type many formatted no mdisk_id many mdisk_name many FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000002 throttling 2048 preferred_node_id 2 fast_write_state empty cache readwrite ... Note: An I/O governing rate of 0 (displayed as virtual_disk_throttling in the CLI output of the lsvdisk command) does not mean that zero IOPS (or MBs per second) can be achieved. It means that no throttle is set.

6.6 Cache mode and cache-disabled volumes


You use cache-disabled volumes primarily when you are virtualizing an existing storage infrastructure and you want to retain the existing storage system copy services. You might want to use cache-disabled volumes where there is intellectual capital in existing copy services automation scripts. We recommend that you keep the use of cache-disabled volumes to minimum for normal workloads. You can use cache-disabled volumes also to control the allocation of cache resources. By disabling the cache for certain volumes, more cache resources will be available to cache I/Os to other volumes in the same I/O Group. This technique is particularly effective where an I/O Group is serving volumes that will benefit from cache and other volumes where the benefits of caching are small or non-existent.

6.6.1 Underlying controller remote copy with SVC cache-disabled volumes


Where synchronous or asynchronous remote copy is used in the underlying storage controller, the controller LUNs at both the source and destination must be mapped through the SVC as image mode disks with the SVC cache disabled. Note that, of course, it is possible to access either the source or the target of the remote copy from a host directly,

114

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

rather than through the SVC. You can use the SVC copy services with the image mode volume representing the primary site of the controller remote copy relationship. It does not make sense to use SVC copy services with the volume at the secondary site, because the SVC does not see the data flowing to this LUN through the controller . Figure 6-2 shows the relationships between the SVC, the volume, and the underlying storage controller for a cache-disabled volume.

Figure 6-2 Cache-disabled volume in remote copy relationship

6.6.2 Using underlying controller flash copy with SVC cache disabled volumes
Where Flash Copy is used in the underlying storage controller, the controller LUNs for both the source and the target must be mapped through the SVC as image mode disks with the SVC cache disabled as shown in Figure 6-3 on page 116. Note that, of course, it is possible to access either the source or the target of the Flash Copy from a host directly rather than through the SVC.

Chapter 6. Volumes

115

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 6-3 Flash copy with cache-disabled volumes

6.6.3 Changing cache mode of volumes


The cache mode of a volume can be concurrently (with I/O) changed via the svctask chvdisk command. The command must not fail I/O to the user and the command must be allowed to run on any kind of volume. The command if used correctly without the -force flag must not result in a corrupt volume, therefore the cache must be flush and discard cache data if the user disables the cache on a volume. Example 6-11 shows an image volume VDISK_IMAGE_1 that had the cache parameter changed after created.
Example 6-11 changing the cache mode of a volume

IBM_2145:svccf8:admin>svctask mkvdisk -name VDISK_IMAGE_1 -iogrp 0 -mdiskgrp IMAGE_Test -vtype image -mdisk D8K_L3331_1108 Virtual Disk, id [9], successfully created IBM_2145:svccf8:admin>svcinfo lsvdisk VDISK_IMAGE_1 id 9 name VDISK_IMAGE_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id 5 mdisk_grp_name IMAGE_Test capacity 20.00GB 116
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

type image formatted no mdisk_id 33 mdisk_name D8K_L3331_1108 FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000014 throttling 0 preferred_node_id 1 fast_write_state empty cache readwrite udid fc_map_count 0 sync_rate 50 copy_count 1 se_copy_count 0 ... IBM_2145:svccf8:admin>svctask chvdisk -cache none VDISK_IMAGE_1 IBM_2145:svccf8:admin>svcinfo lsvdisk VDISK_IMAGE_1 id 9 name VDISK_IMAGE_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id 5 mdisk_grp_name IMAGE_Test capacity 20.00GB type image formatted no mdisk_id 33 mdisk_name D8K_L3331_1108 FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000014 throttling 0 preferred_node_id 1 fast_write_state empty cache none udid fc_map_count 0 sync_rate 50 copy_count 1 se_copy_count 0 ... Note: By default, the volumes are created with the cache mode enabled (readwrite), but you can specify the cache mode during the volume creation using the -cache option.

Chapter 6. Volumes

117

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

6.7 The effect of load on storage controllers


Because the SVC is able to share the capacity of a few MDisks to many more volumes (and, thus, are assigned to hosts generating I/O), it is possible that an SVC can generate a lot more I/O than the storage controller normally received if there was not an SVC in the middle. To add FlashCopy to this situation can add more I/O to a storage controller in addition to the I/O that hosts are generating. It is important to take the load that you can put onto a storage controller into consideration when defining volumes for hosts to make sure that you do not overload a storage controller. So, assuming that a typical physical drive can handle 150 IOPS (a Serial Advanced Technology Attachment (SATA) might handle slightly fewer IOPS than 150) and by using this example, you can calculate the maximum I/O capability that an Storage Pool can handle. Then, as you define the volumes and the FlashCopy mappings, calculate the maximum average I/O that the SVC will receive per volume before you start to overload your storage controller. This example assumes: An MDisk is defined from an entire array (that is, the array only provides one LUN and that LUN is given to the SVC as an MDisk). Each MDisk that is assigned to an Storage Pool is the same size and same RAID type and comes from a storage controller of the same type. MDisks from a storage controller are contained entirely in the same Storage Pool. The raw I/O capability of the Storage Pool is the sum of the capabilities of its MDisks. For example, for five RAID 5 MDisks with eight component disks on a typical back-end device, the I/O capability is: 5 x (150 x 7) = 5250 This raw number might be constrained by the I/O processing capability of the back-end storage controller itself. FlashCopy copying contributes to the I/O load of a storage controller, and thus, it must be taken into consideration. The effect of a FlashCopy is effectively adding a number of loaded volumes to the group, and thus, a weighting factor can be calculated to make allowance for this load. The affect of FlashCopy copies depends on the type of I/O taking place. For example, in a group with two FlashCopy copies and random writes to those volumes, the weighting factor is 14 x 2 = 28. The total weighting factor for FlashCopy copies is given in Table 6-4.
Table 6-4 FlashCopy weighting Type of I/O to the volume None/very little Reads only Sequential reads and writes Random reads and writes Random writes Impact on I/O Insignificant Insignificant Up to 2x I/Os Up to 15x I/O Up to 50x I/O Weight factor for FlashCopy 0 0 2xF 14 x F 49 x F

118

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Thus, to calculate the average I/O per volume before overloading the Storage Pool, use this formula: I/O rate = (I/O Capability) / (No volumes + Weighting Factor) So, using the example Storage Pool as defined previously, if we added 20 volumes to the Storage Pool and that Storage Pool was able to sustain 5250 IOPS, and there were two FlashCopy mappings that also have random reads and writes, the maximum I/O per volumes is: 5250 / (20 + 28) = 110 Note that this is an average I/O rate, so if half of the volumes sustain 200 I/Os and the other half of the volumes sustain 10 I/Os, the average is still 110 IOPS.

Conclusion
As you can see from the previous examples, Tivoli Storage Productivity Center is an extremely useful and powerful tool for analyzing and solving performance problems. If you want a single parameter to monitor to gain an overview of your systems performance, it is the read and write response times for both volumes and MDisks. This parameter shows everything that you need in one view. It is the key day-to-day performance validation metric. It is relatively easy to notice that a system that usually had 2 ms writes and 6 ms reads suddenly has 10 ms writes and 12 ms reads and is getting overloaded. A general monthly check of CPU usage will show you how the system is growing over time and highlight when it is time to add a new I/O Group (or cluster). In addition, there are useful rules for OLTP-type workloads, such as the maximum I/O rates for back-end storage arrays, but for batch workloads, it really is a case of it depends.

6.8 Setting up FlashCopy services


Regardless of whether you use FlashCopy to make one target disk, or multiple target disks, it is important that you consider the application and the operating system. Even though the SVC can make an exact image of a disk with FlashCopy at the point in time that you require,

it is pointless if the operating system, or more importantly, the application, cannot use the copied disk.
Data stored to a disk from an application normally goes through these steps: 1. The application records the data using its defined application programming interface. Certain applications might first store their data in application memory before sending it to disk at a later time. Normally, subsequent reads of the block just being written will get the block in memory if it is still there. 2. The application sends the data to a file. The file system accepting the data might buffer it in memory for a period of time. 3. The file system will send the I/O to a disk controller after a defined period of time (or even based on an event). 4. The disk controller might cache its write in memory before sending the data to the physical drive. If the SVC is the disk controller, it will store the write in its internal cache before sending the I/O to the real disk controller. 5. The data is stored on the drive.

Chapter 6. Volumes

119

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

At any point in time, there might be any number of unwritten blocks of data in any of these steps, waiting to go to the next step. It is also important to realize that sometimes the order of the data blocks created in step 1 might not be the same order that is used when sending the blocks to steps 2, 3, or 4. So it is possible, that at any point in time, data arriving in step 4 might be missing a vital component that has not yet been sent from step 1, 2, or 3. FlashCopy copies are normally created with data that is visible from step 4. So, to maintain application integrity, when a FlashCopy is created, any I/O that is generated in step 1 must make it to step 4 when the FlashCopy is started. In other words, there must not be any outstanding write I/Os in steps 1, 2, or 3. If there were outstanding write I/Os, the copy of the disk that is created at step 4 is likely to be missing those transactions, and if the FlashCopy is to be used, these missing I/Os can make it unusable.

6.8.1 Steps to making a FlashCopy volume with application data integrity


The steps that you must perform when creating FlashCopy copies are: 1. Your host is currently writing to a volume as part of its day-to-day usage. This volume becomes the source volume in our FlashCopy mapping. 2. Identify the size and type (image, sequential, or striped) of the volume. If the volume is an image mode volume, you need to know its size in bytes. If it is a sequential or striped mode volume, its size, as reported by the SVC GUI or SVC command line interface (CLI), is sufficient. To identify the volumes in an SVC cluster, use the svcinfo lsvdisk command, as shown in Example 6-12. If you want to put Vdisk_1 into a FlashCopy mapping, you do not need to know the byte size of that volume, because it is a striped volume. Creating a target volume of 2 GB is sufficient.
Example 6-12 Using the command line to see the type of the volumes

IBM_2145:svccf8:admin>svcinfo lsvdisk -delim : id:name:IO_group_id:IO_group_name:status:mdisk_grp_id:mdisk_grp_name:capacity:type :FC_id:FC_name:RC_id:RC_name:vdisk_UID:fc_map_count:copy_count:fast_write_state:se _copy_count 0:NYBIXTDB02_T03:0:io_grp0:online:3:MDG4DS8KL3331:20.00GB:striped:::::600507680182 05E12000000000000000:0:1:empty:0 1:NYBIXTDB02_2:0:io_grp0:online:0:MDG1DS8KL3001:5.00GB:striped:::::60050768018205E 12000000000000007:0:1:empty:0 3:Vdisk_1:0:io_grp0:online:2:MDG1DS4K:2.00GB:striped:::::60050768018205E1200000000 0000012:0:1:empty:0 9:VDISK_IMAGE_1:0:io_grp0:online:5:IMAGE_Test:20.00GB:image:::::60050768018205E120 00000000000014:0:1:empty:0 ... The VDISK_IMAGE, which is used in our example, is an image-mode volume. In this case, you need to know its exact size in bytes. In Example 6-13 on page 121, we use the -bytes parameter of the svcinfo lsvdisk command to find its exact size. Thus, the target volume must be created with a size of 21474836480 bytes, not 20 GB. 120
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Example 6-13 Find the exact size of an image mode volume using the command line interface

IBM_2145:svccf8:admin>svcinfo lsvdisk -bytes VDISK_IMAGE_1 id 9 name VDISK_IMAGE_1 IO_group_id 0 IO_group_name io_grp0 status online mdisk_grp_id 5 mdisk_grp_name IMAGE_Test capacity 21474836480 type image formatted no mdisk_id 33 mdisk_name D8K_L3331_1108 FC_id FC_name RC_id RC_name vdisk_UID 60050768018205E12000000000000014 ... 3. Create a target volume of the required size as identified by the source volume. The target volume can be either an image, sequential, or striped mode volume; the only requirement is that it must be exactly the same size as the source volume. The target volume can be cache-enabled or cache-disabled. 4. Define a FlashCopy mapping, making sure that you have the source and target disks defined in the correct order. (If you use your newly created volume as a source and the existing hosts volume as the target, you will destroy the data on the volume if you start the FlashCopy.) 5. As part of the define step, you can specify the copy rate from 0 to 100. The copy rate will determine how quickly the SVC will copy the data from the source volume to the target volume. Setting the copy rate to 0 (NOCOPY), the SVC will only copy blocks that have changed since the mapping was started on the source volume to the target volume (if the target volume is mounted, read write to a host). 6. The prepare process for the FlashCopy mapping can take several minutes to complete, because it forces the SVC to flush any outstanding write I/Os belonging to the source volumes to the storage controllers disks. After the preparation completes, the mapping has a Prepared status and the target volume behaves as though it was a cache-disabled volume until the FlashCopy mapping is either started or deleted. Note: If you create a FlashCopy mapping where the source volume is a target volume of an active Metro Mirror relationship, you add additional latency to that existing Metro Mirror relationship (and possibly affect the host that is using the source volume of that Metro Mirror relationship as a result). The reason for the additional latency is that the FlashCopy prepares and disables the cache on the source volume (which is the target volume of the Metro Mirror relationship), and thus, all write I/Os from the Metro Mirror relationship need to commit to the storage controller before the completion is returned to the host.

Chapter 6. Volumes

121

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

7. After the FlashCopy mapping is prepared, you can then quiesce the host by forcing the host and the application to stop I/Os and flush any outstanding write I/Os to disk. This process is different for each application and for each operating system. One guaranteed way to quiesce the host is to stop the application and unmount the volume from the host. 8. As soon as the host completes its flushing, you can then start the FlashCopy mapping. The FlashCopy starts extremely quickly (at most, a few seconds). 9. When the FlashCopy mapping has started, you can then unquiesce your application (or mount the volume and start the application), at which point the cache is re-enabled for the source volumes. The FlashCopy continues to run in the background and ensures that the target volume is an exact copy of the source volume when the FlashCopy mapping was started. You can perform step 1 on page 120 through step 5 on page 121 while the host that owns the source volume performs its typical daily activities (that means no downtime). While step 6 on page 121 is running, which can last several minutes, there might be a delay in I/O throughput, because the cache on the volume is temporarily disabled. Step 7 must be performed when the application I/O is completly stopped (or suspended). However, the steps 8 and 9 complete quickly and application unavailability is minimal. The target FlashCopy volume can now be assigned to another host, and it can be used for read or write even though the FlashCopy process has not completed. Note: If you intend to use the target volume on the same host (as the source volume is) at the same time that the source volume is visible to that host, you might need to perform additional preparation steps to enable the host to access volumes that are identical.

6.8.2 Making multiple related FlashCopy volumes with data integrity


Where a host has more than one volume, and those volumes are used by one application, FlashCopy consistency might need to be performed across all disks at exactly the same moment in time to preserve data integrity. Here are examples when this situation might apply: A Windows Exchange server has more than one drive, and each drive is used for an Exchange Information Store. For example, the exchange server has a D drive, an E drive, and an F drive. Each drive is an SVC volume that is used to store different information stores for the Exchange server. Thus, when performing a snap copy of the exchange environment, all three disks need to be flashed at exactly the same time, so that if they were used during a recovery, no one information store has more recent data on it than another information store. A UNIX relational database has several volumes to hold different parts of the relational database. For example, two volumes are used to hold two distinct tables, and a third volume holds the relational database transaction logs. Again, when a snap copy of the relational database environment is taken, all three disks need to be in sync. That way, when they are used in a recovery, the relational database is not missing any transactions that might have occurred if each volume was copied by using FlashCopy independently. Here are the steps to ensure that data integrity is preserved when volumes are related to each other: 122
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

1. Your host is currently writing to the volumes as part of its daily activities. These volumes will become the source volumes in our FlashCopy mappings. 2. Identify the size and type (image, sequential, or striped) of each source volume. If any of the source volumes is an image mode volume, you will need to know its size in bytes. If any of the source volumes are sequential or striped mode volumes, their size as reported by the SVC GUI or SVC command line will be sufficient. 3. Create a target volume of the required size for each source identified in the previous step. The target volume can be either an image, sequential, or striped mode volume; the only requirement is that they must be exactly the same size as their source volume. The target volume can be cache-enabled or cache-disabled. 4. Define a FlashCopy Consistency Group. This Consistency Group will be linked to each FlashCopy mapping that you have defined, so that data integrity is preserved between each volume. 5. Define a FlashCopy mapping for each source volume, making sure that you have the source disk and the target disk defined in the correct order. (If you use any of your newly created volumes as a source and the existing hosts volume as the target, you will destroy the data on the volume if you start the FlashCopy). When defining the mapping, make sure that you link this mapping to the FlashCopy Consistency Group that you defined in the previous step. As part of defining the mapping, you can specify the copy rate from 0 to 100. The copy rate will determine how quickly the SVC will copy the source volumes to the target volumes. Setting the copy rate to 0 (NOCOPY), the SVC will only copy blocks that have changed on any volume since the Consistency Group was started on the source volume or the target volume (if the target volume is mounted read/write to a host). 6. Prepare the FlashCopy Consistency Group. This preparation process can take several minutes to complete, because it forces the SVC to flush any outstanding write I/Os belonging to the volumes in the Consistency Group to the storage controllers disks. After the preparation process completes, the Consistency Group has a Prepared status and all source volumes behave as though they were cache-disabled volumes until the Consistency Group is either started or deleted. Note: If you create a FlashCopy mapping where the source volume is a target volume of an active Metro Mirror relationship, this mapping adds additional latency to that existing Metro Mirror relationship (and possibly affects the host that is using the source volume of that Metro Mirror relationship as a result). The reason for the additional latency is that the FlashCopy Consistency Group preparation process disables the cache on all source volumes (which might be target volumes of a Metro Mirror relationship), and thus, all write I/Os from the Metro Mirror relationship need to commit to the storage controller before the complete status is returned to the host. 7. After the Consistency Group is prepared, you can then quiesce the host by forcing the host and the application to stop I/Os and flush any outstanding write I/Os to disk. This process differs for each application and for each operating system. One guaranteed way to quiesce the host is to stop the application and unmount the volumes from the host. 8. As soon as the host completes its flushing, you can then start the Consistency Group. The FlashCopy start completes extremely quickly (at most, a few seconds).

Chapter 6. Volumes

123

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

9. When the Consistency Group has started, you can then unquiesce your application (or mount the volumes and start the application), at which point the cache is re-enabled. The FlashCopy continues to run in the background and preserves the data that existed on the volumes when the Consistency Group was started. Step 1 on page 123 through step 6 on page 123 can be performed while the host that owns the source volumes is performing its typical daily duties (that is, no downtime). While step 6 on page 123 is running, which can take several minutes, there might be a delay in I/O throughput, because the cache on the volumes is temporarily disabled. You must perform 7 on page 123 when the application I/O is completly stopped (or suspended). However, the steps 8 and 9 complete quickly and application unavailability is minimal. The target FlashCopy volumes can now be assigned to another host and used for read or write even though the FlashCopy processes have not completed. Note: If you intend to use any of the target volumes on the same host as their source volume at the same time that the source volume is visible to that host, you might need to perform additional preparation steps to enable the host to access volumes that are identical.

6.8.3 Creating multiple identical copies of a volume


Since SVC 4.2, you can create multiple point-in-time copies of a source volume. These point-in-time copies can be made at different times (for example, hourly) so that an image of a volume can be captured before a previous image has completed. If there is a requirement to have more than one volume copy created at exactly the same time, using FlashCopy Consistency Groups is the best method. By placing the FlashCopy mappings into a Consistency Group (where each mapping uses the same source volumes), when the FlashCopy Consistency Group is started, each target will be an identical image of all the other volume FlashCopy targets. The volume Mirroring feature allows you to have one or two copies of a volume, too. For more details, refer to 6.2, What is volume mirroring on page 103.

6.8.4 Creating a FlashCopy mapping with the incremental flag


By creating a FlashCopy mapping with the incremental flag, only the data that has been changed since the last FlashCopy was started is written to the target volume. This functionality is necessary in cases where we want, for example, a full copy of a volume for disaster tolerance, application testing, or data mining. It greatly reduces the time required to establish a full copy of the source data as a new snapshot when the first background copy is completed. In cases where clients maintain fully independent copies of data as part of their disaster tolerance strategy, using incremental FlashCopy can be useful as the first layer in their disaster tolerance and backup strategy.

124

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

6.8.5 Thin-provisioned FlashCopy


Using the thin-provisioned volume feature, which was introduced in SVC 4.3, FlashCopy can be used in a more efficient way. Thin-provisioned volume allows for the late allocation of MDisk space. Thin-provisioned volumes present a virtual size to hosts, while the real Storage Pool space (the number of extents x the size of the extents) allocated for the volume might be considerably smaller. Thin volumes as target volumes offer the opportunity to implement Thin-provisioned Flash Copy. Thin volumes as source volume and target volume can also be used to make point-in-time copies. There are two distinct meanings: Copy of a thin source volume to a thin target volume The background copy only copies allocated regions, and the incremental feature can be used for refresh mapping (after a full copy is complete). Copy of a Fully Allocated source volume to a thin target volume For this combination, you must have a zero copy rate to avoid fully allocating the thin target volume. Note: The defaults for grain size are different: 32 KB for Thin-provisioned volume and 256 KB for FlashCopy mapping.

You can use thin volumes for cascaded FlashCopy and multiple target FlashCopy. It is also possible to mix thin with normal volumes, and it can be used for incremental FlashCopy too, but using thin volumes for incremental FlashCopy only makes sense if the source and target are Thin-provisioned. The recommendation for Thin-provisioned Flash Copy: Thin-provisioned volume grain size must be equal to the FlashCopy grain size. Thin-provisioned volume grain size must be 64 KB for the best performance and the best space efficiency. The exception is where the thin target volume is going to become a production volume (will be subjected to ongoing heavy I/O). In this case, the 256 KB thin-provisioned grain size is recommended to provide better long term I/O performance at the expense of a slower initial copy. Note: Even if the 256 KB thin-provisioned volume grain size is chosen, it is still beneficial if you keep the FlashCopy grain size to 64 KB. It is then possible to minimize the performance impact to the source volume, even though this size increases the I/O workload on the target volume. Clients with extremely large numbers of FlashCopy/Remote Copy relationships might still be forced to choose a 256 KB grain size for FlashCopy due to constraints on the amount of bitmap memory.

Chapter 6. Volumes

125

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

6.8.6 Using FlashCopy with your backup application


If you are using FlashCopy together with your backup application and you do not intend to keep the target disk after the backup has completed, we recommend that you create the FlashCopy mappings using the NOCOPY option (background copy rate = 0). If you intend to keep the target so that you can use it as part of a quick recovery process, you might choose one of the following options: Create the FlashCopy mapping with NOCOPY initially. If the target is used and migrated into production, you can change the copy rate at the appropriate time to the appropriate rate to have all the data copied to the target disk. When the copy completes, you can delete the FlashCopy mapping and delete the source volume, thus, freeing the space. Create the FlashCopy mapping with a low copy rate. Using a low rate might enable the copy to complete without an impact to your storage controller, thus, leaving bandwidth available for production work. If the target is used and migrated into production, you can change the copy rate to a higher value at the appropriate time to ensure that all data is copied to the target disk. After the copy completes, you can delete the source, thus, freeing the space. Create the FlashCopy with a high copy rate. While this copy rate might add additional I/O burden to your storage controller, it ensures that you get a complete copy of the source disk as quickly as possible. By using the target on a different Storage Pool, which, in turn, uses a different array or controller, you reduce your window of risk if the storage providing the source disk becomes unavailable. With Multiple Target FlashCopy, you can now use a combination of these methods. For example, you can use the NOCOPY rate for an hourly snapshot of a volume with a daily FlashCopy using a high copy rate.

6.8.7 Using FlashCopy for data migration


SVC FlashCopy can help you with data migration, especially if you want to migrate from a controller (and your own testing reveals that the SVC can communicate with the device). Another reason to use SVC FlashCopy is to keep a copy of your data behind on the old controller in order to help with a back-out plan in the event that you want to stop the migration and revert back to the original configuration. In this example, you can use the following steps to help migrate to a new storage environment with minimum downtime, which enables you to leave a copy of the data in the old environment if you need to back up to the old configuration. To use FlashCopy to help with migration: 1. Your hosts are using the storage from either an unsupported controller or a supported controller that you plan on retiring. 2. Install the new storage into your SAN fabric and define your arrays and logical unit numbers (LUNs). Do not mask the LUNs to any host; you will mask them to the SVC later. 3. Install the SVC into your SAN fabric and create the required SAN zones for the SVC nodes and SVC to see the new storage. 4. Mask the LUNs from your new storage controller to the SVC and use svctask detectmdisk on the SVC to discover the new LUNs as MDisks. 5. Place the MDisks into the appropriate Storage Pool.

126

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

6. Zone the hosts to the SVC (while maintaining their current zone to their storage) so that you can discover and define the hosts to the SVC. 7. At an appropriate time, install the IBM SDD onto the hosts that will soon use the SVC for storage. If you have performed testing to ensure that the host can use both SDD and the original driver, you can perform this step anytime before the next step. 8. Quiesce or shut down the hosts so that they no longer use the old storage. 9. Change the masking on the LUNs on the old storage controller so that the SVC now is the only user of the LUNs. You can change this masking one LUN at a time so that you can discover them (in the next step) one at a time and not mix any LUNs up. 10.Use svctask detectmdisk to discover the LUNs as MDisks. We recommend that you also use svctask chmdisk to rename the LUNs to something more meaningful. 11.Define a volume from each LUN and note its exact size (to the number of bytes) by using the svcinfo lsvdisk command. 12.Define a FlashCopy mapping and start the FlashCopy mapping for each volume by using the steps in 6.8.1, Steps to making a FlashCopy volume with application data integrity on page 120. 13.Assign the target volumes to the hosts and then restart your hosts. Your host sees the original data with the exception that the storage is now an IBM SVC LUN. With these steps, you have made a copy of the existing storage, and the SVC has not been configured to write to the original storage. Thus, if you encounter any problems with these steps, you can reverse everything that you have done, assigning the old storage back to the host, and continue without the SVC. By using FlashCopy in this example, any incoming writes go to the new storage subsystem and any read requests that have not been copied to the new subsystem automatically come from the old subsystem (the FlashCopy source). You can alter the FlashCopy copy rate, as appropriate, to ensure that all the data is copied to the new controller. After the FlashCopy completes, you can delete the FlashCopy mappings and the source volumes. After all the LUNs have been migrated across to the new storage controller, you can remove the old storage controller from the SVC node zones and then, optionally, remove the old storage controller from the SAN fabric. You can also use this process if you want to migrate to a new storage controller and not keep the SVC after the migration. At step 2 on page 126, make sure that you create LUNs that are the same size as the original LUNs. Then, at step 11, use image mode volumes. When the FlashCopy mappings complete, you can shut down the hosts and map the storage directly to them, remove the SVC, and continue on the new storage controller.

6.8.8 Summary of FlashCopy rules


To summarize the FlashCopy rules: FlashCopy services can only be provided inside an SVC cluster. If you want to FlashCopy to remote storage, the remote storage needs to be defined locally to the SVC cluster. To maintain data integrity, ensure that all application I/Os and host I/Os are flushed from any application and operating system buffers. You might need to stop your application in order for it to be restarted with a copy of the volume that you make. Check with your application vendor if you have any doubts.

Chapter 6. Volumes

127

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

Be careful if you want to map the target flash-copied volume to the same host that already has the source volume mapped to it. Check that your operating system supports this configuration. The target volume must be the same size as the source volume; however, the target volume can be a different type (image, striped, or sequential mode) or have different cache settings (cache-enabled or cache-disabled). If you stop a FlashCopy mapping or a Consistency Group before it has completed, you will lose access to the target volumes. If the target volumes are mapped to hosts, they will have I/O errors. A volume cannot be a source in one FlashCopy mapping and a target in another FlashCopy mapping. A volume can be the source for up to 256 targets. Starting on SVC 6.2.0.0, you are allowed to create a FlashCopy mapping using a target volume that is part of a remote copy relationship. This enables the reverse feature to be used in conjunction with a disaster recovery implementation. It also enables fast failback from a consistent copy held on a FlashCopy target volume at the auxiliary cluster to the master copy.

6.8.9 IBM Tivoli Storage FlashCopy Manager


The management of many large FlashCopy relationships and Consistency Groups is a complex task without a form of automation for assistance. IBM Tivoli FlashCopy Manager V2.2 provides integration between the SVC and Tivoli Storage Manager for Advanced Copy Services, providing application-aware backup and restore by leveraging the SVC FlashCopy features and function. For information on how IBM Tivoli Storage FlashCopy Manager interacts with the IBM System Storage SAN Volume Controller (SVC), please check the following link: http://www.redbooks.ibm.com/redpapers/pdfs/redp4653.pdf More details about IBM Tivoli Storage FlashCopy Manager are available here: http://www-01.ibm.com/software/tivoli/products/storage-flashcopy-mgr/

6.8.10 IBM System Storage Support for Microsoft Volume Shadow Copy Service
The SAN Volume Controller provides support for the Microsoft Volume Shadow Copy Service and Virtual Disk Service. The Microsoft Volume Shadow Copy Service can provide a point-in-time (shadow) copy of a Windows host volume while the volume is mounted and files are in use. The Microsoft Virtual Disk Service provides a single vendor and technology-neutral interface for managing block storage virtualization, whether done by operating system software, RAID storage hardware, or other storage virtualization engines. The following components are used to provide support for the service: SAN Volume Controller The cluster CIM server IBM System Storage hardware provider, known as the IBM System Storage Support for Microsoft Volume Shadow Copy Service and Virtual Disk Service software

128

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521VDisks.fm

Microsoft Volume Shadow Copy Service The vSphere Web Services when it is in a VMware virtual platform The IBM System Storage hardware provider is installed on the Windows host. To provide the point-in-time shadow copy, the components complete the following process: 1. A backup application on the Windows host initiates a snapshot backup. 2. The Volume Shadow Copy Service notifies the IBM System Storage hardware provider that a copy is needed. 3. The SAN Volume Controller prepares the volumes for a snapshot. 4. The Volume Shadow Copy Service quiesces the software applications that are writing data on the host and flushes file system buffers to prepare for the copy. 5. The SAN Volume Controller creates the shadow copy using the FlashCopy Copy Service. 6. The Volume Shadow Copy Service notifies the writing applications that I/O operations can resume, and notifies the backup application that the backup was successful. The Volume Shadow Copy Service maintains a free pool of volumes for use as a FlashCopy target and a reserved pool of volumes. These pools are implemented as virtual host systems on the SAN Volume Controller. For more details on how to implement and work with IBM System Storage Support for Microsoft Volume Shadow Copy Service, refer to Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933-00.

Chapter 6. Volumes

129

7521VDisks.fm

Draft Document for Review February 16, 2012 3:49 pm

130

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Chapter 7.

Remote Copy services


In this chapter, we discuss the best practices for using the Remote Copy services Metro Mirror (MM) and Global Mirror (GM). The main focus is on intercluster GM relationships. For details on implementation and setup of SVC including Remote Copy and Inter-Cluster link refer to the book Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933. This chapter is divided into the following sections: Remote Copy (RC) Service: An Introduction Terminology and Functional Concepts Remote Copy features by release Intercluster (Remote) Link Design Points (essentially a recap of what weve just discussed) Planning Use Cases GM States 1920 error Monitoring and troubleshooting

Copyright IBM Corp. 2011. All rights reserved.

131

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

7.1 Remote Copy services: an introduction


The general application of a Remote Copy (RC) service is to maintain two identical copies of a data set. Often the two copies will be separated by some distance, hence the term remote, although this is not a prerequisite. Remote Copy services, as implemented by SVC, can be configured in the form of Metro Mirror (MM) or Global Mirror (GM). Both are based on two, or more independent SVC clusters connected on a FC fabric (Intracluster Metro Mirror which is a single cluster in which remote copy relationships exist). The clusters are configured in a Remote Copy partnership over the FC fabric; connect (FC login) to each other, and establish communications in the same way as if they were located nearby on the same fabric. The only difference is in the expected latency of that communication, the bandwidth capability of the intercluster link and the availability of the link as compared with the local fabric. Local and remote clusters in the Remote Copy partnership contain volumes, in a one-to-one mapping, that are configured as a Remote Copy Relationship. It is this relationship that maintains the two identical copies. Each volume performs a designated role; the local volume functions as the source (as well as servicing run-time host-application I/O), and the remote volume functions as target (which shadows the source and is accessible as read-only). SVC offers remote copy solutions based on distance (and by implication mode of operation differs). 1. Metro Mirror (Synchronous mode). Used over metropolitan distances (< 5Km), in which foreground writes (writes to the target volume), and mirrored foreground writes (shadowed writes to the target) are committed at both the local and remote cluster before being acknowledged as complete to the host application. Note: This ensures that the target volume is fully up-to-date, but the application is fully exposed to the latency and bandwidth limitations of intercluster link. Where this is truly remote this may have an adverse effect on application performance. 2. Global Mirror (Asynchronous mode): This mode of operation allows for greater intercluster distance, and deploys an asynchronous remote write operation. Foreground writes, at the local clusters, are executed in normal run-time, whereas their associated mirrored foreground writes, at the remote cluster, are executed asynchronously. Write operations are completed on target volume (local cluster) and acknowledge to host application, before being completed at the source volume (remote cluster). Regardless of which mode of remote copy service is deployed, operations between clusters are driven by the Background and Foreground write I/O processes: Background Write (Re)Synchronization: Write I/O across intercluster link, performed in background, to synchronize source volumes to target mirrored volumes on remote cluster. Also referred to as background copy Foreground I/O - read and write I/O on local SAN, which generates: Mirrored Foreground write I/O - cross intercluster link and remote SAN

When considering a remote copy solution, it is essential that we consider each of these process and the traffic they generate on the SAN, and intercluster link. It is important to

132

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

understand how much traffic the SAN can take, without disruption, and how much traffic your application and copy services processes generate. Successful implementation depends on taking an holistic approach, were we consider all components, and their associated properties. This includes, host application sensitivity, local and remote SAN configurations, local and remote cluster and storage configuration, and the intercluster link.

7.1.1 Common terminology and definitions


When covering such a breadth of technology areas, there may be a multitude of terminologies / definitions for the same technology component, so for the purposes of this document we use the following definitions: Local or Master Cluster: The cluster on which foreground applications run. Local Hosts: Hosts running the foreground applications. Master volume or Source Volume: The local volume that is being mirrored. non restricted access; mapped hosts are capable of reading and writing to volume. Intercluster Link: The remote ISL link between the local and remote clusters. Should be redundant and provide dedicated bandwidth for remote copy processes. Remote or Auxiliary Cluster: The cluster that holds the remote, mirrored copy Auxiliary or Target Volume: The remote volume that holds the mirrored copy Read access only Remote Copy: Generic term used to describe either a Metro Mirror, or Global Mirror relationship, in which data on the source volume is mirrored to an identical copy on a target volume. Often the two copies will be separated by some distance, hence the term remote, although this is not a prerequisite. RC relationship states include: A Consistent relationship: A remote copy relationship were the data set on the target volume represents a data set on the source volumes at a certain point in time. A Synchronised relationship: A relationship is synchronized if it is consistent and the point in time that the target volume represent is the current point in time. Put another way, the target volume contain identical data as the source volume. Synchronous remote copy (Metro Mirror): Writes to both the source and target volumes are committed in the foreground before sending confirmation of the completion to the local host application. Note: Performance loss in foreground write I/O due to intercluster link latency. Asynchronous remote copy (Global Mirror): Foreground write I/O is acknowledged as complete to the local host application, before the mirrored foreground write I/O, is cached at the remote cluster. Mirrored foreground writes are processed asynchronously at the remote cluster, but in a committed sequential order (as determined and managed by the the GM Remote copy process). Note: Performance loss in foreground write I/O is minimized, by adopting asynchronous policy for executing mirrored foreground write I/O. The effect of intercluster link latency is reduced. However there is a small increase in processing foreground write I/O, as it passes through the RC component of SVCs software stack.

Chapter 7. Remote Copy services

133

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

The diagram in Figure 7-1 on page 134 shows some of the definitions, described above, in pictorial form.

Figure 7-1 Remote Copy components and applications

A successful implementation of an intercluster remote copy service is dependent on quality and configuration of the intercluster link (ISL). The intercluster link must be able to provide a dedicated bandwidth for remote copy traffic.

7.1.2 Intercluster link


Intercluster link is specified in terms of Latency and Bandwidth. These parameters define the capabilities of the link with respect to the traffic on it, and be must be chosen such they support all forms of traffic; the mirrored foreground writes, the background copy writes, and intercluster heartbeat messaging (node-to-node communication).

Link Latency is the time taken by data to move across a network from one location to
another and is measured in milliseconds. The longer the time, the greater the performance impact. Link Bandwidth is the network capacity to move data as measured in millions of bits per second (Mbps) or a billions of bits per second (Gbps). The term bandwidth is also used in the following context: Storage Bandwidth: The ability of the backend storage to process I/O. Measures the amount of data (in bytes) that can be sent in a specified amount of time. GM Partnership Bandwidth (parameter): The rate at which background write synchronization is attempted. (unit MB/s).

134

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Warning: With SVC version 5.1.0 the Bandwidth parameter must be explicitly defined by the Client when making a MM/GM partnership. Previously the default value of 50 MB/s was used. The removal of the default is intended to stop users from using the default bandwidth with a link which does not have sufficient capacity. Inter-cluster communication: As well as supporting Mirrored Foreground and background I/O a proportion of the link is also used to carry traffic associated with the exchange of low level messaging between the nodes of the local and remote clusters. A dedicated amount of the link bandwidth is required for: the exchange of heartbeat messages, and the initial configuration of intercluster partnerships,

Summary
Interlink bandwidth as shown in Figure 7-2 on page 135 must be capable of supporting the combined traffic of: Mirrored Foreground writes, as generated by foreground processes at peak times, Background Write Synchronization, as defined by GM Bandwidth Parameter, and Inter-cluster Communication (heartbeat messaging

Figure 7-2 Traffic on the intercluster link

7.2 SVC functions by release


In this section we discuss new function in SVC 6.2 and then in SVC by release.

7.2.1 What is new in SVC 6.2


In this section we describe new function in SVC 6.2.

Chapter 7. Remote Copy services

135

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Multiple cluster mirroring


Multiple cluster mirroring enables MM / GM partnerships up to a maximum of four SVC clusters. The rules governing a MM/GM relationships remain unchanged, meaning a volume can only exist as part of a single MM/GM relationship, and both Metro Mirror and Global Mirror are supported within the same overall configuration. Multiple cluster mirroring advantages: Clients can use a single DR site from multiple production data sites: Assisting clients implementing a consolidated DR strategy, or Assisting clients moving to a consolidated DR strategy Figure 7-3 shows the supported, and unsupported configurations for multiple cluster mirroring.

Figure 7-3 Supported multiple cluster mirroring topologies

Improved support for MM/GM relationships and consistency groups


From SVC 5.1.0 the number of Metro Mirror and Global Mirror remote copy relationships that can be supported has increases from 1024 to 8192. This provides improved scalability, with respect to increased data protection, and greater flexibility, in order to take full advantage of new Multiple Cluster Mirroring possibilities. Note: It is possible to create up to 256 consistency groups, and all 8192 relationships can be in a single consistency group if required.

Zoning considerations
The zoning requirements have been revised and are covered in detail in the following flash https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003634link These are also covered in section Intercluster (Remote) link on page 147 of this chapter.

Flashcopy target volumes as Remote Copy source volumes


Prior to the 6.1.0 release of SVC a MM/GM source volume could not be part of a flashcopy relationship. Conceptually a configuration of this type is advantageous as it can reduce the

136

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

the time, in some disaster recovery scenarios, in which the MM/GM relationship is in a inconsistent state.

Flashcopy target volume as Remote Copy source scenario


For example consider the following scenario described here and shown in Figure 7-4. A GM relationship exists between a source volume A and a target volume B. When this relation is in a consistent-synchronised state an incremental FlashCopy is taken, providing a point-in-time record of consistency. A FlashCopy of this nature may be made on a regular basis. Note: An incremental FlashCopy is used in this scenario, as after the initial instants of FlashCopy have been executed successful, all subsequent execution do not require a full background copy. The incremental parameter means: only the regions of disk space where data has been changed since the FlashCopy mapping was completed, are copied to the target Volume. Thus speeding up FlashCopy completion.

If corruption occurs on source volume A, or the relationship stops and becomes inconsistent, we may want to recover from the last incremental FlashCopy taken. Unfortunately recovering, pre SVC 6.2.0, meant the destruction of the MM/GM relationship. This means the Remote Copy needs to not be running when a FlashCopy process changes the state of the volume. If both processes were running concurrently a given volume could be subject to simultaneous data changes.

Allow Remote Copy of Flash Copy Target Volumes


Started
F M G

Stopped FlashCopy F Metro Mirror M Global Mirror G

In release 6.1 and before, you couldnt Remote Copy (Global or Metro Mirror) a FlashCopy target So you could take a FlashCopy of a Remote Copy secondary for protecting consistency when resynchronising, or to record an important state of the disk G

But you couldnt copy it back to B without deleting the remote copy, then recreating the Remote Copy means we have to copy everything to A

A
Figure 7-4 Remote copy of flash copy target volumes

Destruction of the MM/GM relationship means that the a complete Background copy would be required before the relationship was once again in a consistent-synchronized state. Which would mean an extended period of time in which the host applications were unprotected.

Chapter 7. Remote Copy services

137

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

With the release of 6.2.0 the relationship does not need to be destroyed and a consistent-synchronised state can be achieved more quickly. This means host applications are unprotected for a reduced period of time. Note: SVC has always supported the ability to FlashCopy away from either a MM/GM source or target volume i.e. volumes in remote copy relationships have been able to act as source volumes of FlashCopy relationship. Some caveats: When you prepare a FlashCopy mapping, the SVC puts the source volumes into a temporary cache-disabled state. This temporary state adds additional latency to the remote copy relationship. I/Os that are normally committed to SVC cache, must now be directly committed destaged to the backend storage controller.

7.2.2 Remote copy features by release


In this section we review remote copy features by SVC code release.

Global Mirror
Release 4.1.1: Initial release of GlobalMirror (Asynchronous remote copy) Changes for release 4.2.0 Release 4.2.0: Increase size of non-volatile bitmap space, copy-able vdisk space 16TB Allow 40TB of remote copy per IO group Release 5.1.0: Introduce Multi-Cluster Mirroring Release 6.2.0: Allow a Metro or Global Mirror disk to be a FlashCopy target.

Metro Mirror
Release 1.1.0: Initial release of remote copy Release 2.1.0: Initial release as MetroMirror Release 4.1.1: Algorithms employed to maintain synchronisation through error recovery have been changed to leverage the same Non-volatile journal as GlobalMirror. Release 4.2.0: Increase size of non-volatile bitmap space, copy-able vdisk space 16TB Allow 40TB of remote copy per IO group Release 5.1.0: Introduce Multi-Cluster Mirroring Release 6.2.0: Allow a Metro or Global Mirror disk to be a FlashCopy target.

7.3 Terminology and functional concepts


In this section, we provided an overview of the functional concepts that define how SVC implements remote copy, and terminology used to describe and control this functionality. It builds on the definitions outlined in previously, and introduces additional information on specified limits and default values; it covers: Remote Copy: Partnerships and Relationships Intracluster verses Intercluster Asynchronous Remote Copy Write sequence importance of Write Ordering Colliding write Link speed, latency, and bandwidth Remote Copy Volumes: Copy directions an default roles 138
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

If you are looking for in-depth information on setting up remote copy partnerships and relationships, or administering remote copy relationships refer to the book Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933

7.3.1 Remote copy partnerships and relationships


A remote copy partnership is made between a local and remote cluster using the mkpartnership command. This command defines the operational characteristics of the partnership. The two most important parameters, you must consider are those of: Bandwidth: the rate at which background write (re)synchronization is attempted, and gmlinktolerance: the amount of time, in seconds, that a GM partnership will tolerate poor performance of the intercluster link before adversely affecting foreground write I/O.

Note: Although mirrored foreground writes are performed asynchronously they are inter-related, at a GM process level, with foreground write I/O. A slow responses along the intercluster link may lead to a backlog of GM process events, or an inability to secure process resource on remote nodes. This in turn delays GMs ability to process foreground writes, and hence causes slower writes at application level.

The bandwidth and gmlinktolerance features used with GM are further defined by: relationship_bandwidth_limit: Maximum resynchronization limit, at relationship level. gm_max_hostdelay: Maximum acceptable delay of host I/O attributable to GM.

7.3.2 Global Mirror control parameters


There are four parameters which control the Global Mirror processes: 1. 2. 3. 4. bandwidth, relationship_bandwidth_limit, gmlinktolerance, and gm_max_hostdelay.

The GM partnership bandwidth parameter, is the parameter which specifies the rate, in megabytes per second (MBps), at which the background write resynchronization processes are attempted. i.e. the total bandwidth they consume. With SVC release 5.1.0 the granularity, of control, at a volume relationship level, for Background Write Resynchronization can be additionally modified using he relationship_bandwidth_limit parameter. Unlike its co-parameter it does have a default value, that being 25 MB/s. The parameter defines, at a cluster wide level, the maximum rate at which individual source to target volume background write resynchronization is attempted. Background write resynchronization is attempted at the lowest level, of the combination of these two parameters. Note: The term Background write (re)synchronization, when used in conjunction with SVC, may also be referred to as GM Background copy, within this and other IBM publications. Although asynchronous GM does add some additional overhead to foreground write I/O, as it requires a dedicated portion of the interlink bandwidth to function. Controlling this overhead is
Chapter 7. Remote Copy services

139

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

critical with respected to foreground write I/O performance, and is achieved through the use of the gmlinktolerance parameter. This parameter defines the amount of time that GM processes can run on a poor performing link without adversely affecting foreground write I/O. By setting the gmlinktolerance time limit parameter, you define a safety valve which suspends GM processes, in order that foreground application write activity continues at acceptable performance levels. When creating a GM Partnership a default limit of 300s is used, but this adjustable. The parameter can also be set to 0 which effectively turns off the safety valve, meaning a poor performing link could adversely affect foreground write I/O. The gmlinktolerance parameter does not define: What constitutes a poor performing link, or explicitly define the latency that is acceptable for host applications With release 5.1.0, using the gm_max_hostdelay parameter, you define what constitutes a poor performing link. By using gm_max_hostdelay you can specify the maximum allowable overhead increase in processing foreground write I/O, in milliseconds, that is attributable to effect of running GM processes. If this threshold limit is exceeded, the link is considered to be performing poorly and gmlinktolerance parameter comes into play. The Global Mirror link tolerance timer starts counting down. y threshold value defines the maximum allowable additional impact that Global Mirror operations can add to the response times of foreground writes, on Global Mirror source volumes. The parameter may be used to increase the threshold limit from its default value of 5 milliseconds

7.3.3 Global Mirror partnerships and relationships


A Global Mirror Partnership is a partnership established between a Master (Local) Cluster and a Auxiliary (Remote) Cluster (see Figure 7-5 on page 140).

Figure 7-5 Global Mirror partnership example

140

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

mkpartnership command
The mkpartnership command establishes a one-way Metro Mirror or Global Mirror relationship between the local cluster and a remote cluster. When making a partnership the client must set a Remote Copy bandwidth rate (in MBps), which specifies the proportion of the total intercluster link bandwidth used for MM/GM background copy operations. Note: To establish a fully functional Metro Mirror or Global Mirror partnership, you must issue this command from both clusters.

mkrcrelationship command
Once the partnership is established a Global Mirror relationship can be created between volumes of equal size, on the Master (local) and Auxiliary (remote) clusters. The volumes on the local cluster are Master Volumes, and have an initial role as the source volumes. The volumes on the remote cluster are defined as Auxiliary Volumes, and have the initial role as the target volumes. Notes: After the initial synchronization is complete, the copy direction can be changed, and the role of the Master and Auxiliary volumes can swap i.e source becomes target. As with FlashCopy Volumes can be maintained as Consistency Groups. Once background (re)synchronization is complete a Global Mirror relationship provides and maintains a consistent mirrored copy of a source volume to a target volume, but without requiring the hosts, connected to the local cluster, to wait for the full round-trip delay of the long distance inter-cluster link. i.e. the same function as Metro Mirror Remote Copy, but over longer distance using links with higher latency. Note: Global Mirror is an asynchronous remote copy service Writes to the target volume are made asynchronously, meaning that host writes to the source volume will provide the host with confirmation that the write is complete prior to the I/O completing on the target volume

Intracluster verses Intercluster


Intracluster: Although Global Mirror is available for intracluster, it has no functional value for production use. Intracluster Metro Mirror provides the same capability with less overhead. However, leaving this functionality in place simplifies testing and allows for experimentation and testing (for example, to validate server failover on a single test cluster). Intercluster: Intercluster Global Mirror operations require a minimum of a pair of SVC clusters connected by a number of intercluster links. Limit: When a local and a remote fabric are connected together for Global Mirror purposes, the ISL hop count between a local node and a remote node must not exceed seven hops.

Chapter 7. Remote Copy services

141

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

7.3.4 Asynchronous remote copy


Global Mirror is an asynchronous remote copy technique. In asynchronous remote copy. Write operations are completed on the primary site and the write acknowledgement is sent to the host before it is received at the secondary site. An update of this write operation is sent to the secondary site at a later stage, which provides the capability to perform remote copy over distances exceeding the limitations of synchronous remote copy.

7.3.5 Understanding Remote Copy write operations


In this section we review the Remote Copy write operations concept.

Normal I/O writes


Schematically we can consider SVC as a number of software components arranged in a software stack. I/Os pass through each component of the stack. The first three components define how SVC processes I/O with respect to: SCSI Target - How the SVC Volume is presented to the Host, RC- How Remote Copy processes affect I/O (includes both GM and MM functions) Cache- How I/O is cached. Host I/O to and from volumes which are not in MM/GM relationships pass transparently through the RC component layer of the software stack as shown in Figure 7-6.

Host
(1) Write (2) Ack

The incoming (1) write transparently passes through the RC component of the software stack and into cache, where the write is (2) Acknowledged

SCSI Target Remote Copy Cache

Cache

Master volume

Section of SVCs Software Stack

Figure 7-6 Write IO to volumes not in RC relationships

7.3.6 Asynchronous remote copy


Although Global Mirror is an asynchronous remote copy technique, foreground writes at the local cluster and mirrored foreground writes at the remote cluster are not wholly independent of one another. SVCs implementation of asynchronous remote copy uses algorithms to maintain a consistent image at the target volume at all times. They achieve this by identifying sets of IOs that are active concurrently at the source, assigning an order to those sets, and applying these sets of IOs in the assigned order at the target. The multiple IOs within a single set are applied concurrently.

142

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

The process that marshals the sequential sets of IOs operates at the remote cluster, and so is not subject to the latency of the long distance link. Definition: A consistent image is defined as PIT Point In Time consistency Figure 7-7 shows that a write operation to the master volume is acknowledged back to the host issuing the write before the write operation is mirrored to the cache for the auxiliary volume.

Host
(1) Write (2)

(1) Foreground write from host is processed by RC component, and then cached. (2) Foreground Write is acknowledged as complete by SVC to host application. Sometime later, a (3) Mirrored Foreground Write is sent to Aux volumne. (3) Mirrored Foreground Write Acknowledged.

Remote Copy Cache

(3) Mirrored Foreground Write (3) Foreground Write Acknowledged

Remote Copy Cache

Master volume

Auxillary volume

Global Mirror Relationship


Figure 7-7 Global Mirror relationship write operation

With Global Mirror, a confirmation is sent to the Host server before the Host receives a confirmation of the completion at the Auxiliary Volume. When a write is sent to a Master Volume, it is assigned a sequence number. Mirror writes sent to the Auxiliary Volume are committed in sequential number order. If a write is issued while another write is outstanding, it might be given the same sequence number. This functionality operates to maintain a consistent image at the Auxiliary Volume all times. It identifies sets of I/Os that are active concurrently at the primary VDisk, assigning an order to those sets, and applying these sets of I/Os in the assigned order at the Auxiliary Volume. If further write is received from a host while the secondary write is still active for the same block, even though the primary write might have completed, the new host write on the Auxiliary Volume will be delayed until the previous write has been completed.

7.3.7 Global Mirror write sequence


The Global Mirror algorithms maintain a consistent image on the auxiliary at all times. They achieve this consistent image by identifying sets of I/Os that are active concurrently at the master, assigning an order to those sets, and applying those sets of I/Os in the assigned order at the secondary. As a result, Global Mirror maintains the features of Write Ordering and Read Stability that are described in this chapter. The multiple I/Os within a single set are applied concurrently. The process that marshals the sequential sets of I/Os operates at the secondary cluster, and is therefore not subject to the

Chapter 7. Remote Copy services

143

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

latency of the long distance link. These two elements of the protocol ensure that the throughput of the total cluster can be grown by increasing cluster size, while maintaining consistency across a growing data set. In a failover scenario, where the secondary site needs to become the master source of data, certain updates might be missing at the secondary site. Therefore, any applications that will use this data must have an external mechanism for recovering the missing updates and reapplying them, for example, such as a transaction log replay.

7.3.8 Importance of write ordering


Many applications that use block storage have a requirement to survive failures, such as loss of power or a software crash, and to not lose data that existed prior to the failure. Because many applications must perform large numbers of update operations in parallel to that block storage, maintaining write ordering is key to ensuring the correct operation of applications following a disruption. An application that performs a high volume of database updates is usually designed with the concept of dependent writes. With dependent writes, it is important to ensure that an earlier write has completed before a later write is started. Reversing the order of dependent writes can undermine the applications algorithms and can lead to problems, such as detected or undetected data corruption.

7.3.9 Colliding writes


Colliding writes are defined as new write I/Os that overlap existing active write I/Os. Prior to SVC 4.3.1, the Global Mirror algorithm required that only a single write is active on any given 512 byte LBA of a volume. If a further write is received from a host while the auxiliary write is still active, even though the master write might have completed, the new host write will be delayed until the auxiliary write is complete. This restriction is needed in case a series of writes to the auxiliary have to be retried (called reconstruction). Conceptually, the data for reconstruction comes from the master volume. If multiple writes are allowed to be applied to the master for a given sector, only the most recent write will get the correct data during reconstruction, and if reconstruction is interrupted for any reason, the intermediate state of the auxiliary is inconsistent. Applications that deliver such write activity will not achieve the performance that Global Mirror is intended to support. A volume statistic is maintained about the frequency of these collisions. From V4.3.1 onward, an attempt is made to allow multiple writes to a single location to be outstanding in the Global Mirror algorithm. There is still a need for master writes to be serialized, and the intermediate states of the master data must be kept in a non-volatile journal while the writes are outstanding to maintain the correct write ordering during reconstruction. Reconstruction must never overwrite data on the auxiliary with an earlier version. The volume statistic monitoring colliding writes is now limited to those writes that are not affected by this change. The example in Figure 7-8 shows a colliding write sequence.

144

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Figure 7-8 Colliding writes example

These numbers correspond to the numbers in Figure 7-8: (1) First write is performed from the host to LBA X. (2) Host is provided acknowledgment that the write it complete even though the mirrored write to the auxiliary volume has note yet completed. (1) and (2) occur asynchronously with the first write. (3) Second write is performed from host also to LBA X, if this write occurs prior to (2) the write will be written to the journal file. (4) Host is provided acknowledgment that the second write is complete.

7.3.10 Link speed, latency, and bandwidth


in this section we review the link speed, latency and bandwidth concepts.

Link speed
The speed of a communication link determines how much data can be transported and how long the transmission takes. The faster the link the more data can be transferred within a given amount of time.

Latency
Latency is the time taken by data to move across a network from one location to another and is measured in milliseconds. The longer the time, the greater the performance impact. Latency depends on the speed of light (c = 3 x108m/s, vacuum = 3.3 microsec/km (microsec represents microseconds, one millionth of a second)). The bits of data travel at about two-thirds the speed of light in an optical fiber cable.
However, some latency is added when packets are processed by switches and routers and then forwarded to their destination. While the speed of light may seem infinitely fast, over continental and global distances latency becomes a noticeable factor. There is a direct relationship between distance and latency. Speed of light propagation dictates about one millisecond latency for every 100 miles. For some synchronous remote copy solutions, even a few milliseconds of additional delay may be unacceptable. Latency is a difficult challenge because bandwidth, spending more money for higher speeds reduces latency.

Chapter 7. Remote Copy services

145

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Tip: SCSI write over Fibre Channel requires two round trips per I/O operation, we have 2 (round trips) x 2 (operations) x 5 microsec/km = 20 microsec/km. At 50 km we have an additional latency of 20 microsec/km x 50 km = 1000 microsec = 1 msec (msec represents millisecond). Each SCSI I/O has one msec of additional service time. At 100 km it becomes two msec additional service time

Bandwidth
Bandwidth with respect to Fibre Channel networks, is the network capacity to move data as measured in millions of bits per second (Mbps) or a billions of bits per second (Gbps). Whereas in storage terms, bandwidth measures the amount of data that can be sent in a specified amount of time. Storage applications issue read and write requests to storage devices, and these requests are satisfied at a certain speed commonly called the data rate. Usually disk and tape device data rates are measured in bytes per unit of time and not in bits. Most modern technology storage device LUNs or volumes can manage sequential sustained data rates in the order of 10 MBps to 80-90 MBps. Some manage higher rates. For example an application writes to disk at 80 MBps. Assuming a conversion ratio of 1 MB to 10 Mbits (this is reasonable because it accounts for protocol overhead) we have a data rate of 800 Mbits. It is always useful to check and make sure that you correctly co-relate MBps to Mbps. Warning: When setting up a GM Partnership, using mkpartnership, the -bandwidth parameter does not refer to the general bandwidth characteristic of the links between a local and remote cluster; instead it refers to the background copy (or write resynchronization) rate, as determined by the client, that the intercluster link can sustain.

7.3.11 Choosing a link cable of supporting GM applications


Intercluster Link bandwidth, is the Networking link bandwidth, usually measured and defined in megabits/second. For GM relationships the link bandwidth should be sufficient to support all intercluster traffic; this includes: background write resynchronization (or background copy), Intercluster node-to-node communication, (heartbeat control messages), and Mirrored Foreground I/0, (associated with local host I/0) Rules: Set the GM Partnership bandwidth to a value that is less than the sustainable bandwidth of the link between the clusters. Note: If the GM Partnership bandwidth parameter is set to a higher value than the Link can sustain, the initial background copy process will consume all available Link bandwidth Both intercluster links, as used in redundant scenario, should be capable of providing bandwidth required. Starting with the SVC 5.1.0 release there is a mandatory requirement to set a bandwidth parameter when creating a remote copy partnership.

146

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

These rules will be considered in greater details in section Global Mirror parameters on page 154.

7.3.12 Remote Copy Volumes: Copy directions an default roles


When creating a Global Mirror relationship, the source, or master volume is initially assigned the role of the master, and the target auxiliary volume is initially assigned the role of the auxiliary. This design implies that the initial copy direction of Mirrored Foreground writes and background resynchronization writes (if applicable) is performed from Master to Auxiliary. After the initial synchronization is complete, the copy direction can be changed (see Figure 7.6). The ability to change roles is used to facilitate disaster recovery.

Master Volume

Auxiliary Volume

Copy direction

Role Primary

Role Secondary

Role Secondary
Copy direction

Role Primary

Figure 7-9 Role and direction changes

Warning: When the direction of the relationship is changed, the roles of the volumes are altered. A consequence of this is that the read/write properties are also changed. This means the master volume takes on a secondary role and becomes read-only.

7.4 Intercluster (Remote) link


Global Mirror partnerships and relationships will not work reliably if the SAN fabric on which they are running is incorrectly configured.In this section we focus on the intercluster link, an integral part of a SAN encompassing local and remote clusters and the critical part it plays in the overall of a quality of the SAN configuration.

7.4.1 SAN configuration overview


Redundancy: The intercluster link should adopt the same policy towards redundancy as recommended for the local and remote clusters that it is connecting. There should be redundancy of the ISLs and the individual ISLs should be able to provide the necessary bandwidth in isolation.

Chapter 7. Remote Copy services

147

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Basic Topology and Problems: Due to the nature of Fibre Channel, it is extremely important to avoid inter-switch link (ISL) congestion, this applies equally whether within individual SANs or across the intercluster link. While Fibre Channel (and the SVC) can, under most circumstances, handle a host or storage array that has become overloaded, the mechanisms in Fibre Channel for dealing with congestion in the fabric itself are not effective. The problems caused by fabric congestion can range anywhere from dramatically slow response time all the way to storage access loss. These issues are common with all high-bandwidth SAN devices and are inherent to Fibre Channel; they are not unique to the SVC. When a Fibre Channel network becomes congested, the Fibre Channel switches stop accepting additional frames until the congestion clears. Additionally they may also drop frames. Congestion may quickly move upstream in the fabric and clogs the end devices (such as the SVC) from communicating anywhere. This behavior is referred to as head-of-line blocking, and while modern SAN switches internally have a non-blocking architecture, head-of-line-blocking still exists as a SAN fabric problem. Head-of-line-blocking can result in your SVC nodes being unable to communicate with your storage subsystems or mirror their write caches, just because you have a single congested link leading to an edge switch.

7.4.2 Switches and ISL oversubscription


The IBM System Storage SAN Volume Controller - Software Installation and Configuration Guide, SC23-6628, specifies a suggested maximum host port to ISL ratio of 7:1 With modern 4 or 8 Gbps SAN switches, this ratio implies an average bandwidth (in one direction) per host port of approximately 57 MBps (4 Gbps). You must take peak loads into consideration, not average loads. For instance, while a database server might only use 20 MBps during regular production workloads, it might perform a backup at far higher data rates. Congestion to one switch in a large fabric can cause performance issues throughout the entire fabric, including traffic between SVC nodes and storage subsystems, even if they are not directly attached to the congested switch. The reasons for these issues are inherent to Fibre Channel flow control mechanisms, which are simply not designed to handle fabric congestion. Therefore, any estimates for required bandwidth prior to implementation must have a safety factor built into the estimate. On top of the safety factor for traffic expansion,implement a spare ISL or ISL trunk, providing a fail safe, avoiding congestion if an ISL fails due to issues, such as a SAN switch line card or port blade failure. Exceeding the standard 7:1 oversubscription ration requires you to implement fabric bandwidth threshold alerts. Anytime that one of your ISLs exceeds 70%, you need to schedule fabric changes to distribute the load further. You need to also consider the bandwidth consequences of a complete fabric outage. While a complete fabric outage is a fairly rare event, insufficient bandwidth can turn a single-SAN outage into a total access loss event. Take the bandwidth of the links into account. It is common to have ISLs run faster than host ports, which obviously reduces the number of required ISLs.

7.4.3 Zoning
The zoning requirement have been revised and are covered in detail by the following flash: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003634link

148

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Although Multi-Cluster-Mirroring is support from release 5.1.0, which by its nature increases the potential to zone multiple cluster (nodes) together in a usable (future proof) configurations, however this is not the recommended configuration.

Abstract
SVC nodes in Metro or Global Mirror inter-cluster partnerships may experience lease expiry reboot events if an inter-cluster link to a partner system becomes overloaded. These reboot events may occur on all nodes simultaneously, leading to a temporary loss of host access to Volumes.

Content
If an inter-cluster link becomes severely and abruptly overloaded, it is possible for the local fibre channel fabric to become congested to the extent that no fibre channel ports on the local SVC nodes are able to perform local intra-cluster heartbeat communication. This may result in the nodes experiencing lease expiry events, in which a node will reboot in order to attempt to re-establish communication with the other nodes in the system. If all nodes lease expire simultaneously, this may lead to a loss of host access to Volumes for the duration of the reboot events.

Workaround
The recommended default zoning recommendation for inter-cluster Metro and Global Mirror partnerships has now been revised to ensure that, if link-induced congestion occurs, only two of the four fibre channel ports on each node are able to be subjected to this congestion. The remaining two ports on each node will remain unaffected, and therefore able to continue performing intra-cluster heartbeat communication without interruption. The revised zoning recommendation is as follows: For each node in a clustered system, exactly two fibre channel ports should be zoned to exactly two fibre channel ports from each node in the partner system. This implies that for

each system, there will be two ports on each SVC node that have no remote zones, only local zones

If dual-redundant ISLs are available, then the two ports from each node should be split evenly between the two ISLs, i.e. 1 port from each node should be zoned across each ISL. Local system zoning should continue to follow the standard requirement for all ports on all nodes in a clustered system to be zoned to one another.

7.4.4 Distance extensions for the Intercluster Link


To implement remote mirroring over a distance, you have several choices: Optical multiplexors, such as DWDM or CWDM devices Long-distance small form-factor pluggable transceivers (SFPs) and XFPs Fibre Channel IP conversion boxes Of those options, the optical varieties of distance extension are the gold standard. IP distance extension introduces additional complexity, is less reliable, and has performance limitations. However, we do recognize that optical distance extension is impractical in many cases due to cost or unavailability.

Chapter 7. Remote Copy services

149

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: Distance extension must only be utilized for links between SVC clusters. It must not be used for intra-cluster. Technically, distance extension is supported for relatively short distances, such as a few kilometers (or miles). Refer to the IBM System Storage SAN Volume Controller Restrictions, S1003903, for details explaining why this arrangement is not recommended.

7.4.5 Optical multiplexors


Optical multiplexors can extend your SAN up to hundreds of kilometers (or miles) at extremely high speeds, and for this reason, they are the preferred method for long distance expansion. If you use multiplexor-based distance extension, closely monitor your physical link error counts in your switches. Optical communication devices are high-precision units. When they shift out of calibration, you start to see errors in your frames.

7.4.6 Long-distance SFPs/XFPs


Long-distance optical transceivers have the advantage of extreme simplicity. No expensive equipment is required, and there are only a few configuration steps to perform. However, ensure that you only use transceivers designed for your particular SAN switch.

7.4.7 Fibre Channel: IP conversion


Fibre Channel IP conversion is by far the most common and least expensive form of distance extension. It is also a form of distance extension that is complicated to configure, and relatively subtle errors can have severe performance implications. With Internet Protocol (IP)-based distance extension, it is imperative that you dedicate bandwidth to your Fibre Channel (FC) IP traffic if the link is shared with other IP traffic. Do not assume that because the link between two sites is low traffic or only used for e-mail that this type of traffic will always be the case. Fibre Channel is far more sensitive to congestion than most IP applications. You do not want a spyware problem or a spam attack on an IP network to disrupt your SVC. Also, when communicating with your organizations networking architects, make sure to distinguish between megabytes per second as opposed to megabits. In the storage world, bandwidth is usually specified in megabytes per second (MBps, MB/s, or MB/sec), while network engineers specify bandwidth in megabits (Mbps, Mbit/s, or Mb/sec). If you fail to specify megabytes, you can end up with an impressive-sounding 155 Mb/sec OC-3 link, which is only going to supply a tiny 15 MBps or so to your SVC. With the suggested safety margins included, this is not an extremely fast link at all.

7.4.8 Configuration of intercluster (long distance) links


IBM has tested a number of Fibre Channel extender and SAN router technologies for use with the SVC. The list of supported SAN routers and Fibre Channel extenders is available at this Web site: http://www.ibm.com/storage/support/2145

150

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Link latency considerations


If you use one of these extenders or routers, you need to test the link to ensure that the following requirements are met before you place SVC traffic onto the link: For SVC 4.1.0.x, the round-trip latency between sites must not exceed 68 ms (34 ms oneway) for Fibre Channel (FC) extenders or 20 ms (10 ms one-way) for SAN routers. For SVC 4.1.1.x and later, the round-trip latency between sites must not exceed 80 ms (40 ms one-way). The latency of long distance links is dependent on the technology that is used. Typically, for each 100 km (62.1 miles) of distance, it is assumed that 1 ms is added to the latency, which for Global Mirror means that the remote cluster can be up to 4 000 km (2485 miles) away. When testing your link for latency, it is important that you take into consideration both current and future expected workloads, including any times when the workload might be unusually high. You must evaluate the peak workload by considering the average write workload over a period of one minute or less plus the required synchronization copy bandwidth.

Link bandwidth consumed by inter-node communication


SVC uses part of the bandwidth for its internal SVC inter-cluster heartbeat. The amount of traffic depends on how many nodes are in each of the local and remote clusters. Figure 7-1 shows the amount of traffic, in megabits per second, generated by different sizes of clusters. These numbers represent the total traffic between the two clusters when there is no I/O taking place to mirrored volume on the remote cluster. Half of the data is sent by one cluster, and half of the data is sent by the other cluster. The traffic will be divided evenly over all available intercluster links; therefore, if you have two redundant links, half of this traffic will be sent over each link during fault-free operation.
Table 7-1 SVC inter-cluster heartbeat traffic (megabits per second) Local/remote cluster Two nodes Four nodes Six nodes Eight nodes Two nodes 2.6 4.0 5.4 6.7 Four nodes 4.0 5.5 7.1 8.6 Six nodes 5.4 7.1 8.8 10.5 Eight nodes 6.7 8.6 10.5 12.4

If the link between the sites is configured with redundancy, so that it can tolerate single failures, the link must be sized so that the bandwidth and latency statements continue to be accurate even during single failure conditions.

7.4.9 Link quality


The optical properties of the fiber optic cable influence the distance that can be supported. There is a decrease in signal strength along a fiber optic cable. As the signal travels over the fiber, it is attenuated, and this is caused by both absorption and scattering, and is usually expressed in decibels per kilometer (dB/km). Some early deployed fiber is designed to support the telephone network, and this is sometimes insufficient for todays new multiplexed environments. If you are being supplied dark fiber by another party, you normally specify that they must not allow more than xdB loss in total.

Chapter 7. Remote Copy services

151

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Tip: SCSI write over Fibre Channel requires two round trips per I/O operation have 2 (round trips) x 2 (operations) x 5 microsec/km = 20 microsec/km. At 50 km we have an additional latency of 20 microsec/km x 50 km = 1000 microsec = 1 msec (msec represents millisecond). Each SCSI I/O has one msec of additional service time. At 100 km it becomes two msec additional service time. The decibel (dB) is a convenient way of expressing an amount of signal loss or gain within a system or the amount of loss or gain caused by some component of a system. When signal power is lost, you never lose a fixed amount of power. The rate at which you lose power is not linear. Instead you lose a portion of power: one half, one quarter, and so on. This makes it difficult to add up the lost power along a signals path through the network if measuring signal loss in watts. For example, a signal loses half its power through a bad connection, then it loses another quarter of its power on a bent cable. You cannot add 1/2 plus 1/4 to find the total loss. You must multiply 1/2 by 1/4. This makes calculating large network dB loss both time-consuming and difficult. Decibels, though, are logarithmic, allowing us to easily calculate the total loss/gain characteristics of a system just by adding them up. Keep in mind that they scale logarithmically. If your signal gains 3dB, the signal doubles in power. If your signal loses 3dB, the signal halves in power. It is important to remember that the decibel is a ratio of signal powers. You must have a reference point. For example, you can say, There is a 5dB drop over that connection. But you cannot say, The signal is 5dB at the connection. A decibel is not a measure of signal strength; instead, it is a measure of signal power loss or gain. A decibel milliwatt (dBm) is a measure of signal strength. People often confuse dBm with dB. A dBm is the signal power in relation to one milliwatt. A signal power of zero dBm is one milliwatt, a signal power of three dBm is two milliwatts, six dBm is four milliwatts, and so on. Do not be misled by minus signs. It has nothing to do with signal direction. The more negative the dBm goes, the closer the power level gets to zero. A good link has a very small rate of frame loss. A re-transmission occurs when a frame is lost, directly impacting performance. SVC aims to support retransmissions at 0.2 / 0.1.

7.4.10 Hops
The hop count as such is not increased by the inter-site connection architecture. For example, if we have our SAN extension based on DWDM, the DWDM components are transparent to the number of hops. The hop count limit within a fabric is set by the fabric devices (switch or director) operating system and it is used to derive a frame hold time value for each fabric device. This hold time value is the maximum amount of time that a frame can be held in a switch before it is dropped or the fabric is busy condition is returned. For example, a frame may be held if its destination port is not available. The hold time is derived from a formula using the error detect time-out value and the resource allocation time-out value. The discussion on these fabric values is beyond the scope of this book. However further information can be found in IBM TotalStorage: SAN Product, Design, and Optimization Guide, SG24-6384. If these times become excessive, the fabric experiences undesirable time outs. It is considered that every extra hop adds about 1.2 microseconds of latency to the transmission. Currently, SVC Remote Copy Services supports three hops when protocol conversion exists. That means if you have DWDM extended between primary and secondary sites, three SAN directors or switches can exist between primary and secondary SVC. 152
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

7.4.11 Buffer credits


SAN device ports need memory to temporarily store frames as they arrive, assemble them in sequence, and deliver them to the upper layer protocol. The amount of frames that a port can hold is called its Buffer Credit. Fibre Channel architecture is based on a flow control that ensures a constant stream of data to fill the available pipe. When two FC ports begin a conversation they exchange information about their buffer capacities. A FC port sends only the number of buffer frames for which the receiving port has given credit. This not only avoids overruns, but also provides a way to maintain performance over distance by filling the pipe with in-flight frames or buffers. Buffer_to_Buffer Credit: During login, N_Ports and F_Ports both ends of a link establish its Buffer to Buffer Credit (BB_Credit). End_to_End Credit: In the same way during login all N_Ports establish End to End Credit (EE_Credit) with each other. During data transmission a port should not send more frames than the buffer of the receiving port can handle before getting an indication from the receiving port that it has processed a previously sent frame. Two counters are used for that: BB_Credit_CNT and EE_Credit_CNT. Both are initialized to zero during login. Tip: A rule-of-thumb says that to maintain acceptable performance one buffer credit is required for every two km distance covered. Each time a port sends a frame it increments BB_Credit_CNT and EE_Credit_CNT by one. When it receives R_RDY from the adjacent port it decrements BB_Credit_CNT by one. When it receives ACK from the destination port it decrements EE_Credit_CNT by one. Should at any time BB_Credit_CNT become equal to the BB_Credit or EE_Credit_CNT equal to the EE_Credit of the receiving port, the transmitting port has to stop sending frames until the respective count is decremented. The previous statements are true for Class 2 service. Class 1 is a dedicated connection, so it does not need to care about BB_Credit and only EE_Credit is used (EE Flow Control). Class 3 on the other hand is an unacknowledged service, so it only uses BB_Credit (BB Flow Control), but the mechanism is the same on all cases. Here we can see the importance that the number of buffers has in overall performance. We need enough buffers to make sure the transmitting port can continue sending frames without stopping in order to use the full bandwidth. This is particularly true with distance. At 1Gbps a frame occupies 4 km of fiber. In a 100 km link we can send 25 frames before the first one reaches its destination. We need an ACK (acknowledgment) back to the start to get our EE_Credit full again. We can send another 25 before we receive the first ACK. We need at least 50 buffers to allow for non stop transmission at 100 km distance. The maximum distance that can be achieved at full performance depends on the capabilities of the FC node that are attached at either end of the link extenders. This is vendor specific. There should be a match between the buffer credit capability of the nodes at either end of the extenders. A host bus adapter (HBA) with a buffer credit of 64 communicating with a switch port with only eight buffer credits can read at full performance over a greater distance than it can write. This is because on the writes the HBA can send a maximum of only eight buffers to the switch port, while on the reads, the switch can send up to 64 buffers to the HBA.

7.5 Global Mirror design points


This section provides a summary of the features of Global Mirror. SVC Global Mirror supports the following features:

Chapter 7. Remote Copy services

153

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Asynchronous remote copy of volumes dispersed over metropolitan scale distances is supported. SVC implements the Global Mirror relationship between a volume pair. SVC supports intracluster Global Mirror, where both volumes belong to the same cluster (and I/O Group). However this functionality is better suited to Metro Mirror. SVC supports intercluster Global Mirror, where each volume belongs to its separate SVC cluster. A given SVC cluster can be configured for partnership with between one and three other clusters. This is Multi-Cluster-Mirroring (introduced in Release 5.1.0). Warnings: Clusters running software 6.1.0 or higher cannot form partnerships with clusters running software lower than 4.3.1. SAN Volume Controller clusters cannot form partnerships with Storwize V7000 clusters and vice versa. Intercluster and intracluster Global Mirror can be used concurrently within a cluster for separate relationships. SVC does not require a control network or fabric to be installed to manage Global Mirror. For intercluster Global Mirror, the SVC maintains a control link between the two clusters. This control link is used to control the state and to coordinate the updates at either end. The control link is implemented on top of the same FC fabric connection that the SVC uses for Global Mirror I/O. Note: Although not separate this control does require a dedicated portion of intercluster Link bandwidth. SVC implements a configuration state model that maintains the Global Mirror configuration and state through major events, such as failover, recovery, and re-synchronization. SVC implements flexible re-synchronization support, enabling it to resynthesized volume pairs that have experienced write I/Os to both disks and to re-synchronize only those regions that are known to have changed. Colliding writes are supported. An optional feature for Global Mirror permits a delay simulation to be applied on writes that are sent to auxiliary volumes. Remote Copy maintains write consistency where it ensures that, while the primary VDisk and the secondary VDisk are synchronized, the VDisks stays in sync even in the case of failure in the primary cluster or other failures that cause the results of writes to be uncertain.

7.5.1 Global Mirror parameters


Here we provide an overview of the parameters used to control remote copy, their default settings, and the commands used to set / display them. The Properties / Features of clusters can be displayed using: svcinfo lscluster, and changed using the svctask chcluster, command.

154

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Of particular importance with respect to GM/MM are the following features: partnership (GM) bandwidth The GM partnership bandwidth parameter specifies the rate, in megabytes per second (MBps), at which the (background copy) write resynchronization process is attempted. From release 5.1.0 onwards, this parameter has no default value. (Previously 50MB/s). relationship_bandwidth_limit; (25) (Optional) Specifies the new background copy bandwidth in megabytes per second (MBps), from 1 - 1000. The default is 25 MBps. This parameter operates cluster-wide and defines the maximum background copy bandwidth that any relationship can adopt. The existing background copy bandwidth settings defined on a partnership continue to operate, with the lower of the partnership and VDisk rates attempted. Note: Do not set this value higher than the default without establishing that the higher bandwidth can be sustained. gm_link_tolerance; (300) (Optional) Specifies the length of time, in seconds, for which an inadequate intercluster link is tolerated for a Global Mirror operation. The parameter accepts values from 60 to 400 seconds in steps of 10 seconds. The default is 300 seconds. You can disable the link tolerance by entering a value of zero (0) for this parameter. Note: For later releases there is no default setting. This parameter must be explicitly defined by the client. gm_max_host_delay; (5) -gm_max_host_delay max_host_delay (Optional) Specifies the maximum time delay, in milliseconds, above which the Global Mirror link tolerance timer starts counting down. This threshold value determines the additional impact that Global Mirror operations can add to the response times of the Global Mirror source volumes. You can use this parameter to increase the threshold from the default value of 5 milliseconds. gm_inter_cluster_delay_simulation;0 (Optional) Specifies the intercluster delay simulation, which simulates the Global Mirror round trip delay between two clusters, in milliseconds. The default is 0; the valid range is 0 to 100 milliseconds. gm_intra_cluster_delay_simulation;(0) Optional) Specifies the intracluster delay simulation, which simulates the Global Mirror round trip delay in milliseconds. The default is 0; the valid range is 0 to 100 milliseconds.

7.5.2 chcluster and chpartnership commands


The chpartnership and chcluster commands are used to alter Global Mirror settings and the cluster and partnership level. An invocation example is shown in Figure 7-1.
Example 7-1 Alter Global Mirror settings

svctask copartnership -bandwidth 20 cluster1


Chapter 7. Remote Copy services

155

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

svctask copartnership -stop cluster1 For more details on using MM/GM commands see the Redbook Implementing the IBM System Storage SAN Volume Controller, SG24-7933 or use the command line help option (-h).

7.5.3 How GM Bandwidth is distributed


In this section we consider how the GM Bandwidth resource is distributed within the cluster and how to optimize the distribution of Volumes within IO groups, at the local and remote clusters, in order to maximise performance. Although defined at a cluster level the bandwidth (the rate of background copy), it is then subdivided and distributed at a per node basis, this is it is divided evenly between the nodes, which have volumes, that are performing background copy for active copy relationships. This bandwidth allocation is independent from the number of volumes for which a node is responsible. Each node, in turn, divides its bandwidth evenly between the (multiple) remote copy relationships it has volumes associated with, that are currently performing background copy.

Volume preferred node


Conceptually there is a connection (path) between each node on the primary cluster to each node on the remote cluster. Write I/O, associated with remote copying, travels along this path. Each node-to-node connection is assigned a finite amount of remote copy resource, and can only sustain in flight write I/O to this limit. The node-to-node in-flight write limit is determined by the number of nodes in the remote cluster. The more nodes there are, at the remote cluster, the lower the limit is for the in-flight write I/Os from a local, to a remote node. This means less data can be outstanding from any one local node to any other remote node. Therefore in order to optimize performance Global Mirror Volumes must have their preferred nodes evenly distributed between the nodes of the clusters. The preferred node property of a Volume helps to balance the I/O load between nodes in that I/O Group. This property is also used by Global Mirror to route I/O between clusters. The SVC node that receives a write for a Volume is normally that Volumes preferred node. For Volume s in a Global Mirror relationship, that node is also responsible for sending that write to the preferred node of the target Volume. The primary preferred node is also responsible for sending any writes relating to background copy; again, these writes are sent to the preferred node of the target Volume. Note: The preferred node for a Volume cannot be changed non-disruptively or easily after the Volume is created. Each node of the remote cluster has a fixed pool of Global Mirror system resources for each node of the primary cluster. That is, each remote node has a separate queue for I/O from each of the primary nodes. This queue is a fixed size and is the same size for every node. If preferred nodes for the Volumes of the remote cluster are set so that every combination of primary node and secondary node is used, Global Mirror performance will be maximized.

156

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Figure 7-10 shows an example of Global Mirror resources that are not optimized. Volumes from the Local Cluster are replicated to the Remote Cluster, where all Volumes with a preferred node of Node 1 are replicated to the Remote Cluster, where the target Volumes also have a preferred node of Node 1. With this configuration, the Remote Cluster Node 1 resources reserved for Local Cluster Node 2 are not used. Nor are the resources for Local Cluster Node 1 used for Remote Cluster Node 2.

Figure 7-10 Global Mirror resources not optimized

If the configuration was changed to the configuration shown in Figure 7-11, all Global Mirror resources for each node are used, and SVC Global Mirror operates with better performance than that of the configuration shown in Figure 7-11.

Figure 7-11 Global Mirror resources optimized

How GM Bandwidth can impact foreground I/O latency


The GM bandwidth parameter explicitly defines the rate at which the background copy will be attempted, but also implicitly affects foreground I/O. Background copy bandwidth can affect foreground I/O latency in one of three ways: Increasing Latency of Foreground I/O: If the background copy bandwidth is set too high compared to the inter-cluster link capacity, the synchronous secondary writes of foreground I/Os delay and increase the foreground I/O latency as perceived by applications.

Chapter 7. Remote Copy services

157

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Increasing Latency of Foreground I/O: If GM bandwidth parameter, is set too high, with respect to the actual intercluster link capability, the background copy resynchronization writes consume too much of the intercluster link; starving the link of the ability to service (a)synchronous Mirrored Foreground Writes. Delays in processing the Mirrored Foreground Writes increase the latency of the foreground I/O as perceived by applications. Read I/O overload of Primary Storage: If the GM bandwidth parameter (background copy rate) is set too high, the additional read I/Os, associated with background copy writes, can overload the storage at the primary site and delay foreground (read and write) I/Os. Write I/O overload of Auxiliary Storage: If the GM bandwidth parameter (background copy rate) is set too high for the storage at the secondary site, background copy writes overload the secondary storage, and again delay the (a)synchronous Mirrored Foreground Write I/Os. Note: An increase in the peak foreground workload would also have a detrimental effect on foreground I/O by pushing more mirrored foreground write traffic along the intercluster link (which may not have the bandwidth to sustain it) and potentially overload the primary storage. To set the background copy bandwidth optimally, make sure that you take into consideration all aspects of your environments. The three biggest contributing resources are: the primary storage, the inter-cluster link bandwidth, and the secondary storage. As discussed changes in the environment, or loading of it, may result in foreground I/O being impacted. SVC provides the client with a means of monitoring, and a parameter for controlling how foreground I/O is affected by running remote copy processes. SVC code monitors the delivery of the Mirrored Foreground writes, and if Latency / Performance of these extends beyond a (predefined / client defined) limit for a defined period of time, the remote copy relationship is suspended. This cut off valve parameter is called gmlinktolerance.

Internal monitoring and the gmlinktolerance parameter


The gmlinktolerance parameter is used to ensure that hosts do not perceive the latency of the long distance link; with respect to the bandwidth of either: the hardware maintaining the link, or the storage at the secondary site Both must be provisioned such that combined they can support the maximum throughput delivered by the applications at the primary that are using Global Mirror. If the capabilities of this hardware is exceeded, then the system will become backlogged and the hosts will receive higher latencies on their write IO. MM/GM Remote Copy implements a protection mechanism to detect this condition and halts mirrored foreground write and background copy I/O. Suspension of this type of I/O traffic, ensures that mis-configuration and/or hardware problems do not impact host application availability. SVCs Global Mirror attempts to detect and differentiate between backlogs that are specifically due to the Global Mirror protocols operation, as opposed to general delays in the system when it is very heavily loaded, where a host would see high latency even if Global Mirror were disabled.

158

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

To detect these specific scenarios, Global Mirror measures the time taken to perform the messaging to assign and record the sequence number for a write IO. If this process exceeds the expected average over a period of 10s, then this period is treated as being over loaded. Users set maxhostdelay and gmlinktolerance to control how software responds to these delays. maxhostdelay is a value in milliseconds that can go up to 100. Every 10 seconds, Global Mirror takes a sample of all Global Mirror writes and determines how much of a delay it added. If over half of these writes are greater than maxhostdelay, that sample period is marked as bad. Software keeps a running count of bad periods. Each time there is a bad period, this count goes up by one. Each time there is a good period, this count goes down by 1, to a minimum value of 0. If the link is overloaded for a number of consecutive seconds greater than the gmlinktolerance value, then a 1920 (or other GM related error codes) will recorded against the volume that has consumed the most Global Mirror resource over recent time. A period without overload decrements the count of consecutive periods of overload. So an error log will also be raised if, over any given period of time, the amount of time in overload exceeds the amount of non-overloaded time by gmlinktolerance.

gmlinktolerance Bad Periods


The gmlinktolerance is given in seconds, and Bad Periods are assessed at intervals of 10s. The maximum Bad Period count is therefore the gmlinktolerance parameter value divided by 10. With a gmlinktolerance of 300s, the maximum bad period count is 30; once reached a 1920 error fires. Bad periods do not need to be consecutive, and the bad period count will either increment or decrement at 10s intervals. That is to say 10 bad periods, followed by 5 good periods, followed by 10 bad periods, would result in a bad period count of 15

I/O assessment within Bad Periods


Within each sample period I/Os are assessed. The proportion of Bad I/O to Good I/O is calculated and if this exceeds a defined value the sample period is defined as a Bad Period. A consequence of this is that under light IO load, a single bad I/O can become significant. For example, if only one write I/O is performed every 10s, and this write is considered slow, the Bad Period count will increment.

Edge Case
The worst possible situation, achieved by setting the gm_max_host_delay and gmlinktolerance parameters to their minimum settings (1ms and 20s). With these settings we only need two consecutive bad sample periods before a 1920 error conditions fires. If the foreground write I/O is very light IO load, say a single IO happens in 20s, then with some very unlucky timing: A single Bad I/O" (i.e. a write I/O that took over 1ms in RC), and The Bad I/O spans the boundary of two 10s sample periods This single Bad I/O could theoretically be counted as 2 x Bad Periods and trigger a 1920. A higher gmlinktolerance, gm_max_host_delay setting, or higher IO load would all reduce the risk of encountering this edge case.

Chapter 7. Remote Copy services

159

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

7.5.4 1920 errors


The SVC GM process aims to maintain a low response time of foreground writes even when the long-distance link has a high response time. It monitors how well it is doing, versus this goal, by measuring how long it is taking to process IO; specifically SVC measures the locking and serialization part of the protocol that takes place when a write is received. It compares this with how much time the IO is likely to have taken had GM processes not been active. If this extra time is consistently greater than 5ms, then GM determines that it is not meeting its goal and shuts down the most bandwidth hungry relationship. This generates a 1920 error and protects the local SVC from performance degradation Note: Debug of 1920 errors requires detailed information with respect to I/O at the Primary and secondary Clusters, as well as node-to-node communication. The minimum requirement is that IO Stats should be running, covering the period covering a 1920 error on both clusters, and if possible Tivoli StorageProductivity Center statistics should also be collected.

7.6 Global Mirror planning


In this section we document Global Mirror planning considerations.

7.6.1 Summary of Metro Mirror and Global Mirror rules


To summarize the Metro Mirror and Global Mirror rules: Until release 6.2.0 FlashCopy targets cannot be in a Metro Mirror or Global Mirror relationship, only FlashCopy sources can be in a Metro Mirror or Global Mirror relationship (refer to 7.2.1, What is new in SVC 6.2 on page 135). Metro Mirror or Global Mirror source or target volumes cannot be moved to different I/O Groups. Metro Mirror or Global Mirror volumes cannot be resized. Intra-cluster Metro Mirror or Global Mirror can only mirror between volumes in the same I/O Group. The target Volume must be the same size as the source Volume; however, the target Volume can be a different type (image, striped, or sequential mode) or have different cache settings (cache-enabled or cache-disabled). When using SVC Global Mirror, all components in the SAN switches, remote links, and storage controllers, must be capable of sustaining the workload generated by: application hosts. or Foreground I/O on primary cluster, and that generated by the remote copy processes: Mirrored Foreground Writes, Background Copy (background write re synchronisation) Intercluster heartbeat messaging. The Ignorer Bandwidth parameter - controlling background copy rate must be set to a value appropriate to the link and secondary back-end storage. Global Mirror is not supported for cache-disabled Volumes participating in a Global Mirror relationship.

160

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

We recommend that you use a SAN performance monitoring tool, such as IBM Tivoli Storage Productivity Center, which allows you to continuously monitor the SAN components for error conditions and performance problems. Tivoli Storage Productivity Canter can alert you as soon as there is a performance problem or if a Global (or Metro Mirror) link has been automatically suspended by the SVC. A remote copy relationship that remains stopped without intervention can severely impact your recovery point objective. Additionally, restarting a link that has been suspended for a long period of time can add additional burden to your links while the synchronization catches up. The gmlinktolerance parameter of the remote copy partnership must be set to an appropriate value. The default value of 300 seconds (5 minutes) is appropriate for most clients. If you plan to perform SAN maintenance that might impact SVC GM relationships: Pick a maintenance window where application I/O workload is reduced for the duration of the maintenance Disable the gmlinktolerance feature or increase the gmlinktolerance value (meaning that application hosts might see extended response times from Global Mirror Volumes) Stop the Global Mirror relationships

7.6.2 Planning overview


Ideally we should consider this on a holistic basis, and trial - with data collection tools running - before going live. We need to consider: The inter-cluster link Peak workloads, at the Primary cluster Backend Storage at both clusters Before starting with SVC Remote Copy Services it is important to consider any overhead associated with their introduction. It is important that you know your current infrastructure fully. The following must be considered: (Inter-cluster) link distance and bandwidth, the current SVC clusters load, and the current storage array controllers load. Bandwidth analysis and capacity planning for your links will help to define how many links you need and when you need to add more links in order to ensure the best possible performance and high availability. As part of your implementation project, you may be able to identify and then distribute hot spots across your configuration, or take other actions to manage and balance the load. You must consider the following: Is your bandwidth so little so that you may see an increase in the response time of your applications at times of high workload? Remember that the speed of light is less than 300,000 km/s, that is less then 300 km/ms on fiber. The data must go to the other site, and then an acknowledgement has to come back. Add any possible latency times of some active components on the way, and you approximately get one ms overhead per 100 km for write I/Os. Metro Mirror adds extra latency time due to the link distance to the time of write operation. Can your current SVC cluster or clusters handle extra load?
Chapter 7. Remote Copy services

161

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Problem are not always related to Remote Copy Services or intercluster link, but rather hot spots on the disks subsystems. Be sure these problems are resolved. Is your secondary storage capable of handling the additional workload it receives? This is basically the same backend workload as generated by the primary applications.

7.6.3 Planning specifics


In this section we discuss using both Metro Mirror and Global Mirror between two clusters.

Remote Copy Mirror relationship


A Remote Copy (RC) Mirror relationship is a relationship between two individual Volume s of the same size. The management of the RC Mirror relationships is always performed in the cluster where the source Volume exists. However, you must consider the performance implications of this configuration, because write data from all mirroring relationships will be transported over the same inter-cluster links. Metro Mirror and Global Mirror respond differently to a heavily loaded, poorly performing link. Metro Mirror will usually maintain the relationships in a consistent synchronized state, meaning that primary host applications will start to see poor performance (as a result of the synchronous mirroring being used). Global Mirror, however, offers a higher level of write performance to primary host applications. With a well-performing link, writes are completed asynchronously. If link performance becomes unacceptable, the link tolerance feature automatically stops Global Mirror relationships to ensure that the performance for application hosts remains within reasonable limits. Therefore, with active Metro Mirror and Global Mirror relationships between the same two clusters, Global Mirror writes might suffer degraded performance if Metro Mirror relationships consume most of the inter-cluster links capability. If this degradation reaches a level where hosts writing to Global Mirror experience extended response times, the Global Mirror relationships can be stopped when the link tolerance threshold is exceeded. If this situation happens, refer to 7.5.4, 1920 errors on page 160.

Supported partner clusters


Inter cluster compatibility, with respect to SVC release code and hardware types: Clusters running software 6.1.0 or higher cannot form partnerships with clusters running software lower than 4.3.1. SAN Volume Controller clusters cannot form partnerships with Storwize V7000 clusters and vice versa.

Back-end storage controller requirements


The capabilities of the storage controllers in a remote SVC cluster must be provisioned to allow for: The peak application workload to the Global Mirror or Metro Mirror Volume s The defined level of background copy Any other I/O being performed at the remote site The performance of applications at the primary cluster can be limited by the performance of the back-end storage controllers at the remote cluster.

162

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

To maximize the number of I/Os that applications can perform to Global Mirror and Metro Mirror Volumes: Global Mirror and Metro Mirror Volume s at the remote cluster must be in dedicated MDisk Groups. The MDisk Groups must not contain non-mirror Volume s. Storage controllers must be configured to support the mirror workload that is required of them, which might be achieved by: Dedicating storage controllers to only Global Mirror and Metro Mirror Volume s Configuring the controller to guarantee sufficient quality of service for the disks used by Global Mirror and Metro Mirror Ensuring that physical disks are not shared between Global Mirror or Metro Mirror Volume s and other I/O Verifying that MDisks within a mirror MDisk group must be similar in their characteristics (for example, Redundant Array of Independent Disks (RAID) level, physical disk count, and disk speed)

Technical references and limits


The Metro Mirror and Global Mirror operations support the following functions: Intracluster copying of a volume, in which both VDisks belong to the same cluster and I/O group within the cluster. Intercluster copying of a Disk, in which one Disk belongs to a cluster and the other Disk belongs to a different cluster. Note: A cluster can participate in active Metro Mirror and Global Mirror relationships with itself and up to three other clusters. Intercluster and intracluster Metro Mirror and Global Mirror relationships can be used concurrently within a cluster. The intercluster link is bidirectional. This means that it can copy data from cluster A to cluster B for one pair of VDisks while copying data from cluster B to cluster A for a different pair of VDisks. The copy direction can be reversed for a consistent relationship. Consistency groups are supported to manage a group of relationships that must be kept synchronized for the same application. This also simplifies administration, because a single command that is issued to the consistency group is applied to all the relationships in that group. SAN Volume Controller supports a maximum of 8192 Metro Mirror and Global Mirror relationships per cluster.

7.7 Global Mirror use cases


The following section provides details on some of the common use cases for Global Mirror.

7.7.1 Synchronize a Remote Copy relationship


We will first describe three methods that can be used to establish (synchronize) a Remote Copy relationship.

Chapter 7. Remote Copy services

163

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Full synchronization after create


This is the default method. It is the simplest, in that it requires no additional administrative activity apart from issuing the necessary SVC commands. A CreateRelationship with CreateConsistent state set to FALSE A Start the relationship with Clean set to FALSE However, in some environments, the bandwidth available will make this method unsuitable.

Synchronized before create


In this method, the administrator must ensure that the Master and Auxiliary Virtual Disks contain identical data before creating the relationship. Two ways in which this might be done: Both Volumes are created with the security delete feature so as to make all data zero, or A complete tape image (or other method of moving data) is copied from one disk to the other. In either technique, no write I/O must take place to either Master or Auxiliary before the relationship is established. The administrator must then issue: A CreateRelationship with CreateConsistent state set to TRUE A Start the relationship with Clean set to FALSE This method has advantage over the Full synchronization, in that it does not require all the data to be copied over a constrained link. However, if the data needs to be copied, the Master and Auxiliary disks cannot be used until the copy is complete, which might be unacceptable. Warning: If these steps are not performed correctly, then Remote Copy will report the Relationship as being Consistent, when it is not. This is likely to make any Auxiliary Volume useless.

Quick synchronization after Create


In this method, the administrator must still copy data from Master to Auxiliary. But it can be used without stopping the application at the Master. CreateRelationship issued with CreateConsistent set to TRUE Stop (Relationship) is issued with EnableAccess set to TRUE A tape image (or other method of transferring data) is used to copy the entire Master Volume to the Auxiliary Volume. Once the copy is complete, Restart relationship with Clean set to TRUE With this technique only the data that has changed since the Relationship was created, including all regions which were incorrect in the tape image, are copied by Remote Copy from Master and Auxiliary. Warning: As described in Synchronized before create on page 164 , the copy step must be performed correctly else the Auxiliary will be useless, though Remote Copy will report it as being Synchronized. Having established the different methods of starting a MM/GM relationship, we can use one of these scenarios as a means of implementing the RC relationship while saving bandwidth, and introduce a method of resizing the GM volumes.

164

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

7.7.2 Setting up GM relationships: saving bandwidth and resizing volumes


If you have a situation where you have a large source volume (or a large number of source volumes) that you want to replicate to a remote site and your planning shows that the SVC mirror initial sync time will take too long (or will be too costly if you pay for the traffic that you use), here is a method of setting up the sync using another medium (that might be less expensive). Another reason that you might want to use these steps is if you want to increase the size of the Volume currently in a Metro Mirror relationship or a Global Mirror relationship. To increase the size of these VDisks, you must delete the current mirror relationships and redefine the mirror relationships after you have resized the volumes. In this example, we use tape media as the source for the initial sync for the Metro Mirror relationship or the Global Mirror relationship target before using SVC to maintain the Metro Mirror or Global Mirror. This example does not require downtime for the hosts using the source VDisks. Here are the steps: 1. The hosts are up and running and using their VDisks normally. There is no Metro Mirror relationship or Global Mirror relationship defined yet. You have identified all the VDisks that will become the source VDisks in a Metro Mirror relationship or a Global Mirror relationship. 2. You have already established the SVC cluster relationship with the target SVC. 3. Define a Metro Mirror relationship or a Global Mirror relationship for each source Disk. When defining the relationship, ensure that you use the -sync option, which stops the SVC from performing an initial sync. Note: If you fail to use the -sync option, all of these steps are redundant, because the SVC performs a full initial sync anyway.

4. Stop each mirror relationship by using the -access option, which enables write access to the target VDisks. We will need this write access later. 5. Make a copy of the source Volume to the alternate media by using the dd command to copy the contents of the Volume to tape. Another option might be using your backup tool (for example, IBM Tivoli Storage) to make an image backup of the Volume. Note: Even though the source is being modified while you are copying the image, the SVC is tracking those changes. Your image that you create might already have some of the changes and is likely to have missed some of the changes as well. When the relationship is restarted, the SVC will apply all of the changes that occurred since the relationship was stopped in step 1. After all the changes are applied, you will have a consistent target image.

6. Ship your media to the remote site and apply the contents to the targets of the Metro/Global Mirror relationship; you can mount the Metro Mirror and Global Mirror target Volume s to a UNIX server and use the dd command to copy the contents of the tape to the target Volume. If you used your backup tool to make an image of the Volume, follow

Chapter 7. Remote Copy services

165

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

the instructions for your tool to restore the image to the target Volume. Do not forget to remove the mount, if this is a temporary host. Note: It will not matter how long it takes to get your media to the remote site and perform this step. The quicker you can get it to the remote site and loaded, the quicker SVC is running and maintaining the Metro Mirror and Global Mirror.

7. Unmount the target volumes from your host. When you start the Metro Mirror and Global Mirror relationship later, the SVC will stop write access to the Volume while the mirror relationship is running. 8. Start your Metro Mirror and Global Mirror relationships. While the mirror relationship catches up, the target Volume is not usable at all. As soon as it reaches Consistent Copying, your remote Volume is ready for use in a disaster.

7.7.3 Master and auxiliary volumes and switching their roles


When creating a Global Mirror relationship, the master volume is initially assigned as the master, and the auxiliary volume is initially assigned as the auxiliary. This design implies that the initial copy direction is mirroring the master volume to the auxiliary volume. After the initial synchronization is complete, the copy direction can be changed, if appropriate. In the most common applications of Global Mirror, the master volume contains the production copy of the data and is used by the host application. The auxiliary volume contains the mirrored copy of the data and is used for failover in DR scenarios. Notes: A volume can only be part of one Global Mirror relationship at a time. A volume that is a FlashCopy target cannot be part of a Global Mirror relationship.

7.7.4 Migrating a Metro Mirror relationship to Global Mirror


It is possible to change a Metro Mirror relationship to a Global Mirror relationship or a Global Mirror relationship to a Metro Mirror relationship. This procedure, however, requires an outage to the host and is only successful if you can guarantee that no I/Os are generated to either the source or target Volume s through these steps: 1. Your host is currently running with Volume s that are in a Metro Mirror or Global Mirror relationship. This relationship is in the state Consistent-Synchronized. 2. Stop the application and the host. 3. Optionally, unmap the Volume s from the host to guarantee that no I/O can be performed on these Volume s. If there are currently outstanding write I/Os in the cache, you might need to wait at least two minutes before you can unmap the Volume s. 4. Stop the Metro Mirror or Global Mirror relationship, and ensure that the relationship stops with Consistent Stopped. 5. Delete the current Metro Mirror or Global Mirror relationship. 6. Create the new Metro Mirror or Global Mirror relationship. Ensure that you create it as synchronized to stop the SVC from resynchronizing the Volume s. Use the -sync flag with the svctask mkrcrelationship command.

166

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

7. Start the new Metro Mirror or Global Mirror relationship. 8. Remap the source Volume s to the host if you unmapped them in step 3. 9. Start the host and the application. Extremely important: If the relationship is not stopped in the consistent state, or if any host I/O takes place between stopping the old Metro Mirror or Global Mirror relationship and starting the new Metro Mirror or Global Mirror relationship, those changes will never be mirrored to the target volumes. As a result, the data on the source and target volumes is not exactly the same, and the SVC will be unaware of the inconsistency.

7.7.5 Multiple Cluster Mirroring (MCM)


The concept of Multi-Cluster-Mirroring was introduced with SVC release 5.1.0. (Previously mirroring had been limited to a one-to-one only mapping of clusters). Each SVC cluster can maintain up to three partner cluster relationships, allowing as many as four clusters to be directly associated with each other. This SVC partnership capability enables the implementation of disaster recovery (DR) solutions. Figure 7-12 shows an example of a Multiple Cluster Mirroring configuration.

Figure 7-12 Multiple Cluster Mirroring configuration example

Chapter 7. Remote Copy services

167

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Software level restrictions for Multiple Cluster Mirroring: Partnership between a cluster running 6.1.0 and a cluster running a version earlier than 4.3.1 is not supported. Clusters in a partnership where one cluster is running 6.1.0 and the other is running 4.3.1 cannot participate in additional partnerships with other clusters. Clusters that are all running either 6.1.0 or 5.1.0 can participate in up to three cluster partnerships.

Note: SVC 6.1 supports object names up to 63 characters. Previous levels only supported up to 15 characters. When SVC 6.1 clusters are partnered with 4.3.1 and 5.1.0 clusters, various object names will be truncated at 15 characters when displayed from 4.3.1 and 5.1.0 clusters.

Supported Multiple Cluster Mirroring topologies


Multiple Cluster Mirroring allows for various partnership topologies as illustrated in the following examples:

Star Topology: A B, A C, and A D

Figure 7-13 SVC star topology

Figure 7-13 shows four clusters in a star topology, with cluster A at the center. Where Cluster A can be a central DR site for the three other locations. Using a star topology, you can migrate applications by using a process like the one described in the following example: 1. Suspend application at A. 2. Remove the A B relationship. 3. Create the A C relationship (or alternatively, the B C relationship). 4. Synchronize to cluster C, and ensure A C is established: A B, A C, A D, B C, B D, and C D A B, A C, and B C

168

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Triangle topology: A B, A C, and B C

Figure 7-14 SVC triangle topology

Three clusters in a triangle topology. A potential use case here could be the that data center B is being migrated to C. If we assume data center A is the host production site, and that both B and C are DR sites. Using cluster-star topology it is possible to migrate different applications at different times using a process like 1. suspend application at A 2. take down A-B relationship 3. create A-C relationship (or alternatively B-C relationship) 4. synchronize to C, and ensure A-C is established

and by doing different applications over a series of weekends provide a phased migration capability.

Fully connected topology: A B, A C, A D, B D, and C D

Figure 7-15 SVC fully connected topology

Figure 7-15 is a fully connected mesh where every cluster has a partnership to each of the three other clusters. This allows volumes to be replicated between any pair of clusters.

Chapter 7. Remote Copy services

169

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: This configuration is not recommended, unless relationships are needed between every pair of clusters. Intercluster zoning should be restricted to where necessary only.

Daisy chain topology: A B, A C, and B C


Figure 7-16 shows a daisy-chain topology.

Figure 7-16 SVC daisy-chain topology

Note that although clusters can have up to three partnerships, volumes can only be part of one Remote Copy relationship, for example A B.

Unsupported topology: A B, B C D, and D E

Figure 7-17 Unsupported SVC topology

This is unsupported, because five clusters are indirectly connected. If the cluster can detect this at the time of the fourth mkpartnership command, it will be rejected with an error message. Sometimes, however, this will not be possible - in this case, an error will appear in the errorlog of each cluster in the connected set.

7.7.6 Performing three-way copy service functions


Three-way copy service functions using SVC is not directly supported. If you have a requirement to perform three-way (or more) replication using copy service functions (synchronous or asynchronous mirroring), you can address this requirement by using a combination of: SVC copy services (with image mode cache-disabled volumes), and Native storage controller copy services. Both relationships are active, as shown in Figure 7-18 on page 171.

170

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Figure 7-18 Using three-way copy services

Important: The SVC only supports copy services between two clusters.

In Figure 7-18, the Primary Site uses SVC copy services (Global Mirror or Metro Mirror) to the secondary site. Thus, in the event of a disaster at the primary site, the storage administrator enables access to the target Volume (from the secondary site), and the business application continues processing. While the business continues processing at the secondary site, the storage controller copy services replicate to the third site.

Using native controller Advanced Copy Services functions


Native copy services are not supported on all storage controllers. There is a summary of the known limitations at the following Web site: http://www-1.ibm.com/support/docview.wss?&uid=ssg1S1002852

The storage controller is unaware of the SVC


When you use the copy services function in a storage controller, remember that the storage controller has no knowledge that the SVC exists and that the storage controller uses those disks on behalf of the real hosts. Therefore, when allocating source volumes and target volumes in a point-in-time copy relationship or a remote mirror relationship, make sure you choose them in the right order. If you accidently use a source logical unit number (LUN) with SVC data on it as a target LUN, you can accidentally destroy that data. If that LUN was a Managed Disk (MDisk) in an MDisk group (MDG) with striped or sequential volumes on it, the accident might cascade up and bring the MDG offline. This situation, in turn, makes all the volumes that belong to that group offline. When defining LUNs in point-in-time copy or a remote mirror relationship, double-check that the SVC does not have visibility to the LUN (mask it so that no SVC node can see it), or if the SVC must see the LUN, ensure that it is an unmanaged MDisk.

Chapter 7. Remote Copy services

171

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

The storage controller might, as part of its Advanced Copy Services function, take a LUN offline or suspend reads or writes. The SVC does not understand why this happens; therefore, the SVC might log errors when these events occur. If you mask target LUNs to the SVC and rename your MDisks as you discover them and if the Advanced Copy Services function prohibits access to the LUN as part of its processing, the MDisk might be discarded and rediscovered with an SVC-assigned MDisk name.

Cache-disabled image mode volumes


When the SVC uses a LUN from a storage controller that is a source or target of Advanced Copy Services functions, you can only use that LUN as a cache-disabled image mode volume. If you use the LUN for any other type of SVC Volume, you risk data loss. Not only of the data on that LUN, but you can potentially bring down all volumes in the MDG to which you assigned that LUN (MDisk). If you leave caching enabled on a volume, the underlying controller does not get any write I/Os as the host writes them; the SVC caches them and gestates them at a later time, which can have additional ramifications if a target host is dependent on the write I/Os from the source host as they are written.

7.7.7 When to use storage controller Advanced Copy Services functions


The SVC provides you with greater flexibility than only using native copy service functions, namely: Standard storage device driver. Regardless of the storage controller behind the SVC, you can use the IBM Subsystem Device Driver (SDD) to access the storage. As your environment changes and your storage controllers change, using SDD negates the need to update device driver software as those changes occur. The SVC can provide copy service functions between any supported controller to any other supported controller, even if the controllers are from different vendors. This capability enables you to use a lower class or cost of storage as a target for point-in-time copies or remote mirror copies. The SVC enables you to move data around without host application interruption, which can be useful, especially when the storage infrastructure is retired when new technology becomes available. However, certain storage controllers can provide additional copy service features and functions compared to the capability of the current version of SVC. If you have a requirement to use those features, you can use those additional copy service features and leverage the features that the SVC provides by using cache-disabled image mode VDisks.

7.7.8 Using Metro Mirror or Global Mirror with FlashCopy


SVC allows you to use a volume in a Metro Mirror or Global Mirror relationship as a source volume for a FlashCopy mapping. You cannot use a volume as a FlashCopy mapping target that is already in a Metro Mirror or Global Mirror relationship. When you prepare a FlashCopy mapping, the SVC puts the source volume into a temporary cache-disabled state. This temporary state adds additional latency to the Metro Mirror relationship, because I/Os that are normally committed to SVC memory now need to be committed to the storage controller. 172
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

One method of avoiding this latency is to temporarily stop the Metro Mirror or Global Mirror relationship before preparing the FlashCopy mapping. When the Metro Mirror or Global Mirror relationship is stopped, the SVC records all changes that occur to the source Volume s and applies those changes to the target when the remote copy mirror is restarted. The steps to temporarily stop the Metro Mirror or Global Mirror relationship before preparing the FlashCopy mapping are: 1. Stop each mirror relationship by using the -access option, which enables write access to the target Volume s. We will need this access later. 2. Make a copy of the source volume to the alternate media by using the dd command to copy the contents of the volume to tape. Another option might be using your backup tool (for example, IBM Tivoli Storage Manager) to make an image backup of the volume. Note: Even though the source is being modified while you are copying the image, the SVC is tracking those changes. Your image that you create might already have part of the changes and is likely to have missed part of the changes as well. When the relationship is restarted, the SVC will apply all changes that have occurred since the relationship was stopped in step 1. After all the changes are applied, you will have a consistent target image. 3. Ship your media to the remote site and apply the contents to the targets of the Metro/Global mirror relationship; you can mount the Metro Mirror and Global Mirror target volumes to a UNIX server and use the dd command to copy the contents of the tape to the target Volume. If you used your backup tool to make an image of the volume, follow the instructions for your tool to restore the image to the target volume. Do not forget to remove the mount if this is a temporary host. Note: It will not matter how long it takes to get your media to the remote site and perform this step. The quicker you can get it to the remote site and loaded, the quicker the SVC is running and maintaining the Metro Mirror and Global Mirror. 4. Unmoant the target volumes from your host. When you start the Metro Mirror and Global Mirror relationship later, the SVC will stop write access to the volume while the mirror relationship is running. 5. Start your Metro Mirror and Global Mirror relationships. While the mirror relationship catches up, the target volume is not usable at all. As soon as it reaches Consistent Copying, your remote volume is ready for use in a disaster.

7.7.9 Global Mirror upgrade scenarios


When upgrading cluster software where the cluster participates in one or more inter-cluster relationships only one cluster should be upgraded at a time. That is, both clusters should not be upgraded concurrently. Warning: This is not policed by the software upgrade process. The software upgrade should be allowed to complete one cluster before it is started on the other on the other cluster. If both clusters are upgraded concurrently it may lead to a loss of synchronization. In stress situations it may further lead to a loss of availability. Pre-existing RC relationships will not be affected by a software upgrade performed correctly.
Chapter 7. Remote Copy services

173

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Inter-cluster MM / GM compatibility cross reference


IBM provides a SAN Volume Controller Inter-cluster Metro Mirror and Global Mirror Compatibility Cross Reference. This document provides a compatibility table for inter-cluster Metro Mirror and Global Mirror relationships between SAN Volume Controller code levels. For the latest version of this document see: //www.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003646 Notes: If clusters are at the same code level, the partnership is supported If clusters are at different code levels, check the table below: select the higher code level from the column on the left side of the table, then select the partner cluster code level from the row on the top of the table Figure 7-19 shows Inter-cluster MM / GM compatibility.

Figure 7-19 Inter-cluster MM / GM Compatibility

Additional notes: If all clusters are running software version 5.1.0 or higher, each cluster can be partnered with up to three other clusters. This supports MCM. If a cluster is running a software level earlier than version 5.1.0, each cluster can be partnered with only one other cluster.

Additional guidance for upgrade to SVC 5.1.0 Multi-Cluster Mirroring


The introduction of Multi-Cluster Mirroring necessitates some upgrade restrictions: Concurrent Code Upgrade (CCU) to 5.1.0 is supported from 4.3.1.x only. If the cluster is in a partnership, then the partnered cluster must meet a minimum software level to allow concurrent IO: If Metro Mirror relationships are in place, the partnered cluster may be at 4.2.1 or greater (the level at which Metro Mirror started to use the UGW technology, originally introduced for Global Mirror). If Global Mirror relationships are in place, the partnered cluster may be at 4.1.1 or greater (the minimum level that supports Global Mirror). If no I/O is being mirrored (no active RC relationships), then the remote cluster may be at 3.1.0.5 or greater While a 5.1.0 or greater cluster is partnered with a 4.3.1 or lower cluster, it must only allow the creation of one partnership, to prevent the 4.3.1 code being impacted by the use of multi-cluster mirroring. That is multiple partnerships may only be created in a set of connected clusters all at 5.1.0 or greater.

174

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

7.8 Inter-cluster MM / GM source as FC target


The inclusion of this function helps in Disaster Recovery scenarios. You can have both the FlashCopy function and either Metro Mirror or Global Mirror operating concurrently on the same volume. There are, however, constraints as to how these functions can be used together. A summary of these constraints are: A FlashCopy mapping must be in the idle_copied state when its target volume is the secondary volume of a Metro Mirror or Global Mirror relationship. A FlashCopy mapping cannot be manipulated to change the contents of the target volume of that mapping when the target volume is the primary volume of a Metro Mirror or Global Mirror relationship that is actively mirroring. The I/O group for the FlashCopy mappings must be the same as the I/O group for the FlashCopy target volume.

Figure 7-20 shows a MM \ GM and FlashCopy relationship prior to SVC 6.2.

Figure 7-20 Considerations for MM / GM and FlashCopy relationships prior to SVC 6.2

Figure 7-21 shows a MM \ GM and FlashCopy relationship with SVC 6.2.

Chapter 7. Remote Copy services

175

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 7-21 Considerations for MM / GM and FlashCopy relationships with SVC 6.2

7.9 States and steps in the GM relationship


This section we discuss the states that a GM relationship, and the actions that allow for, or lead to changes of state. For simplicity we consider on single relationships and not consistency groups. Note: New GM relationships may be created as either, Requiring Synchronization (default), or as Being Synchronized.

Requiring full synchronization (after creation)


Full synchronization after creation is the default method. It is the simplest method. However, in certain environments, the bandwidth that is available makes this method unsuitable. The following commands are used to create and start a GM relationship of this type: A GM relationship is created, using mkrcrelationship (without -sync flag) A new relationship is started, using startrcrelationship (without -clean flag)

Synchronized before creation


When making a Global Mirror relationship, of this type, we are specifying that the source volume and target volume are in sync:- contain identical data at the point at which we start the relationship. i.e there is no requirement for background copying between the volumes. In this method, the administrator must ensure that the source and target volumes contain identical data before creating the relationship. There are two ways to ensure that the source and master volumes containidentical data: Both volumes are created with the security delete (-fmtdisk) feature to make all data zero. A complete tape image (or other method of moving data) is copied from source volume to the target volume prior to starting the GM relationship. With this technique, do not allow I/O on either the source or target before the relationship is established.

176

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Then, the administrator must ensure that commands are issued: A GM new relationship is created, using mkrcrelationship with the -sync flag. A new relationship is started, using startrcrelationship with the -clean flag.

Attention: Failure to perform these steps correctly can cause Global Mirror to report the relationship as consistent when it is not, thereby creating a data loss or data integrity exposure for hosts accessing data on the auxiliary volume.

7.9.1 Global Mirror states


The following state diagram (see Figure 7-22 on page 177) details the steps and states with respect to GM relationships made as synchronized, and those made requiring synchronisation after creation.

Figure 7-22 Global Mirror state diagram

GM relationships Synchronized States


Step 1a: The Global Mirror relationship is created with the -sync option, and the Global Mirror relationship enters the ConsistentStopped state. Step 2a: When starting a Global Mirror relationship in the ConsistentStopped state, it enters the ConsistentSynchronized state. This state implies that no updates (write I/O) have been performed on the master volume while in the ConsistentStopped state.

Chapter 7. Remote Copy services

177

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Otherwise, you must specify the -force option, and the Global Mirror relationship then enters the InconsistentCopying state while the background copy is started.

GM relationships Out of Synchronized States


Step 1b: The GM relationship is created without specifying that the source and target volumes are in sync, and the GM relationship enters the InconsistentStopped state. Step 2b: When starting a Global Mirror relationship in the InconsistentStopped state, it enters the InconsistentCopying state while the background copy is started. Step 3: When the background copy completes, the Global Mirror relationship transitions from the InconsistentCopying state to the ConsistentSynchronized state. With the relationship in a consistent synchronized state, the target volume now contains a copy of source data that could be used in a disaster recovery scenario. The consistent synchronized state will persist until the relationship is either stopped, for system administrative purposes, or a error condition is detected, typically a 1920 condition.

A Stop condition (with enable access)


4a: When stopping a Global Mirror relationship in the ConsistentSynchronized state, where specifying the -access option enables write I/O on the auxiliary volume, the Global Mirror relationship enters the Idling state. (Used in disaster recovery scenarios). Step 4b: To enable write I/O on the auxiliary volume, when the GM relationship is in the ConsistentStopped state, issue the command svctask stoprcrelationship, specifying the -access option, and the GM relationship enters the Idling state. Note: A Forced start from ConsistentStopped or Idle changes state to InconsistentCopying. Stop or Error: When a RC relationship is stopped (either intentionally or due to an error), a state transition is applied: For example, the Metro Mirror relationships in the ConsistentSynchronized state enter the ConsistentStopped state, and the Metro Mirror relationships in the InconsistentCopying state enter the InconsistentStopped state. If the connection is broken between the SVC clusters in a partnership, then all (intercluster) Metro Mirror relationships enter a Disconnected state. We must be careful when restarting relationships that are in a idle state, as auxiliary volumes in this state are capable processing both read and write I/O. If an auxiliary volume has been written to, whilst in an idle state, the state of relationship will have implicitly altered to inconsistent. If when restarting the relationship we wish to preserve any write I/Os, that occurred on the Auxiliary volume, we will need to change the direction of the relationship.

Starting from Idle


Step 5a: When starting a Metro Mirror relationship that is in the Idling state, you must specify the -primary argument to set the copy direction. Given that no write I/O has been performed (to either the master or auxiliary volume) while in the Idling state, the Metro Mirror relationship enters the ConsistentSynchronized state. Step 5b: f write I/O has been performed to either the master or auxiliary volume, the -force option must be specified, and the Metro Mirror relationship then enters the InconsistentCopying state, while the background copy is started.

178

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

7.9.2 Disaster Recovery and GM/MM states


A secondary (target volume) does not contain the requested data to be useful for disaster recovery purposes until background copy is complete. It should be noted that until this point

all new write I/O, since relationship started, is processed through the background copy processes, and as such is subject to sequence and ordering of the MM/GM internal
processes, which differ from the real world ordering of the application.

At background copy completion, the relationship enters a Consistent-Synchronized state, all new write I/O is replicated as it is received, from the host in a consistent-synchronized relationship, the primary and secondary Volumes are different only in regions where writes from the host are outstanding. In this state the target volume is also available in read-only mode. As the state diagram shows, there are two possible states that a relationship may enter from consistent-synchronized, either: Consistent-stopped (state entered when we post a 1920 error), or Idling: Both source and target volumes have a common point-in-time consistent state, and both are made available in read/write mode. Write available means both could be used to service host applications, but at any additional writing to volumes in this state will cause the relationship to become inconsistent. Note: Moving from this point usually involves a period of inconsistent copying and therefore loss of redundancy. Errors occurring in this state, become even more critical as an Inconsistent stopped volume does not provide a known Consistent Level of redundancy it is unavailable in respect to read-only, or write / read.

7.9.3 State definitions


The following sections detail the states that are portrayed to the user, for either Consistency Groups or relationships. It also details the extra information that is available in each state. We described the various major states to provide guidance regarding the available configuration commands.

InconsistentStopped
InconsistentStopped is a connected state. In this state, the master is accessible for read and write I/O, but the auxiliary is inaccessible for either read or write I/O. A copy process needs to be started to make the auxiliary consistent. This state is entered when the relationship or Consistency Group was InconsistentCopying and has either suffered a persistent error or received a stop command that has caused the copy process to stop. A start command causes the relationship or Consistency Group to move to the InconsistentCopying state. A stop command is accepted, but has no effect. If the relationship or Consistency Group becomes disconnected, the auxiliary side transitions to InconsistentDisconnected. The master side transitions to IdlingDisconnected.

InconsistentCopying
InconsistentCopying is a connected state. In this state, the master is accessible for read and write I/O, but the auxiliary is inaccessible for either read or write I/O.

Chapter 7. Remote Copy services

179

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

This state is entered after a start command is issued to an InconsistentStopped relationship or Consistency Group. It is also entered when a forced start is issued to an Idling or ConsistentStopped relationship or Consistency Group. In this state, a background copy process runs, which copies data from the master to the auxiliary volume. In the absence of errors, an InconsistentCopying relationship is active, and the copy progress increases until the copy process completes. In certain error situations, the copy progress might freeze or even regress. A persistent error or stop command places the relationship or Consistency Group into the InconsistentStopped state. A start command is accepted, but has no effect. If the background copy process completes on a stand-alone relationship, or on all relationships for a Consistency Group, the relationship or Consistency Group transitions to the ConsistentSynchronized state. If the relationship or Consistency Group becomes disconnected, the auxiliary side transitions to InconsistentDisconnected. The master side transitions to IdlingDisconnected.

ConsistentStopped
ConsistentStopped is a connected state. In this state, the auxiliary contains a consistent image, but it might be out-of-date with respect to the master. This state can arise when a relationship is in the ConsistentSynchronized state and experiences an error that forces a Consistency Freeze. It can also arise when a relationship is created with a CreateConsistentFlag set to true. Normally, following an I/O error, subsequent write activity causes updates to the master, and the auxiliary is no longer synchronized (set to false). In this case, to reestablish synchronization, consistency must be given up for a period. A start command with the -force option must be used to acknowledge this situation, and the relationship or Consistency Group transitions to InconsistentCopying. Issue this command only after all of the outstanding events are repaired. In the unusual case where the master and auxiliary are still synchronized (perhaps following a user stop, and no further write I/O was received), a start command takes the relationship to ConsistentSynchronized. No -force option is required. Also, in this unusual case, a switch command is permitted that moves the relationship or Consistency Group to ConsistentSynchronized and reverses the roles of the master and the auxiliary. If the relationship or Consistency Group becomes disconnected, then the auxiliary side transitions to ConsistentDisconnected. The master side transitions to IdlingDisconnected. An informational status log is generated every time a relationship or Consistency Group enters the ConsistentStopped with a status of Online state. This can be configured to enable an SNMP trap and provide a trigger to automation software to consider issuing a start command following a loss of synchronization.

ConsistentSynchronized
This is a connected state. In this state, the master volume is accessible for read and write I/O. The auxiliary volume is accessible for read-only I/O. Writes that are sent to the master volume are sent to both master and auxiliary volumes. Either successful completion must be received for both writes; the write must be failed to the

180

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

host; or a state must transition out of the ConsistentSynchronized state before a write is completed to the host. A stop command takes the relationship to the ConsistentStopped state. A stop command with the -access parameter takes the relationship to the Idling state. A switch command leaves the relationship in the ConsistentSynchronized state, but reverses the master and auxiliary roles. A start command is accepted, but has no effect. If the relationship or Consistency Group becomes disconnected, the same transitions are made as for ConsistentStopped.

Idling
Idling is a connected state. Both master and auxiliary disks are operating in the master role. Consequently, both master and auxiliary disks are accessible for write I/O. In this state, the relationship or Consistency Group accepts a start command. Global Mirror maintains a record of regions on each disk that received write I/O while Idling. This record is used to determine what areas need to be copied following a start command. The start command must specify the new copy direction. A start command can cause a loss of consistency if either volume in any relationship has received write I/O, which is indicated by the synchronized status. If the start command leads to loss of consistency, you must specify a -force parameter. Following a start command, the relationship or Consistency Group transitions to ConsistentSynchronized if there is no loss of consistency, or to InconsistentCopying if there is a loss of consistency. Also, while in this state, the relationship or Consistency Group accepts a -clean option on the start command. If the relationship or Consistency Group becomes disconnected, both sides change their state to IdlingDisconnected.

7.10 1920 errors


In this section we discuss the mechanisms that lead to remote copy relationships stopping and the recovery actions required to start them again.

7.10.1 Diagnosing and fixing 1920


The SVC generates a 1920 error message whenever a Metro Mirror or Global Mirror relationship has stopped due to adverse conditions. The adverse conditions, if left unresolved, would impact performance of foreground I/O. There are numerous causes of 1920 errors and the condition itself may be the result of a temporary failure; such as maintenance on the intercluster link, or unexpectedly higher foreground host I/O workload, or perhaps a permanent error due to a hardware failure. It is also possible that not all relationships will be affected, and that multiple 1920 errors may be posted.

Internal control policy and raising 1920 errors


While Global Mirror is an asynchronous remote copy service, there is some interplay between local and remote sites. When data comes in to a local VDisk, work must be done to ensure

Chapter 7. Remote Copy services

181

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

that the remote copies are consistent. This work can add a delay to the local write. Normally this delay is low. Users set maxhostdelay and gmlinktolerance to control how software responds to these delays. maxhostdelay is a value in milliseconds that can go up to 100. Every 10 seconds, Global Mirror takes a sample of all Global Mirror writes and determines how much of a delay it added. If over half of these writes are greater than maxhostdelay, that sample period is marked as bad. Software keeps a running count of bad periods. Each time there is a bad period, this count goes up by one. Each time there is a good period, this count goes down by one, to a minimum value of 0. The gmlinktolerance dictates the maximum allowable count of bad periods. The gmlinktolerance is given in seconds, in intervals of 10s. Whatever the gmlinktolerance is set to, this is divided by 10, and used as the maximum bad period count. Thus, if it is 300s, then the maximum bad period count is 30. Once this is reached, the 1920 fires. Bad periods do not need to be consecutive. 10 bad periods, followed by 5 good periods, followed by 10 bad periods, would result in a bad period count of 15.

Trouble shooting 1920 errors


When troubleshooting 1920 errors posted across multiple relationships, you must diagnose the cause of the earliest error first. You must also consider if there are other higher priority cluster errors and put in place fixes for these, as they may be the underlying cause of the 1920. Diagnosis of 1920 is greatly assisted by SAN performance statistics. The recommended tool to gather this information is IBM Tivoli Storage Productivity Center. The recommended statistics monitoring interval is 5 minutes. It is also good practice to turn on SVCs own internal statistics gathering function: IOstats. Although not as powerful as Tivoli Storage Productivity Center, it can provide valuable debug information if snaps are taken close to the time of failure.

7.10.2 Focus areas for 1920 errors (the usual suspects)


As previously stated the causes of 1920 errors my be numerous, and in order to fully understand the underlying reasons for posting this error we must consider all components related to the remote copy relationship. 1. The intercluster Link 2. Primary Storage & Remote Storage, and 3. SVC Nodes (inter-node communications, CPU usage, and the properties, and state of RC Volumes associated with RC relationships) To perform debug we require information from all components, in order to ascertain their health at the point of failure: Switch Logs (confirmation of state of Link at point of failure) Storage Logs SVC SNAPs from both the Master and Auxiliary Clusters, including: IOStats logs if available, and Live dumps, if they were triggered at the point of failure Tivoli Storage Productivity Center statistics (if available)

182

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Note: Contact IBM Level 2 Support for assistance in collecting log information for 1920 errors. They can provided collection scripts that can be used during problem recreates, or deployed during proof on concept activities.

Data collection for diagnostic purposes


Successful diagnose is dependent upon data collection at both clusters. SNAP with livedump (triggered at the point of failure) I/O Stats running Tivoli Storage Productivity Center (if possible) Additional information and Logs from other components, including: Inter-cluster link (and switch) details: Technology Bandwidth Typical measured latency on the intercluster link. Distance on all links (links may take multiple paths for redundancy) Is trunking enabled? How does the link interface with the two SANs? Is compression on the link enabled? Is the link dedicated or shared, if so with what and how much resource do they use? Switch Write Acceleration: Check with IBM for compatibility / known limitations. Switch Compression: Should be transparent but complicates ability to predict bandwidth. Storage and Application: Specific workloads at the time of 1920s - This may or may not be relevant, depending upon the occurrence of the 1920s and the VDisks involved RAID rebuilds Are 1920 associated with Workload Peaks or Scheduled Backup?

Intercluster link
For diagnostic purposes, the following questions should be asked regarding the Intercluster link. Was Link Maintenance Being Performed?: Hardware or Software maintenance associated with Intercluster Link. For example, updating firmware or adding additional capacity. The Intercluster link is overloaded? Indications of this can be found by statistical analysis, using I/O stats and/or Tivoli Storage Productivity Center , of inter-node communications and/or storage controller performance. Using Tivoli Storage productivity Center, you can check the storage metrics either before for GM relationships were stopped (this may be 10s of minutes depending in gmlinktolerance): Diagnose overloaded link using the following: a) High response time for inter-node communication An overloaded long-distance link causes high response times in the inter-node messages sent by SVC. If delays persist, the messaging protocols will exhaust their tolerance elasticity and the GM protocol will be forced to delay handling new foreground writes, whilst waiting for resources to free up. b) Storage Metrics (before 1920 posted)
Chapter 7. Remote Copy services

183

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

Target Volume write throughput approaches link bandwidth

If the write throughput, on the target volume, is approximately equal to your link bandwidth, it is extremely likely that your link is overloaded, check what is driving this: Peak foreground write activity does it exceed bandwidth, or does a Combination of this peak I/O and background copy exceed link capacity? Source Volume write throughput approaches link bandwidth This write throughput represents only the I/O performed by the application hosts. If this number approaches the link bandwidth, you might need to either upgrade the links bandwidth. Alternatively: reduce the foreground write I/O that the application is attempting to perform, or reduce number of RC relationships. Target Volume write throughput greater than source Volume write throughput If this condition exists, then the situation suggests a high level of background copy (in addition to mirrored foreground write I/O. Under these circumstances decreasing the GM partnerships background copy rate parameter; to bring the combined mirrored foreground I/O, and background copy I/O rate back within the remote links bandwidth. Storage Metrics (after 1920 posted) Source Volume write throughput after the GM relationships were stopped. If write throughput increases greatly (by 30% or more) after the GM relationships were stopped, this indicates that the application host was attempting to perform more I/O than the remote link can sustain. This is because while the GM relationships are active, the overloaded remote link causes higher response times to the application host, which in turn decreases the throughput of application host I/O at the source volume. Once the GM relationships have stopped, the application host I/O sees a lower response times, and the true write thoughput returns. To resolve this issue: i) Increase remote link bandwidth, ii) reduce application host I/O, or iii) Reduce number of GM relationships

Storage Controllers
Investigate the primary and remote storage controllers, starting at the remote site. If the back-end storage at the secondary cluster is overloaded, or other problem impacts cache there, then the GM protocol there will fail to keep up and this will similarly exhaust the (gmlinktolerance) elasticity and have similar impact at the primary cluster. Are the storage controller(s) at remote cluster overloaded (pilfering slowly)? Use TPC to obtain the back-end write response time for each MDisk at the remote cluster. Response time for any individual MDisk, which exhibits a sudden increase of 50 ms or more, or that is higher than 100 ms, generally indicates a problem with the backend. Note: Any of the MDisks on the remote backend storage controller, that are providing poor response times, may be underlying cause of a1920 error. i.e. if this response is such that it prevents application I/O from proceeding at the rate required by the application host and the gmlinktolerance parameter is fired - causing 1920. However if you have followed the specified back-end storage controller requirements, and have been running without problems until recently, it is most likely that the error has been caused by a decrease in controller performance due to maintenance actions or a hardware failure of the controller. Check the following:

184

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Is there an error condition on the storage controller; such as media errors, a failed physical disk, or a recovery activity, such as RAID array rebuilding, taking additional bandwidth. If there is an error, fix the problem and restart the Global Mirror relationships. If there is no error, consider whether the secondary controller is capable of processing the required level of application host I/O. It might be possible to improve the performance of the controller by: Adding more, or faster physical disks to a RAID array Changing the RAID level of the array Changing the controllers cache settings (and checking that the cache batteries are healthy, if applicable) Changing other controller-specific configuration parameter

Are the storage controllers at the primary site are overloaded? Analyze the performance of the primary back-end storage using the same steps you use for the remote back-end storage. The main effect of bad performance is to limit the amount of I/O that can be performed by application hosts. Therefore, back-end storage at the primary site must be monitored regardless of Global Mirror. However, if bad performance continues for a prolonged period, it is possible that a false 1920 error will be flagged. For example, the algorithms that access the impact of running Global Mirror will incorrectly interpret slow foreground write activity - and the slow background write activity associated with it - as being slow as a consequence of running Global Mirror and the Global Mirror relationships will stop.

SVC node hardware


Regarding the SVC node hardware, thefollowing should be investigated as possible causes of the1920 errors. Heavily Loaded Primary Cluster If the nodes at the primary cluster are heavily loaded, then the internal GM lock sequence messaging between nodes, which is used to assess the additional impact of running GM, will exceed the gm_max_host_delay (default 5ms). If this condition persists a 1920 error is posted. Note: For analysis of 1920 error, with respect to impact SVC node hardware and loading contact your IBM service support representative (IBM SSR). Level 3 Engagement: Analysis of SVC clusters for overloading. Use Tivoli StorageProductivity Center and IOstats to check: Port to local node send response time, and Port to local node send queue time. High response ( > 1ms) indicate a high load; possible contribution to 1920. SVC node CPU utilization; An excess of 50%, is a higher than average loading; possible contribution.

Chapter 7. Remote Copy services

185

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

SVC Volume states


Check that FlashCopy mappings are in the prepared state. Check if the GM target Volumes are: the sources of a FlashCopy mapping, and if that mapping has been in the prepared state for an extended time, Volumes in this state are cache disabled and as such the performance of them is impacted. Resolve this problem by starting the FlashCopy mapping. This will re-enable the cache, improving the volumes performance and that of GM relationship.

7.10.3 Recovery
After a 1920 error has occurred, the Global Mirror auxiliary VDisks are no longer in the consistent_synchronized state. The cause of the problem must be established, and fixed before the relationship can be restarted. Once restarted the relationship will need to re synchronize. During this period the data on the Metro Mirror or Global Mirror auxiliary VDisks on the secondary cluster is inconsistent and the VDisks could not be used as backup disks by your applications Note: If the relationship has stopped in a consistent state it is possible to use the data on auxiliary Volume, at remote cluster, as backup. Creating a Flash Copy of this volume before restarting the relationship gives additional data protection; as the Flash Copy Volume created maintains the current, consistent, image until such time that Metro Mirror or Global Mirror relationship is again synchronized, and back in a consistent state. To ensure the system has the capacity to handle the background copy load you may want to delay restarting the Metro Mirror or Global Mirror relationship until there is a quiet period. If the required link capacity is not available, you might experience another 1920 error and the Metro Mirror or Global Mirror relationship will stop in an inconsistent state.

Restarting after 1920


The following script has been produced to assist in restarting GM consistency groups and relationships, which have stopped following the posting of a 1920 error.
Example 7-2 Script for restarting Global Mirror

svcinfo lsrcconsistgrp -filtervalue state=consistent_stopped -nohdr -delim : | while IFS=: read id name mci mcn aci acn p state junk; do echo "Restarting group: $name ($id)" svctask startrcconsistgrp -force $name echo "Clearing errors..." svcinfo lserrlogbyrcconsistgrp -unfixed $name | while read id type fixed snmp err_type node seq_num junk; do if [ "$id" != "id" ]; then echo "Marking $seq_num as fixed" svctask cherrstate -sequencenumber $seq_num fi done done svcinfo lsrcrelationship -filtervalue state=consistent_stopped -nohdr -delim : | while IFS=: read id name mci mcn mvi mvn aci acn avi avn p cg_id cg_name state junk; do if [ "$cg_id" == "" ]; then 186
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

echo "Restarting relationship: $name ($id)" svctask startrcrelationship -force $name echo "Clearing errors..." svcinfo lserrlogbyrcrelationship -unfixed $name | while read id type fixed snmp err_type node seq_num junk; do if [ "$id" != "id" ]; then echo "Marking $seq_num as fixed" svctask cherrstate -sequencenumber $seq_num fi done fi done

7.10.4 Disabling gmlinktolerance feature


You can disable the gmlinktolerance feature by setting the gmlinktolerance value to 0 (zero). However, the gmlinktolerance cannot protect applications from extended response times if it is disabled. It might be appropriate to disable the gmlinktolerance feature in the following circumstances: During SAN maintenance windows where degraded performance is expected from SAN components and application hosts can withstand extended response times from Global Mirror VDisks. During periods when application hosts can tolerate extended response times and it is expected that the gmlinktolerance feature might stop the Global Mirror relationships. For example, if you are testing using an I/O generator which is configured to stress the backend storage, the gmlinktolerance feature might detect the high latency and stop the Global Mirror relationships. Disabling gmlinktolerance prevents this at the risk of exposing the test host to extended response times.

7.10.5 Cluster error code 1920: check list for diagnosis


Description: Metro Mirror (Remote Copy) - stopped due to a persistent I/O error. Possible Cause: This error might be caused by: a problem on the primary cluster (including Primary Storage), a problem on the secondary cluster (including Secondary Storage), or a problem on the inter-cluster link. The problem might be: a failure of a component, a component becoming unavailable or having reduced performance due to a service action or it might be that the performance of a component has dropped to a level where the Metro Mirror or Global Mirror relationship cannot be maintained. Alternatively the error might be caused by a change in the performance requirements of the applications using Metro Mirror or Global Mirror. This error is reported on the primary cluster when the copy relationship has not progressed sufficiently over a period of time. Therefore, if the relationship is restarted before all of the problems are fixed, the error might be reported again when the time period next expires (the default period is five minutes). On the primary cluster reporting the error, correct any higher priority errors.
Chapter 7. Remote Copy services

187

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

On the secondary cluster, review the maintenance logs to determine if the cluster was operating with reduced capability at the time the error was reported. The reduced capability might be due to a software upgrade, hardware maintenance to a 2145 node, maintenance to a backend disk subsystem or maintenance to the SAN. On the secondary 2145 cluster, correct any errors that are not fixed. On the intercluster link, review the logs of each link component for any incidents that would cause reduced capability at the time of the error. Ensure the problems are fixed. On the primary and secondary cluster reporting the error, examine internal IOStats. On the intercluster link, examine the performance of each component using an appropriate SAN productivity monitoring tool to ensure that they are operating as expected. Resolve any issues.

7.11 Monitoring Remote Copy relationships


In this section we provide adescription of monitoring your RemoteCopy relationships using Tivoli Storage Productivity Center. For a detailed description of using Tivoli Storage Productivity Center refer to Chapter 13, Monitoring on page 311. It is important to use a SAN performance monitoring tool to ensure that all SAN components perform correctly. While a SAN performance monitoring tool is useful in any SAN environment, it is particularly important when using an asynchronous mirroring solution, such as SVC Global Mirror. Performance statistics must be gathered at the highest possible frequency. Note that if your VDisk or MDisk configuration is changed, you must restart yourTivoli Storage Productivity Center performance report to ensure that performance is correctly monitored for the new configuration. If using Tivoli Storage Productivity Center, monitor: Global Mirror Secondary Write Lag You monitor Global Mirror Secondary Write Lag to identify mirror delays Port to Remote Node Send Response Time needs to be less than 80 ms (the maximum latency supported by SVC Global Mirror). A number in excess of 80 ms suggests that the long-distance link has excessive latency, which needs to be rectified. One possibility to investigate is that the link is operating at maximum bandwidth Sum of Port to Local Node Send Response Time and Port to Local Node Send Queue Time must be less than 1 ms for the primary cluster. A number in excess of 1 ms might indicate that an I/O Group is reaching its I/O throughput limit, which can limit performance. CPU Utilization Percentage CPU Utilization must be below 50%. Sum of Backend Write Response Time and Write Queue Time for Global Mirror MDisks at the remote cluster Time needs to be less than 100 ms. A longer response time can indicate that the storage controller is overloaded. If the response time for a specific storage controller is outside of its specified operating range, investigate for the same reason. Sum of Backend Write Response Time and Write Queue Time for Global Mirror MDisks at the primary cluster

188

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521CopyServices.fm

Time must also be less than 100 ms. If response time is greater than 100 ms, application hosts might see extended response times if the SVCs cache becomes full. Write Data Rate for Global Mirror MDisk groups at the remote cluster This data rate indicates the amount of data that is being written by Global Mirror. If this number approaches either the inter-cluster link bandwidth or the storage controller throughput limit, be aware that further increases can cause overloading of the system and monitor this number appropriately.

Note: IBM support have a number of automated systems that support analysis of Tivoli Storage Productivity Center data. These systems rely on the default naming conventions (filenames) being used. The default names for Tivoli Storage Productivity Center files are: StorageSubsystemPerformance ByXXXXXX.csv Where XXXXXX is: IOGroup, ManagedDiskGroup, ManagedDisk, Node or Volume.

Hints and Tips for Tivoli Storage Productivity Center stats collection
Analysis requires either Tivoli Storage Productivity Center Statistics (CSV) or SVC Raw Statistics (XML). You can export statistics from your Tivoli Storage Productivity Center instance. Because these files get large very quickly, you may take action to limit this. For instance, you can filter the stats files so that individual records that are below a certain threshold are not exported.

Chapter 7. Remote Copy services

189

7521CopyServices.fm

Draft Document for Review February 16, 2012 3:49 pm

190

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

Chapter 8.

Hosts
This chapter describes best practices for monitoring host systems attached to the SAN Volume Controller (SVC). A host system is an Open Systems computer that is connected to the switch through a Fibre Channel (FC) interface. The most important part of tuning, troubleshooting, and performance considerations for a host attached to an SVC will be in the host. There are three major areas of concern: Using multipathing and bandwidth (physical capability of SAN and back-end storage) Understanding how your host performs I/O and the types of I/O Utilizing measurement and test tools to determine host performance and for tuning This topic supplements the IBM System Storage SAN Volume Controller V6.2.0 Information Center and Guides at: https://www-304.ibm.com/support/docview.wss?uid=ssg1S4000968 http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp

Copyright IBM Corp. 2011. All rights reserved.

191

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

8.1 Configuration recommendations


There are basic configuration recommendations when using the SVC to manage storage that is connected to any host. The considerations include how many paths through the fabric are allocated to the host, how many host ports to use, how to spread the hosts across I/O Groups, logical unit number (LUN) mapping, and the correct size of virtual disks (volumes) to use.

8.1.1 Recommended host levels and host object name


When configuring a new host to the SVC, the first step is to determine the recommended operating system, driver, firmware, and supported host bus adapters in order to prevent unanticipated problems due to untested levels. Consult the following document prior to bringing a new host into the SVC for recommended levels: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797 When creating the host, use the actual hostname from the host as the host object name in the SVC to aid in configuration updates or problem determination in the future. If multiple hosts share the exact identical set of disks they may be created with a single host object with multiple ports (wwpns) or as multiple host objects.

8.1.2 The number of paths


From general experience, we have determined that it is best to limit the total number of paths from any host to the SVC. We recommend that you limit the total number of paths that the multipathing software on each host is managing to four paths, even though the maximum supported is eight paths. Following these rules solves many issues with high port fanouts, fabric state changes, and host memory management, and improves performance. Refer to the following Web site for the latest maximum host configurations and restrictions: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003800 The major reason to limit the number of paths available to a host from the SVC is for error recovery, failover, and failback purposes. The overall time for handling errors by a host is significantly reduced. Additionally, resources within the host are greatly reduced each time that you remove a path from the multipathing management. Two path configurations have just one path to each node, which is a supported configuration but not recommended for most configurations. In previous SVC releases, host configuration information was available via the host attachement guide: http://www-1.ibm.com/support/docview.wss?rs=591&context=STCCCXR&context=STCCCYH&dc =DA400&q1=english&q2=-Japanese&uid=ssg1S7002159&loc=en_US&cs=utf-8&lang=en For release 6.1 and higher this information is now consolidated into the IBM System Storage SAN Volume Controller V6.2.0 Information Center: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp

We have measured the effect of multipathing on performance as shown in the following tables. As the charts show, the differences in performance are generally minimal, but the

192

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

differences can reduce performance by almost 10% for specific workloads. These numbers were produced with an AIX host running IBM Subsystem Device Driver (SDD) against the SVC. The host was tuned specifically for performance by adjusting queue depths and buffers. We tested a range of reads and writes, random and sequential, cache hits and misses, at 512 byte, 4 KB, and 64 KB transfer sizes. Table 8-1 on page 193 shows the effects of multipathing.
Table 8-1 4.3.0 Effect of multipathing on write performance R/W test Write Hit 512 b Sequential IOPS Write Miss 512 b Random IOPS 70/30 R/W Miss 4K Random IOPS 70/30 R/W Miss 64K Random MBps 50/50 R/W Miss 4K Random IOPS 50/50 R/W Miss 64K Random MBps Four paths 81 877 60 510.4 130 445.3 1 810.8138 97 822.6 1 674.5727 Eight paths 74 909 57 567.1 124 547.9 1 834.2696 98 427.8 1 678.1815 Difference

-8.6%
-5.0% -5.6% 1.3% 0.6% 0.2%

While these measurements were taken with 4.3.0 SVC code, the number of paths affect on performance will not change with subsequent SVC versions.

8.1.3 Host ports


The general recommendation for utilizing host ports connected to the SVC is to limit the number of physical ports to two ports on two different physical adapters. Each of these ports will be zoned to one target port in each SVC node, thus limiting the number of total paths to four, preferably on totally separate redundant SAN fabrics. If four host ports are preferred for maximum redundant paths, the requirement is to zone each host adapter to one SVC target port on each node (for a maximum of eight paths). The benefits of path redundancy are outweighed by the host memory resource utilization required for more paths. Use one host object to represent a cluster of hosts and use multiple worldwide port names (WWPNs) to represent the ports from all the hosts that will share the same set of volumes. Best practice: Though it is supported in theory, we strongly recommend that you keep Fibre Channel tape and Fibre Channel disks on separate host bus adapters (HBAs). These devices have two extremely different data patterns when operating in their optimum mode, and the switching between them can cause undesired overhead and performance slowdown for the applications.

Chapter 8. Hosts

193

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

8.1.4 Port masking


You can use a port mask to control the node target ports that a host can access. The port mask applies to logins from the host port that are associated with the host object. You can use this capability to simplify the switch zoning by limiting the SVC ports within the SVC configuration, rather than utilizing direct one-to-one zoning within the switch. This capability can simplify zone management. The port mask is a four-bit field that applies to all nodes in the cluster for the particular host. For example, a port mask of 0001 allows a host to log in to a single port on every SVC node in the cluster, if the switch zone also includes both host and SVC node ports.

8.1.5 Host to I/O Group mapping


An I/O Grouping consists of two SVC nodes that share management of volumes within a cluster. The recommendation is to utilize a single I/O Group (iogrp) for all volumes allocated to a particular host. This recommendation has many benefits. One major benefit is the minimization of port fanouts within the SAN fabric. Another benefit is to maximize the potential host attachments to the SVC, because maximums are based on I/O Groups. A third benefit is within the host itself, having fewer target ports to manage. The number of host ports and host objects allowed per I/O Group depends upon the switch fabric type. Refer to the maximum configurations document for these maximums: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003800 Occasionally, an extremely powerful host can benefit from spreading its volumes across I/O Groups for load balancing. Our recommendation is to start with a single I/O Group and use the performance monitoring tools, such as TotalStorage Productivity Center (TPC), to determine if the host is I/O Group-limited. If additional I/O Groups are needed for the bandwidth, it is possible to use more host ports to allocate to the other I/O Group. For example, start with two HBAs zoned to one I/O Group. To add bandwidth, add two more HBAs and zone to the other I/O Group. The host object in the SVC will contain both sets of HBAs. The load can be balanced by selecting which host volumes are allocated to each volume. Because volumes are allocated to only a single I/O Group, the load will then be spread across both I/O Groups based on the volume allocation spread.

8.1.6 Volume size as opposed to quantity


In general, host resources, such as memory and processing time, are used up by each storage LUN that is mapped to the host. For each extra path, additional memory can be used, and a portion of additional processing time is also required. The user can control this effect by using fewer larger LUNs rather than lots of small LUNs; however, it might require tuning of queue depths and I/O buffers to support this efficiently. If a host does not have tunable parameters, such as Windows, the host does not benefit from large volume sizes. AIX greatly benefits from larger volumes with a smaller number of volumes and paths presented to it.

8.1.7 Host volume mapping


When you create a host mapping, the host ports that are associated with the host object can see the LUN that represents the volume on up to eight Fibre Channel ports (the four ports on each node in a I/O Group). Nodes always present the logical unit (LU) that represents a specific volume with the same LUN on all ports in an I/O Group.

194

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

This LUN mapping is called the Small Computer System Interface ID (scsi id), and the SVC software will automatically assign the next available ID if none is specified. There is also a unique identifier on each volume called the LUN serial number. The best practice recommendation is to allocate SAN boot OS volume as the lowest SCSI ID (zero for most hosts) and then allocate the various data disks. While not required, if you share a volume among multiple hosts, control the SCSI ID so the IDs are identical across the hosts. This consistency will ensure ease of management at the host level. If you are using image mode to migrate a host into the SVC, allocate the volumes in the same order that they were originally assigned on the host from the back-end storage. An invocation example: svcinfo lshostvdiskmap -delim The resulting output: id:name:SCSI_id:vdisk_id:vdisk_name:wwpn:vdisk_UID 2:host2:0:10:vdisk10:0000000000000ACA:6005076801958001500000000000000A 2:host2:1:11:vdisk11:0000000000000ACA:6005076801958001500000000000000B 2:host2:2:12:vdisk12:0000000000000ACA:6005076801958001500000000000000C 2:host2:3:13:vdisk13:0000000000000ACA:6005076801958001500000000000000D 2:host2:4:14:vdisk14:0000000000000ACA:6005076801958001500000000000000E For example, VDisk 10, in this example, has a unique device identifier (UID) of 6005076801958001500000000000000A, while the SCSI_ id that host2 used for access is 0. svcinfo lsvdiskhostmap -delim : EEXCLS_HBin01 id:name:SCSI_id:host_id:host_name:wwpn:vdisk_UID 950:EEXCLS_HBin01:14:109:HDMCENTEX1N1:10000000C938CFDF:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:109:HDMCENTEX1N1:10000000C938D01F:600507680191011D48000000000 00466 950:EEXCLS_HBin01:13:110:HDMCENTEX1N2:10000000C938D65B:600507680191011D48000000000 00466 950:EEXCLS_HBin01:13:110:HDMCENTEX1N2:10000000C938D3D3:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:111:HDMCENTEX1N3:10000000C938D615:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:111:HDMCENTEX1N3:10000000C938D612:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:112:HDMCENTEX1N4:10000000C938CFBD:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:112:HDMCENTEX1N4:10000000C938CE29:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:113:HDMCENTEX1N5:10000000C92EE1D8:600507680191011D48000000000 00466 950:EEXCLS_HBin01:14:113:HDMCENTEX1N5:10000000C92EDFFE:600507680191011D48000000000 00466 If using IBM multipathing software (IBM Subsystem Device Driver (SDD) or SDDDSM), the command datapath query device shows the vdisk_UID (unique identifier) and so enables easier management of volumes. The SDDPCM equivalent command is pcmpath query device.

Host mapping from more than one I/O

Chapter 8. Hosts

195

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

The SCSI ID field in the host mapping might not be unique for a volume for a host, because it does not completely define the uniqueness of the LUN. The target port is also used as part of the identification. If there are two I/O Groups of volumes assigned to a host port, one set will start with SCSI ID 0 and then increment (given the default), and the SCSI ID for the second I/O Group will also start at zero and then increment by default. Refer to Example 8-1 on page 196 for a sample of this type of host map. Volume s-0-6-4 and volume s-1-8-2 both have a SCSI ID of ONE, yet they have different LUN serial numbers.
Example 8-1 Host mapping for one host from two I/O Groups

IBM_2145:ITSOCL1:admin>svcinfo lshostvdiskmap senegal id name SCSI_id vdisk_id wwpn vdisk_UID 0 senegal 1 60 210000E08B89CCC2 60050768018101BF28000000000000A8 0 senegal 2 58 210000E08B89CCC2 60050768018101BF28000000000000A9 0 senegal 3 57 210000E08B89CCC2 60050768018101BF28000000000000AA 0 senegal 4 56 210000E08B89CCC2 60050768018101BF28000000000000AB 0 senegal 5 61 210000E08B89CCC2 60050768018101BF28000000000000A7 0 senegal 6 36 210000E08B89CCC2 60050768018101BF28000000000000B9 0 senegal 7 34 210000E08B89CCC2 60050768018101BF28000000000000BA 0 senegal 1 40 210000E08B89CCC2 60050768018101BF28000000000000B5 0 senegal 2 50 210000E08B89CCC2 60050768018101BF28000000000000B1 0 senegal 3 49 210000E08B89CCC2 60050768018101BF28000000000000B2 0 senegal 4 42 210000E08B89CCC2 60050768018101BF28000000000000B3 0 senegal 5 41 210000E08B89CCC2 60050768018101BF28000000000000B4

vdisk_name s-0-6-4 s-0-6-5 s-0-5-1 s-0-5-2 s-0-6-3 big-0-1 big-0-2 s-1-8-2 s-1-4-3 s-1-4-4 s-1-4-5 s-1-8-1

Example 8-2 shows the datapath query device output of this Windows host. Note that the order of the two I/O Groups volumes is reversed from the host map. Volume s-1-8-2 is first, followed by the rest of the LUNs from the second I/O Group, then volume s-0-6-4, and the rest of the LUNs from the first I/O Group. Most likely, Windows discovered the second set of LUNS first. However, the relative order within an I/O Group is maintained.
Example 8-2 Using datapath query device for the host map

C:\Program Files\IBM\Subsystem Device Driver>datapath query device Total Devices : 12

DEV#: 0 DEVICE NAME: Disk1 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B5 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk1 Part0 OPEN NORMAL 0 0 196
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

1 2 3

Scsi Port2 Bus0/Disk1 Part0 Scsi Port3 Bus0/Disk1 Part0 Scsi Port3 Bus0/Disk1 Part0

OPEN OPEN OPEN

NORMAL NORMAL NORMAL

1342 0 1444

0 0 0

DEV#: 1 DEVICE NAME: Disk2 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B1 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 1405 0 1 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 1387 0 3 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 0 0 DEV#: 2 DEVICE NAME: Disk3 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B2 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk3 Part0 OPEN NORMAL 1398 0 1 Scsi Port2 Bus0/Disk3 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk3 Part0 OPEN NORMAL 1407 0 3 Scsi Port3 Bus0/Disk3 Part0 OPEN NORMAL 0 0 DEV#: 3 DEVICE NAME: Disk4 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B3 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk4 Part0 OPEN NORMAL 1504 0 1 Scsi Port2 Bus0/Disk4 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk4 Part0 OPEN NORMAL 1281 0 3 Scsi Port3 Bus0/Disk4 Part0 OPEN NORMAL 0 0 DEV#: 4 DEVICE NAME: Disk5 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B4 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk5 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk5 Part0 OPEN NORMAL 1399 0 2 Scsi Port3 Bus0/Disk5 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk5 Part0 OPEN NORMAL 1391 0 DEV#: 5 DEVICE NAME: Disk6 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000A8 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk6 Part0 OPEN NORMAL 1400 0 1 Scsi Port2 Bus0/Disk6 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk6 Part0 OPEN NORMAL 1390 0 3 Scsi Port3 Bus0/Disk6 Part0 OPEN NORMAL 0 0 DEV#: 6 DEVICE NAME: Disk7 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000A9 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk7 Part0 OPEN NORMAL 1379 0 1 Scsi Port2 Bus0/Disk7 Part0 OPEN NORMAL 0 0

Chapter 8. Hosts

197

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

2 3

Scsi Port3 Bus0/Disk7 Part0 Scsi Port3 Bus0/Disk7 Part0

OPEN OPEN

NORMAL NORMAL

1412 0

0 0

DEV#: 7 DEVICE NAME: Disk8 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000AA ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk8 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk8 Part0 OPEN NORMAL 1417 0 2 Scsi Port3 Bus0/Disk8 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk8 Part0 OPEN NORMAL 1381 0 DEV#: 8 DEVICE NAME: Disk9 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000AB ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk9 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk9 Part0 OPEN NORMAL 1388 0 2 Scsi Port3 Bus0/Disk9 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk9 Part0 OPEN NORMAL 1413 0 DEV#: 9 DEVICE NAME: Disk10 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000A7 ============================================================================= Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk10 Part0 OPEN NORMAL 1293 0 1 Scsi Port2 Bus0/Disk10 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk10 Part0 OPEN NORMAL 1477 0 3 Scsi Port3 Bus0/Disk10 Part0 OPEN NORMAL 0 0 DEV#: 10 DEVICE NAME: Disk11 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000B9 ============================================================================= Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk11 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk11 Part0 OPEN NORMAL 59981 0 2 Scsi Port3 Bus0/Disk11 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk11 Part0 OPEN NORMAL 60179 0 DEV#: 11 DEVICE NAME: Disk12 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000BA ============================================================================= Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk12 Part0 OPEN NORMAL 28324 0 1 Scsi Port2 Bus0/Disk12 Part0 OPEN NORMAL 0 0 2 Scsi Port3 Bus0/Disk12 Part0 OPEN NORMAL 27111 0 3 Scsi Port3 Bus0/Disk12 Part0 OPEN NORMAL 0 0 Sometimes, a host might discover everything correctly at initial configuration, but it does not keep up with the dynamic changes in the configuration. The scsi id is therefore extremely important. For more discussion about this topic, refer to 8.2.4, Dynamic reconfiguration on page 201.

198

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

8.1.8 Server adapter layout


If your host system has multiple internal I/O busses, place the two adapters used for SVC cluster access on two different I/O busses to maximize availability and performance.

8.1.9 Availability as opposed to error isolation


It is important to balance availability through the multiple paths through a SAN to the two SVC nodes as opposed to error isolation. Normally, people add more paths to a SAN to increase availability, which leads to the conclusion that you want all four ports in each node zoned to each port in the host. However, our experience has shown that it is better to limit the number of paths so that the software error recovery software within a switch or a host is able to manage the loss of paths quickly and efficiently. Therefore, it is beneficial to keep the span out from the host port through the SAN to an SVC port to one-to-one as much as possible. Limit each host port to a different set of SVC ports on each node, which keeps the errors within a host isolated to a single adapter if the errors are coming from a single SVC port or from one fabric, making isolation to a failing port or switch easier.

8.2 Host pathing


Each host mapping associates a volume with a host object and allows all HBA ports in the host object to access the volume. You can map a volume to multiple host objects. When a mapping is created, multiple paths might exist across the SAN fabric from the hosts to the SVC nodes that are presenting the volume. Most operating systems present each path to a volume as a separate storage device. The SVC, therefore, requires that multipathing software is running on the host. The multipathing software manages the many paths that are available to the volume and presents a single storage device to the operating system.

8.2.1 Preferred path algorithm


I/O traffic for a particular volume is, at any one time, managed exclusively by the nodes in a single I/O Group. The distributed cache in the SAN Controller is two-way. When a volume is created, a preferred node is chosen. This task is controllable at the time of creation. The owner node for a volume is the preferred node when both nodes are available. When I/O is performed to a volume, the node that processes the I/O duplicates the data onto the partner node that is in the I/O Group. A write from the SVC node to the back-end managed disk (MDisk) is only destaged via the owner node (normally, the preferred node). Therefore, when a new write or read comes in on the non-owner node, it has to send some extra messages to the owner-node to check if it has the data in cache, or if it is in the middle of destaging that data. Therefore, performance will be enhanced by accessing the volume through the preferred node. IBM multipathing software (SDD, SDDPCM, or SDDDSM) will check the preferred path setting during initial configuration for each volume and manage the path usage: Non-preferred paths: Failover only Preferred path: Chosen multipath algorithm (default: load balance)

Chapter 8. Hosts

199

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

8.2.2 Path selection


There are many algorithms used by multipathing software to select the paths used for an individual I/O for each volume. For enhanced performance with most host types, the recommendation is to load balance the I/O between only preferred node paths under normal conditions. The load across the host adapters and the SAN paths will be balanced by alternating the preferred node choice for each volume. Care must be taken when allocating volumes with the SVC Console GUI to ensure adequate dispersion of the preferred node among the volumes. If the preferred node is offline, all I/O will go through the non-preferred node in write-through mode. Certain multipathing software does not utilize the preferred node information, so it might balance the I/O load for a host differently. Veritas DMP is one example. Table 8-2 shows the effect with 16 devices and read misses of the preferred node contrasted with the non-preferred node on performance and shows the effect on throughput. The effect is significant.
Table 8-2 The 16 device random 4 Kb read miss response time (4.2 nodes, usecs) Preferred node (owner) 18 227 Non-preferred node 21 256 Delta 3 029

Table 8-3 shows the change in throughput for the case of 16 devices and random 4 Kb read miss throughput using the preferred node as opposed to a non-preferred node shown in Table 8-2.
Table 8-3 The 16 device random 4 Kb read miss throughput (IOPS) Preferred node (owner) 105 274.3 Non-preferred node 90 292.3 Delta 14 982

In Table 8-4, we show the effect of using the non-preferred paths compared to the preferred paths on read performance.
Table 8-4 Random (1 TB) 4 Kb read response time (4.1 nodes, usecs) Preferred Node (Owner) 5 074 Non-preferred Node 5 147 Delta 73

Table 8-5 shows the effect of using non-preferred nodes on write performance.
Table 8-5 Random (1 TB) 4 Kb write response time (4.2 nodes, usecs) Preferred node (owner) 5 346 Non-preferred node 5 433 Delta 87

IBM SDD software, SDDDSM software, and SDDPCM software recognize the preferred nodes and utilize the preferred paths.

200

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

8.2.3 Path management


The SVC design is based on multiple path access from the host to both SVC nodes. Multipathing software is expected to retry down multiple paths upon detection of an error. We recommend that you actively check the multipathing software display of paths available and currently in usage periodically and just before any SAN maintenance or software upgrades. IBM multipathing software (SDD, SDDPCM, and SDDDSM) makes this monitoring easy through the command datapath query device or pcmpath query device.

Fast node reset


There was a major improvement in SVC 4.2 in software error recovery. Fast node reset restarts a node following a software failure before the host fails I/O to applications. This node reset time improved from several minutes for standard node reset in previous SVC versions to about thirty seconds for SVC 4.2.

Pre-SVC 4.2.0 node reset behavior


When an SVC node is reset, it will disappear from the fabric. So from a host perspective, a few seconds of non-response from the SVC node will be followed by receipt of a registered state change notification (RSCN) from the switch. Any query to the switch name server will find that the SVC ports for the node are no longer present. The SVC ports/node will be gone from the name server for around 60 seconds.

SVC 4.2.0 node reset behavior


When an SVC node is reset, the node ports will not disappear from the fabric. Instead, the node will keep the ports alive. So from a host perspective, SVC will simply stop responding to any SCSI traffic. Any query to the switch name server will find that the SVC ports for the node are still present, but any FC login attempts (for example, PLOGI) will be ignored. This state will persist for around 30-45 seconds. This improvement is a major enhancement for host path management of potential double failures, such as a software failure of one node while the other node in the I/O Group is being serviced, and software failures during a code upgrade. This new feature will also enhance path management when host paths are misconfigured and include only a single SVC node.

8.2.4 Dynamic reconfiguration


Many users want to dynamically reconfigure the storage connected to their hosts. The SVC gives you this capability by virtualizing the storage behind the SVC so that a host will see only the SVC volumes presented to it. The host can then add or remove storage dynamically and reallocate using volume-MDisk changes. After you decide to virtualize your storage behind an SVC, an image mode migration is used to move the existing back-end storage behind the SVC. This process is simple, seamless, and requires the host to be gracefully shut down. Then the SAN must be rezoned for SVC to be the host, the back-end storage LUNs must be moved to the SVC as a host, and the SAN rezoned for the SVC as a back-end device for the host. The host will be brought back up with the appropriate multipathing software, and the LUNs are now managed as SVC image mode volumes. These volumes can then be migrated to new storage or moved to striped storage anytime in the future with no host impact whatsoever. There are times, however, when users want to change the SVC volume presentation to the host. The process to change the SVC volume presentation to the host dynamically is

Chapter 8. Hosts

201

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

error-prone and not recommended. However, it is possible to change the SVC volume presentation to the host by remembering several key issues. Hosts do not dynamically reprobe storage unless prompted by an external change or by the users manually causing rediscovery. Most operating systems do not notice a change in a disk allocation automatically. There is saved information about the device database information, such as the Windows registry or the AIX Object Data Manager (ODM) database, that is utilized.

Add new volumes or paths


Normally, adding new storage to a host and running the discovery methods (such as cfgmgr) are safe, because there is no old, leftover information that is required to be removed. Simply scan for new disks or run cfgmgr several times if necessary to see the new disks.

Removing volumes and then later allocating new volumes to the host
The problem surfaces when a user removes a host map on the SVC during the process of removing a volume. After a volume is unmapped from the host, the device becomes unavailable and the SVC reports that there is no such disk on this port. Usage of datapath query device after the removal will show a closed, offline, invalid, or dead state as shown here: Windows host: DEV#: 0 DEVICE NAME: Disk1 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018201BEE000000000000041 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk1 Part0 CLOSE OFFLINE 0 0 1 Scsi Port3 Bus0/Disk1 Part0 CLOSE OFFLINE 263 0 AIX host: DEV#: 189 DEVICE NAME: vpath189 TYPE: 2145 POLICY: Optimized SERIAL: 600507680000009E68000000000007E6 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk1654 DEAD OFFLINE 0 0 1 fscsi0/hdisk1655 DEAD OFFLINE 2 0 2 fscsi1/hdisk1658 INVALID NORMAL 0 0 3 fscsi1/hdisk1659 INVALID NORMAL 1 0 The next time that a new volume is allocated and mapped to that host, the SCSI ID will be reused if it is allowed to set to the default value, and the host can possibly confuse the new device with the old device definition that is still left over in the device database or system memory. It is possible to get two devices that use identical device definitions in the device database, such as in this example. Note that both vpath189 and vpath190 have the same hdisk definitions while they actually contain different device serial numbers. The path fscsi0/hdisk1654 exists in both vpaths. DEV#: 189 DEVICE NAME: vpath189 TYPE: 2145 POLICY: Optimized SERIAL: 600507680000009E68000000000007E6 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk1654 CLOSE NORMAL 0 0 1 fscsi0/hdisk1655 CLOSE NORMAL 2 0 2 fscsi1/hdisk1658 CLOSE NORMAL 0 0 202
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

3 fscsi1/hdisk1659 CLOSE NORMAL 1 0 DEV#: 190 DEVICE NAME: vpath190 TYPE: 2145 POLICY: Optimized SERIAL: 600507680000009E68000000000007F4 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 fscsi0/hdisk1654 OPEN NORMAL 0 0 1 fscsi0/hdisk1655 OPEN NORMAL 6336260 0 2 fscsi1/hdisk1658 OPEN NORMAL 0 0 3 fscsi1/hdisk1659 OPEN NORMAL 6326954 0 The multipathing software (SDD) recognizes that there is a new device, because at configuration time, it issues an inquiry command and reads the mode pages. However, if the user did not remove the stale configuration data, the Object Data Manager (ODM) for the old hdisks and vpaths still remains and confuses the host, because the SCSI ID as opposed to the device serial number mapping has changed. You can avoid this situation if you remove the hdisk and vpath information from the device configuration database (rmdev -dl vpath189, rmdev -dl hdisk1654, and so forth) prior to mapping new devices to the host and running discovery. Removing the stale configuration and rebooting the host is the recommended procedure for reconfiguring the volumes mapped to a host. Another process that might cause host confusion is expanding a volume. The SVC will tell a host through the scsi check condition mode parameters changed, but not all hosts are able to automatically discover the change and might confuse LUNs or continue to use the old size. Review the IBM System Storage SAN Volume Controller V6.2.0 - Software Installation and Configuration Guide, GC27-2286, for more details and supported hosts: https://www-304.ibm.com/support/docview.wss?uid=ssg1S7003570

8.2.5 Volume migration between I/O Groups


Migrating volumes between I/O Groups is another potential issue if the old definitions of the volumes are not removed from the configuration. Migrating volumes between I/O Groups is not a dynamic configuration change, because each node has its own worldwide node name (WWNN); therefore, the host will see the new nodes as a different SCSI target. This process causes major configuration changes. If the stale configuration data is still known by the host, the host might continue to attempt I/O to the old I/O node targets during multipathing selection. Example 8-3 shows the Windows SDD host display prior to I/O Group migration.
Example 8-3 Windows SDD host display prior to I/O Group migration

C:\Program Files\IBM\Subsystem Device Driver>datapath query device DEV#: 0 DEVICE NAME: Disk1 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000A0 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk1 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk1 Part0 OPEN NORMAL 1873173 0 2 Scsi Port3 Bus0/Disk1 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk1 Part0 OPEN NORMAL 1884768 0 DEV#: 1 DEVICE NAME: Disk2 Part0 TYPE: 2145 SERIAL: 60050768018101BF280000000000009F POLICY: OPTIMIZED

Chapter 8. Hosts

203

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 1863138 0 2 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 1839632 0 If you just quiesce the host I/O and then migrate the volumes to the new I/O Group, you will get closed offline paths for the old I/O Group and open normal paths to the new I/O Group. However, these devices do not work correctly, and there is no way to remove the stale paths without rebooting. Note the change in the pathing in Example 8-4 for device 0 SERIAL:S60050768018101BF28000000000000A0.
Example 8-4 Windows volume moved to new I/O Group dynamically showing the closed offline paths

C:\Program Files\IBM\Subsystem Device Driver>datapath query device Total Devices : 12

DEV#: 0 DEVICE NAME: Disk1 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF28000000000000A0 ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk1 Part0 CLOSED OFFLINE 0 0 1 Scsi Port2 Bus0/Disk1 Part0 CLOSED OFFLINE 1873173 0 2 Scsi Port3 Bus0/Disk1 Part0 CLOSED OFFLINE 0 0 3 Scsi Port3 Bus0/Disk1 Part0 CLOSED OFFLINE 1884768 0 4 Scsi Port2 Bus0/Disk1 Part0 OPEN NORMAL 0 0 5 Scsi Port2 Bus0/Disk1 Part0 OPEN NORMAL 45 0 6 Scsi Port3 Bus0/Disk1 Part0 OPEN NORMAL 0 0 7 Scsi Port3 Bus0/Disk1 Part0 OPEN NORMAL 54 0 DEV#: 1 DEVICE NAME: Disk2 Part0 TYPE: 2145 POLICY: OPTIMIZED SERIAL: 60050768018101BF280000000000009F ============================================================================ Path# Adapter/Hard Disk State Mode Select Errors 0 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 0 0 1 Scsi Port2 Bus0/Disk2 Part0 OPEN NORMAL 1863138 0 2 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 0 0 3 Scsi Port3 Bus0/Disk2 Part0 OPEN NORMAL 1839632 0 To change the I/O Group, you must first flush the cache within the nodes in the current I/O Group to ensure that all data is written to disk. The SVC command line interface (CLI) guide recommends that you suspend I/O operations at the host level. The recommended way to quiesce the I/O is to take the volume groups offline, remove the saved configuration (AIX ODM) entries, such as hdisks and vpaths for those that are planned for removal, and then gracefully shut down the hosts. Migrate the volume to the new I/O Group and power up the host, which will discover the new I/O Group. If the stale configuration data was not removed prior to the shutdown, remove it from the stored host device databases (such as ODM if it is an AIX host) at this point. For Windows hosts, the stale registry information is normally ignored after reboot. Doing volume migrations in this way will prevent the problem of stale configuration issues.

204

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

8.3 I/O queues


Host operating system and host bus adapter software must have a way to fairly prioritize I/O to the storage. The host bus might run significantly faster than the I/O bus or external storage; therefore, there must be a way to queue I/O to the devices. Each operating system and host adapter have unique methods to control the I/O queue. It can be host adapter-based or memory and thread resources-based, or based on how many commands are outstanding for a particular device. You have several configuration parameters available to control the I/O queue for your configuration. There are host adapter parameters and also queue depth parameters for the various storage devices (volumes on the SVC). There are also algorithms within multipathing software, such as qdepth_enable.

8.3.1 Queue depths


Queue depth is used to control the number of concurrent operations occurring on different
storage resources. Queue depth is the number of I/O operations that can be run in parallel on a device. The section on limiting queue depths in large Sans that used to be in the previous documentation has been replaced with a calculation for homogeneous and non-homogeneous fibre channel hosts. This calculation is for an overall queue depth per I/O groups. This number would be used to possibly reduce queue depths lower than the recommendations or defaults for individual host adapters. Refer to Queue depth in Fibre Channel hosts in the IBM SAN Volume Controller V6.2.0 Information Center, for more information. http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc.console.doc/svc_FCqueuedepth.html Queue depth control must be considered for the overall SVC I/O Group to maintain performance within the SVC. It must also be controlled on an individual host adapter basis, LUN basis to avoid taxing the host memory, or physical adapter resources basis. The Aix host attachment scripts will set the intial queue depth setting for AIX. Other OS queue depth settings will be specified for each host type in the information center if they are different from the defaults. There is also an overall requirement per SVC IO group defined in the Host Attachment settings are available in the infocenter here: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc.console.doc/svc_hostattachmentmain.html AIX host attachment scripts are available here: http://www-1.ibm.com/support/dlsearch.wss?rs=540&q=host+attachment&tc=ST52G7&dc=D4 10 Queue depth control within the host is accomplished through limits placed by the adapter resources for handling I/Os and by setting a queue depth maximum per LUN. Multipathing software also controls queue depth using different algorithms. SDD recently made an algorithm change in this area to limit queue depth individually by LUN as opposed to an overall system queue depth limitation. The host I/O will be converted to MDisk I/O as needed. The SVC submits I/O to the back-end (MDisk) storage as any host normally does. The host allows user control of the queue depth that is maintained on a disk. SVC controls queue depth for MDisk I/O without any user

Chapter 8. Hosts

205

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

intervention. After SVC has submitted I/Os and has Q I/Os per second (IOPS) outstanding for a single MDisk (that is, it is waiting for Q I/Os to complete), it will not submit any more I/O until some I/O completes. That is, any new I/O requests for that MDisk will be queued inside SVC. The graph in Figure 8-1 on page 206 indicates the effect on host volume queue depth for a simple configuration of 32 volumes and one host.

Figure 8-1 (4.3.0) IOPS compared to queue depth for 32 volumes tests on a single host

Figure 8-2 shows another example of queue depth sensitivity for 32 volumes on a single host.

Figure 8-2 (4.3.0) MBps compared to queue depth for 32 volume tests on a single host

While these measurements were taken with 4.3.0 code, the effect that queue depth will have on performance is the same regardless of SVC code version.

206

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

8.4 Multipathing software


The SVC requires the use of multipathing software on hosts that are connected. The latest recommended levels for each host operating system and multipath software package are documented in the SVC Web site: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797 Note that the prior levels of host software packages that were recommended are also tested for SVC 4.3.0 and allow for flexibility in maintaining the host software levels with respect to the SVC software version. In other words, it is possible to upgrade the SVC before upgrading the host software levels or after upgrading the software levels, depending on your maintenance schedule.

8.5 Host clustering and reserves


To prevent hosts from sharing storage inadvertently, it is prudent to establish a storage reservation mechanism. The mechanisms for restricting access to SVC volumes utilize the Small Computer Systems Interface-3 (SCSI-3) persistent reserve commands or the SCSI-2 legacy reserve and release commands. There are several methods that the host software uses for implementing host clusters. They require sharing the volumes on the SVC between hosts. In order to share storage between hosts, control must be maintained over accessing the volumes. Certain clustering software uses software locking methods. Other methods of control can be chosen by the clustering software or by the device drivers to utilize the SCSI architecture reserve/release mechanisms. The multipathing software can change the type of reserve used from a legacy reserve to persistent reserve, or remove the reserve.

Persistent reserve refers to a set of Small Computer Systems Interface-3 (SCSI-3) standard commands and command options that provide SCSI initiators with the ability to establish, preempt, query, and reset a reservation policy with a specified target device. The functionality provided by the persistent reserve commands is a superset of the legacy reserve/release commands. The persistent reserve commands are incompatible with the legacy reserve/release mechanism, and target devices can only support reservations from either the legacy mechanism or the new mechanism. Attempting to mix persistent reserve commands with legacy reserve/release commands will result in the target device returning a reservation conflict error.
Legacy reserve and release mechanisms (SCSI-2) reserved the entire LUN (volume) for exclusive use down a single path, which prevents access from any other host or even access from the same host utilizing a different host adapter. The persistent reserve design establishes a method and interface through a reserve policy attribute for SCSI disks, which specifies the type of reservation (if any) that the OS device driver will establish before accessing data on the disk. Four possible values are supported for the reserve policy: No_reserve: No reservations are used on the disk. Single_path: Legacy reserve/release commands are used on the disk. PR_exclusive: Persistent reservation is used to establish exclusive host access to the disk.

Chapter 8. Hosts

207

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

PR_shared: Persistent reservation is used to establish shared host access to the disk. When a device is opened (for example, when the AIX varyonvg command opens the underlying hdisks), the device driver will check the ODM for a reserve_policy and a PR_key_value and open the device appropriately. For persistent reserve, it is necessary that each host attached to the shared disk use a unique registration key value.

Clearing reserves
It is possible to accidently leave a reserve on the SVC volume or even the SVC MDisk during migration into the SVC or when reusing disks for another purpose. There are several tools available from the hosts to clear these reserves. The easiest tools to use are the commands lquerypr (AIX SDD host) and pcmquerypr (AIX SDDPCM host). There is also a Windows SDD/SDDDSM tool, which is menu driven. The Windows Persistent Reserve Tool is called PRTool.exe and is installed automatically when SDD or SDDDSM is installed: C:\Program Files\IBM\Subsystem Device Driver>PRTool.exe It is possible to clear SVC volume reserves by removing all the host mappings when SVC code is at 4.1.0 or higher. Example 8-5 shows how to determine if there is a reserve on a device using the AIX SDD lquerypr command on a reserved hdisk.
Example 8-5 The lquerypr command

[root@ktazp5033]/reserve-checker-> lquerypr -vVh /dev/hdisk5 connection type: fscsi0 open dev: /dev/hdisk5 Attempt to read reservation key... Attempt to read registration keys... Read Keys parameter Generation : 935 Additional Length: 32 Key0 : 7702785F Key1 : 7702785F Key2 : 770378DF Key3 : 770378DF Reserve Key provided by current host = 7702785F Reserve Key on the device: 770378DF This example shows that the device is reserved by a different host. The advantage of using the vV parameter is that the full persistent reserve keys on the device are shown, as well as the errors if the command fails. An example of a failing pcmquerypr command to clear the reserve shows the error: # pcmquerypr -ph /dev/hdisk232 -V connection type: fscsi0 open dev: /dev/hdisk232 couldn't open /dev/hdisk232, errno=16 Use the AIX include file errno.h to find out what the 16 indicates. This error indicates a busy condition, which can indicate a legacy reserve or a persistent reserve from another host (or this host from a different adapter). However, there are certain AIX technology levels (TLs) that have a diagnostic open issue, which prevents the pcmquerypr command from opening the device to display the status or to clear a reserve.

208

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

The following hint and tip gives more information about some older AIX TL levels that break the pcmquerypr command: http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S1003122&lo c=en_US&cs=utf-8&lang=en

SVC MDisk reserves


Sometimes, a host image mode migration will appear to succeed, but when the volume is actually opened for read or write I/O, problems occur. The problems can result from not removing the reserve on the MDisk before using image mode migration into the SVC. There is no way to clear a leftover reserve on an SVC MDisk from the SVC. The reserve will have to be cleared by mapping the MDisk back to the owning host and clearing it through host commands or through back-end storage commands as advised by IBM technical support.

8.5.1 AIX
The following topics describe items specific to AIX.

HBA parameters for performance tuning


The following example settings can be used to start off your configuration in the specific workload environment. These settings are suggestions, and they are not guaranteed to be the answer to all configurations. Always try to set up a test of your data with your configuration to see if there is further tuning that can help. Again, knowledge of your specific data I/O pattern is extremely helpful.

AIX operating system settings


The following section outlines the settings that can affect performance on an AIX host. We look at these settings in relation to how they impact the two workload types.

Transaction-based settings
The following host attachment script will set the default values of attributes for the SVC hdisks: devices.fcp.disk.IBM.rte or devices.fcp.disk.IBM.mpio.rte. You can modify these values, but they are an extremely good place to start. There are additionally HBA parameters that are useful to set for higher performance or large numbers of hdisk configurations. All attribute values that are changeable can be changed using the chdev command for AIX. AIX settings, which can directly affect transaction performance, are the queue_depth hdisk attribute and num_cmd_elem in the HBA attributes.

The queue_depth hdisk attribute


For the logical drive known as the hdisk in AIX, the setting is the attribute queue_depth: # chdev -l hdiskX -a queue_depth=Y -P In this example, X is the hdisk number, and Y is the value to which you are setting X for queue_depth. For a high transaction workload of small random transfers, try queue_depth of 25 or more, but for large sequential workloads, performance is better with shallow queue depths, such as 4.

Chapter 8. Hosts

209

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

The num_cmd_elem attribute


For the HBA settings, the attribute num_cmd_elem for the fcs device represents the number of commands that can be queued to the adapter: chdev -l fcsX -a num_cmd_elem=1024 -P The default value is 200, and the maximum value is: LP9000 adapters: 2048 LP10000 adapters: 2048 LP11000 adapters: 2048 LP7000 adapters: 1024 Best practice: For a high volume of transactions on AIX or a large numbers of hdisks on the fcs adapter, we recommend that you increase num_cmd_elem to 1 024 for the fcs devices being used. AIX settings, which can directly affect throughput performance with large I/O block size, are the lg_term_dma and max_xfer_size parameters for the fcs device.

The lg_term_dma attribute


This AIX Fibre Channel adapter attribute controls the direct memory access (DMA) memory resource that an adapter driver can use. The default value of lg_term_dma is 0x200000, and the maximum value is 0x8000000. A recommended change is to increase the value of lg_term_dma to 0x400000. If you still experience poor I/O performance after changing the value to 0x400000, you can increase the value of this attribute again. If you have a dual-port Fibre Channel adapter, the maximum value of the lg_term_dma attribute is divided between the two adapter ports. Therefore, never increase lg_term_dma to the maximum value for a dual-port Fibre Channel adapter, because this value will cause the configuration of the second adapter port to fail.

The max_xfer_size attribute


This AIX Fibre Channel adapter attribute controls the maximum transfer size of the Fibre Channel adapter. Its default value is 100 000, and the maximum value is 1 000 000. You can increase this attribute to improve performance. You can change this attribute only with AIX 5.2.0 or higher. Note that setting the max_xfer_size affects the size of a memory area used for data transfer by the adapter. With the default value of max_xfer_size=0x100000, the area is 16 MB in size, and for other allowable values of max_xfer_size, the memory area is 128 MB in size.

Throughput-based settings
In the throughput-based environment, you might want to decrease the queue depth setting to a smaller value than the default from the host attach. In a mixed application environment, you do not want to lower the num_cmd_elem setting, because other logical drives might need this higher value to perform. In a purely high throughput workload, this value will have no effect. Best practice: The recommended start values for high throughput sequential I/O environments are lg_term_dma = 0x400000 or 0x800000 (depending on the adapter type) and max_xfr_size = 0x200000. We recommend that you test your host with the default settings first and then make these possible tuning changes to the host parameters to verify if these suggested changes actually enhance performance for your specific host configuration and workload.

210

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

Configuring for fast fail and dynamic tracking


For host systems that run an AIX 5.2 or higher operating system, you can achieve the best results by using the fast fail and dynamic tracking attributes. Before configuring your host system to use these attributes, ensure that the host is running the AIX operating system Version 5.2 or higher. Perform the following steps to configure your host system to use the fast fail and dynamic tracking attributes: 1. Issue the following command to set the Fibre Channel SCSI I/O Controller Protocol Device event error recovery policy to fast_fail for each Fibre Channel adapter: chdev -l fscsi0 -a fc_err_recov=fast_fail The previous example command was for adapter fscsi0. 2. Issue the following command to enable dynamic tracking for each Fibre Channel device: chdev -l fscsi0 -a dyntrk=yes The previous example command was for adapter fscsi0.

Multipathing
When the AIX operating system was first developed, multipathing was not embedded within the device drivers. Therefore, each path to an SVC volume was represented by an AIX hdisk. The SVC host attachment script devices.fcp.disk.ibm.rte sets up the predefined attributes within the AIX database for SVC disks, and these attributes have changed with each iteration of host attachment and AIX technology levels. Both SDD and Veritas DMP utilize the hdisks for multipathing control. The host attachment is also used for other IBM storage devices. The Host Attachment allows AIX device driver configuration methods to properly identify and configure SVC (2145), IBM DS6000 (1750), and IBM DS8000 (2107) LUNs: http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=D410&q1=host+att achment&uid=ssg1S4000106&loc=en_US&cs=utf-8&lang=en

SDD
IBM Subsystem Device Driver (SDD) multipathing software has been designed and updated consistently over the last decade and is an extremely mature multipathing technology. The SDD software also supports many other IBM storage types directly connected to AIX, such as the 2107. SDD algorithms for handling multipathing have also evolved. There are throttling mechanisms within SDD that controlled overall I/O bandwidth in SDD Releases 1.6.1.0 and lower. This throttling mechanism has evolved to be single vpath specific and is called qdepth_enable in later releases. SDD utilizes persistent reserve functions, placing a persistent reserve on the device in place of the legacy reserve when the volume group is varyon. However, if IBM HACMP is installed, HACMP controls the persistent reserve usage depending on the type of varyon used. Also, the enhanced concurrent volume groups (VGs) have no reserves: varyonvg -c for enhanced concurrent and varyonvg for regular VGs that utilize the persistent reserve. Datapath commands are an extremely powerful method for managing the SVC storage and pathing. The output shows the LUN serial number of the SVC volume and which vpath and hdisk represent that SVC LUN. Datapath commands can also change the multipath selection algorithm. The default is load balance, but the multipath selection algorithm is programmable. The recommended best practice when using SDD is also load balance using four paths. The datapath query device output will show a somewhat balanced number of selects on each preferred path to the SVC: DEV#: 12 DEVICE NAME: vpath12 TYPE: 2145 POLICY: Optimized SERIAL: 60050768018B810A88000000000000E0 ====================================================================

Chapter 8. Hosts

211

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

Path# 0 1 2 3

Adapter/Hard Disk fscsi0/hdisk55 fscsi0/hdisk65 fscsi0/hdisk75 fscsi0/hdisk85

State OPEN OPEN OPEN OPEN

Mode NORMAL NORMAL NORMAL NORMAL

Select 1390209 0 1391852 0

Errors 0 0 0 0

We recommend that you verify that the selects during normal operation are occurring on the preferred paths (use datapath query device -l). Also, verify that you have the correct connectivity.

SDDPCM
As Fibre Channel technologies matured, AIX was enhanced by adding native multipathing support called Multipath I/O (MPIO). This structure allows a manufacturer of storage to create software plug-ins for their specific storage. The IBM SVC version of this plug-in is called SDDPCM, which requires a host attachment script called devices.fcp.disk.ibm.mpio.rte: http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&dc=D410&q1=host+att achment&uid=ssg1S4000203&loc=en_US&cs=utf-8&lang=en SDDPCM and AIX MPIO have been continually improved since their release. We recommend that you are at the latest release levels of this software. The preferred path indicator for SDDPCM will not display until after the device has been opened for the first time, which differs from SDD, which displays the preferred path immediately after being configured. SDDPCM features four types of reserve policies: No_reserve policy Exclusive host access single path policy Persistent reserve exclusive host policy Persistent reserve shared host access policy The usage of the persistent reserve now depends on the hdisk attribute: reserve_policy. Change this policy to match your storage security requirements. There are three path selection algorithms: Failover Round-robin Load balancing The latest SDDPCM code of 2.1.3.0 and later has improvements in failed path reclamation by a health checker, a failback error recovery algorithm, Fibre Channel dynamic device tracking, and support for SAN boot device on MPIO-supported storage devices.

8.5.2 SDD compared to SDDPCM


There are several reasons for choosing SDDPCM over SDD. SAN boot is much improved with native mpio-sddpcm software. Multiple Virtual I/O Servers (VIOSs) are supported. Certain applications, such as Oracle ASM, will not work with SDD. Another thing that might be worthwhile noting is that with SDD, all paths can go to dead, which will improve HACMP and Logical Volume Manager (LVM) mirroring failovers. With SDDPCM, one path will always remain open even if the LUN is dead. This design causes longer failovers.

212

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

With SDDPCM utilizing HACMP, enhanced concurrent volume groups require the no reserve policy for both concurrent and non-concurrent resource groups. Therefore, HACMP uses a software locking mechanism instead of implementing persistent reserves. HACMP used with SDD does utilize persistent reserves based on what type of varyonvg was executed.

SDDPCM pathing
SDDPCM pcmpath commands are the best way to understand configuration information about the SVC storage allocation. The following example shows how much can be determined from this command, pcmpath query device, about the connections to the SVC from this host.

DEV#: 0 DEVICE NAME: hdisk0 TYPE: 2145 ALGORITHM: Load Balance SERIAL: 6005076801808101400000000000037B ====================================================================== Path# Adapter/Path Name State Mode Select Errors 0 fscsi0/path0 OPEN NORMAL 155009 0 1 fscsi1/path1 OPEN NORMAL 155156 0 In this example, both paths are being used for the SVC connections. These counts are not the normal select counts for a properly mapped SVC, and two paths are not an adequate number of paths. Use the -l option on pcmpath query device to check whether these paths are both preferred paths. If they are both preferred paths, one SVC node must be missing from the host view. Using the -l option shows an asterisk on both paths, indicating a single node is visible to the host (and is the non-preferred node for this volume): 0* 1* fscsi0/path0 fscsi1/path1 OPEN OPEN NORMAL NORMAL 9795 0 9558 0

This information indicates a problem that needs to be corrected. If zoning in the switch is correct, perhaps this host was rebooted while one SVC node was missing from the fabric.

Veritas
Veritas DMP multipathing is also supported for the SVC. Veritas DMP multipathing requires certain AIX APARS and the Veritas Array Support Library. It also requires a certain version of the host attachment script devices.fcp.disk.ibm.rte to recognize the 2 145 devices as hdisks rather than MPIO hdisks. In addition to the normal ODM databases that contain hdisk attributes, there are several Veritas filesets that contain configuration data: /dev/vx/dmp /dev/vx/rdmp /etc/vxX.info Storage reconfiguration of volumes presented to an AIX host will require cleanup of the AIX hdisks and these Veritas filesets.

8.5.3 Virtual I/O server


Virtual SCSI is based on a client/server relationship. The Virtual I/O Server (VIOS) owns the physical resources and acts as the server, or target, device. Physical adapters with attached disks (volumes on the SVC, in our case) on the Virtual I/O Server partition can be shared by one or more partitions. These partitions contain a virtual SCSI client adapter that sees these virtual devices as standard SCSI compliant devices and LUNs.

Chapter 8. Hosts

213

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

hdisks and logical volume (LV) VSCSI hdisks.

There are two types of volumes that you can create on a VIOS: physical volume (PV) VSCSI

PV VSCSI hdisks are entire LUNs from the VIOS point of view, and if you are concerned about failure of a VIOS and have configured redundant VIOSs for that reason, you must use PV VSCSI hdisks. So, PV VSCSI hdisks are entire LUNs that are volumes from the virtual I/O client (VIOC) point of view. An LV VCSI hdisk cannot be served up from multiple VIOSs. LV VSCSI hdisks reside in LVM volume groups (VGs) on the VIOS and cannot span PVs in that VG, nor be striped LVs. Due to these restrictions, we recommend using PV VSCSI hdisks. Multipath support for SVC attachment to Virtual I/O Server is provided by either SDD or MPIO with SDDPCM. Where Virtual I/O Server SAN Boot or dual Virtual I/O Server configurations are required, only MPIO with SDDPCM is supported. We recommend using MPIO with SDDPCM due to this restriction with the latest SVC-supported levels as shown by:
https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797#_VIOS

There are many questions answered on the following Web site for usage of the VIOS: http://www14.software.ibm.com/webapp/set2/sas/f/vios/documentation/faq.html One common question is how to migrate data into a VIO environment or how to reconfigure storage on a VIOS. This question is addressed in the previous link. Many clients want to know if SCSI LUNs can be moved between the physical and virtual environment as is. That is, given a physical SCSI device (LUN) with user data on it that resides in a SAN environment, can this device be allocated to a VIOS and then provisioned to a client partition and used by the client as is? The answer is no, this function is not supported at this time. The device cannot be used as is. Virtual SCSI devices are new devices when created, and the data must be put on them after creation, which typically requires a type of backup of the data in the physical SAN environment with a restoration of the data onto the volume.

Why do we have this limitation


The VIOS uses several methods to uniquely identify a disk for use as a virtual SCSI disk; they are: Unique device identifier (UDID) IEEE volume identifier Physical volume identifier (PVID) Each of these methods can result in different data formats on the disk. The preferred disk identification method for volumes is the use of UDIDs.

MPIO uses the UDID method


Most non-MPIO disk storage multipathing software products use the PVID method instead of the UDID method. Because of the different data format associated with the PVID method, clients with non-MPIO environments need to be aware that certain future actions performed in the VIOS logical partition (LPAR) can require data migration, that is, a type of backup and restoration of the attached disks. These actions can include, but are not limited to: Conversion from a non-MPIO environment to MPIO Conversion from the PVID to the UDID method of disk identification Removal and rediscovery of the Disk Storage ODM entries Updating non-MPIO multipathing software under certain circumstances

214

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

Possible future enhancements to VIO Due in part to the differences in disk format that we just described, VIO is currently supported for new disk installations only. AIX, VIO, and SDD development are working on changes to make this migration easier in the future. One enhancement is to use the UDID or IEEE method of disk identification. If you use the UDID method, it might be possible to contact IBM technical support to get a method of migrating that might not require restoration. A quick and simple method to determine if a backup and restoration is necessary is to run the command lquerypv -h /dev/hdisk## 80 10 to read the PVID off the disk. If the output is different on both the VIOS and VIOC, you must use backup and restore.

How to back up the VIO configuration


To back up the VIO configuration: 1. Save off the volume group information from the VIOC (PVIDs and VG names). 2. Save off the disk mapping, PVID, and LUN ID information from ALL VIOSs. This step includes mapping the VIOS hdisk (typically, a hdisk) to the VIOC hdisk and you must save at least the PVIDs information. 3. Save off the physical LUN to host LUN ID information on the storage subsystem for when we reconfigure the hdisk (typically). After all the pertinent mapping data has been collected and saved, it is possible to back up and reconfigure your storage and then restore using the AIX commands: Back up the VG data on the VIOC. For rootvg, the supported method is a mksysb and an install, or savevg and restvg for non-rootvg.

8.5.4 Windows
There are two options of multipathing drivers released for Windows 2003 Server hosts. Windows 2003 Server device driver development has concentrated on the storport.sys driver. This driver has significant interoperability differences from the older scsiport driver set. Additionally, Windows has released a native multipathing I/O option with a storage specific plug-in. SDDDSM was designed to support these newer methods of interfacing with Windows 2003 Server. In order to release new enhancements more quickly, the newer hardware architectures (64-bit EMT and so forth) are only tested on the SDDDSM code stream; therefore, only SDDDSM packages are available. The older version of the SDD multipathing driver works with the scsiport drivers. This version is required for Windows Server 2000 servers, because storport.sys is not available. The SDD software is also available for Windows 2003 Server servers when the scsiport hba drivers are used.

Clustering and reserves


Windows SDD or SDDDSM utilizes the persistent reserve functions to implement Windows Clustering. A stand-alone Windows host will not utilize reserves. Review this Microsoft article about clustering to understand how a cluster works: http://support.microsoft.com/kb/309186/

Chapter 8. Hosts

215

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

When SDD or SDDDSM is installed, the reserve and release functions described in this article are translated into proper persistent reserve and release equivalents to allow load balancing and multipathing from each host.

SDD versus SDDDSM


All new installations should be using SDDDSM unless the Windows OS is a legacy version (2000, NT). The major requirement for choosing SDD or SDDDSM is to ensure the matching host bus adapter driver type is also loaded on the system. Choose the storport driver for SDDDSM and the scsiport versions for SDD. Future enhancements to multipathing will concentrate on SDDDSM within the windows MPIO framework.

Tunable parameters
With Windows operating systems, the queue depth settings are the responsibility of the host adapters and configured through the BIOS setting. Configuring the queue depth settings varies from vendor to vendor. Refer to your manufacturers instructions about how to configure your specific cards and the IBM SAN Volume Controller Information Center (Host Attachment Chapter): http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc.console.doc/svc_FChostswindows_cover.html Queue depth is also controlled by the Windows application program. The application program has control of how many I/O commands it will allow to be outstanding before waiting for completion. The queue depth may have to be adjusted based on the overall IO group queue depth calculation in 8.3.1, Queue depths. For IBM FAStT FC2-133 (and QLogic-based HBAs), the queue depth is known as the execution throttle, which can be set with either the QLogic SANSurfer tool or in the BIOS of the QLogic-based HBA by pressing Ctrl+Q during the startup process.

Changing back-end storage LUN mappings dynamically


Unmapping a LUN from a Windows SDD or SDDDSM server and then mapping a different LUN using the same SCSI ID can cause data corruption and loss of access. The procedure for reconfiguration is documented at the following Web site: http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&uid=ssg1S1003316&lo c=en_US&cs=utf-8&lang=en

Recommendations for Disk Alignment using Windows with SVC volumes


The recommended settings for the best performance with SVC when you use Microsoft Windows operating systems and applications with a significant amount of I/O can be found at the following Web site: http://www-1.ibm.com/support/docview.wss?rs=591&context=STPVGU&context=STPVFV&q1=m icrosoft&uid=ssg1S1003291&loc=en_US&cs=utf-8&lang=en

8.5.5 Linux
IBM has decided to transition SVC multipathing support from IBM SDD to Linux native DM-MPIO multipathing (listed as Device Mapper Multipath in the table). Veritas DMP is also

216

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

available for certain kernels. Refer to the SAN Volume Controller Supported Hardware List, Device Driver, Firmware and Recommended Software Levels V6.2 for which versions of each Linux kernel require SDD, DM-MPIO, and Veritas DMP support: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797#_RH21 Some kernels allow a choice of which multipathing driver to use. This is indicated by a horizontal bar between the choices of multipathing driver for the specific kernel shown to the left. If your kernel is not listed for support, contact your IBM marketing representative to request a Request for Price Quotation (RPQ) for your specific configuration. Certain types of Clustering are now supported, however the multipathing software choice is tied to the type of cluster and hba driver. For example, Veritas Storage Foundation is supported for certain hardware/kernel combinations, but it also requires veritas DMP multipathing. Contact IBM marketing for RPQ support if you need Linux Clustering in your specific environment and it is not listed.

SDD compared to DM-MPIO


For reference on the multipathing choices for Linux operating systems, SDD development has provided the white paper, Considerations and Comparisons between IBM SDD for Linux and DM-MPIO, which is available at: http://www-1.ibm.com/support/docview.wss?rs=540&context=ST52G7&q1=linux&uid=ssg1S7 001664&loc=en_US&cs=utf-8&lang=en

Tunable parameters
Linux performance is influenced by HBA parameter settings and queue depth. Aside from the overall calculation for queue depth for the IO group mentioned in 8.3.1, Queue depths, there are also maximums per hba adapter/type with settings recommended in the SVC 6.2.0 Information Center. Refer to the settings for each specific HBA type and general Linux OS tunable parameters in the IBM SAN Volume Controller Information Center (Host Attachment Chapter) at: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc431.console.doc/svc_linover_1dcv35.html In addition to the I/O and OS parameters, Linux also has tunable file system parameters. You can use the command tune2fs to increase file system performance based on your specific configuration. The journal mode and size can be changed. Also, the directories can be indexed. http://www.ibm.com/developerworks/linux/library/l-lpic1-v3-104-2/index.html?ca=dgr -lnxw06TracjLXFilesystems

8.5.6 Solaris
There are two options for multipathing support on Solaris hosts. You will choose between Symantec/VERITAS Volume Manager, or Solaris MPxIO depending on your file system requirements and the OS levels in the latest interoperability matrix: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797#_Sun58

Chapter 8. Hosts

217

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

IBM SDD is no longer supported because its features are now available natively in the multipathing driver Solaris MPxIO. IF SDD support is still needed, contact your IBM marketing representative to request a Request for Price Quotation (RPQ) for your specific configuration.

Solaris MPxIO
SAN boot and clustering support are available for 5.9 and 5.10 OS, dependent on the multipathing driver and hba choices. Releases of SVC code prior to 4.3.0 did not support load balancing of the MPxIO software. Configure your SVC host object with the type attribute set to tpgs if you want to run MPxIO on your Sun SPARC host. For example: svctask mkhost -name new_name_arg -hbawwpn wwpn_list -type tpgs In this command, -type specifies the type of host. Valid entries are hpux, tpgs, or generic. The tpgs option enables an extra target port unit. The default is generic. For guidance on configuring MPxIO software for OS 5.10 and using SVC volumes, please refer to the following document: http://download.oracle.com/docs/cd/E19957-01/819-0139/ch_3_admin_multi_devices.html

Symantec/VERITAS Volume Manager


When managing IBM SVC storage in Symantecs volume manager products, you must install an array support library (ASL) on the host so that the volume manager is aware of the storage subsystem properties (active/active, active/passive). If the appropriate ASL is not installed, the volume manager has not claimed the LUNs. Usage of the ASL is required to enable the special failover/failback multipathing that SVC requires for error recovery. Use the following commands to determine the basic configuration of a Symantec/Veritas server: pkginfo l (lists all installed packages) showrev -p |grep vxvm (to obtain version of volume manager) vxddladm listsupport (to see what ASLs are configured) vxdisk list vxdmpadm listctrl all (shows all attached subsystems, and provides a type where possible) vxdmpadm getsubpaths ctlr=cX (lists paths by controller) vxdmpadm getsubpaths dmpnodename=cxtxdxs2 (lists paths by lun) The following commands will determine if the SVC is properly connected and show at a glance which ASL library is used (native DMP ASL or SDD ASL). Here is an example of what you see when Symantec volume manager is correctly seeing our SVC, using the SDD passthrough mode ASL: # vxdmpadm list enclosure all ENCLR_NAME ENCLR_TYPE ENCLR_SNO STATUS ============================================================ OTHER_DISKS OTHER_DISKS OTHER_DISKS CONNECTED VPATH_SANVC0 VPATH_SANVC 0200628002faXX00 CONNECTED Here is an example of what we see when SVC is configured using native DMP ASL:

218

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

# vxdmpadm listenclosure all ENCLR_NAME ENCLR_TYPE ENCLR_SNO STATUS ============================================================ OTHER_DISKS OTHER_DSKSI OTHER_DISKS CONNECTED SAN_VC0 SAN_VC 0200628002faXX00 CONNECTED

ASL specifics for SVC


For SVC, ASLs have been developed using both DMP multipathing or SDD passthrough multipathing. SDD passthrough is documented here for legacy purposes only. For SDD passthrough: http://www.symantec.com/business/support/index?page=content&id=TECH45863 # pkginfo -l VRTSsanvc PKG=VRTSsanvc BASEDIR=/etc/vx NAME=Array Support Library for IBM SAN.VC with SDD. PRODNAME=VERITAS ASL for IBM SAN.VC with SDD. Using SDD is no longer best practice. We recommend that SDD configurations be replaced with native DMP. For latest ASL levels to use native DMP: https://sort.symantec.com/asl The following commands will display the current ASL levels: pkginfo -l VRTSsanvc PKGINST: VRTSsanvc NAME: Array Support Librarry for IBM SAN.VC in NATIVE DMP mode

For latest Veritas Patch levels, refer to: https://sort.symantec.com/patch/matrix To check the installed Symantec/VERITAS version: showrev -p |grep vxvm To check what IBM ASLs are configured into the volume manager: vxddladm listsupport |grep -i ibm Following the installation of a new ASL using pkgadd, you need to either reboot or issue vxdctl enable. To list the ASLs that are active, run vxddladm listsupport.

How to troubleshoot configuration issues


Here is an example of the appropriate ASL not being installed or the system not enabling the ASL. The key is the enclosure type OTHER_DISKS: vxdmpadm listctlr all CTLR-NAME ENCLR-TYPE STATE ENCLR-NAME ===================================================== c0 OTHER_DISKS ENABLED OTHER_DISKS c2 OTHER_DISKS ENABLED OTHER_DISKS c3 OTHER_DISKS ENABLED OTHER_DISKS

Chapter 8. Hosts

219

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

vxdmpadm listenclosure all ENCLR_NAME ENCLR_TYPE ENCLR_SNO STATUS ============================================================ OTHER_DISKS OTHER_DISKS OTHER_DISKS CONNECTED Disk Disk DISKS DISCONNECTED

8.5.7 VMware
Review the SAN Volume Controller Supported Hardware List, Device Driver, Firmware and Recommended Software Levels V6.2 to determine the various ESX levels that are supported, and whether you plan to utilize the newly available 6.2 support of VMware https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797#_VMVAAI SVC 6.2.0 adds support for VMware vStorage APIs. SVC implemented new storage-related tasks that were previously performed by VMware, which helps improve efficiency and frees up server resources for other more mission-critical tasks. The new functions include full copy, block zeroing, and hardware-assisted locking. If not using the new API functionality, the recommended minimum and supported VMware levels is now 3.5. If lower versions are required, contact your IBM marketing representative and ask about the submission of an RPQ for support. The necessary patches and procedures required will be supplied after the specific configuration has been reviewed and approved. Host Attachment recommendations are now available in theIBM System Storage SAN Volume Controller V6.2.0 Information Center and Guides at: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc.console.doc/svc_over_1dcur0.html

vStorage API for Array Integration ( VAAI ).

The specific chapter is here: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp?topic=/com.ibm.storage.s vc.console.doc/svc_vmwrequiremnts_21layq.html

Multipathing solutions supported


Multipathing is supported at ESX level 2.5.x and higher; therefore, installing multipathing software is not required. There are two multipathing algorithms available: Fixed Path Round Robin VMware multipathing has improved to utilize the SVC preferred node algorithms starting with 4.0. Preferred paths are ignored in Vmware versions prior to 4.0. The VMware multipathing software performs static load balancing for I/O, which defines the fixed path for a given volume. Round Robin rotates path selection for a given volume through all paths. For any given volume using the fixed path policy, the first discovered preferred node path is chosen. Both fixed path and round robins algorithms have been modified with 4.0 and higher 220
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

to honor the SVC preferred node which is discovered via the TPGS command. Path failover is automatic in both cases. If Round Robin is used, path failback may not return to a preferred node path, and therefore it is recommended to manually check pathing after any maintenance or problems have occurred. Multipathing configuration maximums The maximum supported configuration for the VMware multipathing software is: A total of 256 SCSI devices Four paths to each volume Note: Each path to a volume equates to a single SCSI device. For more information about VMware and SVC, VMware storage and zoning recommendations, HBA settings and attaching volumes to VMware, refer to: Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933 http://www.redbooks.ibm.com/redpieces/abstracts/sg247933.html

8.6 Mirroring considerations


As you plan how to fully utilize the various options to back up your data through mirroring functions, consider how to keep a consistent set of data for your application. A consistent set of data implies a level of control by the application or host scripts to start and stop mirroring with both host-based mirroring and back-end storage mirroring features. It also implies a group of disks that must be kept consistent. Host applications have a certain granularity to their storage writes. The data has a consistent view to the host application only at certain times. This level of granularity is at the file system level as opposed to the SCSI read/write level. The SVC guarantees consistency at the SCSI read/write level when its features of mirroring are in use. However, a host file system write might require multiple SCSI writes. Therefore, without a method of controlling when the mirroring stops, the resulting mirror can be missing a portion of a write and look corrupted. Normally, a database application has methods to recover the mirrored data and to back up to a consistent view, which is applicable in the case of a disaster that breaks the mirror. However, we recommend that you have a normal procedure of stopping at a consistent view for each mirror in order to be able to easily start up the backup copy for non-disaster scenarios.

8.6.1 Host-based mirroring


Host-based mirroring is a fully redundant method of mirroring using two mirrored copies of the data. Mirroring is done by the host software. If you use this method of mirroring, we recommend that each copy is placed on a separate SVC cluster. SVC based mirroring is also available. If you use SVC mirrors, ensure each copy is on a different backend controller based Mdiskgrp.

Chapter 8. Hosts

221

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

8.7 Monitoring
A consistent set of monitoring tools is available when IBM SDD, SDDDSM, and SDDPCM are used for the multipathing software on the various OS environments. Examples earlier in this chapter showed how the datapath query device and datapath query adapter commands can be used for path monitoring. Path performance can also be monitored via datapath commands: datapath query devstats (or pcmpath query devstats) The datapath query devstats command shows performance information for a single device, all devices, or a range of devices. Example 8-6 shows the output of datapath query devstats for two devices.
Example 8-6 The datapath query devstats command output

C:\Program Files\IBM\Subsystem Device Driver>datapath query devstats Total Devices : 2 Device #: 0 ============= I/O: SECTOR: Transfer Size: Total Read 1755189 14168026 <= 512 271 Total Write 1749581 153842715 <= 4k 2337858 Active Read 0 0 <= 16K 104 Active Write 0 0 <= 64K 1166537 Maximum 3 256 > 64K 0

Device #: 1 ============= I/O: SECTOR: Transfer Size: Total Read 20353800 162956588 <= 512 296 Total Write 9883944 451987840 <= 4k 27128331 Active Read 0 0 <= 16K 215 Active Write 1 128 <= 64K 3108902 Maximum 4 256 > 64K 0

Also, an adapter level statistics command is available: datapath query adaptstats (also mapped to pcmpath query adaptstats). Refer to Example 8-7 for a two adapter example.
Example 8-7 The datapath query adaptstats output

C:\Program Files\IBM\Subsystem Device Driver>datapath query adaptstats Adapter #: 0 ============= I/O: SECTOR: Adapter #: 1 Total Read 11060574 88611927 Total Write 5936795 317987806 Active Read 0 0 Active Write 0 0 Maximum 2 256

222

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Hosts.fm

============= I/O: SECTOR: Total Read 11048415 88512687 Total Write 5930291 317726325 Active Read 0 0 Active Write 1 128 Maximum 2 256

It is possible to clear these counters so that you can script the usage to cover a precise amount of time. The commands also allow you to choose devices to return as a range, single device, or all devices. The command to clear the counts is datapath clear device count.

8.7.1 Automated path monitoring


There are many situations in which a host can lose one or more paths to storage. If the problem is just isolated to that one host, it might go unnoticed until a SAN issue occurs that causes the remaining paths to go offline, such as a switch failure, or even a routine code upgrade, which can cause a loss-of-access event, which seriously affects your business. To prevent this loss-of-access event from happening, many clients have found it useful to implement automated path monitoring using SDD commands and common system utilities. For instance, a simple command string in a UNIX system can count the number of paths: datapath query device | grep dead | lc This command can be combined with a scheduler, such as cron, and a notification system, such as an e-mail, to notify SAN administrators and system administrators if the number of paths to the system changes.

8.7.2 Load measurement and stress tools


Generally, load measurement tools are specific to each host operating system tool support. For example, the AIX OS has the tool iostat. Windows OS has perfmon.msc /s. There are industry standard performance benchmarking tools available. These tools are available by joining the Storage Performance Council. The information about how to join is available here: http://www.storageperformance.org/home These tools are available to both create stress and measure the stress that was created with a standardized tool and are highly recommended for generating stress for your test environments to compare against the industry measurements. Another recommended stress tool available is iometer for Windows and Linux hosts: http://www.iometer.org AIX System p has Wikis on performance tools and has made a set available for their users: http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/Performance+Monitoring +Tools http://www-941.ibm.com/collaboration/wiki/display/WikiPtype/nstress

Chapter 8. Hosts

223

7521Hosts.fm

Draft Document for Review February 16, 2012 3:49 pm

Xdd is a tool for measuring and analyzing disk performance characteristics on single systems or clusters of systems. It was designed by Thomas M. Ruwart from I/O Performance, Inc. to provide consistent and reproducible performance of a sustained transfer rate of an I/O subsystem. It is a command line-based tool that grew out of the UNIX community and has been ported to run in Windows environments as well. Xdd is a free software program distributed under a GNU General Public License. Xdd is available for download at: http://www.ioperformance.com/products.htm The Xdd distribution comes with all the source code necessary to install Xdd and the companion programs for the timeserver and the gettime utility programs. DS4000 Best Practices and Performance Tuning Guide, SG24-6363-02, has detailed descriptions of how to use these measurement and test tools: http://www.redbooks.ibm.com/abstracts/sg246363.html?Open

224

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521p02_perf.fm

Part 1

Part

Performance best practices


In this part we introduce/provide/describe/discuss...

Copyright IBM Corp. 2011. All rights reserved.

225

7521p02_perf.fm

Draft Document for Review February 16, 2012 3:49 pm

226

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Performance.fm

Chapter 9.

SVC 6.2 performance highlights


In this chapter, we discuss the latest performance improvements achieved by SAN Volume Controller (SVC) code release 6.2, the new SVC node hardware models CF8 and CG8, and the new SVC Performance Monitoring Tool.

Copyright IBM Corp. 2011. All rights reserved.

227

7521Performance.fm

Draft Document for Review February 16, 2012 3:49 pm

9.1 SVC continuing performance enhancements


Since IBM first introduced SVC in May 2003, it has continually improved its performance to meet increasing client demands. The SVC hardware architecture, based in the IBM x-Series servers, allows for fast deployment of the latest technological improvements available, like multi-core processors, increased memory, faster Fibre Channel interfaces and optional features. In Table 9-1you can see the main specifications of each SVC node model for comparison.
Table 9-1 SVC node models specifications SVC node model 4F2 8F2 8F4 8G4 8A4 CF8 CG8 x-Series model x335 x336 x336 x3550 x3250M2 x3550M2 x3550M3 Processors 2 Xeon 2 Xeon 2 Xeon 2 Xeon 5160 1 dual-core Xeon 3100 1 quad-core Xeon E5500 1 quad-core Xeon E5600 Memory 4GB 8GB 8GB 8GB 8GB 24GB 24GB FC Ports and speed 4@2Gbps 4@2Gbps 4@4Gbps 4@4Gbps 4@4Gbps 4@8Gbps 4@8Gbps SSDs up to 4x 146GB* up to 4x 146GB* iSCSI 2x 1Gbps 2x 1Gbps 2x 10Gbps*

Note: Items marked with (*) are optional. In CG8 model a node can have either SSD drives or the 10Gbps iSCSI interfaces, but not both.

In July 2007 a SVC with 8 nodes model 8G4 running code version 4.2 delivered 272,505.19 SPC-1 IOPS. In February 2010 a SVC with 6 nodes model CF8 running code version 5.1 delivered 380,489.30 SPC-1 IOPS. For details on each of these benchmarks see the documents posted in the URLs below. Check also the Storage Performance Council web site for the latest published SVC benchmanrks.
http://www.storageperformance.org/benchmark_results_files/SPC-1/IBM/A00087_IBM_DS8700_SVC-5 .1-6node/a00087_IBM_DS8700_SVC5.1-6node_full-disclosure-r1.pdf http://www.storageperformance.org/results/a00052_IBM-SVC4.2_SPC1_full-disclosure.pdf

Figure 9-1 on page 229 compares the performance between two different SVC clusters, each with one single I/O group, with a series of different workloads. The first case is a 2-node 8G4 cluster running SVC version 4.3, and the second is a 2-node CF8 cluster running SVC version 5.1. SR / SW: sequential read / sequential write RH / RM / WH / WM : read or write, cache hit / cache miss 512b / 4K / 64K: block size 70/30: mixed profile 70% read and 30% write

228

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Performance.fm

Figure 9-1

When discussing Enterprise Storage solutions, raw I/O performance is important, but not everything: to date, IBM has shipped more than 22.500 SVC engines, running in more than 7,200 SVC systems. In 2008 and 2009, across the entire installed base, SVC delivered better than five nines (99.999%) availability. Check IBM SVC web site for the latest information on SVC. http://www.ibm.com/systems/storage/software/virtualization/svc

9.2 Solid State Drives (SSDs) and Easy Tier


SVC version 6.2 radically increased the number of possible approaches you can take with your managed storage, among other by introducing the use of Solid State Disks (SSDs) both internally to the SVC nodes and in the managed array controllers, along with Easy Tier to automatically analyze and make the best possible use of your fastest storage layer. Solid State Drives are much faster than conventional disks, but are also much more expensive. SVC node model CF8 already supported internal SSDs in code version 5.1. Figure 9-2 shows some figures of throughput with SVC version 5.1 and SSDs alone.

Chapter 9. SVC 6.2 performance highlights

229

7521Performance.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 9-2 SVC 5.1 2-node cluster with internal SSDs: throughput for a variety or workloads

The recommended configuration and use of SSDs in SVC 6.2, either installed internally in the SVC nodes or in the managed storage controllers, are covered in other chapters of this Redbook - see Chapters 10, 11 and 12 for details. Note: While we provide in this Redbook several recommendations on how to fine tune your existing SVC and extract its best not only in I/Os per second but also in ease of management, there are many more other possible scenarios than we could possibly cover here. We strongly encourage you to contact your IBM Representative and Storage Techline for advice if you have a highly demanding storage environment. They have the knowledge and tools to provide you the best-fitting, tailor-made SVC solution for your needs.

9.2.1 Internal SSDs Redundancy


In order to achieve internal SSDs redundancy with SVC version 5.1 in case of node failure, we needed to use a scheme in which the SSDs in one node were mirrored by a corresponding set of SSDs in its partner node. The recommended way of accomplishing this was to define a striped MDisk group to contain the SSDs of a given node, which should support an equal number of primary and secondary VDisk copies. The physical node location of each primary VDisk copy should match with the node assignment of that copy and the node assignment of the VDisk itself. This arrangement ensures minimal traffic requirements between nodes, and a balanced load across the mirrored SSDs. SVC version 6.2 introduced the use of arrays for the internal SSDs that can be configured according to the use you intend to give them. Table 9-2 on page 231 shows the possible RAID levels you can configure your internal SSD arrays. Note: SVC version 5.1 supports use of internal SSDs as managed disks, whereas SVC 6.2 uses them as array members. Internal SSDs are not supported in SVC version 6.1. Refer to Chapter 16, SVC scenarios on page 453 for an upgrade approach when already using SSDs in SVC version 5.1.

230

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Performance.fm

Table 9-2 RAID levels for internal SSDs RAID level (GUI Preset) RAID-0 (Striped) RAID-1 (Easy Tier) What you will need 1-4 drives, all in a single node. 2 drives, one in each node of the IO Group. When to use it? When Volume Mirror is on external MDisks. When using Easy Tier and/or both mirrors on SSDs For best performance A pool should only contain arrays from a single IO Group. An Easy Tier pool should only contain arrays from a single IO Group. The external MDisks in this pool should only be used by the same IO Group. A pool should only contain arrays from a single IO Group. Recommended over Volume Mirroring.

RAID-10 (Mirrored)

4-8 drives, equally distributed amongst each node of the IO Group

When using multiple drives for a Volume

9.2.2 Performance scalability and I/O Groups


Since a SVC cluster handles a particular volumes I/O by the pair of nodes (I/O group) it belongs to, its performance scalability when adding nodes is pretty much linear - under normal circumstances, you can expect a four-node cluster to be able to drive about twice as much I/O or throughput than a two-node cluster. This, of course, is valid provided you dont reach a contention or bottleneck in other components like back end storage controllers or SAN links. On the other hand, it is important to try and keep your I/O workload balanced across your SVC nodes and I/O groups as evenly as possible, so that you wont have a situation with one I/O group experiencing contention and another with idle capacity. If you a have a cluster with different node models, you could expect the I/O group with newer node models to be able to handle more I/O than the other ones, but exactly how much more is a very tricky question, so we recommend you dont count on it. Try and keep your SVC cluster with similar node models, and refer to Chapter 14, Maintenance on page 395 for the different approaches on how to upgrade them. Plan carefully the distribution of your servers across your SVC I/O groups, and the volumes of one I/O group across its nodes. Re-evaluate this distribution whenever you attach another servers to your SVC. Use the Performance Monitoring Tool described in 9.3, Real Time Performance Monitor on page 232 to help with this task.

Chapter 9. SVC 6.2 performance highlights

231

7521Performance.fm

Draft Document for Review February 16, 2012 3:49 pm

9.3 Real Time Performance Monitor


SVC code version 6.2 includes a Real Time Performance Monitor screen. It shows you the main performance indicators, including CPU utilization and throughput at the interfaces, volumes and MDisks. Figure 9-3 shows an example of a nearly-idle SVC cluster which at the moment was performing one single volume migration across storage pools.

Figure 9-3 SVC Real Time Performance Monitor

We recommend that you check this screen periodically for possible hot spots that might be developing in your SVC environment. To view this screen in the GUI go to the Home page then select Performance on the top left menu. At this moment the SVC GUI will start plotting the charts so give it a few moments until you can see the graphs being shown. You can position your cursor over a particular point in a curve to see details like the actual value and time for that point. SVC will plot a new point every five seconds and will show you the last five minutes of data. Try also changing the System Statistics setting in the top left corner to see details for a particular node. SVC Performance Monitor does not store performance data for later analysis, its display only show you what happened in the last five minutes. While it may provide you valuable input to help diagnose a performance problem in real time, it does not trigger performance alerts nor provide you the long-term trends required for capacity planning. For that you need a tool capable of collecting and storing performance data for long periods and present you with the corresponding reports, like IBM TotalStorage Productivity Center (TPC). See Chapter 14 Monitoring for details.

232

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

10

Chapter 10.

Backend performance considerations


In this chapter we will discuss the performance consideration for backend storage in the IBM System Storage SAN Volume Controller (SVC) implementation. We will cover the configuration aspects on the backend storage to optimize it for the use with SVC. The generic aspects and also storage subsystem specifics will be covered. Proper backend sizing and configuration is essential to achieve optimal performance out of the SVC environment.

Copyright IBM Corp. 2011. All rights reserved.

233

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

10.1 Workload considerations


Most applications meet performance objectives when average response times for random I/O are in the 2 - 15 millisecond range; however, there are response-time sensitive applications (typically transaction-oriented) that cannot tolerate maximum response times of more than a few milliseconds. You must consider availability in the design of these applications; however, be careful to ensure that sufficient back-end storage subsystem capacity is available to prevent elevated maximum response times. Note: We recommend that you use the Disk Magic application to size the performance demand for specific workloads. You can obtain a copy of Disk Magic, which can assist you with this effort, from: http://www.intellimagic.net

Batch and OLTP workloads


Clients often want to know whether to mix their batch and online transaction processing (OLTP) workloads in the same Managed Disk Group (MDG). Batch and OLTP workloads might both require the same tier of storage, but in many SVC installations, there are multiple MDGs in the same storage tier so that the workloads can be separated. We usually recommend mixing workloads so that the maximum resources are available to any workload when needed. However, batch workloads are a good example of the opposing point of view. There is a fundamental problem with letting batch and online work share resources: the amount of I/O resources that a batch job can consume is often limited only by the amount of I/O resources available. To address this problem, it obviously can help to segregate the batch workload to its own MDG, but segregating the batch workload to its own MDG does not necessarily prevent node or path resources from being overrun. Those resources might also need to be considered if you implement a policy of batch isolation. For SVC, an interesting alternative is to cap the data rate at which batch volumes are allowed to run by limiting the maximum throughput of a VDisk; refer to 6.5.1, Governing of volumes on page 112. Capping the data rate at which batch volumes are allowed to run can potentially let online work benefit from periods when the batch load is light while limiting the damage when the batch load is heavy. A lot depends on the timing of when the workloads will run. If you have mainly OLTP during the day shift and the batch workloads run at night, there is normally no problem with mixing the workloads in the same MDG. But you run the two workloads concurrently, and if the batch workload runs with no cap or throttling and requires high levels of I/O throughput, we recommend that wherever possible, the workloads are segregated onto different MDGs that are supported by different back-end storage resources. Note: SVC can greatly improve overall capacity and performance utilization of the backend storage subsystem with balancing the workload across parts of it, or across the whole subsystem. It is important to remember that SVC environment has to be sized properly on the backend storage level, as virtualizing the environment can not provide more than is available on the backend storage. This is especially true with the cache unfriendly workloads.

234

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

10.2 Tiering
You can use the SVC to create tiers of storage in which each tier has different performance characteristics by only including MDisks that have the same performance characteristics within an MDG. So, if you have a storage infrastructure with, for example, three classes of storage, you create each volume from the MDG, which has the class of storage that most closely matches the volumes expected performance characteristics. Because migrating between storage pools, or rather MDGs, is non-disruptive to the users, it is an easy task to migrate a volume to another storage pool, if the actual performance is different than expected. Note: If there is uncertainty about in which storage pool (SP) to create a volume, initially use the pool with the lowest performance and then move the Volume up to a higher performing pool later if required.

10.3 Storage controller considerations


Storage virtualization provides greater flexibility in managing the storage environment and in general gives the opportunity to utilize storage subsystems better than when they are used alone. SVC is achieving this better and balanced utilization with the use of striping across backend - storage subsystems resources. Striping can be done on the entire storage subsystem, part of the storage subsystem or across more storage subsystems. Tip: It is recommended that striping is performed across backend disks of the same characteristics. For example if the storage subsystem has 100 15K FC drives and 200 7.2K SATA drives it is not recommended to stripe across all 300 drives, but to have two striping groups, one with 15K FC drives in, and the other with 7.2 SATA drives. As SVC sits in the middle of the IO path between the hosts and the storage subsystem, and acting as a storage subsystem for the hosts it can also improve the performance of the whole environment because of the additional cache usage. This is especially true for cache friendly workloads. SVC acts as the host towards storage subsystems and because of that all standard host considerations should be applied. The main difference between the SVC usage of the storage subsystem and the hosts usage of it, is that in the case of SVC only one device is accessing it. With the use of striping this access provides evenly utilized storage subsystems. The even utilization of a storage subsystem is only achievable via proper setup. To achieve even utilization storage pools have to be distributed across all available storage subsystem resources - drives, IO buses, RAID controllers. It is important to understand that the SVC environment can only serve to the hosts the IO capacity which is provided by the backend storage subsystems and its internal SSD drives.

10.3.1 Backend IO capacity


To calculate what the SVC environment can deliver in terms of IO performance several factors have to be considered. The following steps illustrate how to calculate IO capacity of the SVC backend. RAID array IO performance
Chapter 10. Backend performance considerations

235

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

RAID arrays are created on the storage subsystem as a placement for a LUNs which are assigned to the SVC as managed disks. The performance of the particular RAID array depends on the following: Type of the drives used in the array (for example, 15K FC, 10K SAS, 7.2K SATA, SSD) Number of the drives used in the array Type of the RAID used (ie. RAID 10, RAID 5, RAID 6) Table 10-1 shows conservative rule of thumb numbers for random IO performance which can be used in the calculations.
Table 10-1 Disk IO rates Disk type FC 15K/SAS 15K FC 10K/SAS 10K SATA 7.2K Number of IOps 160 120 75

The next important parameter which has to be considered when we want to calculate the IO capacity of an RAID array is the write penalty. Write penalty for various RAID array types is shown in Table 10-2.
Table 10-2 RAID write penalty RAID type RAID 5 RAID 10 RAID 6 Number of sustained failures 1 minimum 1 2 Number of disks N+1 2xN N+2 Write penalty 4 2 6

RAID 5 and RAID 6 do not suffer from the write penalty in case that full stripe writes (also called stride writes) are performed. In this case the write penalty is 1. With this and the information of how many disks are in each array, we are able to calculate read and write IO capacity of a particular array. In Table 10-3 we calculate the IO capacity. In this example our RAID array has eight 15K FC drives.
Table 10-3 RAID array (8 drives) IO capacity RAID type RAID 5 RAID 10 RAID 6 Read only IO capacity (IOps) 7 x 160 = 1120 8 x 160 = 1280 6 x 160 = 960 Write only IO capacity (IOps) (8 x 160)/4 = 320 (8 x 160)/2 = 640 (8 x 160)/6 = 213

In most of the current generation storage subsystems write operations are cached and handled asynchronous meaning that write penalty is hidden from the user. Of course heavy and steady random writes can cause that write cache destage is not fast enough and in this situation the speed of the array will be limited to the speed defined with the number of drives and the RAID array type. The numbers from Table 10-3 on page 236 are

236

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

covering the worst case scenario and are not taking into account any read or write cache efficiency. Storage pool IO capacity If we are using 1:1 LUN (SVC managed disk) to array mapping, then the array IO capacity is already the IO capacity of the managed disk. The IO capacity of the SVC storage pool, is the sum of the IO capacity of all managed disks in that pool. For example, if we have 10 managed disks from the RAID arrays with 8 disks as used in our example, then the IO capacity of the storage pool will be as shown in Table 10-4.
Table 10-4 Storage pool IO capacity RAID type RAID 5 RAID 10 RAID 6 Read only IO capacity (IOps) 10 * 1120 = 11200 10 * 1280 = 12800 10 * 960 = 9600 Write only IO capacity (IOps) 10 * 320 = 3200 10 * 640 = 6400 10 * 213 = 2130

IO capacity of RAID 5 storage pool would range from 3200 IOps when the workload pattern on the RAID array level is 100% write and 11200, when the workload pattern is 100% read. It is important to understand that this is workload pattern caused by SVC towards storage subsystem and it is not necessarily the same as it is from the host to the SVC, because of the SVC cache usage. If more than one managed disk (LUN) is used per array, then each managed disk will get the portion of the array IO capacity. For example if we would have two LUNs per our eight disk array and only one of the managed disks from each array would be used in the storage pool then IO capacity for 10 managed disks would be as shown in Table 10-5.
Table 10-5 Storage pool IO capacity with two LUNs per array RAID type RAID 5 RAID 10 RAID 6 Read only IO capacity (IOps) 10 * 1120/2 = 5600 10 * 1280/2 = 6400 10 * 960/2 = 4800 Write only IO capacity (IOps) 10 * 320/2 = 1600 10 * 640/2 = 3200 10 * 213/2 = 1065

The numbers shown in Table 10-5 are valid in the case that both LUNs on the array are evenly utilized. In the case that the second LUNs on the arrays participating in the storage pool, are idle storage pool capacity you can achieve numbers shown in Table 10-4 on page 237. In an environment with two LUNs per array the second LUN can also utilize the entire IO capacity of the array and cause the LUN used for the SVC storage pool to get less available IOps. If the second LUN on those arrays is also used for the SVC storage pool, the cumulative IO capacity of two storage pools in this case would be equal to one storage pool with one LUN per array. Storage subsystem cache influence The numbers for SVC storage pool IO capacity calculated in Table 10-5 did not take in account caching on the storage subsystem level, but only the raw RAID array performance. Similar to the hosts using SVC having the read/write pattern and cache efficiency in its workload, the SVC has also read/write pattern and cache efficiency towards the storage subsystem. Let us look at the example of the following host to SVC IO pattern:

Chapter 10. Backend performance considerations

237

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

70:30:50 - 70% reads, 30% writes, 50% read cache hits Read related IOps generated from the host IO = Host IOps * 0.7 * 0.5 Write related IOps generated from the host IO = Host IOps * 0.3 Table 10-6 shows the relation from host IOps to the SVC backend IOps.
Table 10-6 Host to SVC backend IO map Host IOps 2000 Pattern 70:30:50 Read IOps 700 Write IOps 600 Total IOps 1300

The total IOps from the Table 10-6 is the number of IOps which will be sent from SVC to the storage pool on the storage subsystem. As SVC is acting as the host towards storage subsystem we can also assume that we will have some read/write pattern and read cache hit on this traffic. As we can see from the table above, the 70:30 read/write pattern with the 50% cache hit from the host to the SVC is causing approximate 54:46 read write pattern from for the SVC traffic to the storage subsystem. If we apply the same read cache hit 50%, we get the 950 IOps which will be send to the RAID arrays, which are part of the storage pool, inside the storage subsystem as shown in Table 10-7.
Table 10-7 SVC to storage subsystem IO map SVC IOps 1300 Pattern 54:46:50 Read IOps 350 Write IOps 600 Total IOps 950

Note: These calculations are valid only when the IO generated from the host to the SVC generates exactly one IO from the SVC to the storage subsystem. If for example, SVC is combining several host IOs to one storage subsystem IO, this would mean that higher IO capacity can be achieved. It is also important to understand that IO with higher block size decreases the RAID array IO capacity, so it is quite possible that combining the IOs will not increase the total array IO capacity as viewed from the host perspective. The drive IO capacity numbers used in above IO capacity calculations are for small block size (ie. 4K-32K). To simplify this example we will assume that number of IOps generated on the path from the host to the SVC and from the SVC to the storage subsystem will remain the same. If we assume the write penalty, then the total IOps towards RAID array for the above host example would be as shown in Table 10-8.
Table 10-8 RAID array total utilization RAID type RAID 5 RAID 10 RAID 6 Host IOps 2000 2000 2000 SVC IOps 1300 1300 1300 RAID array IOps 950 950 950 RAID array IOps with write penalty 350+4*600 = 2750 350+2*600 = 1550 350+6*600 = 3950

Based on the above calculations we can make a generic formula to calculate available host IO capacity from the given RAID/storage pool IO capacity. Let us assume we have the following parameters: R - Host read ratio (%)

238

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

W - Host write ratio (%) C1 - SVC read cache hits (%) C2 - Storage subsystem read cache hits (%) WP - Write penalty for the RAID array XIO - RAID array/storage pool IO capacity Host IO capacity (HIO) could then be cacluated using the following formula: HIO = XIO / (R*C1*C2/1000000 + W*WP/100) As we can see the host IO capacity can be lower than storage pool IO capacity when the denominator in the above formula is bigger than 1. If we want to calculate at which write percentage in IO pattern (W) host IO capacity will be lower than storage pool capacity we can use the following formula: W =< 99.9 / (WP - C1*C2/10000) We can see that write percentage (W) mainly depends on the write penalty of the RAID array. In the Table 10-9 we can see the break-even value for the W with read cahce hit of 50% on the SVC and storage subsystem level.
Table 10-9 W % break-even RAID type RAID 5 RAID 10 RAID 6 Write penalty (WP) 4 2 6 W % break-even 26,64% 57,08% 17.37%

The W % break-even value from Table 10-9 is a good reference which RAID level should be used if we want to maximally utilize the storage subsystem backend RAID arrays from the write workload perspective. With the above formulas we can also calculate what would be host IO capacity for our example storage pool from Table 10-4 on page 237 with the 70:30:50 IO pattern (Read:Write:Cache hit) from the host side and 50% read cache hit on the storage subsystem. The results are shown in Table 10-10.
Table 10-10 Host IO example capacity RAID type RAID 5 RAID 10 RAID 6 Storage pool IO capacty (IOps) 112000 128000 9600 Host IO capacity (IOps) 8145 16516 4860

As already mentioned the above formula assumes that there is no IO grouping on the SVC level. With the SVC code 6.x the default backend read and write IO size is 256K. This means that it a [possible scenario is that a host would read or write multiple, for example 8, aligned 32K blocks from/to the SVC. The SVC would combine this to one IO on the backend side. In such cases the above formulas would need to be adjusted and in this case the available host IO for particular storage pool would increase.

Chapter 10. Backend performance considerations

239

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

FlashCopy
The use of the FlashCopy on the volume can generate additional load on the backend. It is important to understand that until FlashCopy (FC) target is not fully copied or when copy rate 0 is used, the IO to the FC target will cause IO load on the FC source. Once FC target is fully copied read/write IOs are served independently from the source read/write IO requests. The following combinations as shown in Table 10-11, are possible when copy rate 0 is used or target FC volume is not fully copied and IO are executed in uncopied area.
Table 10-11 Flash copy IO operations IO operation 1x read IO from source 1x write IO to source 1x write IO to source to the already copied area (copy rate > 0) 1x read IO from target 1x read IO from target from the already copied area copy rate > 0) 1x write IO to target 1x write IO to target to the already copied area copy rate > 0) Source volume write IOs 0 1 1 Source volume read IOs 1 1 0 Target volume write IOs 0 1 0 Target volume read IOs 0 0 0

0 0

1 0

0 0

redirect to the source 1

0 0

1 0

1 1

0 0

As we can see that in certain IO operations we will experience multiple IO overhead which can cause performance degradation of the source and target volume. It is especially important to understand that if the source and the target FC volume will share the same backend storage pool as shown in Table 10-1 this will further have influence on the performance.

240

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-1 FlashCopy source and target volume in the same storage pool

When frequent FC operations are executed and we do not want to have to much impact on the performance of the source FC volumes it is recommended to put target FC volumes on a different storage pool not sharing the same backend disks or even if possible to put then on the separate backend controller as shown in Figure 10-2.

Figure 10-2 Source and target FlashCopy volumes on different storage pools

When there is the need for heavy IO on the target FC volume, for example FC target of the database can be used for data mining, it is recommended to wait until FC copy is completed before target volume is being used.

Chapter 10. Backend performance considerations

241

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

In case that volumes participating in FC operations are big, the copy time required for a full copy is not acceptable. In this situation it is recommended to use incremental FC approach. In this setup initial copy will last longer, all subsequent copies will only copy changes, because of the FC change tracking on source and target volumes. This incremental copying will be performed much faster and it is usually in acceptable time frame so that there is no need to utilize target volumes during the copy operation. Example of this approach is shown in the Figure 10-3.

Figure 10-3 Incremental Flash Copy for performance optimization

With this approach we will achieve minimal impact on the source FC volume.

Thin provisioning
Thin provisioning (TP) function will also affect the performance of the volume as it will generate additional IOs. TP is implemented using as B-Tree directory which is also stored on the storage pool, the same as the actual data is. The real capacity of the volume consists of the virtual capacity and the space used for directory as shown in Figure 10-4.

Figure 10-4 Thin provisioning volume

There are four possible IO scenarios for TP volumes: Write to an unallocated region

242

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

a. Directory lookup indicates the region is unallocated b. SVC allocates space and updates the directory c. Data and directory is written to disk Write to an allocated region a. Directory lookup indicates the region is already allocated b. Data is written to disk Read to an unallocated region (unusual) a. Directory lookup indicates the region is unallocated b. SVC returns a buffer of 0x00s Read to an allocated region a. Directory lookup indicates the region has been allocated b. Data is read from disk As we can see from the list above single host IO requests to the specified TP volume can result in multiple IOs on the backend side because of the related directory lookup. The following are key elements to consider when using TP volumes: 1. Use striping for all TP volumes if possible across many backed disks. If TP volumes are used to reduce the number of required disks, this can also result in performance penalty on those TP volumes. 2. Do not use TP volumes where high I/O performance is required 3. TP volumes require more I/O capacity because of the directory lookups and because of that for truly random workloads this can generate 2x more workload on the backend disks. The directory I/O requests are two way write-back cached, same as fastwrite cache, that means that some applications will perform better as the directory lookup will be served from the cache. 4. TP volumes require more CPU processing on the SVC nodes, so the performance per I/O groups will be lower. Rule of thumb is that I/O capacity of the I/O group can be only 50% when using only TP volumes. 5. Smaller grain size can have more influence to the performance, as it will require more directory I/O It is recommended that bigger grain size (ie. 256K) is used for the host I/O where bigger amounts of write data is expected

Thin Provisioning and FlashCopy


Thin Provisioned (TP) volumes can be used in FlashCopy relations as Space-efficient function which provides capability for thin provisioned volumes. This relation is shown in Figure 10-5.

Chapter 10. Backend performance considerations

243

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 10-5 SVC I/O facilities

For certain workloads the combination of TP and FlashCopy (FC) function can have significant impact on the performance of target FC volumes. This is related to the fact, that FlashCopy starts to copy the volume from its end, and when target FC volume is thin provisioned this means that last block will be physically at the beginning of the volume allocation on the backend storage as shown in the Figure 10-6.

Figure 10-6 FlashCopy thin provisioned target volume

In case of sequential workload as shown in the Figure 10-6 the data will be on the physical level (backend storage) read/write from the end to the beginning. In this case underlying storage subsystem will not be able to recognize sequential operation. This will cause the performance degradation on the particular I/O operation.

10.4 Array considerations


To achieve the optimal performance of the SVC environment it is very important how the array layout is selected.

244

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

10.4.1 Selecting the number of LUNs per array


We generally recommend that you configure LUNs to use the entire array, which is especially true for midrange storage subsystems where multiple LUNs configured to an array have shown to result in a significant performance degradation. The performance degradation is attributed mainly to smaller cache sizes and the inefficient use of available cache, defeating the subsystems ability to perform full stride writes for Redundant Array of Independent Disks 5 (RAID 5) arrays. Additionally, I/O queues for multiple LUNs directed at the same array can have a tendency to overdrive the array. Higher end storage controllers, such as the IBM System Storage DS8000 series, make this much less of an issue through the use of large cache sizes. However, large array sizes might require that multiple LUNs are created due to LUN size limitations. However, on higher end storage controllers, most workloads show the difference between a single LUN per array compared to multiple LUNs per array to be negligible. For midrange storage controllers it is recommended to have one LUN per array as this provides optimal performance configuration. In midrange storage controllers LUNs are usually owned by one controller and having one LUN per array minimizes the impact of IO collision at the drive level which could happen with more LUNs per array, especially if these LUNs are not owned by the same controller and when drive pattern on the LUNs is not the same. Consider the manageability aspects of creating multiple LUNs per array configurations. Be careful in regard to the placement of these LUNs so that you do not create conditions where over-driving an array can occur. Additionally, placing these LUNs in multiple storage pools expands failure domains considerably as we discussed in 5.1, Availability considerations for Storage Pools on page 72. Table 10-12 provides our recommended guidelines for array provisioning on IBM storage subsystems.
Table 10-12 Array provisioning Controller type IBM System Storage DS3000/4000/5000 IBM Storwize V7000 IBM System Storage DS6000 IBM System Storage DS8000 IBM XIV Storage System series LUNs (Managed disks) per array 1 1 1 1-2 N/A

10.4.2 Selecting the number of arrays per storage pool


The capability to stripe across disk arrays is one of the most important performance advantages of the SVC; however, striping across more arrays is not necessarily better. The objective here is to only add as many arrays to a single storage pool as required to meet the performance objectives. Because it is usually difficult to determine what is required in terms of performance, the tendency is to add far too many arrays to a single storage pool, which again increases the failure domain as we discussed previously in 5.1, Availability considerations for Storage Pools on page 72.

Chapter 10. Backend performance considerations

245

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

It is also worthwhile to consider the effect of aggregate load across multiple storage pools. It is clear that striping workload across multiple arrays has a positive effect on performance when you are talking about dedicated resources, but the performance gains diminish as the aggregate load increases across all available arrays. For example, if you have a total of eight arrays and are striping across all eight arrays, your performance is much better than if you were striping across only four arrays. However, if the eight arrays are divided into two LUNs each and are also included in another storage pool (SP), the performance advantagedrops as the load of SP2 approaches that of SP1, which means that when workload is spread evenly across all SPs, there will be no difference in performance. More arrays in the storage pool have more of an effect with lower performing storage controllers, due to the cache and RAID calculation constrains as usually RAID is calculated in the main processor, not on the dedicated processors. So, for example, we require fewer arrays from a DS8000 than we do from for example a DS5000 to achieve the same performance objectives. This difference is primarily related to the internal capabilities of each storage subsystem and will vary based on the workload. Table 10-13 on page 246 shows the recommended number of arrays per storage pool that is appropriate for general cases. Again, when it comes to performance, there can always be exceptions.
Table 10-13 Recommended number of arrays per storage pool Controller type Arrays per storage pool 4 - 24 4 - 24 4 - 24 4 - 12 4 - 12

IBM System Storage DS3000/4000/5000


IBM Storwize V7000 IBM System Storage DS6000

IBM System Storage DS8000


IBM XIV Storage System series

As seen in the Table 10-13 the recommended number of arrays per storage pool is smaller in high end storage subsystems. This is related to the fact that those subsystems can deliver higher performances per array, even if the number of disks in the array is the same. The performance difference is resulted from the better multilayer caching and to specialized processors for RAID calculations. It is important to understand the following: You must consider the number of MDisks per array along with the number of arrays per MDG to understand aggregate MDG loading effects. You can achieve availability improvements without compromising performance objectives. Prior to the version 6.2 of the SVC code, the SVC cluster would only use one path to the managed disk, all other paths were standby paths. When managed disks are recognized by the cluster, active paths will be assigned in the round robin fashion. If we want to utilize all 8 ports in one IO group, then we should have at least 8 managed disks from a particular backend storage subsystem. In the setup of one managed disk per array this would mean that there should be at least 8 arrays from each backend storage subsystem.

246

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

10.5 I/O ports, cache, throughput considerations


When configuring backend storage subsystem for the SVC environment it is very important to provide enough IO ports on the backend storage subsystems to access the LUNs (managed disks). The storage subsystem, in our case SVC, must have adequate IOPS and throughput capacities for achieve right performance level on the host side. Although SVC will greatly improve the utilization of the storage subsystem and increase the performance, the backend storage subsystems have to have enough capabilities to handle the load. It is also very important that backend storage has enough cache for the installed capacity, as especially the write performance greatly depends on the correctly sized write cache.

10.5.1 Back-end queue depth


SVC submits I/O to the back-end (MDisk) storage in the same fashion as any direct-attached host. For direct-attached storage, the queue depth is tunable at the host and is often optimized based on specific storage type as well as various other parameters, such as the number of initiators. For the SVC, the queue depth is also tuned; however, the optimal value used is calculated internally. Note that the exact algorithm used to calculate queue depth is subject to change. Do not rely upon the following details staying the same. However, this summary is true of SVC 4.3.0. There are two parts to the algorithm: a per MDisk limit and a per controller port limit. Q = ((P x C) / N) / M If Q > 60, then Q=60 (maximum queue depth is 60) If Q < 3, then Q=3 (minimum queue depth is 3) In this algorithm: Q = The queue for any MDisk in a specific controller P = Number of WWPNs visible to SVC in a specific controller N = Number of nodes in the cluster M = Number of MDisks provided by the specific controller C = A constant. C varies by controller type: FAStT200, 500, DS4100, and EMC Clarion = 200 DS4700, DS4800, DS6K, and DS8K = 1000 Any other controller = 500 When SVC has submitted and has Q I/Os outstanding for a single MDisk (that is, it is waiting for Q I/Os to complete), it will not submit any more I/O until part of the I/O completes. That is, any new I/O requests for that MDisk will be queued inside the SVC, which is undesirable and indicates that back-end storage is overloaded. The following example shows how a 4-node SVC cluster calculates queue depth for 150 LUNs on a DS8000 storage controller using six target ports:

Chapter 10. Backend performance considerations

247

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Q = ((6 ports *1000/port)/4 nodes)/150 MDisks) = 10 With the sample configuration, each MDisk has a queue depth of 10. SVC4.3.1 has introduced dynamic sharing of queue resources based on workload. MDisks with high workload can now borrow some unused queue allocation from less busy MDisks on the same storage system. While the values are calculated internally and this enhancement provides for better sharing, it is important to consider queue depth in deciding how many MDisks to create.

10.5.2 MDisk transfer size


The size of I/O that the SVC performs to the MDisk depends on where the I/O originated.

Host I/O
In the SVC versions prior to 6.x the maximum backend transfer size resulted from the host IO under normal I/O is 32 KB. This means that in case of host IO which is bigger than 32 KB, this will be broken into several IOs send to the backend storage as shown in Figure 10-7. For this example the transfer size of the IO is 256 KB from the host side.

Figure 10-7 SVC backend IO before version 6.x

In such case it could happen that IO utilization of the backend storage ports could be multiplied compared to the number of IOs coming from the host side. This is especially true for sequential workloads where IO block size tends to be bigger than in traditional random IO. To address this the backend block IO size for reads and writes was increased to 256 KB in SVC versions 6.x, as shown in Figure 10-8.

248

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-8 SVC backend IO after version 6.x

Internal cache track size is 32 KB, and, therefore, when the IO comes to SVC it will be split to the adequate number of the cache tracks. For the above example this would be 8 32 KB cache tracks. Although the backend IO block size can be up to 256 KB, the particular host IO can be smaller and as such read or write operations to the backend managed disks can range from 512 bytes to 256 KB. The same is true for the cache as the tracks will be populated to the size of IO. For example the 60 KB IO would fit in two tracks where first track will be fully populated with 32 KB and second one will only hold 28 KB. If the host IO request is bigger than 256 KB it will be split into 256 KB chunks where the last chunk can be partial depending on the size of IO from the host.

FlashCopy I/O
The transfer size for FlashCopy can be 64KB or 256 KB, because the grain size of FlashCopy is 64 KB or 256 KB and any size write that changes data within a 64 KB or 256 KB grain will result in a single 64KB or 256 KB read from the source and write to the target.

Thin provisioning I/O


The use of thin provisioning also affects the backed transfer size, which depends on the granularity at which space is allocated. The granularity can be 32KB, 64KB, 128KB or 256KB. When grain is initially allocated it will always be formatted by writing 0x00s.

Chapter 10. Backend performance considerations

249

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Coalescing writes
The SVC coalesces writes up to the 32 KB track size if writes reside in the same tracks prior to destage, for example, if 4 KB is written into a track, another 4 KB is written to another location in the same track.This track will move to the bottom of the least recently used (LRU) list in the cache upon the second write, and the track will now contain 8 KB of actual data. This system can continue until the track reaches the top of the LRU list and is then destaged; the data is written to the back-end disk and removed from the cache. Any contiguous data within the track will be coalesced for the destage. Sequential writes The SVC does not employ a caching algorithm for explicit sequential detect, which means coalescing of writes in SVC cache has a random component to it. For example, 4 KB writes to VDisks will translate to a mix of 4 KB, 8 KB, 16 KB, 24 KB, and 32 KB transfers to the MDisks with reducing probability as the transfer size grows. Although larger transfer sizes tend to be more efficient, this varying transfer size has no effect on the controllers ability to detect and coalesce sequential content to achieve full stride writes. Sequential reads The SVC uses prefetch logic for staging reads based on statistics maintained on 128 MB regions. If the sequential content is sufficiently high enough within a region, prefetch occurs with 32 KB reads.

10.6 SVC extent size


The SVC extent size defines several important parameters of the virtualized environment: The maximum size of the volume The maximum capacity of the single managed disk from the backend sytems The maximum capacity which can be virtualized by the SVC cluster The table Table 10-14 depicts the possible values in conjunction with the extent size.
Table 10-14 SVC extent sizes Extent size (MB) Maximum non thin-provisioned volume capacity in GB Maximum thin-provisione d volume capacity in GB Maximum MDisk capacity in GB Total storage capacity manageable per system

16 32 64 128 256 512 1024

2048 (2 TB) 4096 (4 TB) 8192 (8 TB) 16,384 (16 TB) 32,768 (32 TB) 65,536 (64 TB) 131,072 (128 TB)

2000 4000 8000 16,000 32,000 65,000 130,000

2048 (2 TB) 4096 (4 TB) 8192 (8 TB) 16,384 (16 TB) 32,768 (32 TB) 65,536 (64 TB) 131,072 (128 TB)

64 TB 128 TB 256 TB 512 TB 1 PB 2 PB 4 PB

250

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Extent size (MB)

Maximum non thin-provisioned volume capacity in GB

Maximum thin-provisione d volume capacity in GB

Maximum MDisk capacity in GB

Total storage capacity manageable per system

2048 4096 8192

262,144 (256 TB) 262,144 (256 TB) 262,144 (256 TB)

260,000 262,144 262,144

262,144 (256 TB) 524,288 (512 TB) 1,048,576 (1024 TB)

8 PB 16 PB 32 PB

The size of SVC extent will also define how many extents will be used for a particular volume. The example of two different extent sizes which is shown in Figure 10-9, shows that with the larger extent size less extents are required.

Figure 10-9 Different extent sizes for the same volume

The extent size and the number of managed disks in the storage pool define the extent distribution in case of stripped volumes. In the example shown Figure 10-10 we can see two different cases where in first case we have the ratio of volume size and extent size the same as the number of managed disks in the storage pool, and in the second case where this ratio is not equal to the number of managed disks.

Figure 10-10 SVC extents distribution

Chapter 10. Backend performance considerations

251

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

For even storage pool utilization it is recommended to align the size of volumes and extents so that even extent distribution can be achieved. As the volumes are typically used from the beginning of the volume, this does not bring performance improvements. This is also only valid for non thin provisioned volumes. Tip: It is recommended to align extent size to the underlying backend storage, for example an internal array stride size if this is possible in relation to the whole cluster size.

10.7 SVC cache partitioning


In a situation where more I/O is driven to an SVC node than can be sustained by the back-end storage, the SVC cache can become exhausted. This situation can happen even if only one storage controller is struggling to cope with the I/O load, but it impacts traffic to others as well. To avoid this situation, SVC cache partitioning provides a mechanism to protect the SVC cache from not only overloaded controllers, but also misbehaving controllers. The SVC cache partitioning function is implemented on a per storage pool (SP) basis. That is, the cache automatically partitions the available resources on a per SP basis. The overall strategy is to protect the individual controller from overloading or faults. If many controllers (or in this case, SPs) are overloaded, the overall cache can still suffer. Table 15 shows the upper limit of write cache data that any one partition, or SP, can occupy.
Table 15 Upper limit of write cache data Number of SPs 1 2 3 4 5 or more Upper limit 100% 66% 40% 30% 25%

The effect of the SVC cache partitioning is that no single SP occupies more than its upper limit of cache capacity with write data. Upper limits are the point at which the SVC cache starts to limit incoming I/O rates for volumes created from the SP. If a particular SP reaches the upper limit, it will experience the same result as a global cache resource that is full. That is, the host writes are serviced on a one-out one-in basis - as the cache destages writes to the back-end storage. However, only writes targeted at the full SP are limited, all I/O destined for other (non-limited) SPs continues normally. Read I/O requests for the limited SP also continue normally. However, because the SVC is destaging write data at a rate that is obviously greater than the controller can actually sustain (otherwise, the partition does not reach the upper limit), reads are serviced equally as slowly. The main thing to remember is that the partitioning is only limited on write I/Os. In general, a 70/30 or 50/50 ratio of read to write operations is observed. Of course, there are applications, or workloads, that perform 100% writes; however, write cache hits are much less of a benefit than read cache hits. A write always hits the cache. If modified data already resides in the

252

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

cache, it is overwritten, which might save a single destage operation. However, read cache hits provide a much more noticeable benefit, saving seek and latency time at the disk layer. In all benchmarking tests performed, even with single active SPs, good path SVC I/O group throughput remains the same as it was before the introduction of SVC cache partitioning. For in-depth information about SVC cache partitioning, we recommend the following IBM Redpaper publication: IBM SAN Volume Controller 4.2.1 Cache Partitioning, REDP-4426-00

10.8 DS8000 considerations


In this section we look at SVC performance considerations when using the DS8000 as backend storage.

10.8.1 Volume layout


Volume layout considerations as related to the SVC performance are described here.

Ranks to extent pools mapping


When configuring the DS8000, two different approaches for the rank to extent pools mapping exist: One rank per extent pool Multiple ranks per extent pool using DS8000 Storage Pool Striping (SPS) The most common approach is to map one rank to one extent pool, which provides good control for volume creation, because it ensures that all volume allocation from the selected extent pool will come from the same rank. The SPS feature became available with the R3 microcode release for the DS8000 series and effectively means that a single DS8000 volume can be striped across all the ranks in an extent pool (therefore, the functionality is often referred as extent pool striping). So, if a given extent pool includes more than one rank, a volume can be allocated using free space from several ranks (which also means that SPS can only be enabled at volume creation, no reallocation is possible). The SPS feature requires that your DS8000 layout has been well thought-out from the beginning to utilize all resources in the DS8000. If this is not done, SPS might cause severe performance problems (for example, if configuring a heavily loaded extent pool with multiple ranks from the same DA pair). Because the SVC itself stripes across MDisks, the SPS feature is not as relevant here as when accessing the DS8000 directly. Regardless of which approach is used, a minimum of two extent pools must be used to fully and evenly utilize DS8000. A minimum of two extent pools are required to utilize both servers (server0 and server1) inside the DS8000 because of the extent pool affinity to those servers. The decision which type of ranks to extent pool mapping will be used mainly depends on the following factors: 1. Model of DS8000 used as backend storage (DS8100/DS8300/DS8700/DS8800) 2. How stable is the DS8000 configuration 3. What microcode is installed or can be installed on the DS8000

Chapter 10. Backend performance considerations

253

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

One rank to one extent pool


When DS8000 physical configuration is static from the beginning, or when microcode 6.1 or above is not available it is recommended to use one rank to one extent pool mapping. In such configuration it is also recommended to define one LUN per extent pool if this is possible. The DS8100 and DS8300 does not support larger than 2TB LUNs. So in case that rank is bigger than 2TB, more than one LUN should be defined on the particular rank. This means that two LUNs would share the same backend disks (spindles) and this has to be taken into account for performance planning. Example of such configuration is shown in Figure 10-11.

Figure 10-11 Two LUNs per DS8300 rank

The DS8700 and DS8800 models do not have 2TB limit. It is recommended to use single LUN to rank mapping as shown in Figure 10-12.

Figure 10-12 One LUN per DS8800 rank

In this setup we will have as many extent pools as there are ranks and extent pools would be evenly divided between both internal servers (server0 and server1). With both approaches SVC will be used to distribute the workload across ranks evenly by striping the volumes across LUNs. One of the benefits of one rank to one extent pool is that physical LUN placement can be easily determined in case when this is required, for example in performance analysis.

254

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

The drawback of such setup is that when additional ranks are added and they are integrated into existing SVC storage pools, existing volumes has to be restriped either manually or with scripts.

Multiple ranks in one extent pool


When DS8000 microcode level 6.1 or higher is installed or available and physical configuration of the DS8000 will change during the lifecycle (ie. additional capacity will be installed) it is recommended to use SPS with two extent pools for each disk type. Two extent pools are required to balance the use of processor resources. Example of such setup is shown in Figure 10-13.

Figure 10-13 Multiple ranks in extent pool

With this design it is important that LUN size is defined in a way that each will have the same amount of extents on each rank (extent size is 1GB). In the example above this would mean that LUN would have size of N x 10GB. With this approach the utilization of the DS8000 on the rank level would be balanced. If additional rank is added to the configuration the existing DS8000 LUNs (SVC managed disks) can be rebalanced using DS8000 Easy Tier manual operation so that optimal resource utilization of DS8000 will be achieved. With this there is no need to restripe volumes on the SVC level.

Extent pools
The number of extent pools on the DS8000 depends on the rank setup. As described above minimum two extent pools are required to evenly utilize both servers inside DS8000. In all cases an even number of extent pools will provide the most even distribution of resources.

Device adapter pair considerations for selecting DS8000 arrays


The DS8000 storage architectures both access disks through pairs of device adapters (DA pairs) with one adapter in each storage subsystem controller. The DS8000 scales from two to eight DA pairs. When possible, consider adding arrays to storage pools (SP) based on multiples of the installed DA pairs. For example, if the storage controller contains six DA pairs, use either six or 12 arrays in an SP with arrays from all DA pairs in a given MDG.

Balancing workload across DS8000 controllers


When configuring storage on the IBM System Storage DS8000 disk storage subsystem, it is important to ensure that ranks on a device adapter (DA) pair are evenly balanced between

Chapter 10. Backend performance considerations

255

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

odd and even extent pools. Failing to do this can result in a considerable performance degradation due to uneven device adapter loading. The DS8000 assigns server (controller) affinity to ranks when they are added to an extent pool. Ranks that belong to an even-numbered extent pool have an affinity to server0, and ranks that belong to an odd-numbered extent pool have an affinity to server1. Figure 10-14 on page 256 shows an example of a configuration that will result in a 50% reduction in available bandwidth. Notice how arrays on each of the DA pairs are only being accessed by one of the adapters. In this case, all ranks on DA pair 0 have been added to even-numbered extent pools, which means that they all have an affinity to server0, and therefore, the adapter in server1 is sitting idle. Because this condition is true for all four DA pairs, only half of the adapters are actively performing work. This condition can also occur on a subset of the configured DA pairs.

Figure 10-14 DA pair reduced bandwidth configuration

Example 10-1 shows what this invalid configuration looks like from the CLI output of the lsarray and lsrank commands. The important thing to notice here is that arrays residing on the same DA pair contain the same group number (0 or 1), meaning that they have affinity to the same DS8000 server (server0 is represented by group0 and server1 is represented by group1). As an example of this situation, arrays A0 and A4 can be considered. They are both attached to DA pair 0, and in this example, both arrays are added to an even-numbered extent pool (P0 and P4). Doing so means that both ranks have affinity to server0 (represented by group0), leaving the DA in server1 idle.
Example 10-1 Command output - lsarray and lsrank dscli> lsarray -l Date/Time: Aug 8, 2008 8:54:58 AM CEST IBM DSCLI Version:5.2.410.299 DS: IBM.2107-75L2321 Array State Data RAID type arsite Rank DA Pair DDMcap(10^9B) diskclass =================================================================================== A0 Assign Normal 5 (6+P+S) S1 R0 0 146.0 ENT A1 Assign Normal 5 (6+P+S) S9 R1 1 146.0 ENT

256

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

A2 A3 A4 A5 A6 A7

Assign Assign Assign Assign Assign Assign

Normal Normal Normal Normal Normal Normal

5 5 5 5 5 5

(6+P+S) (6+P+S) (6+P+S) (6+P+S) (6+P+S) (6+P+S)

S17 S25 S2 S10 S18 S26

R2 R3 R4 R5 R6 R7

2 3 0 1 2 3

146.0 146.0 146.0 146.0 146.0 146.0

ENT ENT ENT ENT ENT ENT

dscli> lsrank -l Date/Time: Aug 8, 2008 8:52:33 AM CEST IBM DSCLI Version: 5.2.410.299 DS: IBM.2107-75L2321 ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts ====================================================================================== R0 0 Normal Normal A0 5 P0 extpool0 fb 779 779 R1 1 Normal Normal A1 5 P1 extpool1 fb 779 779 R2 0 Normal Normal A2 5 P2 extpool2 fb 779 779 R3 1 Normal Normal A3 5 P3 extpool3 fb 779 779 R4 0 Normal Normal A4 5 P4 extpool4 fb 779 779 R5 1 Normal Normal A5 5 P5 extpool5 fb 779 779 R6 0 Normal Normal A6 5 P6 extpool6 fb 779 779 R7 1 Normal Normal A7 5 P7 extpool7 fb 779 779

Figure 10-15 shows an example of a correct configuration that balances the workload across all four DA pairs.

Figure 10-15 DA pair correct configuration

Example 10-2 shows what this correct configuration looks like from the CLI output of the lsrank command. The configuration from the lsarray output remains unchanged. Notice that arrays residing on the same DA pair are now split between groups 0 and 1. Looking at arrays A0 and A4 once again now shows that they have different affinities (A0 to group0, A4 group1). To achieve this correct configuration, what has been changed compared to Example 10-1 on page 256 is that array A4 now belongs to an odd-numbered extent pool (P5).
Example 10-2 Command output dscli> lsrank -l Date/Time: Aug 9, 2008 2:23:18 AM CEST IBM DSCLI Version: 5.2.410.299 DS: IBM.2107-75L2321

Chapter 10. Backend performance considerations

257

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

ID Group State datastate Array RAIDtype extpoolID extpoolnam stgtype exts usedexts ====================================================================================== R0 0 Normal Normal A0 5 P0 extpool0 fb 779 779 R1 1 Normal Normal A1 5 P1 extpool1 fb 779 779 R2 0 Normal Normal A2 5 P2 extpool2 fb 779 779 R3 1 Normal Normal A3 5 P3 extpool3 fb 779 779 R4 1 Normal Normal A4 5 P5 extpool5 fb 779 779 R5 0 Normal Normal A5 5 P4 extpool4 fb 779 779 R6 1 Normal Normal A6 5 P7 extpool7 fb 779 779 R7 0 Normal Normal A7 5 P6 extpool6 fb 779 779

10.8.2 Cache
For the DS8000, you cannot tune the array and cache parameters. The arrays will be either 6+p or 7+p, depending on whether the array site contains a spare and whether the segment size (contiguous amount of data that is written to a single disk) is 256 KB for fixed block volumes. Caching for the DS8000 is done on a 64 KB track boundary.

10.8.3 Determining the number of controller ports for DS8000


Configure a minimum of four controller ports to the SVC per controller regardless of the number of nodes in the cluster. Configure up to 16 controller ports for large controller configurations where more than 48 ranks are being presented to the SVC cluster. Currently 16 ports per storage subsystem is the maximum supported from the SVC side. In case of smaller DS8000 configurations four controller ports are sufficient. Additionally, we recommend that no more than two ports of each of the DS8000s 4-port adapters are used. When DS8000s 8-port adapters are used, no more than four ports should be used. Table 10-16 shows the recommended number of DS8000 ports and adapters based on rank count and adapter type.
Table 10-16 Recommended number of ports and adapters Ranks 2 - 16 16 - 48 > 48 Ports 4 8 16 Adapters 2 - 4 (2/4 port adapter) 4 - 8 (2/4 port adapter), 2-4 (8 port adapter) 8 - 16 (2/4 port adapter), 4-8 (8 port adapter)

The DS8000 populate Fibre Channel (FC) adapters across two to eight I/O enclosures, depending on configuration. Each I/O enclosure represents a separate hardware domain. Ensure that adapters configured to different SAN networks do not share the same I/O enclosure as part of our goal of keeping redundant SAN networks isolated from each other. Example of DS8800 connections with 16 IO ports on eight 8 port adapters is shown in Figure 10-16. In this case two ports per adapter are used.

258

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-16 DS8800 with 16 IO ports

Example of DS8800 connections with 4 IO ports on two 4 port adapters is shown in Figure 10-17. In this case two ports per adapter are used.

Chapter 10. Backend performance considerations

259

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 10-17 DS8000 with four IO ports

Best practices that we recommend: Configure a minimum of four ports per DS8000. Configure 16 ports per DS8000 when > 48 ranks are presented to the SVC cluster. Configure a maximum of two ports per four port DS8000 adapter and four ports per eight port DS8000 adapter. Configure adapters across redundant SAN networks from different I/O enclosures.

10.8.4 Storage pool layout


The number of the SVC storage pools (SP) from DS8000 primarily depends on the following factors: Type of different disks installed in the DS8000 Number of disks in the array RAID5 - 6+P+S RAID5 - 7 + P RAID10 - 2 + 2 + 2P + 2S RAID10 - 3 + 3 + 2P

260

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

This factors define the performance and size attributes of the DS8000 LUNs which will act as managed disks for SVC SPs. As already discussed in previous sections SVC storage pool should have managed disks (MD) with the same characteristic for performance and capacity as this is required for even DS8000 utilization. Tip: It is recommended that main characteristics of the storage pool are described in its name. For example the pool on DS8800 with 146GB 15K FC disks in RAID5 could have the following name - DS8800_146G15KFCR5. The Figure 10-18 shows an example of DS8700 storage pool layout based on disk type and RAID level. In this case ranks with RAID5 6+P+S and 7+P are combined in the same storage pool and the same goes for RAID10 2+2+2P+2S and 3+3+2P which are also combined in the same storage pool. With this approach some parts of volumes or some volumes would be only striped over MDs (LUNs) which are on the arrays/ranks where there is no spare disk. As those MDs effectively have one spindle more, this can also compensate for the performance requirements as more extents will be places on them. Such approach simplifies management of the SPs as it allows for a smaller number of SPs to be used. There are four SP defined in this scenario: 145GB 15K R5 - DS8700_146G15KFCR5 300GB 10K R5 - DS8700_300G10KFCR5 450GB 15K R10 - DS8700_450G15KFCR10 450GB 15K R5 - DS8700_450G15KFCR5

Figure 10-18 DS8700 storage pools based on disk type and RAID level

If we want to achieve totally optimized configuration from the RAID perspective the configuration would include SPs based on exact number of disks included in the array/rank as shown in the Figure 10-19.

Chapter 10. Backend performance considerations

261

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 10-19 DS8700 storage pools with exact number of disks in the array/rank

With this setup seven SPs will be defined instead of four. The complexity of management would be increased as more pools would have to be managed. From the performance perspective backend would be completely balanced on the RAID level. Configurations with so many different disk types in one storage subsystem are not common and usually one DS8000 system would have a maximum of two types of disks and different types of disks would be installed in different systems. The example of such setup on DS8800 is shown in Figure 10-20.

Figure 10-20 DS8800 storage pool setup with two types of disks

Although it is possible to span the storage pool across multiple backend systems as shown in the Figure 10-21 it is recommended to keep storage pools bound inside single DS8000 because of the availability reasons.

262

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-21 DS8000 spanned storage pool

Best practices that we recommend: Use the same type of arrays (disk and RAID type) in the storage pool Minimize number of storage pools. If single type or two types of disks are used two storage pools can be used per DS8000, one for RAID 6+P+S and one for RAID 7+P if RAID5 is used and the same for RAID 10 with 2+2+2P+2S and 3+3+2P. Spread storage pool across both internal servers (server0 and server1). This means using LUNs from extent pools which have affinity to server0 and the ones with affinity to server1 in the same storage pool. Where performance is not the main goal a single storage pool can be used with mixing LUNs from array with different number of disks (spindles). The example of DS8800 with two storage pools for 6+P+S RAID5 and 7+P arrays is shown in Figure 10-22.

Chapter 10. Backend performance considerations

263

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 10-22 Three frame DS8800 with RAID5 arrays

10.8.5 Extent size


The extent size should be aligned with internal DS8000 extent size which is 1GB. If cluster size required different size than this size will prevail.

264

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

10.9 XIV considerations


In this section we look at SVC performance considerations when using the XIV as backend storage.

10.9.1 LUN size


The main advantage of the XIV storage system is that all the LUNs are distributed across all physical disks. This means that there is no other attribute then the volume size which should be used to maximize the space usage and minimize the number of LUNs. XIV system can grow from 6 to 15 installed modules, and it can have 1TB, 2TB or 3TB disk modules. The maximum LUN size which can be used on SVC is 2TB and maximum 511 LUNs can be presented from single XIV system to the SVC cluster. Currently SVC does not support dynamically expanding of the LUNs on the XIV. The following LUN sizes are recommended: 1TB disks - 1632GB (Table 10-17) 2TB disks (Gen3) - 1669GB (Table 10-18) 3TB disks (Gen3) - 2185GB (Table 10-19) Table 10-17, Table 10-18, and Table 10-19 show the number of managed disks and capacity available based on the number of installed modules.
Table 10-17 XIV with 1TB disks and 1632GB LUNs Number of XIV Modules Installed 6 9 10 11 12 13 14 15 Number of LUNs (MDisks) at 1632GB each 16 26 30 33 37 40 44 48 IBM XIV System TB used 26.1 42.4 48,9 53.9 60.4 65.3 71.8 78.3 IBM XIV System TB Capacity Available 27 43 50 54 61 66 73 79

Table 10-18 XIV with 2TB disks and 1669GB LUNs (Gen3) Number of XIV Modules Installed 6 9 Number of LUNs (MDisks) at 1669GB each 33 52 IBM XIV System TB used 55.1 86.8 IBM XIV System TB Capacity Available 55.7 88

Chapter 10. Backend performance considerations

265

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Number of XIV Modules Installed 10 11 12 13 14 15

Number of LUNs (MDisks) at 1669GB each 61 66 75 80 89 96

IBM XIV System TB used 101.8 110.1 125.2 133.5 148.5 160.2

IBM XIV System TB Capacity Available 102.6 111.5 125.9 134.9 149.3 161.3

Table 10-19 XIV with 3TB disks and 2185GB LUNs (Gen3) Number of XIV Modules Installed 6 9 10 11 12 13 14 15 Number of LUNs (MDisks) at 2185GB each 38 60 70 77 86 93 103 111 IBM XIV System TB used 83 131.1 152.9 168.2 187.9 203.2 225.0 242.5 IBM XIV System TB Capacity Available 84.1 132.8 154.9 168.3 190.0 203.6 225.3 243.3

If XIV is initially not configured with the full capacity, the SVC rebalancing script can be used to optimize volume placement when additional capacity is added to the XIV.

10.9.2 IO ports
XIV support from 8 to 24 FC ports, depending on the number of modules installed. Each module has two dualport FC cards. It is recommended to use one port per card for SVC use. With this setup the number of available ports for SVC use will range from 4 to 12 ports as shown in the Table 10-20.
Table 10-20 XIV FC ports for SVC Number of XIV Modules Installed 6 9 10 XIV Modules with FC ports Total available FC ports Ports used per FC card Port available for the SVC

4,5 4,5,7,8 4,5,7,8

8 16 16

1 1 1

4 8 8

266

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Number of XIV Modules Installed 11 12 13 14 15

XIV Modules with FC ports

Total available FC ports

Ports used per FC card

Port available for the SVC

4,5,7,8,9 4,5,7,8,9 4,5,6,7,8,9 4,5,6,7,8,9 4,5,6,7,8,9

20 20 24 24 24

1 1 1 1 1

10 10 12 12 12

As we can see the SVC 16 port limit for storage subsystem is not reached. Ports available for the SVC use should be connected to dual fabrics to provide redundancy. Each module should be connected to separate fabrics. The example of best practices SAN connectivity is shown in Figure 10-23.

Figure 10-23 XIV SAN connectivity

Host definition for SVC on XIV system


It is recommended that one host definition is used for the whole SVC cluster and defining all SVC WWPNs to this host and map all LUNs to it.

Chapter 10. Backend performance considerations

267

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

It is possible to use the cluster definition with each SVC node as a host, but then it is important that LUNs mapped has their LUN ID preserved when mapped to the SVC.

10.9.3 Storage pool layout


As all LUNs on the single XIV system share the same performance and capacity characteristics it is recommended to use single storage pool for a single XIV system. Tip: It is recommended to use a single storage pool for a single XIV system.

10.9.4 Extent size


The recommended extent size is 1GB. If cluster size required different size than this size will prevail.

10.10 Storwize V7000 considerations


Storwize V7000 (V7000) itself provides the same virtualization capabilities as SVC with the addition that it can also utilize internal disks. V7000 can also virtualize external storage systems as SVC and in many cases V7000 can satisfy performance and capacity requirements. V7000 is mainly used in combination with SVC for the following reasons: To consolidate more V7000 into single bigger environments for scalability reasons. Where SVC is already virtualizing other storage systems and additional capacity is provided by V7000. Up to version 6.2 remote replication was not possible between SVC and V7000, so if SVC would be used on the primary data center and V7000 was chosen for secondary data centre, SVC was required to support replication compatibility. SVC with current versions provides more cache (24GB per node vs. 8GB per V7000 node) so adding SVC on top can provide additional caching capability which is very beneficial for cache friendly workloads. V7000 with Solid State Disks (SSD) can be added to SVC setup to provide easy tier capabilities at capacities bigger then possible with internal SVC SSDs. This is for example common setup with backend storage which does not provide SSD disk capacity or when to much internal resources would be used for them.

10.10.1 Volume setup


When V7000 is used as the backend storage system for SVC it main function is to provide the RAID capability. The main recommendation for the V7000 setup in SVC environment is to define one storage pool with one volume per V7000 array. With this we are avoiding striping over striping. Striping would be performed only on the SVC level. Each volume is then presented to the SVC as managed disk (MD) and all MDs from the same type of disks in V7000 should be used in one storage pool on the SVC level. The optimal array sizes for SAS disks are 6+1, 7+1, 8+1. The smaller array size is mainly recommended for the RAID rebuild times. There is not other implications on the performance

268

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

with bigger array sizes, for example 10+1, 11+1. Example of the V7000 configuration with optimal smaller arrays and non-optimal bigger arrays is shown in Figure 10-24.

Figure 10-24 V7000 array for SAS disks

As we can see in this example, one hotspare disk was used for enclosure. This is not a requirement, but it is a good practice as this gives symmetrical use of the enclosures. At minimum it is recommended to use one hotspare (HS) disk per SAS chain for each type of the disks in the V7000 and if more then two enclosures are present it is recommended to have at least two HS disks per SAS chain per disk type if those disks occupy more then two enclosures. The example of multiple disk types in the V7000 is shown in Figure 10-25.

Figure 10-25 V7000 with multiple disk types

When defining a volume on the V7000 level default values should be used. This default values will define 256KB strip size (the size of RAID chunk on the particular disk). This is inline with the SVC backend IO size which is in version above 6.1 256KB. For example using 256KB strip size would give 2MB stride size (the whole RAID chunk size) in 8+1 array.

Chapter 10. Backend performance considerations

269

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

As V7000 also supports big NL-SAS driver (2TB and 3TB). Using those drives in the RAID5 arrays could produce significant RAID rebuild times, even couple of hours. Based on this it is recommended to use RAID6 to avoid double failure during rebuild period. Example of such setup is shown in Figure 10-26.

Figure 10-26 V7000 RAID6 arrays

Tip: Make sure that volumes defined on V7000 are evenly distributed across all nodes.

10.10.2 IO ports
Each V7000 has four FC ports for a host access. This ports will be used for SVC to access the volumes on V7000. Minimum configuration is to connect each V7000 canister node to two independent fabrics as shown in the Figure 10-27.

270

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-27 V7000 two connections per node

In this setup SVC would access V7000 with two node configuration over four ports. Such connectivity is sufficient for not fully loaded V7000 environments. In case that V7000 is hosting capacity which requires more connections than two per node, four connections per node should be used as shown in the Figure 10-28.

Chapter 10. Backend performance considerations

271

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 10-28 V7000 four connections per node

With two node V7000 setup this would give eight target connections from the SVC perspective which is well below 16 target ports which is current SVC limit for backend storage subsystem. Current limit in the V7000 configuration is four node cluster. With this configuration with four connections to the SAN, 16 target ports would be reached and as such this would be still supported configuration. Such example is shown in the Figure 10-29.

Figure 10-29 Four node V7000 setup

272

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Important: It is very important that at minimum two ports per node are connected to the SAN with connections to two redundant fabrics.

10.10.3 Storage pool layout


As with any other storage subsystem where different disk types can be installed it is recommended that the volumes from the same characteristics (size, RAID level, rotational speed) are used in single storage pool on the SVC level. It is also recommended that single storage pool is used for all volumes of the same characteristics. For optimal configuration exact number of disks should be used in storage pool. So for example if we have 7+1 and 6+1 arrays this would mean that two pools would be used as shown in the Figure 10-30.

Figure 10-30 V7000 storage pool example with two pools

The example above has hot spare disk in every enclosure which is not a requirement. To avoid two pools for the same disk type it is recommended to create array configuration based on the following rules: Number of disks in the array 6+1 7+1 8+1 Number of hotspare disks Minimum 2 Based on the array size the following setup for five enclosure V7000 the following symmetrical array configuration is possible: 6+1 - 17 arrays (119 disks) + 1 x hotspare disk 7+1 - 15 arrays (120 disks) + 0 x hotspare disk 8+1 - 13 arrays (117 disks) + 3 x hotspare disks

Chapter 10. Backend performance considerations

273

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

The 7+1 arrays does not provide any hotspare disks in the symmetrical array configuration ad shown in the Figure 10-31.

Figure 10-31 V7000 7+1 symmetrical array configuration

The 6+1 arrays provides single hotspare disk in the symmetrical array configuration as shown in the Figure 10-32 and this is not recommended value for the number of hotspare disk.

Figure 10-32 V7000 6+1 symmetrical array configuration

The 8+1 arrays provides three hotspare disks in the symmetrical array configuration as shown in the Figure 10-33. This is in recommended value range for the number of hotspare disks (two).

274

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Figure 10-33 V7000 8+1 symmetrical array configuration

As you can see that the best configuration for a single storage pool for the same type of disks in the five enclosure V7000 is 8+1 array configuration. Note: Symmetrical array configuration for the same disk type provides the least possible complexity in the storage pool configuration.

10.10.4 Extent size


The recommended extent size is 1GB. If cluster size required different size than this size will prevail.

10.11 DS5000 considerations


This section discussed DS5000 consideration which they also apply to DS3000/4000 models.

10.11.1 Selecting array and cache parameters


In this section, we describe the optimum array and cache parameters.

DS5000 array width


With Redundant Array of Independent Disks 5 (RAID 5) arrays, determining the number of physical drives to put into an array always presents a compromise. Striping across a larger number of drives can improve performance for transaction-based workloads. However, striping can also have a negative effect on sequential workloads. A common mistake that people make when selecting array width is the tendency to focus only on the capability of a single array to perform various workloads. However, you must also consider in this decision the aggregate throughput requirements of the entire storage server. A large number of physical disks in an array can create a workload imbalance between the controllers, because only one controller of the DS5000 actively accesses a specific array. When selecting array width, you must also consider its effect on rebuild time and availability.
Chapter 10. Backend performance considerations

275

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

A larger number of disks in an array increases the rebuild time for disk failures, which can have a negative effect on performance. Additionally, more disks in an array increase the probability of having a second drive fail within the same array prior to the rebuild completion of an initial drive failure, which is an inherent exposure to the RAID 5 architecture. Best practice: For the DS5000, we recommend array widths of 4+p and 8+p.

Segment size
With direct-attached hosts, considerations are often made to align device data partitions to physical drive boundaries within the storage controller. For the SVC, aligning device data partitions to physical drive boundaries within the storage controller is less critical based on the caching that the SVC provides and based on the fact that there is less variation in its I/O profile, which is used to access back-end disks. Because the maximum destage size for the SVC is 256KB, it is impossible to achieve full stride writes for random workloads. For the SVC, the only opportunity for full stride writes occurs with large sequential workloads, and in that case, the larger the segment size is, the better. Larger segment sizes can adversely affect random I/O, however. The SVC and controller cache do a good job of hiding the RAID 5 write penalty for random I/O, and therefore, larger segment sizes can be accommodated. The primary consideration for selecting segment size is to ensure that a single host I/O will fit within a single segment to prevent accessing multiple physical drives. Testing has shown that the best compromise for handling all workloads is to use a segment size of 256 KB. Best practice: We recommend a segment size of 256 KB as the best compromise for all workloads.

Cache block size


The DS4000 uses a 4 KB cache block size by default; however, it can be changed to 16 KB. For the earlier models of DS4000 using the 2 Gb Fibre Channel (FC) adapters, the 4 KB block size performed better for random I/O, and 16 KB performs better for sequential I/O. However, because most workloads contain a mix of random and sequential I/O, the default values have proven to be the best choice. For the higher performing DS4700 and DS4800, the 4 KB block size advantage for random I/O has become harder to see. Because most client workloads involve at least some sequential workload, the best overall choice for these models is the 16 KB block size. Best practice: For the DS5/4/3000, set the cache block size to 16 KB. Table 10-21 is a summary of the recommended SVC and DS5000 values.
Table 10-21 Recommended SVC values Models SVC SVC DS5000 DS5000 Attribute Extent size (MB) Managed mode Segment size (KB) Cache block size (KB) Value 256 Striped 256 16 KB

276

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521BEPerfConsid.fm

Models DS5000 DS5000 DS5000

Attribute Cache flush control Readahead RAID 5

Value 80/80 (default) 1 4+p, 8+p

10.11.2 Considerations for controller configuration


In this section, we discuss controller configuration considerations.

Balancing workload across DS5000 controllers


A best practice when creating arrays is to spread the disks across multiple controllers, as well as alternating slots, within the enclosures. This practice improves the availability of the array by protecting against enclosure failures that affect multiple members within the array, as well as improving performance by distributing the disks within an array across drive loops. You spread the disks across multiple controllers, as well as alternating slots, within the enclosures by using the manual method for array creation. Figure 10-34 shows a Storage Manager view of a 2+p array that is configured across enclosures. Here, we can see that each disk of the three disks is represented in a separate physical enclosure and that slot positions alternate from enclosure to enclosure.

Figure 10-34 Storage Manager

Chapter 10. Backend performance considerations

277

7521BEPerfConsid.fm

Draft Document for Review February 16, 2012 3:49 pm

10.11.3 Mixing array sizes within the storage pool


Mixing array sizes within the storage pool in general is not of concern. Testing has shown no measurable performance differences between selecting all 6+p arrays and all 7+p arrays as opposed to mixing 6+p arrays and 7+p arrays. In fact, mixing array sizes can actually help balance workload, because it places more data on the ranks that have the extra performance capability provided by the eighth disk. There is one small exposure here in the case where an insufficient number of the larger arrays are available to handle access to the higher capacity. In order to avoid this situation, ensure that the smaller capacity arrays do not represent more than 50% of the total number of arrays within the storage pool. Best practice: When mixing 6+p arrays and 7+p arrays in the same storage pool, avoid having smaller capacity arrays comprise more than 50% of the arrays.

10.11.4 Determining the number of controller ports for DS4000


The DS4000 must be configured with two ports per controller for a total of four ports per DS4000.

278

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

11

Chapter 11.

Easy Tier
In this chapter we describe the function provided by the EasyTier disk performance optimization feature of the SAN Volume Controller. We also explain how to activate the EasyTier process for both evaluation purposes and for automatic extent migration.

Copyright IBM Corp. 2011. All rights reserved.

279

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

11.1 Overview of Easy Tier


Determining the amount of I/O activity occurring on an SVC extent and when to move the extent to an appropriate storage performance tier is usually too complex a task to manage manually. Easy Tier is a performance optimization function that overcomes this issue as it will automatically migrate or move extents belonging to a volume between MDisk storage tiers. Easy Tier monitors the I/O activity and latency of the extents on all volumes with the Easy Tier function turned on in a multitier storage pool over a 24-hour period. It then creates an extent migration plan based on this activity and will dynamically move high activity or hot extents to a higher disk tier within the storage pool. It will also move extents whose activity has dropped off, or cooled, from the high-tier MDisks back to a lower-tiered MDisk. Because this migration works at the extent level, it is often referred to as sub-LUN migration. The Easy Tier function can be turned on or off at the storage pool level and at the volume level. To experience the potential benefits of using Easy Tier in your environment before actually installing actually installing expensive solid-state disks (SSDs), you can turn on the Easy Tier function for a single level storage pool. Next, also turn on the Easy Tier function for the volumes within that pool. This will start monitoring activity on the volume extents in the pool. Easy Tier will create a migration report every 24 hours on the number of extents that would be moved if the pool were a multitiered storage pool. So even though Easy Tier extent migration is not possible within a single tier pool, the Easy Tier statistical measurement function is available. Note: Image mode and sequential volumes are not candidates for Easy Tier automatic data placement.

11.2 Easy Tier concepts


This section explains the concepts underpinning Easy Tier functionality.

11.2.1 SSD arrays and MDisks


The SSD drives are treated no differently by the SVC than HDDs with respect to RAID arrays or MDisks. The individual SSD drives in the storage managed by the SVC will be combined into an array, usually in RAID10 or RAID5 format. It is unlikely that RAID6 SSD arrays will be used due to the double parity overhead, with two SSD logical drives used for parity only. An LUN will be created on the array, which is then presented to the SVC as a normal managed disk (MDisk). As is the case for HDDs, the SSD RAID array format will help protect against individual SSD failures. Depending on your requirements, additional high availability protection, above the RAID level, can be achieved by using volume mirroring. In the example disk tier pool shown in Figure 11-2 on page 282, you can see the SSD MDisks presented from the SSD disk arrays.

280

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

11.2.2 Disk tiers


It is likely that the MDisks (LUNs) presented to the SVC cluster will have different performance attributes because of the type of disk or RAID array that they reside on. The MDisks can be on 15 K RPM Fibre Channel or SAS disk, Nearline SAS or SATA, or even solid state disks (SSDs). Thus, a storage tier attribute is assigned to each MDisk. The default is generic_hdd. With SVC 6.1, a new disk tier attribute is available for SSDs known as generic_ssd. Note that the SVC does not automatically detect SSD MDisks. Instead, all external MDisks are initially put into the generic_hdd tier by default. Then the administrator has to manually change the SSD tier to generic_ssd by using the CLI or GUI.

11.2.3 Single tier storage pools


Figure 11-1 shows a scenario in which a single storage pool will be populated with MDisks presented by an external storage controller. In this solution the striped or mirrored volume can be measured by Easy Tier, but no action to optimize the performance will occur.

Figure 11-1 Single tier storage pool with striped volume

MDisks that are used in a single tier storage pool should have the same hardware characteristics, for example, the same RAID type, RAID array size, disk type, and disk revolutions per minute (RPMs) and controller performance characteristics.

11.2.4 Multiple tier storage pools


A multiple tiered storage pool will have a mix of MDisks with more than one type of disk tier attribute, for example, a storage pool containing a mix of generic_hdd and generic_ssd MDisks.

Chapter 11. Easy Tier

281

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 11-2 shows a scenario in which a storage pool is populated with two different MDisk types: one belonging to an SSD array, and one belonging to an HDD array. Although this example shows RAID5 arrays, other RAID types can be used.

Figure 11-2 Multitier storage pool with striped volume

Adding SSD to the pool means additional space is also now available for new volumes, or volume expansion.

11.2.5 Easy Tier process


The Easy Tier function has four main processes: I/O Monitoring This process operates continuously and monitors volumes for host I/O activity. It collects performance statistics for each extent and derives averages for a rolling 24-hour period of I/O activity. Easy Tier makes allowances for large block I/Os and thus only considers I/Os of up to 64 KB as migration candidates. This is an efficient process and adds negligible processing overheads to the SVC nodes. Data Placement Advisor The Data Placement Advisor uses workload statistics to make a cost benefit decision as to which extents are to be candidates for migration to a higher performance (SSD) tier. This process also identifies extents that need to be migrated back to a lower (HDD) tier.

282

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

Data Migration Planner Using the extents previously identified, the Data Migration Planner step builds the extent migration plan for the storage pool. 4. Data Migrator The Data Migrator step involves the actual movement or migration of the volumes extents up to, or down from the high disk tier. The extent migration rate is capped so that a maximum of up to 30 MBps is migrated. This equates to around 3 TB a day that will be migrated between disk tiers. When relocating volume extents, Easy Tier performs these actions: It attempts to migrate the most active volume extents up to SSD first. To ensure there is a free extent available, a less frequently accessed extent may first need to be migrated back to HDD. A previous migration plan and any queued extents that are not yet relocated are abandoned.

11.2.6 Easy Tier operating modes


There are three main operating modes for Easy Tier: Off mode, Evaluation or measurement only mode, and Automatic Data Placement or extent migration mode.

Easy Tier - Off mode


With Easy Tier turned off, there are no statistics recorded and no extent migration.

Evaluation or measurement only mode


Easy Tier Evaluation or measurement only mode collects usage statistics for each extent in a single tier storage pool where the Easy Tier value is set to on for both the volume and the pool. This is typically done for a single tier pool containing only HDDs, so that the benefits of adding SSDs to the pool can be evaluated prior to any major hardware acquisition. A statistics summary file is created in the /dumps directory of the SVC nodes named dpa_heat.nodeid.yymmdd.hhmmss.data. This file can be offloaded from the SVC nodes with PSCP -load or using the GUI as shown in 11.4.1, Measuring by using the Storage Advisor Tool on page 286. A web browser is used to view the report created by the tool.

Auto Data Placement or extent migration mode


In Auto Data Placement or extent migration operating mode, the storage pool parameter -easytier on or auto must be set and the volumes in the pool will have -easytier on. The storage pool must also contain MDisks with different disk tiers; thus a multitiered storage pool. Dynamic data movement is transparent to the host server and application users of the data, other than providing improved performance. Extents will automatically be migrated according to 11.3.2, Implementation rules on page 284. The statistic summary file is also created in this mode. This file can be offloaded for input to the advisor tool. The tool will produce a report on the extents moved to SSD and a prediction of performance improvement that can be gained if more SSD arrays were available.

Chapter 11. Easy Tier

283

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

11.2.7 Easy Tier activation


To activate Easy Tier, set the Easy Tier value on the pool and volumes as shown in Table 11-1. The defaults are set in favor of Easy Tier. For example, if you create a new storage pool the -easytier value is auto. If you create a new volume, the value is on.
Table 11-1 Easy Tier parameter settings

Examples of the use of these parameters are shown in 11.5, Using Easy Tier with the SVC CLI on page 287 and 11.6, Using Easy Tier with the SVC GUI on page 293.

11.3 Easy Tier implementation considerations


In this section we describe considerations to keep in mind before implementing Easy Tier.

11.3.1 Prerequisites
There is no Easy Tier license required for the SVC; it comes as part of the V6.1 code. For Easy Tier to migrate extents you will need to have disk storage available that has different tiers, for example a mix of SSD and HDD.

11.3.2 Implementation rules


Keep the following implementation and operation rules in mind when you use the IBM System Storage Easy Tier function on the SAN Volume Controller. Easy Tier automatic data placement is not supported on image mode or sequential volumes. I/O monitoring for such volumes is supported, but you cannot migrate extents on such volumes unless you convert image or sequential volume copies to striped volumes. 284
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

Automatic data placement and extent I/O activity monitors are supported on each copy of a mirrored volume. Easy Tier works with each copy independently of the other copy. Note: Volume mirroring can have different workload characteristics on each copy of the data because reads are normally directed to the primary copy and writes occur to both. Thus, the number of extents that Easy Tier will migrate to SSD tier will probably be different for each copy. If possible, the SAN Volume Controller creates new volumes or volume expansions using extents from MDisks from the HDD tier. However, it will use extents from MDisks from the SSD tier if necessary. When a volume is migrated out of a storage pool that is managed with Easy Tier, then Easy Tier automatic data placement mode is no longer active on that volume. Automatic data placement is also turned off while a volume is being migrated even if it is between pools that both have Easy Tier automatic data placement enabled. Automatic data placement for the volume is re-enabled when the migration is complete.

11.3.3 Limitations
Limitations exist when using IBM System Storage Easy Tier on the SAN Volume Controller. Limitations when removing an MDisk by using the -force parameter When an MDisk is deleted from a storage pool with the -force parameter, extents in use are migrated to MDisks in the same tier as the MDisk being removed, if possible. If insufficient extents exist in that tier, then extents from the other tier are used. Limitations when migrating extents When Easy Tier automatic data placement is enabled for a volume, the svctask migrateexts command-line interface (CLI) command cannot be used on that volume. Limitations when migrating a volume to another storage pool When the SAN Volume Controller migrates a volume to a new storage pool, Easy Tier automatic data placement between the two tiers is temporarily suspended. After the volume is migrated to its new storage pool, Easy Tier automatic data placement between the generic SSD tier and the generic HDD tier resumes for the moved volume, if appropriate. When the SAN Volume Controller migrates a volume from one storage pool to another, it will attempt to migrate each extent to an extent in the new storage pool from the same tier as the original extent. In several cases, such as a target tier being unavailable, the other tier is used. For example, the generic SSD tier might be unavailable in the new storage pool. Limitations when migrating a volume to image mode Easy Tier automatic data placement does not support image mode. When a volume with Easy Tier automatic data placement mode active is migrated to image mode, Easy Tier automatic data placement mode is no longer active on that volume. Image mode and sequential volumes cannot be candidates for automatic data placement. Easy Tier does support evaluation mode for image mode volumes.

Best practices
Always set the Storage Pool -easytier value to on rather than to the default value auto. This makes it easier to turn on evaluation mode for existing single tier pools, and

Chapter 11. Easy Tier

285

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

no further changes will be needed when you move to multitier pools. See Easy Tier activation on page 284 for more information about the mix of pool and volume settings. Using Easy Tier can make it more appropriate to use smaller storage pool extent sizes.

11.4 Measuring and activating Easy Tier


In the following sections we describe how to measure using Easy Tier and how to activate it.

11.4.1 Measuring by using the Storage Advisor Tool


The IBM Storage Advisor Tool is a command-line tool that runs on Windows systems. It takes input from the dpa_heat files created on the SVC nodes and produces a set of html files containing activity reports. The advisor tool is an application that creates a Hypertext Markup Language (HTML) file containing a report. For more information, visit the following website: http://www-01.ibm.com/support/docview.wss?uid=ssg1S4000935 Contact your IBM Representative or IBM Business Partner for further detail about Storage Advisor Tool.

Offloading statistics
To extract the summary performance data, use one of these methods.

Using the command-line interface (CLI)


Find the most recent dpa_heat.node_name.date.time.data file in the cluster by entering the following CLI command: svcinfo lsdumps node_id | node_name where node_id | node_name is the node ID or name to list the available dpa_heat data files. Next, perform the normal PSCP -load download process: pscp -unsafe -load saved_putty_configuration admin@cluster_ip_address:/dumps/dpa_heat.node_name.date.time.data your_local_directory

Using the GUI


If you prefer using the GUI, then navigate to the Troubleshooting Support page, as shown in Figure 11-3.

Figure 11-3 dpa_heat File Download

286

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

Running the tool


You run the tool from a command line or terminal session by specifying up to two input dpa_heat file names and directory paths; for example: C:\Program Files\IBM\STAT>STAT dpa_heat.nodenumber.yymmdd.hhmmss.data A file called index.html is then created in the STAT base directory. When opened with your browser, it will display a summary page as shown in Figure 11-4.

Figure 11-4 Example of STAT Summary

The distribution of hot data and cold data for each volume is shown in the volume heat distribution report. The report displays the portion of the capacity of each volume on SSD (red), and HDD (blue), as shown in Figure 11-5.

Figure 11-5 STAT Volume Heatmap Distribution sample

11.5 Using Easy Tier with the SVC CLI


This section describes the basic steps for activating Easy Tier by using the SVC command line interface (CLI). Our example is based on the storage pool configurations as shown in Figure 11-1 on page 281 and Figure 11-2 on page 282. Our environment is an SVC cluster with the following resources available: 1 x I/O group with two 2145-CF8 nodes 8 x external 73 GB SSD Drives - (4 x SSD per RAID5 array) 1 x external Storage Subsystem with HDDs

Chapter 11. Easy Tier

287

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

Deleted lines: Many non-Easy Tier-related lines have been deleted in the command output or responses in examples shown in the following sections to enable you to focus on Easy Tier-related information only.

11.5.1 Initial cluster status


Example 11-1 displays the SVC cluster characteristics prior to adding multitiered storage (SSD with HDD) and commencing the Easy Tier process. The example shows two different tiers available in our SVC cluster, generic_ssd and generic_hdd. At this time there is zero disk allocated to the generic_ssd tier, and therefore it is showing 0.00 MB capacity.
Example 11-1 SVC cluster IBM_2145:ITSO-CLS5:admin>svcinfo lscluster id name location partnership bandwidth id_alias 0000020060800004 ITSO-CLS5 local 0000020060800004 IBM_2145:ITSO-CLS5:admin>svcinfo lscluster 0000020060800004 id 0000020060800004 name ITSO-CLS5 . tier generic_ssd tier_capacity 0.00MB tier_free_capacity 0.00MB tier generic_hdd tier_capacity 18.85TB tier_free_capacity 18.43TB

11.5.2 Turning on Easy Tier evaluation mode


Figure 11-1 on page 281 shows an existing single tier storage pool. To turn on Easy Tier evaluation mode, we need to set -easytier on for both the storage pool and the volumes in the pool. Refer to Table 11-1 on page 284 to check the required mix of parameters needed to set the volume Easy Tier status to measured. As shown in Example 11-2, we turn Easy Tier on for both the pool and volume so that the extent workload measurement is enabled. We first check and then change the pool. Then we repeat the steps for the volume.
Example 11-2 Turning on Easy Tier evaluation mode

IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp -filtervalue "name=Single*" id name status mdisk_count vdisk_count easy_tier easy_tier_status 27 Single_Tier_Storage_Pool online 3 1 off inactive IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp Single_Tier_Storage_Pool id 27 name Single_Tier_Storage_Pool status online mdisk_count 3 vdisk_count 1 . easy_tier off easy_tier_status inactive . tier generic_ssd 288
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

tier_mdisk_count 0 . tier generic_hdd tier_mdisk_count 3 tier_capacity 200.25GB IBM_2145:ITSO-CLS5:admin>svctask chmdiskgrp -easytier on Single_Tier_Storage_Pool IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp Single_Tier_Storage_Pool id 27 name Single_Tier_Storage_Pool status online mdisk_count 3 vdisk_count 1 . easy_tier on easy_tier_status active . tier generic_ssd tier_mdisk_count 0 . tier generic_hdd tier_mdisk_count 3 tier_capacity 200.25GB

------------ Now Reapeat for the Volume ------------IBM_2145:ITSO-CLS5:admin>svcinfo lsvdisk -filtervalue "mdisk_grp_name=Single*" id name status mdisk_grp_id mdisk_grp_name capacity type 27 ITSO_Volume_1 online 27 Single_Tier_Storage_Pool 10.00GB striped IBM_2145:ITSO-CLS5:admin>svcinfo lsvdisk ITSO_Volume_1 id 27 name ITSO_Volume_1 . easy_tier off easy_tier_status inactive . tier generic_ssd tier_capacity 0.00MB . tier generic_hdd tier_capacity 10.00GB

IBM_2145:ITSO-CLS5:admin>svctask chvdisk -easytier on ITSO_Volume_1 IBM_2145:ITSO-CLS5:admin>svcinfo lsvdisk ITSO_Volume_1 id 27 name ITSO_Volume_1 . easy_tier on easy_tier_status measured . tier generic_ssd tier_capacity 0.00MB

Chapter 11. Easy Tier

289

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

. tier generic_hdd tier_capacity 10.00GB

11.5.3 Creating a multitier storage pool


With the SSD drive candidates placed into an array, we now need a pool into which the two tiers of disk storage will be placed. If you already have an HDD single tier pool, a traditional pre-SVC V6.1 pool, then all you will need to know is the existing MDiskgrp ID or name. In this example we have a storage pool available within which we want to place our SSD arrays, Multi_Tier_Storage_Pool. After creating the SSD arrays, which appear as MDisks, they are placed into the storage pool as shown in Example 11-3. Note that the storage pool easy_tier value is set to auto because it is the default value assigned when you create a new storage pool. Also note that the SSD MDisks default tier value is set to generic_hdd, and not to generic_ssd.
Example 11-3 Multitier pool creation IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp -filtervalue "name=Multi*" id name status mdisk_count vdisk_count capacity easy_tier easy_tier_status 28 Multi_Tier_Storage_Pool online 3 1 200.25GB auto inactive IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp Multi_Tier_Storage_Pool id 28 name Multi_Tier_Storage_Pool status online mdisk_count 3 vdisk_count 1 . easy_tier auto easy_tier_status inactive . tier generic_ssd tier_mdisk_count 0 . tier generic_hdd tier_mdisk_count 3

IBM_2145:ITSO-CLS5:admin>svcinfo lsmdisk mdisk_id mdisk_name status mdisk_grp_name capacity raid_level tier 299 SSD_Array_RAID5_1 online Multi_Tier_Storage_Pool 203.6GB raid5 generic_hdd 300 SSD_Array_RAID5_2 online Multi_Tier_Storage_Pool 203.6GB raid5 generic_hdd IBM_2145:ITSO-CLS5:admin>svcinfo lsmdisk SSD_Array_RAID5_2 mdisk_id 300 mdisk_name SSD_Array_RAID5_2 status online mdisk_grp_id 28 mdisk_grp_name Multi_Tier_Storage_Pool capacity 203.6GB

. raid_level raid5 tier generic_hdd

290

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp -filtervalue "name=Multi" *" id name mdisk_count vdisk_count capacity easy_tier easy_tier_status 28 Multi_Tier_Storage_Pool 5 1 606.00GB auto inactive IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp Multi_Tier_Storage_Pool id 28 name Multi_Tier_Storage_Pool status online mdisk_count 5 vdisk_count 1 . easy_tier auto easy_tier_status inactive . tier generic_ssd tier_mdisk_count 0 . tier generic_hdd tier_mdisk_count 5

11.5.4 Setting the disk tier


As shown in Example 11-3 on page 290, MDisks that are detected have a default disk tier of generic_hdd. Easy Tier is also still inactive for the storage pool because we do not yet have a true multidisk tier pool. To activate the pool we have to reset the SSD MDisks to their correct generic_ssd tier. Example 11-4 shows how to modify the SSD disk tier.
Example 11-4 Changing an SSD disk tier to generic_ssd

IBM_2145:ITSO-CLS5:admin>svcinfo lsmdisk SSD_Array_RAID5_1 id 299 name SSD_Array_RAID5_1 status online . tier generic_hdd IBM_2145:ITSO-CLS5:admin>svctask chmdisk -tier generic_ssd SSD_Array_RAID5_1 IBM_2145:ITSO-CLS5:admin>svctask chmdisk -tier generic_ssd SSD_Array_RAID5_2

IBM_2145:ITSO-CLS5:admin>svcinfo lsmdisk SSD_Array_RAID5_1 id 299 name SSD_Array_RAID5_1 status online . tier generic_ssd IBM_2145:ITSO-CLS5:admin>svcinfo lsmdiskgrp Multi_Tier_Storage_Pool id 28 name Multi_Tier_Storage_Pool status online mdisk_count 5 vdisk_count 1 . easy_tier auto
Chapter 11. Easy Tier

291

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

easy_tier_status active . tier generic_ssd tier_mdisk_count 2 tier_capacity 407.00GB . tier generic_hdd tier_mdisk_count 3

11.5.5 Checking a volumes Easy Tier mode


To check the Easy Tier operating mode on a volume, we need to display its properties using the lsvdisk command. An automatic data placement mode volume will have its pool value set to ON or AUTO, and the volume set to ON. The CLI volume easy_tier_status will be displayed as active, as shown in Example 11-5. An evaluation mode volume will have both the pool and volume value set to ON. However, the CLI volume easy_tier_status will be shown as measured, as seen in Example 11-2 on page 288.
Example 11-5 Checking a volumes easy_tier_status

IBM_2145:ITSO-CLS5:admin>svcinfo lsvdisk ITSO_Volume_10 id 28 name ITSO_Volume_10 mdisk_grp_name Multi_Tier_Storage_Pool capacity 10.00GB type striped . easy_tier on easy_tier_status active . tier generic_ssd tier_capacity 0.00MB tier generic_hdd tier_capacity 10.00GB The volume in the example will be measured by Easy Tier and a hot extent migration will be performed from the hdd tier MDisk to the ssd tier MDisk. Also note that the volume hdd tier generic_hdd still holds the entire capacity of the volume because the generic_ssd capacity value is 0.00 MB. The allocated capacity on the generic_hdd tier will gradually change as Easy Tier optimizes the performance by moving extents into the generic_ssd tier.

11.5.6 Final cluster status


Example 11-6 shows the SVC cluster characteristics after adding multitiered storage (SSD with HDD).
Example 11-6 SVC Multi-Tier cluster

IBM_2145:ITSO-CLS5:admin>svcinfo lscluster ITSO-CLS5 id 000002006A800002 name ITSO-CLS5 . 292


SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

tier generic_ssd tier_capacity 407.00GB tier_free_capacity 100.00GB tier generic_hdd tier_capacity 18.85TB tier_free_capacity 10.40TB As you can now see we have two different tiers available in our SVC cluster, generic_ssd and generic_hdd. At this time there are also extents being used on both the generic_ssd tier and the generic_hdd tier; see the free_capacity values. However, we do not know from this command if the SSD storage is being used by the Easy Tier process. To determine if Easy Tier is actively measuring or migrating extents within the cluster, you need to view the volume status as shown previously in Example 11-5 on page 292.

11.6 Using Easy Tier with the SVC GUI


This section describes the basic steps to activate Easy Tier by using the web interface or GUI. Our example is based on the storage pool configurations shown in Figure 11-1 on page 281 and Figure 11-2 on page 282. Our environment is an SVC cluster with the following resources available: 1 x I/O group with two 2145-CF8 nodes 8 x external 73 GB SSD Drives - (4 x SSD per RAID5 array) 1 x external Storage Subsystem with HDDs

11.6.1 Setting the disk tier on MDisks


When displaying the storage pool you can see that Easy Tier is inactive, even though there are SSD MDisks in the pool as shown in Figure 11-6.

Figure 11-6 GUI select MDisk to change tier

This is because, by default, all MDisks are initially discovered as Hard Disk Drives (HDDs); see the MDisk properties panel Figure 11-7 on page 294.

Chapter 11. Easy Tier

293

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 11-7 MDisk default tier is Hard Disk Drive

Therefore, for Easy Tier to take effect, you need to change the disk tier. Right-click the selected MDisk and choose Select Tier, as shown in Figure 11-8.

Figure 11-8 Select the Tier

Now set the MDisk Tier to Solid-State Drive, as shown in Figure 11-9 on page 295.

294

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Easytier.fm

Figure 11-9 GUI Setting Solid-State Drive tier

The MDisk now has the correct tier and so the properties value is correct for a multidisk tier pool, as shown in Figure 11-10.

Figure 11-10 Show MDisk details Tier and RAID level

11.6.2 Checking Easy Tier status


Now that the SSDs are known to the pool as Solid-State Drives, the Easy Tier function becomes active as shown in Figure 11-11 on page 296. After the pool has an Easy Tier active status, the automatic data relocation process begins for the volumes in the pool. This occurs as the default Easy Tier setting for volumes is ON.

Chapter 11. Easy Tier

295

7521Easytier.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 11-11 Storage Pool with Easy Tier active

11.7 Solid State Drives

296

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

12

Chapter 12.

Applications
In this chapter, we provide information about laying out storage for the best performance for general applications, IBM AIX Virtual I/O (VIO) servers, and IBM DB2 databases specifically. While most of the specific information is directed to hosts running the IBM AIX operating system, the information is also relevant to other host types.

Copyright IBM Corp. 2011. All rights reserved.

297

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

12.1 Application workloads


In general, there are two types of data workload (data processing): Transaction-based Throughput-based These workloads are different by nature and must be planned for in quite different ways. Knowing and understanding how your host servers and applications handle their workload is an important part of being successful with your storage configuration efforts and the resulting performance. A workload that is characterized by a high number of transactions per second and a high number of I/Os Per Second (IOPS) is called a transaction-based workload. A workload that is characterized by a large amount of data transferred, normally with large I/O sizes, is called a throughput-based workload. These two workload types are conflicting in nature and consequently will require different configuration settings across all components comprising the storage infrastructure. Generally, I/O (and therefore application) performance will be best when the I/O activity is evenly spread across the entire I/O subsystem. But first, let us describe each type of workload in greater detail and explain what you can expect to encounter in each case.

12.1.1 Transaction-based workloads


High performance transaction-based environments cannot be created with a low-cost model of a storage server. Indeed, transaction process rates are heavily dependent on the number of back-end physical drives that are available for the storage subsystem controllers to use for parallel processing of host I/Os, which frequently results in having to decide how many physical drives you need. Generally, transaction intense applications also use a small random data block pattern to transfer data. With this type of data pattern, having more back-end drives enables more host I/Os to be processed simultaneously, because read cache is far less effective than write cache, and the misses need to be retrieved from the physical disks. In many cases, slow transaction performance problems can be traced directly to hot files that cause a bottleneck on a critical component (such as a single physical disk). This situation can occur even when the overall storage subsystem sees a fairly light workload. When bottlenecks occur, they can present an extremely difficult and frustrating task to resolve. Because workload content can continually change throughout the course of the day, these bottlenecks can be extremely mysterious in nature and appear and disappear or move over time from one location to another location. Generally, I/O (and therefore application) performance will be best when the I/O activity is evenly spread across the entire I/O subsystem.

12.1.2 Throughput-based workloads


Throughput-based workloads are seen with applications or processes that require massive amounts of data sent and generally use large sequential blocks to reduce disk latency.

298

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

Generally, a smaller number of physical drives are needed to reach adequate I/O performance than with transaction-based workloads. For instance, 20 - 28 physical drives are normally enough to reach maximum I/O throughput rates with the IBM System Storage DS4000 series of storage subsystems. In a throughput-based environment, read operations make use of the storage subsystem cache to stage greater chunks of data at a time to improve the overall performance. Throughput rates are heavily dependent on the storage subsystems internal bandwidth. Newer storage subsystems with broader bandwidths are able to reach higher numbers and bring higher rates to bear.

12.1.3 Storage subsystem considerations


It is of great importance that the selected storage subsystem model is able to support the required I/O workload. Besides availability concerns, adequate performance must be ensured to meet the requirements of the applications, which include evaluation of the disk drive modules (DDMs) used and if the internal architecture of the storage subsystem is sufficient. With todays mechanically based DDMs, it is important that the DDM characteristics match the needs. In general, a high rotation speed of the DDM platters is needed for transaction-based throughputs, where the DDM head continuously moves across the platters to read and write random I/Os. For throughput-based workloads, a lower rotation speed might be sufficient, because of the sequential I/O nature. As for the subsystem architecture, newer generations of storage subsystems have larger internal caches, higher bandwidth busses, and more powerful storage controllers.

12.1.4 Host considerations


When discussing performance, we need to consider far more than just the performance of the I/O workload itself. Many settings within the host frequently affect the overall performance of the system and its applications. All areas must be checked to ensure that we are not focusing on a result rather than the cause. However, in this book we are focusing on the I/O subsystem part of the performance puzzle; so we will discuss items that affect its operation. Several of the settings and parameters that we discussed in Chapter 8, Hosts on page 191 must match both for the host operating system (OS) and for the host bus adapters (HBAs) being used as well. Many operating systems have built-in definitions that can be changed to enable the HBAs to be set to the new values.

12.2 Application considerations


When gathering data for planning from the application side, it is important to first consider the workload type for the application. If multiple applications or workload types will share the system, you need to know the type of workloads of each application, and if the applications have both types or are mixed (transaction-based and throughput-based), which workload will be the most critical. Many environments have a mix of transaction-based and throughput-based workloads; generally, the transaction performance is considered the most critical. However, in some environments, for example, a Tivoli Storage Manager backup environment, the streaming high throughput workload of the backup itself is the critical part of the operation. The backup database, although a transaction-centered workload, is a less critical workload.

Chapter 12. Applications

299

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

12.2.1 Transaction environments


Applications that use high transaction workloads are known as Online Transaction Processing (OLTP) systems. Examples of these systems are database servers and mail servers. If you have a database, you tune the server type parameters, as well as the databases logical drives, to meet the needs of the database application. If the host server has a secondary role of performing nightly backups for the business, you need another set of logical drives, which are tuned for high throughput for the best backup performance you can get within the limitations of the mixed storage subsystems parameters. So, what are the traits of a transaction-based application? In the following sections, we explain these traits in more detail. As mentioned earlier, you can expect to see a high number of transactions and a fairly small I/O size. Different databases use different I/O sizes for their logs (refer to the following examples), and these logs vary from vendor to vendor. In all cases, the logs are generally high write-oriented workloads. For table spaces, most databases use between a 4 KB and a 16 KB I/O size. In certain applications, larger chunks (for example, 64 KB) will be moved to host application cache memory for processing. Understanding how your application is going to handle its I/O is critical to laying out the data properly on the storage server. In many cases, the table space is generally a large file made up of small blocks of data records. The records are normally accessed using small I/Os of a random nature, which can result in about a 50% cache miss ratio. For this reason and to not waste space with unused data, plan for the SAN Volume Controller (SVC) to read and write data into cache in small chunks (use striped volumes with smaller extent sizes). Another point to consider is whether the typical I/O is read or write. In most OLTP environments, there is generally a mix of about 70% reads and 30% writes. However, the transaction logs of a database application have a much higher write ratio and, therefore, perform better in a different storage pool . Also, you need to place the logs on a separate virtual disk (volume), which for best performance must be located on a different storage pool that is defined to better support the heavy write need. Mail servers also frequently have a higher write ratio than read ratio. Best practice: Database table spaces, journals, and logs must never be collocated on the same MDisk or storage pool in order to avoid placing them on the same back-end storage logical unit number (LUN) or Redundant Array of Independent Disks (RAID) array.

12.2.2 Throughput environments


With throughput workloads, you have fewer transactions, but much larger I/Os. I/O sizes of 128 K or greater are normal, and these I/Os are generally of a sequential nature. Applications that typify this type of workload are imaging, video servers, seismic processing, high performance computing (HPC), and backup servers. With large size I/O, it is better to use large cache blocks to be able to write larger chunks into cache with each operation. Generally, you want the sequential I/Os to take as few back-end I/Os as possible and to get maximum throughput from them. So, carefully decide how the logical drive will be defined and how the volumes are dispersed on the back-end storage MDisks. Many environments have a mix of transaction-oriented workloads and throughput-oriented workloads. Unless you have measured your workloads, assume that the host workload is 300
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

mixed and use SVC striped volumes over several MDisks in an storage pool in order to have the best performance and eliminate trouble spots or hot spots.

12.3 Data layout overview


In this section, we document data layout from an AIX point of view. Our objective is to help ensure that AIX and storage administrators, specifically those responsible for allocating storage, know enough to lay out the storage data, consider the virtualization layers, and avoid the performance problems and hot spots that come with poor data layout. The goal is to balance I/Os evenly across the physical disks in the back-end storage subsystems. We will specifically show you how to lay out storage for DB2 applications as a good example of how an application might balance its I/Os within the application. There are also various implications for the host data layout based on whether you utilize SVC image mode or SVC striped mode volumes.

12.3.1 Layers of volume abstraction


Back-end storage is laid out into RAID arrays by RAID type, the number of disks in the array, and the LUN allocation to the SVC or host. The RAID array is a certain number of disk drive modules (DDMs) (usually containing from two to 32 disks and most often, around 10 disks) in a RAID configuration (RAID 0, RAID 1, RAID 5, or RAID 10, typically); although, certain vendors call their entire disk subsystem an array. Use of an SVC adds another layer of virtualization to understand, because there are volumes, which are LUNs served from the SVC to a host, and MDisks, which are LUNs served from back-end storage to the SVC. The SVC volumes are presented to the host as LUNs. These LUNs are then mapped as physical volumes on the host, which might build logical volumes out of the physical volumes. Figure 12-1 on page 302 shows the layers of storage virtualization.

Chapter 12. Applications

301

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 12-1 Layers of storage virtualization

12.3.2 Storage administrator and AIX LVM administrator roles


Storage administrators control the configuration of the back-end storage subsystems and their RAID arrays (RAID type and number of disks in the array, although there are restrictions on the number of disks in the array and other restrictions depending upon the disk subsystem). They normally also decide the layout of the back-end storage LUNs (MDisks), SVC storage pools, SVC volumes, and which volumes are assigned to which hosts. The AIX administrators control the AIX Logical Volume Manager (LVM) and in which volume group (VG) the SVC volumes (LUNs) are placed. They also create logical volumes (LVs) and file systems within the VGs. These administrators have no control where multiple files or directories reside in an LV unless there is only one file or directory in the LV. There is also an application administrator for those applications, such as DB2, which balance their I/Os by striping directly across the LVs. Together, the storage administrator, LVM administrator, and application administrator control on which physical disks the LVs reside.

12.3.3 General data layout recommendations


Our primary recommendation for laying out data on SVC back-end storage for general applications is to use striped volumes across storage pools consisting of similar-type MDisks with as few MDisks as possible per RAID array. This general purpose rule is applicable to

302

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

most SVC back-end storage configurations and removes a significant data layout burden for the storage administrators. Consider where the failure boundaries are in the back-end storage and take this into consideration when locating application data. A failure boundary is defined as what will be affected if we lose a RAID array (an SVC MDisk). All the volumes and servers striped on that MDisk will be affected together with all other volumes in that storage pool. Consider also that spreading out the I/Os evenly across the back-end storage has a performance benefit and a management benefit. We recommend that an entire set of back-end storage is managed together considering the failure boundary. If a company has several lines of business (LOBs), it might decide to manage the storage along each LOB so that each LOB has a unique set of back-end storage. So, for each set of back-end storage (a group of storage pools or perhaps better, just one storage pool), we create only striped volumes across all the back-end storage arrays, which is is beneficial, because the failure boundary is limited to a LOB, and performance and storage management is handled as a unit for the LOB independently. What we do not recommend is to create striped volumes that are striped across different sets of back-end storage, because using different sets of back-end storage makes the failure boundaries difficult to determine, unbalances the I/O, and might limit the performance of those striped volumes to the slowest back-end device. For SVC configurations where SVC image mode volumes must be used, we recommend that the back-end storage configuration for the database consists of one LUN (and therefore one image mode volume) per array, or an equal number of LUNs per array, so that the Database Administrator (DBA) can guarantee that the I/O workload is distributed evenly across the underlying physical disks of the arrays. Refer to Figure 12-2 on page 304. Use striped mode volumes for applications that do not already stripe their data across physical disks. Striped volumes are the all-purpose volumes for most applications. Use striped mode volumes if you need to manage a diversity of growing applications and balance the I/O performance based on probability. If you understand your application storage requirements, you might take an approach that explicitly balances the I/O rather than a probabilistic approach to balancing the I/O. However, explicitly balancing the I/O requires either testing or a good understanding of the application and the storage mapping and striping to know which approach works better. Examples of applications that stripe their data across the underlying disks are DB2, IBM GPFS, and Oracle ASM. These types of applications might require additional data layout considerations as described in , How the partitions are selected for use and laid out can vary from system to system. In all cases, you need to ensure that spreading the partitions is done in a manner to achieve maximum I/Os available to the logical drives in the group. Generally, large volumes are built across a number of different logical drives to bring more resources to bear. You must be careful when selecting logical drives when you do this in order to not use logical drives that will compete for resources and degrade performance. on page 306.

Chapter 12. Applications

303

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

General data layout recommendation for AIX:


Evenly balance I/Os across all physical disks (one method is by striping the volumes) To maximize sequential throughput, use a maximum range of physical disks (AIX command mklv -e x) for each LV. MDisk and volume sizes: Create one MDisk per RAID array. Create volumes based on the space needed, which overcomes disk subsystems that do not allow dynamic LUN detection. When you need more space on the server, dynamically extend the volume on the SVC and then use the AIX command chvg -g to see the increased size in the system.
Figure 12-2 General data layout recommendations for AIX storage

SVC striped mode volumes


We recommend striped mode volumes for applications that do not already stripe their data across disks. Creating volumes that are striped across all RAID arrays in an storage pool ensures that AIX LVM setup does not matter. Creating volumes that are striped across all RAID arrays in an storage pool is an excellent approach for most general applications and eliminates data layout considerations for the physical disks. Use striped volumes with the following considerations: Use extent sizes of 64 MB to maximize sequential throughput when it is important. Refer to Table 12-1 for a table of extent size compared to capacity. Use striped volumes when the number of volumes does not matter. Use striped volumes when the number of VGs does not affect performance. Use striped volumes when sequential I/O rates are greater than the sequential rate for a single RAID array on the back-end storage. Extremely high sequential I/O rates might require a different layout strategy. Use striped volumes when you prefer the use of extremely large LUNs on the host. Refer to Volume size on page 307 for details about how to utilize large volumes.
Table 12-1 Extent size as opposed to maximum storage capacity Extent size 16 MB 32 MB 64 MB 128 MB 256 MB 512 MB 1 GB 2 GB Maximum storage capacity of SVC cluster 64 TB 128 TB 256 TB 512 TB 1 PB 2 PB 4 PB 8 PB

304

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

12.3.4 Database strip size considerations (throughput workload)


It is also worthwhile thinking about the relative strip sizes (a strip is the amount of data written to one volume or container before going to the next volume or container). Database strip sizes are typically small. Let us assume they are 32 KB. The SVC strip size (called extent) is user selectable and in the range of 16 MB to 2 GB. The back-end RAID arrays have strip sizes in the neighborhood of 64 - 512 KB. Then, there is the number of threads performing I/O operations (assume they are sequential, because if they are random, it does not matter). The number of sequential I/O threads is extremely important and is often overlooked, but it is a key part of the design to get performance from applications that perform their own striping. Comparing striping schemes for a single sequential I/O thread might be appropriate for certain applications, such as backups, extract, transform, and load (ETL) applications, and several scientific/engineering applications, but typically, it is not appropriate for DB2 or Tivoli Storage Manager. If we have one thread per volume or container performing sequential I/O, using SVC image mode volumes ensures that the I/O is done sequentially with full strip writes (assuming RAID 5). With SVC striped volumes, we might run into situations where two threads are doing I/O to the same back-end RAID array or run into convoy effects that temporarily reduce performance (convoy effects result in longer periods of lower throughput). Tivoli Storage Manager uses a similar scheme as DB2 to spread out its I/O, but it also depends on ensuring that the number of client backup sessions is equal to the number of Tivoli Storage Manager storage volumes or containers. Tivoli Storage Manager performance issues can be improved by using LVM to spread out the I/Os (called PP striping), because it is difficult to control the number of client backup sessions. For this situation, a good approach is to use SVC striped volumes rather than SVC image mode volumes. The perfect situation for Tivoli Storage Manager is n client backup sessions going to n containers (each container on a separate RAID array). To summarize, if you are well aware of the applications I/O characteristics and the storage mapping (from the application all the way to the physical disks), you might want to consider explicit balancing of the I/Os using SVC image mode volumes to maximize the applications striping performance. Normally, using SVC striped volumes makes sense, balances the I/O well for most situations, and is significantly easier to manage.

12.3.5 LVM volume groups and logical volumes


Without an SVC managing the back-end storage, the administrator must ensure that the host operating system aligns its device data partitions or slices with those of the logical drive. Misalignment can result in numerous boundary crossings that are responsible for unnecessary multiple drive I/Os. Certain operating systems do this automatically, and you just need to know the alignment boundary that they use. Other operating systems, however, might require manual intervention to set their start point to a value that aligns them. With an SVC managing the storage for the host as striped volumes, aligning the partitions is easier, because the extents of the volume are spread across the MDisks in the storage pool. The storage administrator must ensure an adequate distribution. Understanding how your host-based volume manager (if used) defines and makes use of the logical drives when they are presented is also an important part of the data layout. Volume managers are generally set up to place logical drives into usage groups for their use. The volume manager then creates volumes by carving up the logical drives into partitions (sometimes referred to as slices) and then building a volume from them by either striping or concatenating them to form the desired volume size.

Chapter 12. Applications

305

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

How the partitions are selected for use and laid out can vary from system to system. In all cases, you need to ensure that spreading the partitions is done in a manner to achieve maximum I/Os available to the logical drives in the group. Generally, large volumes are built across a number of different logical drives to bring more resources to bear. You must be careful when selecting logical drives when you do this in order to not use logical drives that will compete for resources and degrade performance.

12.4 Database Storage


In a world with networked and highly virtualized storage, database storage design can seem like a dauntingly complex task for a DBA or system architect to get right. Poor database storage design can have a significant negative impact on a database server. CPUs are so much faster than physical disks that it is not uncommon to find poorly performing database servers that are significantly I/O bound and underperforming by many times their potential. The good news is that it is not necessary to get database storage design perfectly right. Understanding the innards of the storage stack and manually tuning the location of database tables and indexes on particular parts of different physical disks is neither generally achievable nor generally maintainable by the average DBA in todays virtualized storage world. Simplicity is the key to ensuring good database storage design. The basics involve ensuring an adequate number of physical disks to keep the system from becoming I/O bound. More information and basic initial suggestions for a healthy database server through easy-to-follow best practices in database storage, including guidelines and recommendations can be found in a Best Practices - Database Storage Document at the following website: http://www.ibm.com/developerworks/data/bestpractices/databasestorage/

12.5 Data layout with the AIX virtual I/O (VIO) server
The purpose of this section is to describe strategies to get the best I/O performance by evenly balancing I/Os across physical disks when using the VIO Server.

12.5.1 Overview
In setting up storage at a VIO server (VIOS), a broad range of possibilities exists for creating volumes and serving them up to VIO clients (VIOCs). The obvious consideration is to create sufficient storage for each VIOC. Less obvious, but equally important, is getting the best use of the storage. Performance and availability are of paramount importance. There are typically internal Small Computer System Interface (SCSI) disks (typically used for the VIOS operating system) and SAN disks. Availability for disk is usually handled by RAID on the SAN or by SCSI RAID adapters on the VIOS. We will assume here that any internal SCSI disks are used for the VIOS operating system and possibly for the VIOCs operating systems. Furthermore, we will assume that the applications are configured so that the limited I/O will occur to the internal SCSI disks on the VIOS and to the VIOCs rootvgs. If you expect your rootvg will have a significant IOPS rate, you can configure it in the same fashion as we recommend for other application VGs later.

306

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Applications.fm

VIOS restrictions
There are two types of volumes that you can create on a VIOS: physical volume (PV) VSCSI hdisks and logical volume (LV) VSCSI hdisks. PV VSCSI hdisks are entire LUNs from the VIOS point of view, and if you are concerned about failure of a VIOS and have configured redundant VIOS for that reason, you must use PV VSCSI hdisks. So, PV VSCSI hdisks are entire LUNs that are volumes from the VIOC point of view. An LV VSCSI hdisk cannot be served up from multiple VIOSs. LV VSCSI hdisks reside in LVM VGs on the VIOS and cannot span PVs in that VG, nor be striped LVs.

VIOS queue depth


From a performance point of view, the queue_depth of VSCSI hdisks is limited to 3 at the VIOC, which limits the IOPS bandwidth to approximately 300 IOPS (assuming an average I/O service time of 10 ms). Thus, you need to configure a sufficient number of VSCSI hdisks to get the IOPS bandwidth needed. The queue depth limit changed in Version 1.3 of the VIOS (August 2006) to 256; although, you still need to worry about the IOPS bandwidth of the back-end disks. When possible, set the queue depth of the VIOC hdisks to match that of the VIOS hdisk to which it maps.

12.5.2 Data layout strategies


You can use the SVC or AIX LVM (with appropriate configuration of vscsi disks at the VIOS) to balance the I/Os across the back-end physical disks. When using an SVC, here is how to balance the I/Os evenly across all arrays on the back-end storage subsystems: You create just a few LUNs per array on the back-end disk in each storage pool (the normal practice is to have RAID arrays of the same type and size, or nearly the same size, and same performance characteristics in an storage pool). You create striped volumes on the SVC that are striped across all back-end LUNs. When you do this, the LVM setup does not matter, and you can use PV vscsi hdisks and redundant VIOSs or LV vscsi hdisks (if you are not worried about VIOS failure).

12.6 Volume size


Larger volumes might need more disk buffers and larger queue_depths depending on the I/O rates; however, there is a large benefit of less AIX memory and fewer path management resources used. It is worthwhile to tune the queue_depths and adapter resources for this purpose. It is preferable to use fewer large LUNs, because it is easy to increase the queue_depth, which does require application downtime, and disk buffers, because handling more AIX LUNs requires a considerable amount of OS resources.

12.7 Failure boundaries


As mentioned in 12.3.3, General data layout recommendations on page 302, it is important to consider failure boundaries in the back-end storage configuration. If all of the LUNs are spread across all physical disks (either by LVM or SVC volume striping), and you experience a single RAID array failure, you might lose all your data. So, there are situations in which you probably want to limit the spread for certain applications or groups of applications. You might
Chapter 12. Applications

307

7521Applications.fm

Draft Document for Review February 16, 2012 3:49 pm

have a group of applications where if one application fails, none of the applications can perform any productive work. When implementing the SVC, limiting the spread can be accounted for through the storage pool layout. Refer to Chapter 5, Storage pools and Managed Disks on page 71 for more information about failure boundaries in the back-end storage configuration.

308

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521p04.fm

Part 2

Part

Management, monitoring and troubleshooting


In this part of the book we provide information on best practices for monitoring, managing,and troubleshooting your SVC. Practical examples are also included.

Copyright IBM Corp. 2011. All rights reserved.

309

7521p04.fm

Draft Document for Review February 16, 2012 3:49 pm

310

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13

Chapter 13.

Monitoring
In this chapter, we discuss Tivoli Storage Productivity Center reports and how to use them to monitor your SVC and Storwize V7000 and identify performance problems. We then show several examples of misconfiguration and failures, and how they can be identified in Tivoli Storage Productivity Center using the Topology Viewer and performance reports. We also show how to collect and view performance data directly from the SVC. You must always use the latest version of Tivoli Storage Productivity Center that is supported by your SVC code; Tivoli Storage Productivity Center is often updated to support new SVC features. If you have an earlier version of Tivoli Storage Productivity Center installed, you might still be able to reproduce the reports described here, but certain data might not be available.

Copyright IBM Corp. 2011. All rights reserved.

311

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.1 Using Tivoli Storage Productivity Center to analyze the SVC


In this chapter we will guide you through the reports that are available to monitor your SVC and basic steps on resolving performance problems.

13.1.1 IBM SAN Volume Controller (SVC) or Storwize V7000


Tivoli Storage Productivity Center provides several reports specific to the SVC and or Storwize V7000: Managed Disk Group (SVC/Storwize V7000 Storage Pool): No additional information is provided here that you need for performance problem determination (see Figure 13-1). The report reflects whether EZ-Tier was introduced in to the storage pool.

Figure 13-1 Asset Report: Manage Disk Group (SVC Storage Pool) Detail

Managed Disks: Figure 13-2 shows the Managed Disk for the selected SVC. No additional information is provided here that you need for performance problem determination. The report was enhanced in 4.2.1 to also reflect if the MDisk is a Solid State Disk (SSD). SVC does not automatically detect SSD MDisks. To mark them as SDD candidates for EZ-Tier, the managed disk tier attribute must be manually changed from generic_hdd to generic_sdd.

312

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-2 Tivoli Storage Productivity Center Asset Report: Managed Disk Detail

Virtual Disks: Figure 13-3 shows virtual disks for the selected SVC, or in this case a virtual disk or volume from a Storwize V7000. Note: Virtual disks for either the Storwize V7000 or SVC are identical within Tivoli Storage Productivity Center in this report. So only Storwize V7000 screens were selected as they review an SVC version 6.2 version impact with Tivoli Storage Productivity Center V4.2.1.

Figure 13-3 Tivoli Storage Productivity Center Asset Report: Virtual Disk Detail

The virtual disks are referred to as volumes in other performance reports. For the volumes, you see the managed disk (MDisk) on which the virtual disks are allocated, but you do not see the correct RAID level. From a SVC perspective, you often stripe the data across the MDisks within a storage pool so that Tivoli Storage Productivity Center displays RAID 0 as the RAID level.

Chapter 13. Monitoring

313

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

As with many other reports this one was also enhanced to report on EZ-Tier and Space Efficient usage. In this example screen shot you see that EZ-Tier is enabled for this volume, still in inactive status. In addition this report was also enhanced to show the amount of storage assigned to this volume from the different tiers (sdd and hdd). There is another report that can help you see the actual configuration of the volume. This includes the MDG or Storage Pool, Backend Controller, MDisks among other details; unfortunately, this information is not available in the asset reports on the MDisks. Volume to Backend Volume Assignment: Figure 13-4 shows the location of the Volume to Backend Volume Assignment report within the Navigation Tree.

Figure 13-4 Volume to Backend Volume Assignment Navigation Location

Figure 13-5 shows the report. Notice that the virtual disks are referred to as volumes in the report.

Figure 13-5 Asset Report: Volume to Backend Volume Assignment

314

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

This report provides many details about the volume. While specifics of the RAID configuration of the actual MDisks are not presented, the report is quite useful since all aspects from the host perspective to the backend storage are placed together in one report. The following details are available and are quite useful: Storage Subsystem containing the Disk in View, for this report this is the SVC Storage Subsystem type, for this report this is the SVC User-Defined Volume Name Volume Name Volume Space, total usable capacity of the volume Tip: For space-efficient volumes, this value is the amount of storage space requested for these volumes, not the actual allocated amount. This can result in discrepancies in the overall storage space reported for a storage subsystem using space-efficient volumes. This also applies to other space calculations, such as the calculations for the Storage Subsystem's Consumable Volume Space and FlashCopy Target Volume Space. Storage Pool associated with this volume Disk, what MDisk is the volume placed upon. Note: For SVC or Storwize V7000 volumes spanning multiple MDisks, this report will have multiple entries for that volume to reflect the actual MDisks the volume is using. Disk Space, what is the total disk space available on the MDisk. Available Disk Space, what is the remaining space available on the MDisk. Backend Storage Subsystem, what is the name of Storage Subsystem this MDisk is from. Backend Storage Subsystem type, what type of storage subsystem is this. Backend Volume Name, what is the volume name for this MDisk as known by the backend storage subsystem. (Big Time Saver) Backend Volume Space Copy ID Copy Type, this will present the type of copy this volume is being used for, such as primary or copy for SVC versions 4.3 and newer. Primary is the source volume, and Copy is the target volume. Backend Volume Real Space, for full backend volumes this is the actual space. For Space Efficient backend volumes this is the real capacity being allocated. Easy Tier, indicated whether EZ-Tier is enabled or not on the volume. Easy Tier status, active or inactive. Tiers Tier Capacity

Chapter 13. Monitoring

315

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.2 SVC considerations


When starting to analyze the performance of the SVC environment to identify a performance problem, we recommend that you identify all of the components between the two systems and verify the performance of the smaller components.

13.2.1 SVC traffic


Traffic between a host, the SVC nodes, and a storage controller goes through these paths: 1. The host generates the I/O and transmits it on the fabric. 2. The I/O is received on the SVC node ports. 3. If the I/O is a write I/O: a. The SVC node writes the I/O to the SVC node cache. b. The SVC node sends a copy to its partner node to write to the partner nodes cache. c. If the I/O is part of a Metro Mirror or Global Mirror, a copy needs to go to the secondary VDisk of the relationship. d. If the I/O is part of a FlashCopy and the FlashCopy block has not been copied to the target VDisk, this action needs to be scheduled. 4. 4. If the I/O is a read I/O: a. The SVC needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the SVC needs to read the data from the physical LUNs (managed disks). 5. At some point, write I/Os are sent to the storage controller. 6. The SVC might also do some read ahead I/Os to load the cache in order to reduce latency on subsequent read commands.

13.2.2 SVC best practice recommendations for performance


We recommend that you have at least two MDisk groups, one for key applications, another for everything else. You might want more MDisk groups if you have different device types to be separated, for example, RAID5 versus RAID10, or SAS versus Near Line (NL)-SAS. The development recommendations for SVC are summarized here for DS8000s: One MDisk per Extent Pool One MDisk per Storage Cluster One MDisk group per storage subsystem One MDisk group per RAID array type (RAID 5 versus RAID 10) One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB) There are situations where multiple MDisk groups are desirable: Workload isolation Short-stroking a production MDisk group Managing different workloads in different groups

316

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13.3 Storwize V7000 considerations


When starting to analyze the performance of the Storwize V7000 environment to identify a performance problem, we recommend that you identify all of the components between the Storwize V7000, the server and the backend storage subsystem if configured in that manner, or between the Storwize V7000 and the server. Then verify the performance of all of components.

13.3.1 Storwize V7000 traffic


Traffic between a host, the Storwize V7000 nodes and direct attached storage, and or a backend storage controller all traverse the same storage paths: 1. The host generates the I/O and transmits it on the fabric. 2. The I/O is received on the Storwize V7000 canister ports. 3. If the I/O is a write I/O: a. The Storwize V7000 node canister writes the I/O to its cache. b. The preferred canister sends a copy to its partner canister to update the partners canister cache. c. If the I/O is part of a Metro or Global Mirror, a copy needs to go to the secondary volume of the relationship. d. If the I/O is part of a FlashCopy and the FlashCopy block has not been copied to the target volume, this action needs to be scheduled. 4. 4. If the I/O is a read I/O: a. The Storwize V7000 needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the Storwize V7000 needs to read the data from the physical MDisks. 5. At some point, write I/Os are destaged to Storwize V7000 managed MDisks or sent to the backend SAN attached storage controller(s). 6. The Storwize V7000 might also do some data optimized sequential detect pre-fetch cache I/Os to pre-load the cache in case the next read I/O has been determined by the Storwize V7000 cache algorithms. This approach benefits the sequential I/O when compared with the more common Least Recently Used (LRU) method used for non sequential I/O.

13.3.2 Storwize V7000 best practice recommendations for performance


We recommend that you have at least two storage pools for internal MDisks, and two for external MDisks from external Storage Subsystems. Each of these storage pools whether built from internal / external MDisks would provide the basis for either a general purpose class of storage or for a higher performance or highly availability class of storage. You might want more storage pools if you have different device types to be separated, for example, RAID5 versus RAID10, or SAS versus Near Line (NL)-SAS. The development recommendations for Storwize V7000 are summarized below: One MDisk group per storage subsystem One MDisk group per RAID array type (RAID 5 versus RAID 10) One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB)

Chapter 13. Monitoring

317

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

There are situations where multiple MDisk groups are desirable: Workload isolation Short-stroking a production MDisk group Managing different workloads in different groups

13.4 Top 10 reports for SVC and Storwize V7000


Top 10 reports from Tivoli Storage Productivity Center is a very common request. In this section we summarize which reports you should create to begin your performance analysis regarding an SVC or Storwize V7000 virtualized storage environment. Figure 13-6 on page 318 is numbered with our recommended sequence to proceed. In other cases, such as performance analysis for a particular Server, we might recommend that you follow another sequence, starting with Managed Disk Group performance. This allows you to quickly identify Managed Disk and VDisk belonging to the Server you are analyzing.

Figure 13-6 Top 10 reports - sequence to proceed

Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk to view system reports that are relevant to SVC and Storwize V7000. I/O Group Performance and Managed Disk Group Performance are specific reports for SVC and Storwize V7000, while Module/Node Cache Performance is also available for IBM XIV. In Figure 13-7 those reports are highlighted:

318

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-7 System reports for SVC and Storwize V7000

Figure 13-8 shows a sample structure to review basic SVC concepts about SVC structure and then proceed with performance analysis at the different component levels.

SVC Storwize V7000


VDisk (1 TB) VDisk (1 TB) VDisk (1 TB)
3 TB of virtua lize d stora ge

I/O Group SVC Node SVC Node

MDisk (2 TB)

MDisk (2 TB)

MDisk (2 TB)

MDisk (2 TB)

8 TB of mana ged storage (used to determine SVC St orage software Usage)

DS4000, 5000, 6000, 8000, XIV . ..

Internal Storage (Storwize V7000 only)


RAW stora ge

Figure 13-8 SVC and Storwize V7000 sample structure

13.4.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports
Note: For SVCs with multiple I/O groups, a separate row is generated for every I/O group within each SVC.

Chapter 13. Monitoring

319

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

In our lab environment, data was collected for an SVC with a single I/O group. The scroll bar at the bottom of the table indicates that additional metrics can be viewed, as shown in Figure 13-9.

Figure 13-9 I/O group performance

Important: The data displayed in a performance report is the last collected value at the time the report is generated. It is not an average of the last hours or days, but it simply shows the last data collected. Click the next to SVC io_grp0 entry to drill down and view the statistics by nodes within the selected I/O group. Notice that a new tab, Drill down from io_grp0, is created containing the report for nodes within the SVC. See Figure 13-10.

Figure 13-10 Drill down from io_grp0

To view a historical chart of one or more specific metrics for the resources, click the icon. A list of metrics is displayed, as shown in Figure 13-11. You can select one or more metrics that use the same measurement unit. If you select metrics that use different measurement units, you will receive an error message.

CPU Utilization percentage


The CPU Utilization reports give you an indication of how busy the cluster nodes are. To generate a graph of CPU utilization by node, select the CPU Utilization Percentage metric and click Ok.

320

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-11 SVC CPU utilization selection

You can change the reporting time range and click the Generate Chart button to re-generate the graph, as shown in Figure 13-12 on page 322. A continual high Node CPU Utilization rate, indicates a busy I/O group; in our environment CPU utilization doesnt rise above 24%, that is a more than acceptable value.

Recommendations (SVC only)


If the CPU utilization for the SVC Node remains constantly high above 70%, it might be time to increase the number of I/O Groups in the cluster. You can also redistribute workload to other I/O groups in the SVC cluster if available. You can add cluster I/O groups up to the maximum of four I/O Groups per SVC cluster. If there are already four I/O Groups in a cluster (with the latest firmware installed), and you are still having high SVC Node CPU utilization as indicated in the reports, it is time to build a new cluster and consider either migrating some storage to the new cluster, or if existing SVC nodes are not of the 2145-CG8 version, upgrading them to the CG8 nodes.

Chapter 13. Monitoring

321

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-12 SVC CPU utilization graph

I/O Rate (overall)


To view the overall total I/O rate, click the Drill down from io_grp0 tab to return to the performance statistics for the Nodes within the SVC. Click the icon and select the Total I/O Rate (overall) metric.Then click Ok. See Figure 13-13 on page 322.

Figure 13-13 I/O rate

322

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Notice that the I/Os are only present on Node 2. So, in Figure 13-15 on page 324, you could see a configuration problem, where workload is not well balanced, at least during this time frame (this is the reason for the red traffic light shown in that figure).

Recommendations
To interpret your performance results, the first recommendation is to go always back to your baseline. For information on creating a baseline refer to SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364. Moreover, some industry benchmarks for the SVC and Storwize V7000 are available. SVC 4.2, and the 8G4 node brought a dramatic increase in performance as demonstrated by the results in the Storage Performance Council (SPC) Benchmarks, SPC-1 and SPC-2. The benchmark number, 272,505.19 SPC-1 IOPS, is the industry-leading OLTP result and the PDF is available at the following URL: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf An SPC Benchmark2 was also performed for Storwize V7000; the Executive Summary PDF is available at the following URL: http://www.storageperformance.org/benchmark_results_files/SPC-2/IBM_SPC-2/B0005 2_IBM_Storwize-V7000/b00052_IBM_Storwize-V7000_SPC2_executive-summary.pdf Figure 13-14 on page 324 shows numbers on max I/Os and MB/s per I/O group. Realize that SVC performance or your realized SVC obtained performance will be based upon multiple factors. Some of these are: The specific SVC nodes in your configuration The type of Managed Disks (volumes) in the Managed Disk Group (MDG) The application I/O workloads using the MDG The paths to the backend storage These are all factors that ultimately lead to the final performance realized. In reviewing the SPC benchmark (see Figure 13-14), depending upon the transfer block size used, the results for the I/O and Data Rate obtained are quite different. Looking at the two-node I/O group used, you might see 122,000 I/Os if all of the transfer blocks were 4K. In typical environments, they rarely are. So if you jump down to 64K, or bigger. with anything over about 32K, you might realize a result more typical of the 29,000 as seen by the SPC benchmark.

Chapter 13. Monitoring

323

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Max I/Os and MB/s Per I/O Group 70/30 R/W Miss
2145-8G4 4K Transfer Size 122K 500MB/s 64K Transfer Size 29K 1.8GB/s 2145-8F4 4K Transfer Size 72K 300MB/s 64K Transfer Size 23K 1.4GB/s 2145-4F2 4K Transfer Size 38K 156MB/s 64K Transfer Size 11K 700MB/s 2145-8F2 4K Transfer Size 72K 300MB/s 64K Transfer Size 15K 1GB/s

Figure 13-14 SPC SVC benchmark Max I/Os and MB/s per I/O group

As mentioned before, in the I/O rate graph shown in Figure 13-15, you can see a configuration problem indicated by the red traffic light in the lower right corner.

Figure 13-15 I/O rate graph

324

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Response time
To view the read and write response time at Node level, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Backend Read Response Time and Backend Write Response Time metrics, as shown in Figure 13-16.

Figure 13-16 SVC Node Response time selection

Click Ok to generate the report, as shown in Figure 13-17 on page 326. We see values that could be accepted in backend response time for both read and write operations, and these are consistent for both our I/O Groups.

Recommendations
For random read I/O, the backend rank (disk) read response times should seldom exceed 25 msec, unless the read hit ratio is near 99%. Backend Write Response Times will be higher because of RAID 5 (or RAID 10) algorithms, but should seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines. In case of poor response time, you should investigate using all available information from the SVC and the backend storage controller. Possible causes for a large change in response times from the backend storage might be visible using the storage controller management tool include: Physical array drive failure leading to an array rebuild. This drives additional backend storage subsystem internal read/write workload while the rebuild is in progress. If this is causing poor latency, it might be desirable to adjust the array rebuild priority to lessen the load. However, this must be balanced with the increased risk of a second drive failure during the rebuild, which would cause data loss in a RAID 5 array. Cache battery failure leading to cache being disabled by the controller. This can usually be resolved simply by replacing the failed battery.
Chapter 13. Monitoring

325

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-17 SVC Node Response time report

For further details about Rules of Thumb and how to interpret these values, consult the SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364-02: http://www.redbooks.ibm.com/redbooks/pdfs/sg247364.pdf

Data Rate
To look at the Read Data rate, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Read Data Rate metric. Press down Shift key and select Write Data Rate and Total Data Rate. Then click Ok to generate the chart, shown in Figure 13-18 on page 327.

326

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-18 SVC Data Rate graph

To interpret your performance results, the first recommendation is to go always back to your baseline. For information on creating a baseline refer to SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364. Moreover, a benchmark is available. The throughput benchmark, 7,084.44 SPC-2 MBPS, is the industry-leading throughput benchmark, and the PDF is available here: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf

13.4.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports
Efficient use of cache can help enhance virtual disk I/O response time. The Node Cache Performance report displays a list of cache related metrics such as Read and Write Cache Hits percentage and Read Ahead percentage of cache hits. The cache memory resource reports provide an understanding of the utilization of the SVC or Storwize V7000 cache. These reports provide you with an indication of whether the cache is able to service and buffer the current workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Module/Node Cache performance report. Notice that this report is generated at SVC and Storwize V7000 node level (there is an entry that refers to IBM XIV storage device), as shown in Figure 13-19 on page 328.

Chapter 13. Monitoring

327

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-19 SVC and Storwize V7000 Node cache performance report

Cache Hit percentage


Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the backend disk arrays. Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. If the cache is full, a write might be delayed while some changed data is destaged to the disk arrays to make room for the new write data. The Read and Write Transfer Sizes are the average number of bytes transferred per I/O operation. To look at the Read cache hits percentage for Storwize V7000 nodes, select both nodes and click the icon and select the Read Cache Hits percentage (overall). Then click Ok to generate the chart, shown in Figure 13-20.

Figure 13-20 Storwize V7000 Cache Hits percentage - no traffic on node1

Important: The flat line for node1 does not mean that the read request for that node cannot be handled by the cache, it means that there is no traffic at all on that node, as is illustrated in Figure 13-21 on page 329 and Figure 13-22 on page 329, where Read Cache Hit Percentage and Read I/O Rates are compared in the same time interval.

328

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-21 Storwize V7000 Read Cache Hit Percentage

Figure 13-22 Storwize V7000 Read I/O Rate

Chapter 13. Monitoring

329

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

This could be not a good configuration, since the two nodes are not balanced. In our lab environment volumes defined on Storwize V7000 were all defined with node2 as preferred node. After moving the preferred node for volume tpcblade3-7-ko from node2 to node1, we obtained the graphs showed in Figure 13-23 and Figure 13-24 on page 331 for Read Cache Hit percentage and Read I/O Rates:

Figure 13-23 Storwize V7000 Cache Hit Percentage after reassignment

330

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-24 Storwize V7000 Read I/O rate after reassignment

Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good backend response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the backend ranks can be driven a little harder, to higher utilizations. If you need to analyze further cache performances and try to understand if it is enough for your workload, you can run multiple metrics charts. Select the metrics named percentage, because you can have multiple metrics with the same unit type, in one chart. In the Selection panel, move from Available Column to Included Column the percentage metrics you want include, then in the Selection button check only the Storwize V7000 entries. Figure Figure 13-25 on page 333 shown an example where several percentage metrics are chosen for Storwize V7000: The complete list of metrics is as follows: CPU utilization percentage: The average utilization of the node controllers in this I/O group during the sample interval. Dirty Write percentage of Cache Hits: The percentage of write cache hits which modified only data that was already marked dirty in the cache; re-written data. This is an obscure measurement of how effectively writes are coalesced before destaging.

Chapter 13. Monitoring

331

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Read/Write/Total Cache Hits percentage (overall): The percentage of reads/writes/total during the sample interval that are found in cache. This is an important metric. The write cache hot percentage should be very nearly 100%. Readahead percentage of Cache Hits: An obscure measurement of cache hits involving data that has been prestaged for one reason or another. Write Cache Flush-through percentage: For SVC and Storwize V7000, the percentage of write operations that were processed in Flush-through write mode during the sample interval. Write Cache Overflow percentage: For SVC and Storwize V7000 the percentage of write operations that were delayed due to lack of write-cache space during the sample interval. Write Cache Write-through percentage: For SVC and Storwize V7000 the percentage of write operations that were processed in Write-through write mode during the sample interval. Write Cache Delay percentage: The percentage of all I/O operations that were delayed due to write-cache space constraints or other conditions during the sample interval. Only writes can be delayed, but the percentage is of all I/O. Small Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are <= 8 KB. Small Transfers Data percentage Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are <= 8 KB. Medium Transfers I/O percentage Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Medium Transfers Data percentage Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 8 KB and <= 64 KB. Large Transfers I/O percentage Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Large Transfers Data percentage Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 64 KB and <= 512 KB. Very Large Transfers I/O percentage Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 512 KB. Very Large Transfers Data percentage Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 512 KB. Overall Host Attributed Response Time Percentage 332
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

The percentage of the average response time, both read response time and write response time, that can be attributed to delays from host systems. This metric is provided to help diagnose slow hosts and poorly performing fabrics. The value is based on the time taken for hosts to respond to transfer-ready notifications from the SVC nodes (for read) and the time taken for hosts to send the write data after the node has responded to a transfer-ready notification (for write). The following metric is only applicable in a Global Mirror Session: Global Mirror Overlapping Write Percentage Average percentage of write operations issued by the Global Mirror primary site which were serialized overlapping writes for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes were serialized.

Figure 13-25 Storwize V7000 multiple metrics Cache selection

Then select all the metrics in the Select charting option pop-up window and click Ok to generate the chart. In our test, we notice in Figure 13-26 on page 334 that there is a drop in the Cache Hits percentage. Even if the drop is not so dramatic, this can be considered as an example for further investigation on arising problems. Changes in these performance metrics together with an increase in backend response time (see Figure 13-27 on page 334) shows that the storage controller is heavily burdened with I/O, and the Storwize V7000 cache could become full of outstanding write I/Os. Host I/O activity will be impacted with the backlog of data in the Storwize V7000 cache and with any other Storwize V7000 workload that is going on to the same MDisks.

Chapter 13. Monitoring

333

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: For SVC only, If cache utilization is a problem, you can add additional cache to the cluster by adding an I/O Group and moving VDisks to the new I/O Group. Also, note that adding an I/O Group and moving VDisk from one I/O group to another is still a disruptive action. So proper planning to manage this disruption is required. You cant add an I/O Group to Storwize V7000.

Figure 13-26 Storwize V7000 Multiple nodes resource performance metrics

Figure 13-27 Storwize V7000 increased Overall Backend response Time

334

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

For further details about Rules of Thumb and how to interpret these values, consult the SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364-02.

13.4.3 Top 10 for SVC #3: Managed Disk Group Performance reports
Managed Disk Group performance report provides disk performance information at the managed disk group level. It summarizes read and write transfer size, backend read, write, and total I/O rate. From this report you can easily drill up to see the statistics of virtual disks supported by a managed disk group or drill down to view the data for the individual MDisks that make up the managed disk group. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and select Managed Disk Group performance. A table is displayed listing all the known managed disk groups and their last collected statistics, based on the latest performance data collection. See Figure 13-28.

Figure 13-28 Managed Disk Group performance

One of the Managed Disk Groups is CET_DS8K1901mdg. Click the drill down icon on the entry CET_DS8K1901mdg to drill down. A new tab is created, containing the Managed Disks in the Managed Disk Group. See Figure 13-29.

Figure 13-29 Drill down from Managed Disk Group Performance report

Chapter 13. Monitoring

335

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Click the drill down icon on the entry mdisk61 to drill down. A new tab is created, containing the Volumes in the Managed Disk. See Figure 13-30.

Figure 13-30 Drill down from Managed Disk performance report

I/O rate
We recommend that you analyze how the I/O workload is split between Managed Disk Groups, to determine if it is well balanced or not. Click Managed Disk Groups tab, select all Managed Disk Groups, click the icon, and select Total Backend I/O Rate, as shown in Figure 13-31.

Figure 13-31 Top 10 SVC - Managed Disk Group I/O rate selection

Click Ok to generate the next chart, as shown in Figure 13-32 on page 337. When reviewing this general chart, you must understand that it reflects all I/O to the backend storage from the MDisks included within this MDG. The key for this report is a general understanding of backend I/O rate usage, not whether there is balance outright. 336

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

While the SVC and Storwize V7000 by default stripes writes and read I/Os across all MDisks, the striping is not through a RAID 0 type of stripe. Rather, as the VDisk is a concatenated volume, the striping injected by the SVC and Storwize V7000 is only in how we identify extents to be used when we create a VDisk. Until host I/O write actions fill up the first extent, the remaining extents in the block VDisk provided by SVC will not be used. It is very likely when you are looking at the MDG Backend I/O report, that you will not see a balance of write activity even for a single MDG. In the report shown in Figure 13-32, for the time frame specified, we see that at one point we have a maximum of nearly 8200 IOPS.

Figure 13-32 Top 10 SVC - Managed Disk Group I/O rate report

Response Time
Now you can get back to the list of MDisks, by moving to the Drill down from CET_DS8K1901mdg tab (see Figure 13-29 on page 335). Select all the Managed Disks entries, click the icon and select the Backend Read Response time metric, as shown in Figure 13-33 on page 338.

Chapter 13. Monitoring

337

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-33 Managed disk Backend Read Response Time

Then click Ok to generate the chart, as shown in Figure 13-34 on page 339.

Recommendations
For random read I/O, the backend rank (disk) read response time should seldom exceed 25 msec, unless the read hit ratio is near 99%. Backend Write Response Time will be higher because of RAID5 (or RAID10) algorithms, but should seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines.

338

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-34 Backend response time

Backend Data Rates


Backend throughput and response time depend on the actual DDMs in use by the storage subsystem that the LUN or Volume was created from and the specific RAID type in use. With this report you can also check how MDisk workload is distributed. Select all the Managed Disks from the Drill down from CET_DS8K1901mdg tab, click the icon, and select the Backend Data Rates, as shown in Figure 13-35 on page 340.

Chapter 13. Monitoring

339

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-35 MDisk Backend Data Rates selection

Click Ok to generate the report shown in Figure 13-36. Here the workload is not balanced on MDisks.

Figure 13-36 MDisk Backend Data Rates report

340

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13.4.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports
Tivoli Storage Productivity Center provides five reports on top volume performance: Top Volume Cache performance: Prioritized by the Total Cache Hits percentage (overall) metric. Top Volume Data Rate performances: Prioritized by the Total Data Rate metric. Top Volume Disk performances: Prioritized by the Disk to cache Transfer rate metric. Top Volume I/O Rate performances: Prioritized by the Total I/O Rate (overall) metric. Top Volume Response performances: Prioritized by the Total Data Rate metric. Volumes referred in these reports correspond to the VDisks in SVC. Important: The last collected performance data on volumes are used for the reports. The report creates a ranked list of volumes based on the metric used to prioritize the performance data. You can customize these reports according to the needs of your environment. To limit these system reports to just SVC subsystems, you have to specify a filter, as shown in Figure 13-37. Click the Selection tab, then click Filter. Click Add to specify another condition to be met. This has to be done for all the five reports.

Figure 13-37 SVC Top Volumes Filter selection

Top Volume Cache performance


This report shows the cache statistics for the top 25 volumes, prioritized by the Total Cache Hits percentage (overall) metric, as shown in Figure 13-38 on page 342. This is the weighted average of read cache hits and write cache hits. The percentage of writes that are handled in cache should be 100% for most enterprise storage. An important metric is the percentage of reads during the sample interval that are found in cache.

Chapter 13. Monitoring

341

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good backend response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the backend ranks can be driven a little harder, to higher utilizations.

Figure 13-38 SVC Top volume Cache Hit performance report

Top Volume Data Rate performance


To find out the top five volumes with the highest total data rate during the last data collection time interval, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Top Volumes Data Rate Performance. By default, the scope of the report is not limited to a single storage subsystem. Tivoli Storage Productivity Center interrogates the data collected for all the storage subsystems that it has statistics for and creates the report with a list of 25 volumes that have the highest total Data Rate. To limit the output, click the Selection tab to enter 5 as the maximum number of rows to be displayed on the report, as shown in Figure 13-39 on page 343.

342

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-39 Top Volume Data rate selection

Click Generate Report on the Selection panel to regenerate the report, shown next in Figure 13-40. If this report is generated during the run time periods, the volumes would be have the highest total data rate and be listed on the report.

Figure 13-40 SVC Top Volume Data rate report

Chapter 13. Monitoring

343

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Top Volume Disk Performance


This report includes many metrics about cache and volume-related informations. Figure 13-41 shows the list of Top 25 volumes, prioritized by the Disk to cache Transfer rate metric. This metric indicates the average number of track transfers per second from disk to cache during the sample interval.

Figure 13-41 SVC Top Volume Disk performance

Top Volume I/O Rate Performance


The top volume data rate performance, top volume I/O rate performance and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes. Figure 13-42 on page 345 shows the top 25 volumes, prioritized by the Total I/O Rate (overall) metrics.

Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 I/O per second) to very large values (more than 1000 I/O/second). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume.

344

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-42 SVC Top Volume I/O Rate performances

Top Volume Response performance


The top volume data rate performance, top volume I/O rate performance and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes in this report. Figure 13-43 on page 346 shows the top 25 volumes, prioritized by the Overall Response Time metrics.

Recommendations
Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume (and our performance models assume) that 10 milliseconds is pretty high. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization would have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications.

Chapter 13. Monitoring

345

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the backend storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which would produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the backend disk ranks in order to produce an acceptable overall average response time. To simplify, we can assume that (front end) response times probably need to be in the 5-15 msec range. The rank (backend) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Backend write response times can be even higher, generally up to 80 msec. Important: All the above considerations are not valid for SSD disks, where seek time and latency are not applicable. We should expect for these disks much better performance and therefore very short response time (less than 4 ms).

Figure 13-43 SVC TOP volume Response performance report

Refer to 13.8, Case study: Top volumes response time and I/O rate performance report on page 368 to create a tailored report for your environment.

13.4.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports
The SVC and Storwize V7000 port performance reports help you understand the SVC and Storwize V7000 impact on the fabric and give you an indication of the traffic between: SVC (or Storwize V7000) and hosts that receive storage SVC (or Storwize V7000) and backend storage Nodes in the SVC (or Storwize V7000) cluster

346

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

These reports can help you understand if the fabric might be a performance bottleneck and if upgrading the fabric can lead to performance improvement. Port performance report summarizes the various send, receive and total port I/O rates and data rates. Expand IBM Tivoli Storage Productivity Center My Reports System Reports Disk and click Port Performance. in order to display only SVC and Storwize V7000 ports, click Filter to produce a report for all the volumes belonging to SVC or Storwize V7000 subsystems, as shown in Figure 13-44:

Figure 13-44 Port Performance Report - Subsystems filters

A separate row is generated for each subsystems ports. The information displayed in each of the rows reflect data last collected for the port. Notice the Time column displayed the last collection time, which might be different for different subsystem ports. Not all the metrics in the Port Performance report are applicable for all ports. For example, the Port Send Utilization percentage, Port Receive Utilization Percentage, Overall Port Utilization percentage data are not available on SVC or Storwize V7000 ports. N/A is displayed in the place when data is not available, as shown in Figure 13-45. By clicking Total Port I/O Rate you get a prioritized list by I/O rate.

Figure 13-45 Port Performance report

At this point you can verify if the Data Rates seen to the backend ports are beyond the normal expected for the speed of your fibre links, as shown in Figure 13-46 on page 348. This report is typically generated to support Problem Determination, Capacity Management, or SLA reviews. Based upon the 8 Gb per second fabric, these rates are well below the throughput capability of this fabric, and thus the fabric is not a bottleneck, here.

Chapter 13. Monitoring

347

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-46 SVC and Storwize V7000 Port I/O rate report

348

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Then generate another historical chart with the Port Send Data Rate and Port Receive Data Rate metric, as shown in Figure 13-47, which confirms the unbalanced workload for one port.

Recommendations
Based on the nominal speed of each of the FC ports, which could be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of round 800 MB/sec. So, you should generate an alert when it is more than 400 MB/sec.

Figure 13-47 SVC and Storwize V7000 Port Data Rate report

To investigate further using the Port performance report, go back to the I/O group performances report. Expand IBM Tivoli Storage Productivity Center My Reports System Reports Disk. Click I/O group Performance and drill-down to Node level. In the example in Figure 13-48 we choose node 1 of the SVC subsystem:

Figure 13-48 SVC node port selection

Chapter 13. Monitoring

349

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Then click the icon and select Port to Local Node Send Queue Time, Port to Local Node Receive Queue Time, Port to Local Node Receive Response Time and Port to Local Node Send Response Time, as shown in Figure 13-49:

Figure 13-49 SVC Node port selection queue time

Look at port rates between SVC nodes, hosts, and disk storage controllers. Figure 13-50 shows low queue and response times, indicating that the nodes do not have a problem communicating with each other.

Figure 13-50 SVC Node ports report

350

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

If this report shows high queue and response times, the write activity (because each node communicates to each other node over the fabric) is affected. Unusually high numbers in this report indicate: SVC (or Storwize V7000) node or port problem (unlikely) Fabric switch congestion (more likely) Faulty fabric ports or cables (most likely)

Identify over-utilized ports


You can verify if you have any Host Adapter or SVC (or Storwize V7000) ports that are heavily loaded when the workload is balanced between the specific ports of a Subsystem that your application server is using. If you identify an imbalance, then you need to review whether the imbalance is a problem or not. If there is an imbalance, and the response times and data rate are acceptable, then taking a note of the impact might be the only action required. If there is a problem at the application level, then a review of the volumes using these ports, and a review of their I/O and data rates will need to determine if redistribution is required. To support this review, you can generate a port chart and using the date range (specify the specific time frame when you know I/O and data was in place). Then select the Total Port I/O Rate metric on all of SVC (or Storwize V7000) ports, or the specific Host Adapter ports in question. The graphical report shown in Figure 13-51 on page 351 refers to all the Storwize ports:

Figure 13-51 SVC Port I/O Send/Receive Rate

After you have the I/O rate review chart, you also need to generate a data rate chart for the same time frame. This will support a review of your HA ports for this application. Generate another historical chart with the Total Port Data Rate metric, as shown in Figure 13-52 on page 352 that confirms the unbalanced workload for one port shown in the foregoing report.

Recommendations
Chapter 13. Monitoring

351

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

According to the nominal speed of each FC ports, which could be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of around 800 MB/sec. So, you should generate an alert when it is more than 400 MB/sec.

Figure 13-52 Port Data Rate report

13.5 Reports for Fabric and Switches


Fabric and Switches provide metrics that you cannot create in a Top 10 reports list. Tivoli Storage Productivity Center provides the most important metrics in order to create reports against them. Figure 13-53 on page 353 shows a list of System Reports available for your Fabric.

13.5.1 Switches reports: Overview


The first four reports shown in Figure 13-53 on page 353 provide you Asset information in a tabular view. You can see the same information in a graphic view, using Topology Viewer. Tip: We recommend that you use Topology Viewer to get Asset information.

352

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-53 Fabric list of reports

Tip: Rather than using a specific report to monitor Switch Port Errors, we recommend that you use the Constraint Violation report. By setting an Alert for the number of errors at the switch port level, the Constraint Violation report becomes a direct tool to monitor the errors in your fabric. For more information on Constraint Violation reports refer to SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364.

13.5.2 Switch Port Data Rate performance


For the TOP report, we recommend that you analyze Switch Ports Data Rate report. Total Port Data Rate shows the average number of megabytes (2^20 bytes) per second that were transferred for send and receive operations, for a particular port during the sample interval. Expand: IBM Tivoli Storage Productivity Center Reporting System Reports Fabric and select Top Switch Ports Data Rates performance. Click the icon and select Total Port Data Rate, as shown in Figure 13-54 on page 354.

Chapter 13. Monitoring

353

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-54 Fabric report - Port Data Rate selection

Click Ok to generate the chart shown next in Figure 13-55 on page 355. Port Data Rates do not reach a warning level, in this case, knowing that FC Port speed is 8 Gbits/sec.

Recommendations
Use this report to monitor if some Switch Ports are overloaded or not. According to FC Port nominal speed (2 Gbit, 4 Gbit or more) as shown in Table 13-1, you have to establish the maximum workload a switch port can reach. We recommend to not exceed 50-70%.
Table 13-1 Switch Port data rates FC Port speed Gbits/sec 1 Gbits/sec 2 Gbits/sec 4 Gbits/sec 8 Gbits/sec 10 Gbit/sec FC Port speed MBytes/sec 100 MB/sec 200 MB/sec 400 MB/sec 800 MB/sec 1000 MB/sec Recommended Port Data Rate threshold 50 MB/sec 100 MB/sec 200 MB/sec 400 MB/sec 500 MB/sec

354

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-55 Fabric report - Port Data Rate report.

13.6 Case study: Server - performance problem with one server


Often a problem is reported as a server suffering poor performance, and usually the storage disk subsystem is the first suspect. In this case study we show how Tivoli Storage Productivity Center can help you to debug this problem verifying if this is a storage problem or an out of storage issue, providing volume mapping for this server and identifying which storage components are involved in the path. Tivoli Storage Productivity Center provides reports that show the storage assigned to the computers within your environment. To display one of the reports, expand Disk Manager Reporting Storage Subsystem Computer Views By Computer. Click the Selection button to select particular available resources to be on the report (in our case server tpcblade3-7), as shown in Figure 13-56 on page 356.

Chapter 13. Monitoring

355

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-56 Computer case study - selection

Click Generate Report to get the output shown in Figure 13-57. Scrolling to the right of the table more information is available, such as the volume names, volume capacity, allocated and unallocated volume spaces are listed.

Figure 13-57 Computer case study - volume list

356

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Data on the report can be exported by selecting File Export Data to a comma delimited file, comma delimited file with headers or formatted report file and HTML file. You can start from this volumes list to analyze performance data and workload I/O rate. Tivoli Storage Productivity Center provides a report that shows volume to backend volume assignments. To display the report, expand Disk Manager Reporting Storage Subsystem Volume to Backend Volume Assignment By Volume. Click Filter to limit the list of the volumes to the ones belonging to server tpcblade3-7, as shown in Figure 13-58.

Figure 13-58 Computer case study - volume to backend filter

Click Generate Report to get the list shown in Figure 13-59 on page 358.

Chapter 13. Monitoring

357

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-59 Computer case-study - volume to backend list

Scroll to the right to see the SVC managed disks and backend volumes on DS8000, as shown in Figure 13-60: Note: The highlighted lines with N/A values are related to Backend Storage subsystem not defined in our Tivoli Storage Productivity Center environment. To obtain the information on Backend Storage Subsystem, it has to be added in Tivoli Storage Productivity Center environment, together with the corresponding probe job (see the first line in the report in Figure 13-60, where the backend storage subsystem is part of our Tivoli Storage Productivity Center environment and therefore the volume is correctly showed in all its details).

Figure 13-60 Backend Storage Subsystems

With this information and the list of volumes mapped to this computer, you can start to run a Performance Report to understand where the problem for this server could be.

358

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13.7 Case study: Storwize V7000- disk performance problem


In this case study we look at a problem reported by a customer: One disk volume is having different and lower performance results during the last period. At times it is getting a good response time, and sometimes it is getting unacceptable response time. Throughput is also changing. The customer specified the name of the affected volume, that is, tpcblade3-7-ko2, a VDisk in a Storwize V7000 subsystem.

Recommendations
Looking at disk performance problems, you need to check the overall response time as well as its overall I/O rate. If they are both high, there might be a problem. If the overall response time is high and the I/O rate is trivial, the impact of the high overall response time might be inconsequential. Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Then click Filter to produce a report for all the volumes belonging to Storwize V7000 subsystems, as shown in Figure 13-61.

Figure 13-61 SVC performance report by Volume

Click the volume you need to investigate, click the icon and select Total I/O Rate (overall). Then click Ok to produce the graph, as shown in Figure 13-62 on page 360.

Chapter 13. Monitoring

359

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-62 Storwize V7000 performance report - volume selection

The chart in Figure 13-63 shows that I/O rate had been around 900 operations per second and suddenly declined to around 400 operations per second. Then, it goes back to 900 operations per second. In this case study we limited the days to the time frame reported by the customer when the problem was noticed.

Figure 13-63 Storwize V7000 volume - Total I/O rate chart

360

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Select again the Volumes tab, click the volume you need to investigate, click the icon and scroll down to select Overall Response Time. Then click Ok to produce the chart, as shown in Figure 13-64.

Figure 13-64 Storwize V7000 performance report - volume selection

The chart in Figure 13-65 indicates the increase in response time from a few milliseconds to around 30 milliseconds. This information, combined with the high I/O rate, indicates there is a significant problem and further investigation is appropriate.

Figure 13-65 Storwize V7000 Volume - response time

Chapter 13. Monitoring

361

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

The next step is to look at the performance of MDisks in the MDisk group. To identify to which MDisk the VDisk tpcblade3-7-ko2 belongs, go back to Volumes tab and click the drill up icon, as shown in Figure 13-66.

Figure 13-66 SVC Volume and MDisk selection

Figure 13-67 shows the Managed Disks where tpcblade3-7-ko2 extents reside:

Figure 13-67 Storwize V7000 Volume and MDisk selection - 2

Select all the MDisks. Click the icon and select Overall Backend Response Time. Click Ok as shown in Figure 13-68 on page 363.

362

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-68 Storwize V7000 metric selection

Keep the charts generated relevant to this scenario, using the charting time range. You can see from the chart in Figure 13-69 that something happened around May, 26 at 6:00 pm that probably caused the backend response time for all MDisks to dramatically increase.

Figure 13-69 Overall Backend Response Time

If you take a look at the chart for the Total Backend I/O Rate for these two MDisks during the same time period, you will see that their I/O rates all remained in a similar overlapping pattern, even after the introduction of the problem. This is as expected and would be because

Chapter 13. Monitoring

363

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

tpcblade3-7-ko2 is evenly striped across the two MDisks. The I/O rate for these MDisks is only as high as the slowest MDisk, as shown in Figure 13-70.

Figure 13-70 Backend I/O Rate

At this point, we have identified that the response time for all MDisks dramatically increased and that the response time. Next step is to generate a report to show the volumes which have an overall I/O rate equal to or greater than 1000 Ops/ms and then generate a chart to show which of those volumes I/O rates changed around 5:30 pm on August 20. Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Click Display historic performance data using absolute time and limit the time period to 1 hour before and1 hour after the event reported in Figure 13-69 on page 363. Click Filter to limit to Storwize V7000 Subsystem and Add a second filter to select the Total I/O Rate (overall) greater than 1000 (it means high I/O rate). Click Ok, as shown in Figure 13-71 on page 365.

364

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-71 Display historic performance data

The report in Figure 13-72, shows all the performance records of the volumes filtered above. In the Volume column there are only three volumes that meet these criteria: tpcblade3-7-ko2, tpcblade3-7-ko3 and tpcblade3ko4. There are multiple rows for each as there is a row for each performance data record. Look for what volumes I/O rate changed around 6:00 pm on May 26. You can click the Time column to sort.

Figure 13-72 Volumes I/O rate changed

Now we have to compare the Total I/O rate (overall) metric for the above volumes and the volume subject of the case study, tpcblade3-7-k02. To do so remove the filtering condition on the Total I/O Rate defined in Figure 13-71 and generate the report again. Then select one row for each of these volumes and select Total I/O Rate (overall). Then click Ok to generate the chart, as shown in Figure 13-73 on page 366.

Chapter 13. Monitoring

365

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-73 Total I/O rate selection for three volumes

For Limit days From, insert the time frame we are investigating. Results: Figure 13-74 on page 367 shows the root cause. Volume tpcblade3-7-ko2 (the blue line in the screen capture) started around 5:00pm and has a Total I/O rate around 1000 IOPS. When the new workloads (generated by tpcblade3-7-ko3 and tpcblade3-ko4)started together, the Total I/O rate for volume tpcblade3-7-ko2 fell from around1000 IOPS to less than 500 I/Os, and then grew up again to about 1000 I/Os when one of the two loads decreased. The hardware has physical limitations on the number of IOPS that it can handle and this was reached at 6:00 pm.

366

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-74 Total I/O rate chart for three volumes

To confirm this behavior, you can generate a chart by selecting Response time. The chart shown in Figure 13-75 confirms that as soon as the new workload started, response time for tpcblade3-7-ko2 gets worse.

Figure 13-75 Response time chart for three volumes

The easy solution is to split this workload, moving one VDisk to another Managed Disk Group.

Chapter 13. Monitoring

367

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.8 Case study: Top volumes response time and I/O rate performance report
The default Top Volumes Response Performance Report can be useful identifying problem performance areas. A long response time is not necessarily indicative of a problem. It is possible to have volumes with long response time with very low (trivial) I/O rates. These situations might pose a performance problem. In this section we tailor Top Volumes Response Performance Report to identify volumes with both long response times and high I/O rates. The report can be tailored for your environment; it is also possible to update your Filters to exclude volumes or subsystems you no longer want in this report. Expand Disk Manager Reporting Storage Subsystem Performance by Volume as shown in Figure 13-76 and keep only desired metrics as Included Columns, moving all the others to Available Columns. You can save this report to be referenced in the future from IBM Tivoli Storage Productivity Center My Reports your user Reports.

Figure 13-76 TOP Volumes tailored reports - metrics

You have to specify the filters to limit the report, as shown in Figure 13-77 on page 369. Click Filter and then Add the conditions. In our example we are limiting the report to Subsystems SVC* and DS8* and to the volumes that have an I/O Rate greater than 100 Ops/sec and a Response Time greater than 5 msec.

368

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-77 TOP volumes tailored reports - filters

Prior to generating the report, you should specify the date and time of the period for which you want to make the inquiry. Important: Specifying large intervals might require intensive processing and a long time to complete. As shown in Figure 13-78, click Generate Report.

Figure 13-78 TOP Volume tailored report - limiting days

Figure 13-79 on page 370 shows the resulting Volume list. Sorting by response time or by I/O Rate columns (by clicking the column header), you can easily identify which entries have both interesting total I/O Rates and Overall Response Times.

Chapter 13. Monitoring

369

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Recommendations
We suggest that in a production environment, you might want to initially specify a Total I/O Rate overall somewhere between 1 and 100 Ops/sec and Overall Response Time (msec) that is greater than or equal to 15 msec, and adjust those numbers to suit your needs as you gain more experience.

Figure 13-79 TOP Volume tailored report - volumes list

13.9 Case study: SVC and Storwize V7000 performance constraint alerts
Along with reporting on SVC and Storwize V7000 performance, Tivoli Storage Productivity Center can generate alerts when performance has not met, or has exceeded a defined threshold. Like most Tivoli Storage Productivity Center tasks, the alerting can report to: SNMP:

370

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Enables you to send an SNMP trap to an upstream systems management application. The SNMP trap can then be used with other events occurring within the environment to help determine the root cause of an SNMP trap. In this case was generated by the SVC. For example, if the SVC or Storwize V7000 reported to Tivoli Storage Productivity Center that a fibre port went offline, it might in fact be because a switch has failed. This port failed trap, together with the switch offline trapped could be analyzed by a systems management tool to be a switch problem, not an SVC (or Storwize V7000) problem, so that the switch technicians called. Tivoli Omnibus Event: Select to send a Tivoli Omnibus event. Login Notification: Select to send the alert to a Tivoli Storage Productivity Center user. The user receives the alert upon logging in to Tivoli Storage Productivity Center. In the Login ID field, type the user ID. UNIX or Windows NT system event logger Script: The script option enables you to run a predefined set of commands that can help address this event, for example, simply open a trouble ticket in your helpdesk ticket system. Email: Tivoli Storage Productivity Center will send an e-mail to each person listed. Tip: Remember for Tivoli Storage Productivity Center to be able to email addresses, an e-mail relay must be identified in the Administrative Services Configuration Alert Disposition and then Email settings. These are some useful alert events that you should set: CPU utilization threshold: The CPU utilization report will alert you when your SVC or Storwize V7000 nodes become too busy. If this alert is being generated too often, it might be time to upgrade your cluster with additional resources. Development recommends this setting to be at 75% as warning or 90% as critical. These are the defaults that come with Tivoli Storage Productivity Center 4.2.1. So to enable this function, just create an alert selecting the CPU Utilization. Then define the alert actions to be performed. Next, using the Storage Subsystem tab, select the SVC or Storwize V7000 cluster to have this alert set for. Overall port response time threshold: The port response times alert can let you know when the SAN fabric is becoming a bottleneck. If the response times are consistently bad, you should perform additional analysis of your SAN fabric. Overall backend response time threshold: An increase in backend response time might indicate that you are overloading your backend storage: Because backend response times can very depending on what I/O workloads are in place. Prior to setting this value, capture 1 to 4 weeks of data to baseline your environment. Then set the response time values.

Chapter 13. Monitoring

371

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Because you can select the storage subsystem for this alert. You are able to set different alerts based upon the baselines you have captured. Our recommendation is to start with your mission critical Tier 1 storage subsystems. To create an alert, as shown in Figure 13-80, expand Disk Manager Alerting Storage Subsystem Alerts and right click to Create a Storage Subsystems Alert. On the right you get a pull-down menu where you can choose which alert you would like to set.

Figure 13-80 SVC constraints alert definition

Tip: The best place for you to verify which thresholds are currently enabled, and at what values, is at the beginning of a Performance Collection job. Expand Tivoli Storage Productivity Center Job Management and select in the Schedule table the latest performance collection job running or that has run for your subsystem. In the Job for Selected Schedule part of the panel (lower part), expand the corresponding job and select the instance, as shown in Figure 13-81 on page 373.

372

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-81 Job management panel - SVC performance job log selection

By clicking to the View Log File(s) button you access to the corresponding log file, where you can see the threshold defined, as shown in Figure 13-82 on page 373. Tip: to go at the begin of the log file, click on the Top button

Figure 13-82 SVC constraint threshold enabled

Chapter 13. Monitoring

373

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Expand IBM Tivoli Storage Productivity Center Alerting Alert Log Storage Subsystem to list all the alerts occurred. Look for your SVC subsystem, as shown in Figure 13-83 on page 374.

Figure 13-83 SVC constraints alerts history

By clicking the icon next to the alert you would like to enquiry, you get detailed information as shown in Figure 13-84.

Figure 13-84 SVC constraints - alert details

For more information on defining alerts refer to SAN Storage Performance Management Using Tivoli Storage Productivity Center, SG24-7364.

13.10 Case study: Fabric - monitor and diagnose performance


In this case study we try to find a fabric port bottleneck that exceeds 50% port utilization. We are using 50% for lab purposes only. Tip: It would be more realistic to monitor 80% of port utilization, in a production environment. Ports on the switches in this SAN are 8 Gigabit. Therefore, 50% utilization would be approximately 400 megabytes per second. You could create a performance collection job specifying filters, as shown in Figure 13-85. Expand Fabric Manager Reporting Switch Performance By Port. Click Filter, on the upper right corner and specify the conditions shown in Figure 13-85 on page 375. Important: At least one condition option has to be turned on. This results in this report identifying switch ports that satisfy either filter parameter.

374

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-85 Filter for fabric performance reports

After generating this report on the next page, you will use the Topology Viewer to identify what device is being impacted and identify a possible solution. Figure 13-86 shows the result we are getting in our lab.

Figure 13-86 Ports exceeding filters set for switch performance report

Click the icon and select Port Send Data Rate, Port Receive Data Rate and Total Port Data Rate, holding Ctrl key. Click Ok to generate the chart shown in Figure 13-87 on page 376. Tip: This chart gives you an indication as to how persistent this high utilization for this port is. This is an important consideration in order to establish the importance and the impact of this bottleneck.

important: To get al the values in the selected interval, you have to remove the filters defined in Figure 13-85 The chart shows a consistent throughput higher than 300 MB/sec in the selected time period. You can change the dates, extending the Limit days.

Chapter 13. Monitoring

375

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-87 Switch ports Data rate

To identify what device is connected to port 7 on this switch, expand IBM Tivoli Storage Productivity Center Topology Switches. Right click, select Expand all Groups and look for your switch, as shown in Figure 13-88 on page 377.

376

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-88 Topology Viewer for switches

Tip: To navigate in the Topology Viewer, press and hold the Alt key and press and hold the left mouse button to anchor your cursor. With these keys all held down, you can use the mouse to drag the screen to quickly move to the information you need. Find and click port 7. The line shows that it is connected to computer tpcblade3-7, as shown in Figure 13-89 on page 378. Note that in the tabular view on the bottom, you could see Port details. If you scroll right you can check Port speed, too.

Chapter 13. Monitoring

377

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-89 Switch port and computer

Double-click this computer to highlight it. Click Datapath Explorer (see DataPath Explorer shortcut highlighted in the minimap on Figure 13-89) to get a view of the paths between servers and storage subsystems or between storage subsystems (for example you can get SVC to backend storage, or server to storage subsystem). The view consist of three panels (host information, fabric information and subsystem information) that show the path through a fabric or set of fabrics for the endpoint devices, as shown in Figure 13-90 on page 379. Tip: A possible scenario utilizing Data Path Explorer is an application on a host that is running slow. The system administrator wants to determine the health status for all associated I/O path components for this application. Are all components along that path healthy? Are there any component level performance problems that might be causing the slow application response? Looking at the data paths for computer tpcblade3-7, this indicates that it has a single port HBA connection to the SAN. A possible solution to improve the SAN performance for computer tpcblade3-7 is to upgrade it to a dual port HBA.

378

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-90 Data Path Explorer

Chapter 13. Monitoring

379

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration
After Tivoli Storage Productivity Center has probed the SAN environment, it takes the information from all the SAN components (switches, storage controllers, and hosts) and automatically builds a graphical display of the SAN environment. This graphical display is available via the Topology Viewer option on the Tivoli Storage Productivity Center Navigation Tree. The information on the Topology Viewer panel is current as of the successful resolution of the last problem. By default, Tivoli Storage Productivity Center will probe the environment daily; however, you can execute an unplanned or immediate probe at any time. Tip: If you are analyzing the environment for problem determination, we recommend that you execute an ad hoc probe to ensure that you have the latest up-to-date information on the SAN environment. Make sure that the probe completes successfully.

13.11.1 Ensuring that all SVC ports are online


Information in the Topology Viewer can also confirm the health and status of the SVC and the switch ports. When you look at the Topology Viewer, Tivoli Storage Productivity Center will show a Fibre port with a box next to the WWPN. If this box has a black line in it, the port is connected to another device. Table 13-2 shows an example of the ports with their connected status.
Table 13-2 Tivoli Storage Productivity Center port connection status Port view Status This is a port that is connected.

This is a port that is not connected.

Figure 13-91 on page 381 shows the SVC ports connected and the switch ports.

380

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-91 SVC connection

Important: Figure 13-91 shows an incorrect configuration for the SVC connections, as it was implemented for lab purposes only. In real environments it is important that each SVC (or Storwize V7000) node port is connected to two separate fabrics. If any SVC (or Storwize V7000) node port is not connected, each node in the cluster displays an error on LCD display. Tivoli Storage Productivity Center also shows the health of the cluster as a warning in Topology Viewer, as shown in Figure 13-91. It is also important that: You have at least one port from each node in each Fabric; You have an equal number of ports in each Fabric from each node; that is, do not have three ports in Fabric 1 and only one port in Fabric 2 for an SVC (or Storwize V7000) node.

Chapter 13. Monitoring

381

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: In our example, the connected SVC ports are both on-line. When an SVC port is not healthy, a black line drawn is shown between the switch and the SVC node. Since Tivoli Storage Productivity Center knew to where the unhealthy ports were connected on a previous probe (and, thus, they were previously shown with a green line), the probe discovered that these ports were no longer connected, which resulted in the green line becoming a black line. If these ports had never been connected to the switch, no lines will show for them.

382

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13.11.2 Verifying SVC port zones


When Tivoli Storage Productivity Center probes the SAN environment to obtain information on SAN connectivity, it also collects information on the SAN zoning that is currently active. The SAN zoning information is also available on the Topology Viewer via the Zone tab. By opening the Zone tab and clicking both the switch and the zone configuration for the SVC, we can confirm that all of SVC node ports are correct in the Zone configuration. Attention: By default the Zone tab is not enabled. To enable the Zone tab, you must configure and turn this on using the Global Settings. To get to the Global Settings list, open the Topology Viewer screen. Then right-click in any white space. From the pop-up window, select the Global Settings from the list. Within the Global Setting box, place a check mark on the Show Zone Tab box. This will enable you to see SAN Zoning details for you switch fabrics. Figure 13-92 shows that we have defined an SVC node zone called SVC_CL1_NODE in our FABRIC-2GBS, and we have correctly included all of the SVC node ports.

Figure 13-92 Topology Viewer - SVC zoning

13.11.3 Verifying paths to storage


The Data Path Explorer functionality in Topology View can be used to see the path between two objects and it shows the objects and the switch fabric in one view. Using Data Path Explorer, we can see for example that mdisk1 in Storwize V7000-2076-ford1_tbird-IBM is available through two Storwize V7000 ports and trace that
Chapter 13. Monitoring

383

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

connectivity to its logical unit number (LUN) rad (ID:009f). This is shown in Figure 13-93 on page 385. What is not shown in Figure 13-93 is that you can hover the MDisk, LUN, and switch ports and get both health and performance information about these components. This enables you to verify the status of each component to see how well it is performing.

384

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-93 Topology Viewer - Data Path Explorer

Chapter 13. Monitoring

385

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.11.4 Verifying host paths to the Storwize V7000


By using the computer display in Tivoli Storage Productivity Center, you can see all the fabric and storage information for the computer that you select. Figure 13-94 shows the host tpcblade3-11, which has two host bus adapters (HBAs) but only one is active and connected to the SAN. This host has been configured to access some Storwize V7000 storage, as you can see in the top-right part of the panel. Our Topology Viewer shows that tpcblade3-11 is physically connected to a single fabric. By using the Zone tab, we can see the single zone configuration applied to tpcblade3-11 for the 100000051E90199D zone. This will mean that tpcblade3-11 does not have redundant paths, and thus if switch mini went offline, tpcblade3-111 will lose access to its SAN storage. By clicking the zone configuration, we can see which port is included in a zone configuration and thus which switch does have the zone configuration. The port that has no zone configuration will not be surrounded by a gray box.

Figure 13-94 tpcblade3-11 has only one active HBA

The Data Path Viewer in Tivoli Storage Productivity Center can also be used to check and confirm path connectivity between a disk that an operating system sees and the VDisk that the Storwize V7000 provides. Figure 13-95 on page 387 shows the path information relating to host tpcblade3-11 and its VDisks.

386

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-95 does not show us that you can hover over each component to also get health and performance information, which might also be useful when you perform problem determination and analysis.

Figure 13-95 Viewing host paths to the Storwize V7000

Chapter 13. Monitoring

387

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

13.12 Using SVC or Storwize V7000 GUI for real-time monitoring


The SVC or Storwize V7000 GUI enables you to monitor CPU usage, volume, interface, and MDisk bandwidth of your system and nodes. You can use system statistics to monitor the bandwidth of all the volumes, interfaces, and MDisks that are being used on your system. You can also monitor the overall CPU utilization for the system. These statistics summarize the overall performance health of the system and can be used to monitor trends in bandwidth and CPU utilization. You can monitor changes to stable values or differences between related statistics, such as the latency between volumes and MDisks. These differences then can be further evaluated by performance diagnostic tools. To launch the performance monitor you need to start your GUI session through a web browser pointing the following URL> https://<system ip address>/ Select Home Performance as shown in Figure 13-96.

Figure 13-96 Launching the performance monitor panel

The performance monitor panel shown in Figure 13-97 on page 389 presents the graphs in four quadrants: The top left hand quadrant is CPU utilization in percentage. The top right hand quadrant is volume throughput in MBps as well as current volume latency and current IOPS. The bottom left hand quadrant is Interface throughput (FC, SAS and iSCSI). The bottom right hand quadrant is MDisk throughput in MBps as well as current MDisk latency and current IOPS.

388

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-97 performance monitor panel

Each graph represents five minutes of collected statistics and provides a means of assessing the overall performance of your system. For example, CPU utilization shows the current percentage of CPU usage as well as specific data points on the graph, showing peaks in utilization. With this real-time performance monitor, you can quickly view bandwidth of volumes, interfaces, and MDisks. Each of these graphs displays the current bandwidth in megabytes per second, as well as a view of bandwidth over time. Each data point can be accessed to determine its individual bandwidth utilization and to evaluate whether a specific data point might represent performance impacts. For example, you can monitor the interfaces, such as Fibre Channel or SAS interfaces, to determine if the host data-transfer rate is different from the expected rate. The volumes and mdisks graphs also show the IOPS and latency values. On the pop-up menu you can switch from system statistic to statistics by node selecting a specific node to get its real-time performance graphs. Figure 13-98 shows the CPU usage, volume, interface, and MDisk bandwidth for a specific node.

Figure 13-98 Node level performance monitor panel.

Chapter 13. Monitoring

389

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

With this panel you can easily find an unbalanced usage of your system nodes. You can also run the real-time performance monitoring while you are performing other GUI operations selecting the Run in Background option.

390

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

13.13 Gathering manually the SVC statistics


SVC collects three types of statistics: mdisk, vdisk and node statistics. The statistics are collected on a per-node basis. This means that the statistics for a vdisk will be for its usage via that particular node. In SVC 6 code you don't need to start the statistics collection as long as it is already enabled by default. The lscluster <clustername> command shows you the statistics_status. The default statistic_frequency is 15 minutes and you can adjust with the startstats interval <minutes> command. For each collection interval, the SVC creates three statistics files: one for managed disks (MDisks), named Nm_stat; one for virtual disks (VDisks), named Nv_stats; and one for nodes, named Nn_stats. The files are written to the /dumps/iostats directory on each node. A maximum of 16 files of each type can be created for the node. When the 17th file is created, the oldest file for the node is overwritten. In order to retrieve the statistics files from the non-configuration nodes as well, they need prior to be copied onto the configuration node with the cpdumps -prefix /dumps/iostats <non_config node id> command. To retrieve those statistics files from the SVC, you may use secure copy: scp -i <private key file> admin@clustername:/dumps/iostats/* <local destination dir> If you don't use TPC, you will need retrieve and parse these XML files in order to analyze the long term statistics. The counters on the files are posted as absolute values, therefore the application which processes the performance statistics must compare two samples to calculate differences from two separate files. An easy way to gather and store the performance statistic data and generate some useful graphs is using the svcmon. It collects SVC / Storwize V7000 performance data every 1 to 60 minute, then creates spreadsheet files with CSV format, and graph files with GIF format. Taking advantage of a database, svcmon manages SVC / Storwize V7000 performance statistics from minutes to years. If you are interested in svcmon, please visit the following blog: https://www.ibm.com/developerworks/mydeveloperworks/blogs/svcmon svcmon works either in online or standalone modes. Here we briefly describe how to use it in standalone mode. The package is well documented to run on Windows or Linux workstations. For other platforms you need to adjust svcmon scripts. For a MS Windows workstation, you need to install the ActivePerl, PostgreSQL and the Command Line Transformation Utility (msxsl.exe). PuTTY is required if you want to run in online mode but even in standalone mode you might need it to secure copy the /dumps/iostats/ files, the /tmp/svc.config.backup.xml and to access the SVC through command line. You should follow the installation guide available on the svcmon IBM developerWorks blog page. To run svcmon in standalone mode you need to convert the xml config backup file into html format using the svcconfig.pl script. Then you need to copy the performance files to the iostats directory and create/populate the svcmon databse with the svcdb.pl --create and svcperf.pl --offline respectively. The last step is the report generation which is executed with the svcreport.pl script. The reporting functionality generates multiple gif files per object (mdisk, vdisk and node) in conjunction with aggregated csv files. We found these csv files very useful because they allow us to generate customized charts based upon spreadsheet functions like Pivot Tables or DataPilot and search (xLOOKUP) operations. The backup config file converted in html is a
Chapter 13. Monitoring

391

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

good source to create an additional spreadsheet tab in order to relate for instance vdisks with their I/O group and preferred node. Figure 13-99 shows a spreadsheet chart generated from the <system_name>__vdisk.csv file filtered for I/O group 2. The vdisks for this I/O group were selected using a secondary spreadsheet tab populated with the vdisk section of the config backup html file.

Figure 13-99 total ops per vdisk for I/O group 2. Vdisk37 is by far the most busiest volume

By default svcreport.pl script generates gif charts and csv files with one hour data. While the csv files aggregate a large amount of data, the gif charts are presented by vdisk, mdisk and node as described in Table 13-3. To generate a 24h chart you need to specify --for 1440 option. The -for option specifies the time range by minute you want to generate SVC/Storwize V7000 performance report files (CSV and GIF) and the default value is 60 minutes.
Table 13-3 Spreadsheets and gif chart types produced by svcreport spreadsheets (csv) cache_node cache_vdisk cpu drive mdisk node vdisk charts per vdisk cache.hits cache.stage cache.throughput cache.usage vdisk.response.tx vdisk.response.wr vdisk.throughput vdisk.transaction charts per mdisk mdisk.response.worst.resp mdisk.response mdisk.throughput mdisk.transaction charts per node cache.usage.node cpu.usage.node

Figure 13-100 is an example of a chart automatically generated by the svcperf.pl script for the vdisk37. We have chosen to present this chart related to vdisk37 as long as Figure 13-99 shows that vdisk is the one that reaches the highest iops values.

392

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Monitor_NS.fm

Figure 13-100 Number of read and write ops for vdisk37

svcmon is not intended to replace TPC, however it helps a lot when TPC is not available allowing an easy interpretation of the SVC performance xml data. This set of Perl scripts is designed and programmed by Yoshimichi Kosuge personally. It is not an IBM product and it is provided without any warranty. Hence you can use svcmon but on your own risk.

Chapter 13. Monitoring

393

7521Monitor_NS.fm

Draft Document for Review February 16, 2012 3:49 pm

Figure 13-101 Read and Write throughput for vdisk37 in bytes per second

394

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

14

Chapter 14.

Maintenance
Among the many benefits that SVC can provide is greatly simplifying the storage management tasks System Administrators need to perform. However, as the IT environment grows and gets renewed, so need the Storage Infrastructure. In this chapter, we shall discuss some of the Best Practices in day-to-day activities of Storage Administration using SVC that can help you keeping your Storage Infrastructure with the levels of availability, reliability, and resiliency demanded by todays applications and yet keep up with their storage growth needs. You will find in this chapter tips and recommendations that might have already been made in this and other Redbooks, in some cases with more details. Do not hesitate to refer back to them. The idea here was to put in one place the most important topics you should consider in SVC administration so you use this chapter as a checklist. You will also find some practical examples of the procedures described here in Chapter 16, SVC scenarios on page 453. Note: The practices described here were proven to be effective in many SVC installations worldwide for organizations in several different areas with one thing in common: a need to easily, effectively and reliably manage their SAN disk storage environment. Nevertheless, whenever you have a choice between two possible implementations or configurations, if you look deep enough, one will always have both advantages and disadvantages over the other. We expect that you do not take these practices as absolute truth, but rather use them as a guide. The choice of which approach to use is ultimately yours.

Copyright IBM Corp. 2011. All rights reserved.

395

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

14.1 Automating SVC and SAN environment documentation


Before you start complaining (yes, weve heard it all too many times before - documentation is boring, difficult to produce at first, nobody update it later, so its outdated and useless when the time comes) let us tell you that: There are today a number of ways and tools to automate the creation and update of this documentation, so the IT infrastructure itself might take care of keeping it updated - we shall discuss some of them later in this chapter. Planning is the key element of sustained, organized growth, and good documentation of your storage environment is the very blueprint that will allow you plan your approach on future storage growth, both in short and long term. Good documentation should be handy and easy to consult on whatever need, either to decide how to replace your core SAN directors by newer ones or how to fix one single servers disk path problems. Therefore, good documentation is typically the opposite of long documentation. In most cases, were talking about just a few spreadsheets and one diagram. Note: Do NOT store your SVC and SAN environment documentation only in the SAN itself. If your company or organization has a Disaster Recovery plan, include this storage documentation in it and follow its guidelines on how to update and store this data. If not, try and keep a least one updated copy off-site, provided you have the proper security authorization. In theory, this SVC and SAN environment documentation should be sufficient for any System Administrator, with average skills in the products included, to take a copy of all of your configuration information and use it to create a functionally equivalent copy of the environment using nothing but similar hardware without any configuration whatsoever, media out of the shelf, and configuration backup files. This is exactly what you might have to do should you ever face a Disaster Recovery scenario, and also why is so important to run periodic Disaster Recovery tests. Best practices recommend to create the first version of this documentation as you install your solution. In fact, IBM probably asked you to fill in some forms in order to plan the installation of your SVC, and they might be useful to document how your SVC was first configured. In the following sections, we give some suggestions on what we think is the minimum documentation needed for an SVC solution. Do not view it as an exhaustive list; you might have additional business requirements that require other data to be tracked.

14.1.1 Naming Conventions


Whether you will start your SAN and SVC environment documentation from scratch or update the one you have in place today, we recommend that you first take a moment to evaluate if you have a good naming convention in place - a good naming convention allows you to quickly and uniquely identify the components of your SVC and SAN environment, so that the System Administrators can tell if a given name belongs to a volume, a Storage Pool, a Host HBA, etc., just by looking at it. Since error messages typically point to the device that generated the error, a good naming convention quickly tells you where you should start you investigation should an error occur.

396

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

Typical SAN and SVC component names have a limit in the number and type of characters you can use - SVC names for instance are limited to 15 characters. This is typically what makes creating a name convention a bit tricky. Most (if not all) names in SAN storage and SVC can be modified online, so you dont need to worry about planning outages in order to implement your new naming convention. Server names are the exception, and we discuss that later in this chapter. Keep in mind that the examples below are just suggestions that proved to be effective in most cases, but might not be fully adequate to your particular environment or needs. It is your choice the naming convention to use, but once you choose it you should implement it in the whole environment.

Storage Controllers
SVC names the storage controllers simply controllerX, with X being just a sequential, decimal number. If you have multiple controllers attached to your SVC you should change that so the name includes, for instance, the vendor name, the model, or simply its serial number. As such if you receive an error message pointing to controllerX you dont need to log into SVC to know what storage controller you need to check.

MDisks and Storage Pools


When SVC detects new MDisks it names them by default as mdiskXX, with XX a sequential number. We suggest you change that into something more meaningful, for instance, that includes: a reference to the storage controller it belongs to (like its serial number or last digits), the extpool, array, or RAID group, it belongs to in the storage controller, the LUN number or name it has in the storage controller. Examples of mdisk names with this convention: 23K45_A7V10 - Serial 23K45, Array 7, Volume 10. 75VXYZ1_02_0206 - Serial 75VXYZ1, extpool 02, LUN 0206 Storage Pools have a few different possibilities. One is to include storage controller, type of backend disks, RAID type and sequential digits. Another, if you have dedicated pools for specific applications or servers, is to use that instead. Examples: P05XYZ1_3GR5 - Pool 05 from serial 75VXYZ1, LUNs with 300GB FC DDMs and RAID 5 P16XYZ1_EX01 - Pool 16 from serial 75VXYZ1, pool 01 dedicated to Exchange Mail servers

Volumes (formerly VDisks)


Volume names should include: what hosts, or cluster, the volume is mapped to, a single letter indicating its usage by the host, such as: B for boot disk, or R for a rootvg disk (if the server boots from SAN) D for regular data disk, Q for a cluster quorum disk (do not confuse with SVC quorum disks) L for a database logs disks T for a database table disk A couple of sequential digits, for uniqueness. Example: ERPNY01_T03 - volume mapped to server ERPNY01, database table disk 03

Chapter 14. Maintenance

397

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

Hosts
However cool it was in the past to name servers after cartoon or movies characters , today we are faced with large networks, the Internet and Cloud Computing. A good server naming convention allows you to, even in a very large network, quickly spot a server and tell: where it is (so you know how to access it), whats its kind (so you can tell the vendor and support group in charge), what it does (to engage the proper application support and notify its owner), its importance (so you know the severity of a problem should it occur). Changing a servers name might have implications with applications configuration and require a server reboot, so you might want to prepare a detailed plan if you decide to rename several servers in your network. Server name convention example: LLAATRFFNN LL - Location: might designate city, data center, building floor or room, etc. AA - Major application: examples are billing, ERP, Data Warehouse T - type: Unix (which), Windows, Vmware R - role: Production, Test, Q&A, Development FF - function: DB server, application server, web server, file server NN - numeric

SAN Aliases and Zones


SAN aliases typically only need to reflect the device and port associated to it. Including information on where one particular device port is physically attached on the SAN might lead to inconsistencies if you perform a change or maintenance and forget to update the alias. Create one alias for each device port WWPN in your SAN and use these aliases in your zoning configuration. Examples: NYBIXTDB02_FC2: Interface fcs2 of AIX server NYBIXTDB02 (WWPN) SVC02_N2P4: SVC cluster SVC02, port 4 of node 2 (WWPN format 5005076801PXXXXX). Be mindful of SVC ports aliases: the 11th digit of the port WWPN (P) reflects the SVC node FC port, but not directly, as follows:
Table 14-1 WWPNs for the SVC node ports Value of P 4 3 1 2 0 SVC Physical Port 1 2 3 4 none - SVC Node WWNN

SVC02_IO2_A: SVC cluster SVC02, ports group A for iogrp 2 (aliases SVC02_N3P1, SVC02_N3P3, SVC02_N4P1, and SVC02_N4P4) D8KXYZ1_I0301: DS8000 serial number 75VXYZ1, port I0301(WWPN) TL01_TD06: Tape library 01, tape drive 06 (WWPN) If your SAN by any chance does not support aliases, like in heterogeneous fabrics with switches in some interop modes, use WWPNs in your zones all across, just dont forget to update every zone that uses a given WWPN if you come to change it. A SAN zone name should reflect the devices in the SAN it includes, normally in a one-to-one relationship, like:

398

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

servername_svcclustername (from a server to the SVC) svcclustername_storagename (from the SVC cluster to its back end storage) svccluster1_svccluster2 (for remote copy services)

14.1.2 SAN Fabrics documentation


The most basic piece of SAN documentation is a SAN diagram. It is likely to be one of the first things that you are asked to produce if you ever ask for support from your SAN switches vendor. Additionally, a good spreadsheet with ports and zoning information eases your task of searching for detailed information which, if included in the diagram itself, would make it difficult to use.

Automated free tool - Brocade SAN Health


Brocade has a free, automated tool called SANHealth that might help you keeping this documentation. It consists of data collection tool that will log into the SAN switches you indicate and collect data using standard SAN switch commands, then creates a compressed file with the data collection which is sent to a Brocade automated machine for processing either by secure web or e-mail. After a while (typically a few hours) the user receives an e-mail with instructions on how to download the report, which includes a Visio Diagram of your SAN and a very organized Microsoft Excel spreadsheet with all your SAN information. For additional information and download refer to the URL below: http://www.brocade.com/sanhealth At the first time you will be required to play some with the Options in the SANHealth data collection tool so that the resulting diagram returns well organized. You can see an example of a poorly formatted diagram in Figure 14-1.

Figure 14-1 A poorly formatted SAN diagram

Figure 14-2 on page 400 depicts one of SANHealth Options screen where you can choose the format of your SAN diagram that better suits your needs. Depending on the topology and size of your SAN fabrics, you might want to try and play with the options in the Diagram Format or Report Format tabs.

Chapter 14. Maintenance

399

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

SANHealth supports switches from manufactures other than Brocade, like McData and Cisco. Both the data collection tool download and the processing of files are free, and you can download Microsoft Visio and Excel viewers for free from Microsoft web site. There is also an additional free tool available for download called SAN Health Professional, that enables you to audit the reports in detail, utilizing advanced search functionality and inventory tracking. It is possible to configure the SAN Health data collection tool as a Windows scheduled task. Note: Whatever method you choose, we recommend that you generate a fresh report at least once a month, and keep previous versions so you can track the evolution of your SAN.

Figure 14-2 Brocade SAN Health Options screen

TPC Reporting
If you have TPC running in your environment, you can use it to generate reports on your SAN. Details on how to configure and schedule TPC reports can be found it the TPC documentation, just make sure the reports you generate include all the information you need, and schedule them with a periodicity that allow you to back track the changes you do.

14.1.3 SVC
For SVC, you should periodically collect at least the output of the following commands, and import them into a spreadsheet, preferably each command output into a separate sheet: svcinfo lsfabric svcinfo lsvdisk svcinfo lshost

400

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

svcinfo lshostvdiskmap X, with X ranging to all defined host numbers in your SVC cluster Of course, you might want to store the output of additional commands, for instance if you have SVC Copy Services configured, or dedicated MDGs to specific applications or servers. One way to automate this task is creating a batch file (Windows) or shell script (Unix or Linux) that runs these commands, store their output in temporary files, then uses spreadsheet macros to import these temporary files into your SVC documentation spreadsheet. With MS Windows, use Puttys PLINK utility to create a batch session that run these commands and store their output. With Unix or Linux, you may use the standard SSH utility. Create a SVC user with Monitor privilege just to run these batches - dont grant it Administrator privilege. Create and configure a SSH key specifically for it. Use the -delim option of these commands to make their output delimited by a character other than Tab, like comma or colon. Using comma even allows you to initially import the temporary files into your spreadsheet in CSV format. To make your spreadsheet macros simpler, you might want to pre-process the temporary output files and remove any garbage or undesired lines or columns. With Unix or Linux you can use text edition commands like grep, sed and awk. There is freeware software for Windows with the same commands, or you can use any batch text edition utility you like. Remember that the objective is to fully automate this procedure so you can schedule it to run automatically from time to time, and the resulting spreadsheet should be easy to consult and contain only the relevant information you use more frequently. We shall discuss the automated collection and store of configuration and support data (which is typically much more extensive and difficult to use) later in this Chapter.

14.1.4 Storage
We recommend that you fully allocate all the space available in whatever Storage Controllers you use in its backend to the SVC itself, so that you can perform all your Disk Storage Management tasks using just SVC. If thats the case, you only need to generate (by hand) a documentation of your backend Storage Controllers once after you configured it, with the proper updates whenever these controllers receive hardware or code upgrades. As such, there really isnt much point automating this backend storage controller documentation. However, if youre using split controllers, you might want to reconsider, since the portion of your storage controllers being used outside SVC might get its configuration changed frequently. In this case, consult your backend storage controller documentation on how to grab and store the documentation you might need.

14.1.5 Technical Support Information


We recommend that you create and keep handy to all storage administrators a spreadsheet with all relevant information you use or need to provide in the case you need to open a technical support incident for you Storage and SAN components, such as: Hardware Information Vendor, Machine and Model number, serial number (ex. IBM 2145-CF8 S/N 75ABCDE) Configuration, if applicable Current Code Level

Chapter 14. Maintenance

401

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

Physical Location Datacenter complete street address and phone number Equipment physical location - room number, floor, tile location, rack number Vendors security access information or procedure, if applicable On Site person contact name and phone or page number Support Contract Information Vendor contact phone numbers and Web Site - keep them both Customers contact name and phone or page number userid to support web site, if applicable (DO NOT store the password in the spreadsheet unless the spreadsheet itself is password-protected) Support contract number and expiration date As such, everything you need to fill in a web support request form, or to inform a Vendors call center support representative is already there. Typically, you will be first asked for a brief description of the problem, and later on for a detailed description and support data collection.

14.1.6 Tracking Incident & Change tickets


If your organization uses an Incident and Change management and tracking tool (like TSRM), you or the Storage Administration team might need to get some proficiency in its use, for several reasons: If your storage and SAN equipment is not configured to send SNMP traps to this Incident management tool, you will need to manually open Incidents whenever an error is detected. Disk storage allocation and de-allocation, and SAN zoning configuration modifications, should be handled under properly submitted and approved Change tickets. If youre handling a problem yourself, or calling for your Vendors technical support, you might need to produce a list of the Changes recently implemented in your SAN, or since the documentation reports were last produced or updated. Below some Best Practices for SVC and SAN Storage Administration using Incident and Change Management tracking tools: 1. Whenever possible, configure your Storage and SAN equipment to send SNMP traps to the Incident monitoring tool, so that an Incident ticket is automatically open and the proper alert notifications are sent. If youre not using a monitoring tool in your environment, you might want to configure e-mail alerts that are automatically sent to the cell phones or pagers of the Storage Administrators on duty or call. 2. Discuss within your organization which risk classification a storage allocation or de-allocation Change ticket should have. These activities are typically safe and non disruptive to other services and applications when properly handled, but have the potential to cause collateral damage if an human error or unexpected failure occur during implementation. Your organization might decide to assume additional costs with overtime and limit such activities to off-business hours, weekends or maintenance windows if they assess the risks to other critical applications are too high. 3. Use templates for your most common change tickets, like storage allocation or SAN zoning modification, to facilitate and speed up their submission. 4. Do not open Change tickets in advance to replace failed, redundant, hot-pluggable parts, like Disk Drive Modules (DDMs) in storage controllers with hot spares, or SFPs in SAN switches or servers with path redundancy. Typically these fixes dont change anything in your SAN storage topology or configuration and wont cause any more service disruption or degradation than you already had when the part failed. They should be handled within the associated Incident ticket, because it would take longer to replace the part if you need 402
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

to submit, schedule and approve a non-emergency Change ticket. One exception to this rule is if you need to interrupt additional servers or applications in order to replace the part, so you need to schedule the activity and coordinate support groups. Use your good judgment and avoid unnecessary exposure and delays. 5. Keep handy the procedures to generate reports of the latest Incidents and implemented Changes in your SAN Storage environment. Typically theres no need to periodically generate these reports, because your organization probably already have a Problem and Change Management group doing that for trend analysis purposes.

14.1.7 Automated Support Data collection


Along with the more easy-to-use documentation of your SVC and SAN Storage environment, we recommend that you collect and store for some time the configuration files and technical support data collection for all your SAN equipment. These include: supportSave and configSave files on Brocade switches Output of show tech-support details command on Cisco switches Data collections on Brocades DCFM software SVC snap DS4x00 subsystem profiles DS8x00 LUN inventory commands: lsfbvol lshostconnect lsarray lsrank lsioports lsvolgrp

Again, you can create procedures that automatically create and store this data on scheduled dates, delete old ones or even transfer them to tape.

14.1.8 Subscribing for SVC support information


This is probably the most overlooked Best Practice in IT administration, and yet the most efficient way to stay ahead of problems - to be notified of potential threats before they hit you and cause severe services outage. We strongly advise you to access the URL below and subscribe to receive support alerts and notifications for your products - you can select which products you want to be notified about. You can use the same IBM ID you created to access the Electronic Service Call web page (ESC+). If you still dont have an IBM ID, create one. http://www.ibm.com/support http://www.ibm.com/support/esc Do the same for all your vendors of Storage and SAN equipment, if not only IBM. Typically with a quick glance you can tell if an alert or notification is applicable to your SAN storage, so open them as soon as you receive them and keep them at least for a while in a folder of your mailbox.

Chapter 14. Maintenance

403

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

14.2 Storage Management IDs


Practically all organizations have IT security policies enforcing the use of password-protected userids when using their IT assets and tools, but sadly we still find Storage administrators using generic, shared IDs like superuser, admin or root in their managment consoles to perform their tasks, sometimes even using the factory-set default password. Justifications for this behaviour vary from lack of time to some SAN equipment not supporting the organizations particular authentication tool. Typically, SAN storage equipment management consoles do not provide access to the stored data, but one can easily shut down a shared storage controller and any number of critical applications along with it. Moreover, having individual userids set for your storage administrators allows a much better back tracking of you modifications, should you need to analyze your logs. SVC release 6.2 supports new features in user authentication, including a Remote Authentication service, namely the Tivoli Embedded Security Services (ESS) server component level 6.2. This is embedded in TPC 4.11 and may also be found in other products. Regardless of the authentication method you choose: Create individual userids for your Storage Administration staff. Each userid should easily identify the user - use your organizations security standards. Include each individual userid into the UserGroup with just enough privileges to perform its required tasks. If required, create generic userids for your batch tasks, like Copy Services or Reporting. Include them in either CopyOperator or Monitor UserGroup. Do not use generic userids with SecurityAdmin privilege in batch tasks. Create unique SSH public and private keys for each of your administrators. Store your superuser password in a safe location in accordance to your organizations security guidelines, and use it only in emergencies. Figure 14-3 shows the SVC 6.2 GUI useriud creation screen.

Figure 14-3 New userid creation using the GUI

404

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

14.3 Standard operating procedures


You can make your most often SAN storage administration tasks, like SAN storage allocation or removal, or adding or removing a host from the SAN, much easier by creating step-by-step, pre-defined standard procedures for them. Here we present some recommendations your procedures should follow in order to keep your SVC environment healthy and reliable. Some practical examples are shown in Chapter 16, SVC scenarios on page 453.

14.3.1 Allocate and de-allocate volumes to hosts


Check before allocating new volumes to a server with redundant disk paths whether these paths are working fine and the multipath software is free of errors. Fix any disk path errors you find in your server before proceeding. Be mindful when planning for future growth of space efficient vdisks whether your servers operating system will support that particular volume to be extended online - for instance, previous AIX releases would not support online expansion of rootvg LUNs. We recommend you test the procedure first in a non-production server. Always cross-check the host LUN id information with SVCs vdisk_UID. Do not trust that the operating system will recognize, create and number the disk devices in the same sequence or with the same numbers you created them in SVC. Make sure you delete any volume or LUN definition in the server before unmapping it in SVC. For instance, in AIX remove the hdisk from the volume group (reducevg) and delete the associated hdisk device (rmdev). Make sure you explicitly remove a volume from any volume-to-host mappings and any copy services relationship it belongs before deleting it. Avoid at all costs using the -force parameter in rmvdisk. If you issue a svctask rmvdisk and it still has pending mappings, SVC will ask you to confirm and thats the hint you might be doing something wrong. When de-allocating volumes, plan for an interval between unmapping them to hosts (rmvdiskhostmap) and destroying them (rmvdisk) - IBMs internal Storage Technical Quality Review Process (STQRP) asks for a minimum of 48 hours interval. This allows you a quick back out if you later realize you still need some of the data in that volume.

14.3.2 Add and remove hosts in SVC


Check before mapping new servers to SVC whether theyre all free of errors. Fix any errors you find in your server and SVC before proceeding. In SVC, give special attention to anything inactive in svcinfo lsfabric. Plan for an interval between updating the zoning in each of your redundant SAN fabrics, like at the very least 30 minutes. This allows for failover to take place and stabilize, and you to be notified, should any unexpected errors occur. Once you performed the SAN zoning from one servers HBA to SVC, you should be able to list its WWPN with svcinfo lshbaportcandidate. Use svcinfo lsfabric to certify its been seen by the SVC nodes and ports you expected. When creating the host definition in SVC (svctask mkhost) try and avoid the -force parameter.

Chapter 14. Maintenance

405

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

14.4 SVC Code upgrade


Since SVC is in the very core of your Disk and SAN storage environment, its upgrade requires a bit of planning, preparation and verification. However, with the proper precautions, it can be conducted easily and transparently to your servers and applications. At the time this Redbook is being written, SVC version 4.3 is approaching its End of Support date, so your SVC should already be at least at version 5.1. In this topic we shall discuss the generally applicable recommendations for a SVC upgrade, with a special case scenario to upgrade SVC from version 5.1 to 6.X

14.4.1 Prepare for upgrade


Current and Target SVC code level
First step is to determine your current and target SVC code level. To determine your current code level: Log into your SVC Console GUI and see its version in the Clusters tab, or Using the CLI, use the svcinfo lsnodevpd command. SVC code levels are specified by a 4 digits in the format V.R.M.F, where: V is the major Version number R is the Release level M is the Modification level F is the Fix level

If youre running SVC release 5.1 or earlier, youll need to check the SVC Console version. You can see this is the SVC Console Welcome screen, in the upper right corner, or in Windows Control Panel - Add or Remove Software screen. As for the SVC Target code level, we recommend that you set it to the latest Generally Available (GA) release, unless you have some specific reason not to. Examples of such reasons are: a known problem with the particular version of some application or other component of your SAN Storage environment; the latest SVC GA release is not yet cross-certified as compatible with another key component of your SAN storage environment; internal policies in your organization, like using the latest minus 1 release, or asking for some seasoning in the field before implementation. As such, youll need to check the compatibility of your target SVC code level with all components of your SAN storage environment (SAN switches, storage controllers, servers HBAs) and its attached servers (operating systems and eventually applications). Typically, applications only certify the Operating System they run under, and leave to the O.S. provider the task to certify its compatibility with attached components (like SAN storage). Some applications, however, might make special use of special hardware features or raw devices and certify the attached SAN storage as well - if this is your case, consult your applications compatibility matrix to certify your SVC target code level is compatible. Review the SAN Volume Controller and SVC Console GUI Compatibility web page, and the SAN Volume Controller Concurrent Compatibility and Code Cross-Reference Web page: http://www-1.ibm.com/support/docview.wss?rs=591&uid=ssg1S1002888 406
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

http://www-1.ibm.com/support/docview.wss?rs=591&uid=ssg1S1001707

SVC Upgrade Test Utility


We strongly recommend that you install and run the latest SVC Upgrade Test Utility before any SVC code upgrade. You can download the SVC upgrade Test Utility from the URL below: https://www-304.ibm.com/support/docview.wss?uid=ssg1S4000585 Figure 14-4 on page 407 shows its installation using the GUI - it is uploaded and installed like any other software upgrade. This tool will verify the health of your SVC for the upgrade process, checking for unfixed errors, degraded MDisks, inactive fabric connections, configuration conflicts, hardware compatibility and many other things that would otherwise require a series of command outputs to be cross-checked. Note: The SVC Upgrade Test Utility does not log into the storage controllers or SAN switches it uses to check for errors, but rather report the status of its connections to these devices as it sees them. It is still recommended that you check these components for errors as well. Moreover, it is still very important to read carefully the SVC code version Release Notes before running the upgrade procedure. Figure 14-4 shows the SVC 5.1 GUI screen to install the Utility.

Figure 14-4 SVC Upgrade Test Utility installation using the GUI

While you can use either the GUI or the CLI to upload and install the SVC Upgrade Test Utility, you can only use the CLI to run it. Example 14-1shows an example.
Example 14-1
IBM_2145:svccf8:admin>svcupgradetest -v 6.2.0.2 -d svcupgradetest version 6.6 Please wait while the tool tests for issues that may prevent a software upgrade from completing successfully. The test may take several minutes to complete.

Chapter 14. Maintenance

407

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

Checking 32 mdisks: Results of running svcupgradetest: ================================== The tool has found 0 errors and 0 warnings The test has not found any problems with the cluster. Please proceed with the software upgrade. IBM_2145:svccf8:admin>

SVC Hardware Considerations


The release of SVC version 5.1 and of new node models CF8 and CG8 brought another consideration into the SVC upgrade process: if your SVC nodes hardware and target code level are compatible. Figure 14-5 on page 408 shows the compatibility matrix between the latest SVC Hardware node models and Code versions. If your SVC cluster has nodes model 4F2, you need to replace them by newer models before upgrading their code. On the other hand, if youre planning to add or replace nodes with new models CF8 or CG8 to an existing cluster, you need to upgrade your SVC code before it.

Figure 14-5 SVC node models and code versions relationship

Attached Hosts Preparation


As we mentioned before, taken the proper precautions, the SVC upgrade is transparent to the attached servers and their applications, with the automated upgrade procedure updating one SVC node at a time while the other node in the I/O group covers for its designated volumes. For this statement to be true, however, you need the failover capability of your servers multipath software to be working properly. Before you start the SVC upgrade preparation you should check, for each and every one of the servers attached to the SVC cluster you will upgrade: Operating System type, version and maintenance or fix level; HBAs make, model and microcode version; Multipath software type, version and Error Log

408

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

IBMs Support page on SVC Flashes and Alerts (Troubleshooting): http://www-947.ibm.com/support/entry/portal/Troubleshooting/Hardware/System_ Storage/Storage_software/Storage_virtualization/SAN_Volume_Controller_(2145) Fix every single problem or suspect you find with the disk paths failover capability. Since a typical SVC environment has from many dozens to a few hundreds of servers attached to it, a spreadsheet might help you with the Attached Hosts Preparation tracking process. If you have in your environment some kind of hosts virtualization, like VMware ESX, AIX LPARs and VIOs, or Solaris containers, you need to verify the redundancy and failover capability in these virtualization layers as well.

Storage Controllers Preparation


Just as critical as with the attached hosts, the attached storage controllers must be able to correctly handle the failover of MDisk paths. Therefore, they should be running supported microcode versions and their own SAN paths to SVC should be free of errors.

SAN Fabrics preparation


If youre using symmetrical, redundant independent SAN fabrics, preparation of these fabrics for SVC upgrade should be safer than the previous components discussed above, provided you follow the recommendation mentioned in 15.2.2 of a 30 minutes minimum interval between whatever modifications you do in one fabric to the next. Even if one unexpected error manages to put down your entire SAN fabric, the SVC environment should be able to continue working through the other fabric and your applications remain unaffected. We suggest that you take the opportunity, since youre going to upgrade your SVC, to upgrade your SAN switches code as well to the latest supported level. Start by your principal core switch or director, continue by upgrading the other core switches, and leave the edge ones for last. Upgrade one whole fabric (all switches) before moving to the next - as such, whatever problems you might eventually face should affect just this first fabric. Only start your other fabric upgrade once you certified there were no problems in the first one. If youre still not running symmetrical, redundant independent SAN fabrics, we suggest you fix this at your highest priority. You have a Single Point of Failure (SPoF).

Upgrade sequence
The ultimate guide to tell you the order your SVC SAN storage environment components should be upgraded is the SVC Supported Hardware List. Below the link to version 6.2: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797 By cross-checking which version of SVC is compatible with, say, which versions of your SAN directors, one should be able to tell which should be upgraded first. By checking your individual components upgrade path, one can tell if that particular component will require a multi-step upgrade. Typically, if youre not going to make any major version or multi step upgrade in any components, the order that showed to be less prone to eventual problems is: 1. 2. 3. 4. SAN switches or directors Storage Controllers Servers HBAs microcodes and muti-path software SVC Cluster

Chapter 14. Maintenance

409

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

Note: UNDER NO CIRCUMSTANCES upgrade two components of your SVC SAN storage environment simultaneously, like the SVC and one storage controller, even if you intend to do it with your system offline. Doing so might lead to very unpredictable results, and an unexpected problem will be much more difficult to debug.

14.4.2 SVC Upgrade from 5.1 to 6.2


SVC incorporated several new features in version 6 compared to its previous version, but the most significant differences in regards to the upgrade process are those concerning the SVC Console and the new configuration and use of internal SSD disks with Easy Tier. While we shall discuss these two topics here, see Chapter 16, SVC scenarios on page 453 for a practical example of this upgrade.

SVC Console
SVC 6.1 no longer requires a separate hardware with the specific function of its Console. The SVC Console software was incorporated in the nodes, so in order to access the SVC Management GUI you simply use the cluster IP address. If you purchased your SVC with a console or SSPC server, and you no longer have any SVC clusters running SVC release 5.1 or earlier, you can remove the SVC Console software from this server. In fact, SVC Console versions 6.1 and 6.2 are utilities that remove the previous SVC Console GUI software and create desktop shortcuts to the new console GUI. Check the following URL for details and download: https://www-304.ibm.com/support/docview.wss?uid=ssg1S4000918

Easy Tier with SVC internal SSDs


SVC 6.2 included support to Easy Tier using SVC internal SSDs with node models CF8 and CG8. If youre already using internal SSDs with SVC release prior to 6.1, these SSDs need to be removed from whatever MDG they belong to and put into unmanaged state before you can upgrade to release 6.2. Example 14-2 shows what happens if you run svcupgradetest in a cluster with internal SSDs in managed state. If the internal SSDs are in a MDG with other MDisks from external storage controllers, you can remove them from the MDG using rmmdisk with the -force option. Check if you have available space in the MDG before removing the MDisk, the command will fail if it cannot move all extents from the SSD into the other MDisks in the MDG - you will not loose data, but will waste time. If the internal SSDs are all alone in a MDG of their own (as they should), you can migrate all volumes in this MDG to another ones, then remove the MDG entirely. After SVC upgrade you can recreate the SSDs MDG, but we do recommend that you rather use them with Easy Tier.
Example 14-2
IBM_2145:svccf8:admin>svcinfo lsmdiskgrp id name status mdisk_count ... ... 2 MDG3SVCCF8SSD online 2 ... 3 MDG4DS8KL3331 online 8 ... ... IBM_2145:svccf8:admin>svcinfo lsmdisk -filtervalue mdisk_grp_name=MDG3SVCCF8SSD id name status mode mdisk_grp_id mdisk_grp_name ctrl_LUN_# controller_name UID 0 mdisk0 online managed 2 MDG3SVCCF8SSD 0000000000000000 controller0 5000a7203003190c000000000000000000000000000000000000000000000000

capacity 136.7GB

410

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

1 mdisk1 online managed 2 0000000000000000 controller3 5000a72030032820000000000000000000000000000000000000000000000000 IBM_2145:svccf8:admin> IBM_2145:svccf8:admin>svcinfo lscontroller id controller_name ctrl_s/n product_id_high 0 controller0 1 controller1 75L3001FFFF 2 controller2 75L3331FFFF 3 controller3 IBM_2145:svccf8:admin> IBM_2145:svccf8:admin>svcupgradetest -v 6.2.0.2 -d svcupgradetest version 6.6 Please wait while the tool tests for issues that may prevent a software upgrade from completing successfully. The test may take several minutes to complete. Checking 34 mdisks: ******************** Error found ******************** The requested upgrade from 5.1.0.10 to 6.2.0.2 cannot be completed as there are internal SSDs are in use. Please refer to the following flash: http://www.ibm.com/support/docview.wss?rs=591&uid=ssg1S1003707

MDG3SVCCF8SSD

136.7GB

vendor_id IBM IBM IBM IBM

product_id_low 2145 2107900 2107900 2145 Internal

Internal

Results of running svcupgradetest: ================================== The tool has found errors which will prevent a software upgrade from completing successfully. For each error above, follow the instructions given. The tool has found 1 errors and 0 warnings IBM_2145:svccf8:admin>

After you upgrade your SVC cluster from release 5.1 to 6.2, your internal SSD drives no longer appear as MDisks from storage controllers that are in fact the SVC nodes, but rather as drives that you need to configure into arrays that can be used in storage pools (formerly MDisk groups). Example 14-3 shows this change.
Example 14-3
### Previous configuration in SVC version 5.1: IBM_2145:svccf8:admin>svcinfo lscontroller id controller_name ctrl_s/n 0 controller0 1 controller1 75L3001FFFF 2 controller2 75L3331FFFF 3 controller3 IBM_2145:svccf8:admin> ### After upgrade SVC to version 6.2: IBM_2145:svccf8:admin>lscontroller id controller_name ctrl_s/n vendor_id product_id_low product_id_high 1 DS8K75L3001 75L3001FFFF IBM 2107900 2 DS8K75L3331 75L3331FFFF IBM 2107900 IBM_2145:svccf8:admin> IBM_2145:svccf8:admin>lsdrive id status error_sequence_number use tech_type capacity mdisk_id mdisk_name member_id enclosure_id slot_id node_id node_name 0 online unused sas_ssd 136.2GB 0 2 node2 1 online unused sas_ssd 136.2GB 0 1 node1 IBM_2145:svccf8:admin>

vendor_id IBM IBM IBM IBM

product_id_low product_id_high 2145 Internal 2107900 2107900 2145 Internal

You will need to decide what RAID level you will configure in the new arrays with SSD drives, depending on purpose you want to give them and the level of redundancy necessary in order to protect your data in case of hardware failure. Table 14-2 provides the considering factors in each case. Again, we recommend that you use your internal SSD drives for Easy Tier - this is how you will, in most cases, get better overall performance gain.

Chapter 14. Maintenance

411

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

Table 14-2 RAID levels for internal SSDs RAID level (GUI Preset) RAID-0 (Striped) RAID-1 (Easy Tier) What you will need 1-4 drives, all in a single node. 2 drives, one in each node of the IO Group. When to use it? When VDisk Mirror is on external MDisks. When using Easy Tier and/or both mirrors on SSDs For best performance A pool should only contain arrays from a single IO Group. An Easy Tier pool should only contain arrays from a single IO Group. The external MDisks in this pool should only be used by the same IO Group. A pool should only contain arrays from a single IO Group. Recommended over VDisk Mirroring.

RAID-10 (Mirrored)

4-8 drives, equally distributed amongst each node of the IO Group

When using multiple drives for a VDisk

14.4.3 Upgrade SVC clusters participating in MM or GM


When upgrading a SVC cluster that participates in a intercluster Copy Services relationship, make sure you do not upgrade both clusters in the relationship simultaneously. This situation is not verified or monitored by the Automatic Upgrade process and might lead to loss of synchronization and unavailability. Make sure you successfully finished the upgrade in one cluster before you start the next one. Try and upgrade the next cluster as soon as possible to the same code level as the first one - avoid running them with different code levels for extended periods. If possible, we recommend that you stop all intercluster relationships for the duration of the upgrade, and restart them over again once the upgrade is completed.

14.4.4 SVC upgrade


Some generic (version independent) recommendations for your SVC code upgrade: Schedule the SVC code upgrade for a low I/O activity time. The upgrade process puts offline one node at a time, and disables write cache in the I/O group that node belongs to until both nodes were upgraded. As such, with a lower I/O youre less likely to notice any performance degradation during the upgrade. Never power off a SVC node during code upgrade unless you have been instructed to do so by IBM support. Typically, if the upgrade process does encounter a problem and fails, it will back out itself. Check if youre running a web browser type and version supported by the SVC target code level in every computer you intend to use to manage your SVC, including the SVC console. If youre planning for a major SVC version upgrade (like version 5 to version 6), update first your current version to its latest fix level before running the major upgrade.

412

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

14.5 SAN modifications


Sadly, a frequent problem in the administration of shared storage environments is human error while fixing a failure or making a change affecting just one or few servers or applications, and this error ends up affecting other servers or applications because proper precautions were not taken. By making an habit the use of Best Practices, you can uniquely and correctly identify the components of your SAN; use the proper failover commands to disable only the failed parts; understand which modifications are necessarily disruptive, and which can be performed online with little or no performance degradation; avoid unintended disruption of servers and applications; dramatically increase the overall availability of your IT infrastructure. Examples of such human mistakes include: remove the mapping of a LUN (volume, or VDisk) still in use by a server disrupt or disable a servers working disk paths while trying to fix failed ones disrupt a neighbor SAN switch port while inserting or pulling out a FC cable or SFP disable or remove the working part in a redundant set instead of the failed one make modifications affecting both parts of a redundant set without an interval that allows for automatic failover in case of unexpected problems

14.5.1 Cross-referencing HBA WWPNs


The one thing that allows uniquely identify one server in the SAN is its HBAs WWPNs. If a servers name is changed at the Operating System level and not at SVCs host definitions, it will continue to access its previously mapped volumes exactly because its HBAs WWPNs havent changed. On the other hand, if one servers HBA is removed from one server, installed in another, and the previous servers SAN zones and SVC host definitions were not updated, the next server will be able to access volumes that it probably shouldnt. 1. Verify in your server the WWPNs of the HBAs being used for disk access. Typically this can be achieved by your servers SAN disk multipath software. If youre using SDD, run datapath query WWPN, which will return output similar to: [root@nybixtdb02]> datapath query wwpn Adapter Name PortWWN fscsi0 10000000C925F5B0 fscsi1 10000000C9266FD1 If youre using server virtualization, verify the WWPNs in the server actually attached to the SAN, like AIX VIO or VMware ESX. 2. Next, cross-reference with the output of SVCs lshost <hostname>: IBM_2145:svccf8:admin>svcinfo lshost NYBIXTDB02 id 0 name NYBIXTDB02 port_count 2 type generic mask 1111 iogrp_count 1 WWPN 10000000C925F5B0 node_logged_in_count 2 state active WWPN 10000000C9266FD1 node_logged_in_count 2

Chapter 14. Maintenance

413

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

state active IBM_2145:svccf8:admin> 3. If necessary, cross-reference with you SAN switches information. In Brocade switches use nodefind <WWPN>. blg32sw1_B64:admin> nodefind 10:00:00:00:C9:25:F5:B0 Local: Type Pid COS PortName NodeName SCR N 401000; 2,3;10:00:00:00:C9:25:F5:B0;20:00:00:00:C9:25:F5:B0; 3 Fabric Port Name: 20:10:00:05:1e:04:16:a9 Permanent Port Name: 10:00:00:00:C9:25:F5:B0 Device type: Physical Unknown(initiator/target) Port Index: 16 Share Area: No Device Shared in Other AD: No Redirect: No Partial: No Aliases: nybixtdb02_fcs0 b32sw1_B64:admin> Best Practices require that Storage Allocation requests submitted by the Server Support or Application Support teams to the Storage Administration team always include the servers HBA WWPNs the new LUNs or volumes are supposed to be mapped to. For instance, a server might use separate HBAs for disk and tape access, or distribute its mapped LUNs across different HBAs for performance, and one cannot assume that any given new volume is supposed to be mapped to every WWPN that server has logged in the SAN. If your organization uses a Change Management tracking tool, perform all your SAN storage allocations under approved Change tickets with the servers WWPNs listed in the Description and Implementation sessions.

14.5.2 Cross-referencing LUNids


Always cross-reference the SVC vdisk_UID with the server LUNid before any modifications involving SVC volumes. Below an example using a AIX server running SDDPCM. Notice that there is no relation between SVC vdisk_name and AIX device name, and that the very first SAN LUN mapped to the server (SCSI_id=0) showed up as hdisk4 in the server because it already had four internal disks (hdisk0 - hdisk3). IBM_2145:svccf8:admin>lshostvdiskmap NYBIXTDB03 id name SCSI_id vdisk_id vdisk_name vdisk_UID 0 NYBIXTDB03 0 0 NYBIXTDB03_T01 60050768018205E12000000000000000 IBM_2145:svccf8:admin>

root@nybixtdb03::/> pcmpath query device Total Dual Active and Active/Asymmetric Devices : 1 DEV#: 4 DEVICE NAME: hdisk4 TYPE: 2145 ALGORITHM: Load Balance SERIAL: 60050768018205E12000000000000000 ========================================================================== Path# Adapter/Path Name State Mode Select Errors 0* fscsi0/path0 OPEN NORMAL 7 0 1 fscsi0/path1 OPEN NORMAL 5597 0 2* fscsi2/path2 OPEN NORMAL 8 0 3 fscsi2/path3 OPEN NORMAL 5890 0 414
SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

If your organization uses a Change Management tracking tool, include LUNid information in every Change ticket performing SAN storage allocation or reclaim.

14.5.3 HBA replacement


Replacing a failed HBA is a fairly trivial and safe operation if done correctly. Some additional precautions are required, however, if your server have redundant HBAs and its hardware permits you to replace it in hot (with the server still powered up and running). 1. In your server, using the multipath software, identify the failed HBA, write down its WWPN (see 14.5.1, Cross-referencing HBA WWPNs on page 413), and put this HBA and its associated paths offline, gracefully if possible. This is important so that the multipath software stops trying to recover it, your server might even show a degraded performance while you dont do this. 2. Some HBAs come with a label showing its WWPN. If this is your case, make note of it before you install the new HBA in the server. 3. If your server does not support HBAs hot-swap, power off your system, replace the HBA, connect the previously used FC cable into the new HBA, and power the system back on. If it does support hot-swap, follow its proper procedures to replace the HBA in hot - be very careful not to disable or disrupt your good HBA in the process. 4. Verify that the new HBA has successfully logged into the SAN switch - you should be able to see its WWPN logged into the SAN switch port. If it is not, fix this issue before continuing to the next step. Cross-check the WWPN you see in the SAN switch with the one you noted in Step 1, make sure you didnt get the WWNN by mistake. 5. In your SAN zoning configuration tool, replace the old HBAs WWPN for the new one in every alias and zone it belongs to. Be very careful not to touch the other SAN fabric (the one with the good HBA) while you do this. If youre already applying Best Practices to your SAN, there should be only one alias using this WWPN, and zones should be referencing this alias. If youre using SAN port zoning (though you shouldnt be) and you didnt move the new HBAs FC cable to another SAN switch port, you dont need to reconfigure zoning. 6. Verify that the new HBAs WWPN showed up in SVC using lshbaportcandidate. Troubleshoot your SAN connections and zoning if you didnt. 7. Add this new HBAs WWPN in the SVC host definition using addhostport. Dont remove the old one just yet. Run a lshost <servername> and certify that the good HBA shows as active, while the failed and new HBAs show as either inactive or offline. 8. Go back to the server and re-configure the multipath software to recognize the new HBA and its associated SAN disk paths. Certify that all SAN LUNs have redundant, healthy disk paths through both the good and the new HBAs. 9. Now you can go back to the SVC and verify again with lshost <servername> that both the good and the new HBAs WWPNs are active. In this case, you can remove the old HBA WWPN from the host definition using rmhostport. Troubleshoot your SAN connections and zoning if you didnt, and dont remove any HBA WWPN from the host definition until you made sure you have at least two healthy, active ones - as such you dont risk removing your only good one by mistake.

Chapter 14. Maintenance

415

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

14.6 SVC Hardware Upgrades


SVCs scalability features allow for a great deal of flexibility in its configuration, as a consequence there is a number of possible scenarios for its growth. We grouped these scenarios in three groups, and below some suggestions in how to cope with them.

14.6.1 Add SVC nodes to an existing cluster


If your existing SVC cluster is below 4 I/O groups and you intend to upgrade it, youll probably find yourself installing newer, more powerful nodes than your existing ones, and your cluster will have different node models in different I/O groups. In order to install these newer nodes youll need to check if your SVC code level needs to be upgraded first - see SVC Hardware Considerations on page 408 for details. Once you installed them, however, youll probably need to re-distribute your servers across the different I/O groups, as such: 1. Keep in mind that moving a servers volume to different I/O groups cannot be done online, so youll need to schedule a brief outage - you will need to export your servers SVC volumes and re-import them. In AIX for instance were talking about varyoffvg, exportvg, then changing the volumes iogrp in SVC, and importvg back in the server. 2. If each of your servers is zoned to just one I/O group, then youll need to modify your SAN zoning configuration as you move its volumes to another I/O group. Try and balance the best you can the distribution of your servers across I/O groups according to I/O workload. 3. Use the -iogrp parameter in the mkhost command to define in SVC which servers use which I/O groups. If you dont SVC by default map the host to all I/O groups even if they dont exist and regardless of your zoning configuration. See Example 14-4 for an example and how to solve it.
Example 14-4
IBM_2145:svccf8:admin>lshost NYBIXTDB02 id 0 name NYBIXTDB02 port_count 2 type generic mask 1111 iogrp_count 4 WWPN 10000000C9648274 node_logged_in_count 2 state active WWPN 10000000C96470CE node_logged_in_count 2 state active IBM_2145:svccf8:admin>lsiogrp id name node_count vdisk_count host_count 0 io_grp0 2 32 1 1 io_grp1 0 0 1 2 io_grp2 0 0 1 3 io_grp3 0 0 1 4 recovery_io_grp 0 0 0 IBM_2145:svccf8:admin>lshostiogrp NYBIXTDB02 id name 0 io_grp0 1 io_grp1 2 io_grp2 3 io_grp3 IBM_2145:svccf8:admin>rmhostiogrp -iogrp 1:2:3 NYBIXTDB02 IBM_2145:svccf8:admin>lshostiogrp NYBIXTDB02 id name 0 io_grp0 IBM_2145:svccf8:admin>lsiogrp id name node_count vdisk_count host_count 0 io_grp0 2 32 1 1 io_grp1 0 0 0 2 io_grp2 0 0 0 3 io_grp3 0 0 0 4 recovery_io_grp 0 0 0 IBM_2145:svccf8:admin>

416

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Maintaining.fm

4. Try, if possible, not to set a server to use volumes from I/O groups using very different node types - not as a permanent situation anyway. If you do, as this servers storage capacity grows you might experience a performance difference between volumes from different I/O groups, thus making very tricky to spot and solve an eventual performance problem .

14.6.2 Upgrade SVC nodes in an existing cluster


If youre going to replace the nodes of your existing SVC cluster by newer ones, this replacement can be performed non-disruptively - the new node can assume the WWNN of the node youre replacing thus requiring no changes in host configuration or multipath software. See IBM SAN Volume Controller Information Center in the following URL for details on the procedure: http://publib.boulder.ibm.com/infocenter/svc/ic/index.jsp The non-disruptive node replacement makes use of failover capabilities to replace one node in an I/O group at a time. An alternative to this procedure is to replace nodes disruptively by moving volumes to a new I/O group. The disruptive procedure, however, requires additional work on the servers .

14.6.3 Move to a new SVC cluster


If you already have have a highly populated, intensively used SVC cluster that you want to upgrade and use the opportunity to give an overhaul in your SVC and SAN storage environment, one scenario that might easen your effort is to replace your cluster entirely by a newer, bigger, more powerful one. In this case: 1. Install your new SVC cluster; 2. Create a replica of your data in your new cluster; 3. Migrate your servers to the new SVC Cluster at their best convenience. If your servers can tolerate a short, scheduled outage to switch over from one SVC to the other, you can use SVCs Remote Copy Services (Metro Mirror or Global Mirror) to create your data replicas and moving your servers would be no different than what was discussed in 14.6.1, Add SVC nodes to an existing cluster on page 416. If you need to migrate a server online, you can modify its zoning so it uses volumes from both SVC clusters, and use host-based mirroring (like AIX mirrorvg) to move your data from the old SVC to the new one. This last approach uses the servers computing resources (CPU, memory, I/O) to replicate the data, so before you begin make sure it has some such resources to spare. The biggest advantage of this approach is that it easily accomodates, if necessary, the replacement of your SAN switches or your backend storage controllers. You can upgrade the capacity of your backend storage controllers or replace them entirely, just like you can replace your SAN switches by bigger or faster ones. The disadvantage is that you need some spare resources during the migration, like floor space, electricity, cables and storage capacity. In Chapter 16, SVC scenarios on page 453 we show one possible approach for this scenario that replaces the SVC, the switches and the backend storage.

Chapter 14. Maintenance

417

7521Maintaining.fm

Draft Document for Review February 16, 2012 3:49 pm

14.7 Wrap up
There are, of course, many more practices that can be applied to the SAN storage environment management and would benefit its administrators and users. You can see these practices we just reviewed and some more been applied in Chapter 16, SVC scenarios on page 453.

418

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Troubleshooting.fm

15

Chapter 15.

Troubleshooting and diagnostics


The SAN Volume Controller (SVC) has proven to be a robust and reliable virtualization engine that has demonstrated excellent availability in the field. Nevertheless, from time to time, problems occur. In this chapter, we provide an overview about common problems that can occur in your environment. We discuss and explain problems related to the SVC, the Storage Area Network (SAN) environment, storage subsystems, hosts, and multipathing drivers. Furthermore, we explain how to collect the necessary problem determination data and how to overcome these problems.

Copyright IBM Corp. 2011. All rights reserved.

419

7521Troubleshooting.fm

Draft Document for Review February 16, 2012 3:49 pm

15.1 Common problems


Todays SANs, storage subsystems, and host systems are complicated, often consisting of hundreds or thousands of disks, multiple redundant subsystem controllers, virtualization engines, and different types of Storage Area Network (SAN) switches. All of these components have to be configured, monitored, and managed properly, and in the case of an error, the administrator will need to know what to look for and where to look. The SVC is a great tool for isolating problems in the storage infrastructure. With functions found in the SVC, the administrator can more easily locate any problem areas and take the necessary steps to fix the problems. In many cases, the SVC and its service and maintenance features will guide the administrator directly, provide help, and suggest remedial action. Furthermore, the SVC will probe whether the problem still persists. When experiencing problems with the SVC environment, it is important to ensure that all components comprising the storage infrastructure are interoperable. In an SVC environment, the SVC support matrix is the main source for this information. Visit the following link for the latest SVC Version 6.2 support matrix: https://www-304.ibm.com/support/docview.wss?uid=ssg1S1003797 Although the latest SVC code level is supported to run on older HBAs, storage subsystem drivers, and code levels, we recommend that you use the latest tested levels.

15.1.1 Host problems


From the host point of view, you can experience a variety of problems. These problems can start from performance degradation up to inaccessible disks. There are a few things that you can check from the host itself before drilling down to the SAN, SVC, and storage subsystems. Areas to check on the host: Any special software that you are using Operating system version and maintenance/service pack level Multipathing type and driver level Host bus adapter (HBA) model, firmware, and driver level Fibre Channel SAN connectivity Based on this list, the host administrator needs to check and correct any problems. You can obtain more information about managing hosts on the SVC in Chapter 8, Hosts on page 191.

15.1.2 SVC problems


The SVC has good error logging mechanisms. It not only keeps track of its internal problems, but it also tells the user about problems in the SAN or storage subsystem. It also helps to isolate problems with the attached host systems. Every SVC node maintains a database of other devices that are visible in the SAN fabrics. This database is updated as devices appear and disappear.

Fast node reset


The SVC cluster software incorporates a fast node reset function. The intention of a fast node reset is to avoid I/O errors and path changes from the hosts point of view if a software

420

SAN Volume Controller Best Practices and Performance Guidelines

Draft Document for Review February 16, 2012 3:49 pm

7521Troubleshooting.fm

problem occurs in one of the SVC nodes. The fast node reset function means that SVC software problems can be recovered without the host experiencing an I/O error and without requiring the multipathing driver to fail over to an alternative path. The fast node reset is done automatically by the SVC node. This node will inform the other members of the cluster that it is resetting. Other than SVC node hardware and software problems, failures in the SAN zoning configuration are a problem. A misconfiguration in the SAN zoning configuration might lead to the SVC cluster not working, because the SVC cluster nodes communicate with each other by using the Fibre Channel SAN fabrics. You must check the following areas from the SVC perspective: The attached hosts Refer to 15.1.1, Host problems on page 420. The SAN Refer to 15.1.3, SAN problems on page 422. The attached storage subsystem Refer to 15.1.4, Storage subsystem problems on page 422. There are several SVC command line interface (CLI) commands with which you can check the current status of the SVC and the attached storage subsystems. Before starting the complete data collection or starting the problem isolation on the SAN or subsystem level, we recommend that you use the following commands first and check the status from the SVC perspective. You can use these helpful CLI commands to check the environment from the SVC perspective: svcinfo lscontroller controllerid Check that multiple worldwide port names (WWPNs) that match the back-end storage subsystem controller ports are available. Check that the path_counts are evenly distributed across each storage subsystem controller or that they are distributed correctly based on the preferred contr