PRS3026 DS8000 High Availability Best Practices Speaker Notes V1R15 20091007

Recommended Best Practices Considerations for High Availability on IBM System Storage DS8000 and DS6000 and IBM
TotalStorage ESS
Speaker notes
Copyright 2007 IBM Corporation. All right reserved.
This page is intentionally left blank
Chart 3 IBM System Storage Enterprise Disk IBM System Storage DS6000 and DS8000; together with the IBM TotalStorage Enterprise Storage models 800 and 750 are the members of IBMs current enterprise disk family. IBM has a long history of delivering enterprise storage systems designed for performance and reliability. The IBM System Storage DS8000 is IBMs current high-performance, high-capacity series of enterprise disk storage systems. With the implementation of the POWER5 Server Technology in the DS8000 it is possible to create storage system logical partitions (LPARs) that can be used for completely separate production, test, or other unique storage environments. The DS8000 is designed to provide a flexible and extendable disk storage subsystem. The IBM System Storage DS8000 is design to: - Deliver robust, flexible, and cost-effective disk storage for mission-critical workloads - Support storage sharing and consolidation for a wide variety of operating systems and mixed server environments - Help increase storage administration productivity with centralized and simplified management - Provide the creation of multiple storage system LPARs that can be used for completely separate production, test, or other unique storage environments - Occupy 20 percent less floor space than the earlier ESS Model 800's base frame, and holds even more capacity The DS6000 series is a member of the System Storage DS family. The DS6000 series is designed to support the enterprise-class data backup and disaster recovery capabilities such as IBM Copy Services. In a small 3U footprint, the DS6000 provides performance and functions for business continuity, disaster recovery, and resiliency, previously only available in expensive high-end storage servers. The DS6000 series is also Copy Services compatible with the previous Enterprise Storage Server (ESS) Model 800 and 750, as well as the new DS8000 series. The DS6000 introduced a new standard in Pricing and Packaging. This document provides a summary of recommended High Availability best practice considerations for the DS8000, DS6000, and Enterprise Storage Server disk subsystems The reader is assumed to have a baseline understanding of the concepts and facilities of these products
Chart 4: Configuration RAID-5 overview RAID-5 is one of the most commonly used forms of RAID protection. RAID-5 is a method of spreading volume data plus parity data across multiple disk drives. RAID-5 can help provide fast performance by striping data across a defined set of disk drive modules (DDMs). Data protection is provided by the generation of parity information for every stripe of data. If an array member fails, then its contents can be regenerated by using the parity data. RAID-5 implementation in the DS8000 In a DS8000, a RAID-5 array will contain either seven or eight disks depending on whether the array is supplying a spare. A seven-disk array effectively uses one disk for parity, so it is referred to as a 6+P array (where the P stands for parity). The reason only seven disks are available to a 6+P array is that the eighth disk in the array site used to build the array was used as a spare. We then refer to this as a 6+P+S array site (where the S stands for spare). An 8-disk array also effectively uses 1 disk for parity, so it is referred to as a 7+P array. RAID-10 overview RAID-10 was designed to provide high availability by combining features of RAID-0 and RAID-1. RAID-0 seeks to improve performance by striping volume data across multiple disk drives at a time. RAID-1 provides disk mirroring, which duplicates data between two disk drives. By combining the features of RAID-0 and RAID-1, RAID-10 provides additional capabilities for fault tolerance. Access to data is preserved if one disk in each mirrored pair remains available. RAID-10 typically offers faster data reads and writes than RAID-5 because it does not need to manage parity. However, with half of the DDMs in the group used for data and the other half to mirror that data, RAID-10 offers less usable storage capacity than RAID-5 configuration. RAID-10 implementation in the DS8000 In the DS8000, the RAID-10 implementation is achieved using either six or eight DDMs. If spares exist on the array site, then six DDMs are used to make a three-disk RAID-0 array, which is then mirrored. If spares do not exist on the array site, then eight DDMs are used to make a four-disk RAID-0 array, which is then mirrored. Summary of potential Benefits and Drawbacks of RAID 5 and RAID 10 RAID 5 Potential Benefits: -- Lower overall costs, higher usable capacity RAID5 benefits from the distribution of parity across all drives. In a RAID 5 configuration of 8 drives (6+P+S), the capacity efficiency is 6/8 or 75%, meaning that the usable capacity is 75% of the raw capacity. This can provide significant overall savings in absolute storage costs, while offering acceptable levels of redundancy in most applications.
RAID 5 Potential Drawbacks: -- Performance When a block of data is retrieved the RAID algorithm determines which RAID block contains both the disk block and parity block, respectively, for that RAID block and reads the data. If data is modified later, the data block recalculates the parity by subtracting the old block and adding in the new version. This creates 2 writes: one write of the data block followed by a write of the new parity block. However, RAID5 must also perform two reads: first the parity block associated with the data block and then the original data for the block that will be updated. This read + read + write + write thus can create a slower response due to the additional operations. -- Performance in a RAID 5 Rebuild When a RAID5 drive fails, data is recovered by reading the blocks from the remaining drives and calculating the lost data using the parity information. This then requires upwards of 6 reads (in a 6+P+S configuration) to calculate the parity of the lost drive before the data can be delivered. Performance is also degraded further during recovery because all drives have to be actively accessed in order to rebuild the replacement (spare) drive while still servicing I/O requests. -- Redundancy in a RAID 5 Rebuild While disk drives have improved reliability in recent years, there still exists a small but possible probability that during the rebuild cycle a second drive could fail. If this occurred, then data would have to be retrieved from other backup sources such as a disaster recovery site or from tape. The probability is increased by the fact that if any of the remaining 6 drives failed, data is lost. RAID 10 Potential Benefits: --Performance in normal operation RAID10 benefits from the fact that when an I/O read command is issued, both mirrored drives can retrieve the data, resulting in statistically a faster time to data. -- Performance in a RAID 10 Rebuild When a RAID10 drive fails, data can typically be retrieved more quickly from the remaining mirrored pairs. When a RAID10 drive fails, data is recovered by reading the blocks from the mirrored drives. NO parity information is required to be calculated thus reducing the number of reads necessary as compared to RAID 5. Performance degradation on production I/O is also further reduced since during recovery only the mirrored drives have to be read in order to rebuild the replacement (spare) drive while still servicing I/O requests. RAID 10 Potential Drawbacks: -- Cost
In a RAID 10 configuration of 8 drives (4+4), the capacity efficiency is 4/8 or 50%, meaning that the usable capacity is only 50% of the raw capacity. This is further decreased when spares are included. For example, in the option above using 2 sets of three-disk RAID 0 arrays (3+S) in an 8 drive configuration, the capacity efficiency drops to 3/8 or 38%. Exploit Available Hardware Options - Server & Storage failover and failback in a Metro Mirror environment To support continuous availability, user can take advantage of the server & storage failover and failback in a Metro Mirror environment being offered with DS8000. This function requires host software such as GPDS for zOS, HACMP/XD for System p, MSCS for System x (this a lab service offer). IBM Implementation Services for Geographically Dispersed Open Clusters (GDOC) provides the automation, testing and management needed to support high availability and near-transparent application recovery. It is a multi-vendor solution for keeping applications operational in mirrored environments running on IBM AIX, HP-UX, Sun Solaris, Linux, and Microsoft Windows operating systems. Additional information on GDOC can be found at the following url: http://w303.ibm.com/services/salesone/ShowDoc.wss?action=viewdoc&doc=S1USflexiblesretsul CnepOdesrepsiDyllacihpargoeGoiloftroPtnemegagnEitssalesatoh - Concurrent Maintenance: user can replace adapter cards, hardware components while the box is up and running as well as upgrade microcode concurrently. - Minimize Single Frame DS8300 purchases as 1st expansion frame upgrade is disruptive. - Load fixes on test system prior to production system When applying fixes whether it is a critical one or not, IBM strongly recommends the fix should be verified on test system prior to production system. We recommend our client consider to implement an environment that supports Development QA Production. Dynamic Logical Volume Expansion Where previously we might have needed to define new volumes, migrate the data, delete and redefine the old volumes, we can now dynamically increase logical volume size to address the need for more space or different volume sizes. Dynamic volume expansion is available for both z/OS CountKeyData volumes and open systems Fixed Block volumes. Note that typically actions will need to be taken by the host operating systems in order to access the additional space, and those actions will vary according to the operating system. The maximum DS8000 volume sizes remain the same2TB for open systems Fixed Block volumes, and 65520 cylinders for z/OS count key data volumes. And we still recommend that for best space utilization, volume capacity should be a multiple of the extent size.
Logical Partitioning (LPAR) capability to distribute workloads (DS8300-9A2, 9B2)
Select DS8000 model can provide storage system LPARs. This means that user you can run two completely independent virtual storage images with differing workloads, and with different operating environments, within a single physical DS8000 storage subsystem. The DS8000 will partition the subsystem into two virtual storage system images. The processors, memory, adapters, and disk drives are split between the images. There is a robust isolation between the two images via hardware and the POWER5 Hypervisor firmware. Initially each storage system LPAR has access to: - 50 percent of the processors - 50 percent of the processor memory - Up to 16 host adapters - Up to 320 disk drives (up to 96 TB of capacity) With these separate resources, each storage system LPAR can run the same or different versions of microcode, and can be used for completely separate production, test, or other unique storage environments within this single physical system. This may enable storage consolidations, where separate storage subsystems were previously required, helping to increase management efficiency and cost effectiveness. Chapter 3 of the IBM TotalStorage DS8000 Series: Concepts and Architecture Redbooks (SG24-6452-00) has a detailed description of the LPAR in the DS8000 series
Chart 5 Management Console The Management Console is used to perform configuration, management, and maintenance activities on the DS8000. It can be ordered to be located either physically inside the base frame or external for mounting in a customer-supplied rack. If the S-HMC is not operational then it is not possible to perform maintenance, power the DS8000 up or down, or perform Copy Services tasks such as the establishment of FlashCopies. It is thus recommended to order two management consoles to act as a redundant pair. Uninterruptible Power Supply Each frame has two primary power supplies (PPS). Each PPS produces voltages for two different areas of the machine: - 208V is produced to be supplied to each I/O enclosure and each processor complex. This voltage is placed by each supply onto two redundant power buses. - 12V and 5V is produced to be supplied to the disk enclosures. If either PPS fails, the other can continue to supply all required voltage to all power buses in that frame. The PPS can be replaced concurrently. Important: It should be noted that if user installs the DS8000 such that both primary power supplies are attached to the same circuit breaker or the same switch board, then the DS8000 will not be well protected from external power failures. This is a very common cause of unplanned outages. It is recommended that user installs the DS8000 uninterruptible power supply to different power sources. Earthquake Resistant Kit (where applicable) Helps to maintain system availability following earthquake or other instability - Can help prevent injury to people and equipment - Provide earthquake resistance bracing/hardware - Include rack with cross-braces on front and rear of rack to help prevent rack from twisting - Hardware secures rack to the floor - Feature Code 1906 Installed Non-Concurrently (because of power/safety) - Designed to provide Earthquake Zone 4 support IBM Standby Capacity on Demand It allows you to access the extra storage capacity when you need it through a nondisruptive activity. It is a way of having more disks in your DS8000 than you have paid for. It is a great option for capacity planning.
Enable Call Home and Remote Support Call Home Call Home is the capability of the DS HMC to contact IBM support services to report a problem. This is referred to as Call Home for service. The DS HMC will also provide machine-reported product data (MRPD) information to IBM by way of the Call Home facility. The MRPD information includes installed hardware, configurations, and features. The Storage Complex will use the Call Home method to send heartbeat information to IBM and by doing this will ensure that the DS HMC will be able to initiate a Call Home to IBM in the case of an error. Should the heartbeat information not reach IBM, a service call to the client will be initiated by IBM to investigate the status of the DS8000. The Call Home can either be configured for modem or Internet setup. The Call Home service can only be initiated by the DS HMC. Remote support IBM Service personnel located outside of the client facility log in to the DS HMC to provide service and support. The methods available for IBM to connect to the DS HMC are configured by the IBM SSR at the direction of the client, and may include dial-up only access or access through the high-speed Internet connection. For each of the connection types a Virtual Private Internet (VPN) tunnel is established using the Internet Security Architecture (IPSec) protocol. The VPN tunnel is established using an IPSec end-to-end IPSec connection between the HMC and the IBM network. Dial-up connection This is a low-speed asynchronous modem connection to a telephone line. This connection typically favors small amounts of data transfers. Note: IBM recommends that the DS HMC be connected to the clients public network over a secure VPN connection, instead of a dial-up connection. Refer to Chapter 5 of The IBM TotalStorage DS8000 Series: Implementation Redbooks (SG24-6786-01) for additional information on DS8000 Remote Support
Monitor the storage subsystem status The DS8000 has an option to allow user to receive a notification in case of hardware problem inside the DS8000. Notification methods are: e-mail, Simple Network Management Protocol (SNMP), and Service Information Message (SIM). 1.) e-mail notification In case of a serviceable event, an e-mail will be sent to all e-mail addresses you have specified in the e-mail notification worksheet. 2.) SNMP notification In case of a serviceable event, an SNMP trap will be sent to a server that you have specified in the SNMP trap notification worksheet. The service personnel can configure the HMC to either send notification for every serviceable event, or send notification for only those events that Call Home to IBM. 3.) SIM notification SIM is only applicable for zSeries servers. It allows you to receive a notification on
the system console in case of a serviceable event. 4.) Event Log With the user ID customer and the password cust0mer, the clients operating personnel would have the ability to log on to the HMC and view the problem log of the DS8000. How to navigate in the HMC GUI to the Event log: Management Environment * Management console (HMC) Hostname o Service Applications + Service Focal Point # Manage Serviceable Events
System Storage Productivity Center (SSPC) The underlying goal of the SSPC is to consolidate our different storage management interfaces into a unified central point of control. In doing so, our customers have one starting point for configuration and management of our storage systems. Simplification & bring things together at one console might help users from overlooking something. From that perspective it would improve availability. SSPC provides - Single point of entry for element and enterprise storage management - Reduce number of servers required to manage storage infrastructure - Move from element management to enterprise storage management SSPC will be required for new DS8000 serial number beginning with R3. For the existing install base, SSPC is optional but recommended.
Chart 6 Disk Encryption With the DS8000 R4.2 Licensed Internal Code (CIC), the DS8000 will now support the AES 128 encryption standard for all data at rest. Encryption will be done at the drive level and the DS8000 enterprise class 146, 300 & 450 GB 15K RPM drives. Data will be encrypted when it is written to the physical disk and it will be unencrypted when it is read from the physical disks. Data is not encrypted over any links when it is involved with DS8000 Copy Services. Customers would need to continue to use external encryption boxes or Host Data Encryption to encrypt Data replication environments. The DS8000 Disk Encryption is fully integrated with the Tivoli Key Lifecycle Manager (TKLM). The DS8000 network attachment to the TKLM server is via TCP/IP. The DS8000 MUST be attached to at least 2 TKLM servers (a primary and a backup).
NTP External Time Server Support User can set up HMC to sync to external time server using HMC Management -> Change Date and Time WUI task. One requirement is it must sync to real time servers which are stable/accurate. Otherwise, large time jumps could occur and propagate through the storage system and cause many failures. The NTP external time server support helps closing the time gap between server and storage. By doing this, it helps shorten the time required for problem determination.
HMC Critical Data Backup to LPARs Backup is done weekly automatically. 30 copies are saved. Backup includes RMC object data (SFG data) and HMC configurations. Backup does not include code update any more as was the case with DVD backup. Backup can be forced using HMC Management -> Backup Critical Data WUI task. (Make sure you use the task under Storage HMC Tasks category) Restore is started using HMC Management -> HMC Rebuild and Recovery WUI task after HMC is rebuilt and updated to desired code level.
Chart 7 Maintain Currency A well documented preventative maintenance strategy is an important part of maintaining and managing your storage. Regular maintenance of your storage and application of the latest fixes help to maximize storage performance, and may reduce the impact of problems if they arise. IBM recommends the following: 1) Stay as current as possible reduce exposure to know defects 2) Understand what fixes/upgrades are in firmware update 3) Install first on less critical systems, prior to production system 4) Roll the new level of microcode into the environment one system at a time 5) Monitor high impact fixes/flashes regularly 6) Install critical fixes as soon as possible 7) Test the software stack (SAN supported level, host adapter driver) when upgrade microcode Verify the supported matrix. 8) Integrate maintenance schedule into Change Control Management 9) Subscribe to MySupport. Go to http://www.ibm.com/support/mySupport to register and to fill out the profile which product/service that one would like to keep informed. By applying maintenance frequently, one may obtain the benefits of: Fewer unique LIC images to support Reduced exposure to problems fixed in non-activated LIC Emergency fixes can be supplied more expeditiously - Verify that all host paths are available before upgrading software It is a good practice to ensure that all host paths are available before upgrading software as it may cause application outage. Special caution needs to take into account when performing Concurrent Maintenance. - Verify changes on test system prior to production system When applying fixes whether it is a critical one or not, IBM strongly recommends the fix should be verified on test system prior to production system. We recommend our client consider to implement an environment that supports Development QA Production.
Concurrent Maintenance Concurrent Maintenance provides replacement of the following parts while the processor
complex remains running: - Disk drives - Cooling fans - Power Subsystems - PCI-X adapter cards Perform Concurrent Maintenance operations of the storage subsystem during time of low activities. Microcode upgrade will be performed by IBM support personnel. Chart 8 Host Based Monitors and Alert GDPS/PPRC Hyperswap Manager GDPS/PPRC HM is effectively a subset of GDPS/PPRC, with the emphasis being more on the remote copy and disk management aspects, but without the automation and resource management capabilities of the full GDPS/PPRC offering. GDPS/PPRC HM is designed to manage the remote copy configuration and storage subsystem(s), and protect against data loss due to planned or unplanned primary disk outages. GDPS/PPRC HM masks disk subsystem failures by dynamically swapping from the primary disk subsystem to the secondary disk subsystem keeping data available to end-user applications. In the event of a site disaster, the features of the PPRC (Metro Mirror) technology along with the GDPS/PPRC HM capabilities provide a consistent copy of the data. In addition to managing Metro Mirror operations, GDPS/PPRC HM also provides support for FlashCopy. When you are ready to resynchronize your primary and secondary disks following a PPRC suspension event, GDPS/PPRC HM can be set up to automatically take a FlashCopy of the secondary disks, helping to provide a consistent set of disks are preserved should there be a disaster during the re-synch operation. GDPS/PPRC HM is also designed to provide user with the ability to perform planned site switches. Using the panels provided, one can control the swapping of primary and secondary disks in preparation for a planned site switch. GDPS/PPRC HM offers you flexibility because it is application and data independent. As a result, user can avoid having different and complex recovery procedures for each of the database managers. GDPS/PPRC Hyperswap Manager is an IBM Implementation Services. - Hardware requirements: PPRC capable disk subsystems at PPRC Level 3 technology (disk subsystem that support Extended CQUERY) 9672-G5, 9672-G6 or eServer zSeries depending on the level of zOS - Software requirements: zOS with IBM Tivoli System Automation for GDPS/PPRC HM with Netview, V1.1 or later GDPS/PPRC Hyperswap Monitors & Alerts There is an additional aspect of remote copy management that is available with GDPS/PPRC HMnamely, the ability to query and manage the remote copy environment using the GDPS panels.
GDPS/PPRC provides facilities to let you: - Be alerted to any changes in the remote copy environment - Display the remote copy configuration - Stop, start, and change the direction of remote copy - Stop and start FlashCopy GDPS/PPRC HM does not provide script support, so all these functions are only available through the GDPS NetView interface.
Refer to Chapter 4 GPDS Family AN Introduction to Concepts and Capabilities Redbooks (SG24-6374-02) for additional information on GPDS/PPRC Hyperswap Manager TPC-R The TPC for Replication component is intended to provide a single point of control for point-in time and remote volume replication services. This module offers automated source-target matching for the volumes to be replicated, support for cross-device consistency groups, and failover and failback automation. This feature operates out-ofband, with no data passing through it. Replication Management can be a complex task. Between all the variants of flash copy and the different types of remote copy technologies, it is important to set it up correctly. Especially when a customer might be setting up hundreds, if not thousands, of pairs of volumes to be replicated. Monitoring and handling error states can also be demanding task. Using automation technology is a must in complex environments to ease setup and management of the processes. Additionally, Copy services can be implemented differently on different platforms automation can mask that difference to the human operator. TPC for Replication provides the ability to create and monitor thousands of copy sets in an automated fashion for DS8000s, DS6000s, SVCs and ESS 800s... It also coordinates instantaneous flash copies inside supported disk solutions from IBM. Customers can automatically failover operations and then later failback their operations. Starting with V3.3 TPC for Replication family is designed to provide the following values: Administrative enhancements (role based authorization) and several operational enhancements (site awareness and volume protection). Three Site Metro Global Mirror (MGM) support for the IBM DS8000. This allows for synchronous copy between two sites combined with an asynchronous copy link to a distant site (perhaps outside the regional power grid). A new z System based family product designed to provide all the functions of the base TPC for Replication product and the Two Site and Three Site products, but packaged to run on z Systems using a mixture of FICON and TCP/IP communications to provide replication management of DS8000 and DS6000 series machines. In todays environment, customers use many command line interface (CLI) solutions to manage their copy services. There are also various Graphical User Interfaces, or GUIs
to help out also. Additionally, there are several automated solutions that could be consolidated into a single solution for reducing costs and confusionToday, TPC for Replication is a replacement for eRCMF and Global Mirror Utilities (GMU). In the future TPC for Replication may add other environments to its scope. Host Based Collection Facilities z/OS LOGREC The z/OS error log contains data related to hardware and software errors. This data is written to the SYS1.LOGREC data set and is also written to internal storage that is included in a dump. The SYS1.LOGREC data set can be interrogated using the ICFEREP1 program, or if the abend has triggered a dump, the EREP data can be reviewed using the IPCS VERBX LOGDATA command. Generally, the error log entries at the end of the display, if they have an influence on the problem being reviewed, will have time stamps that relate to (or immediately precede) the actual abend. Host Based High Availability Options for Data DFSMF Dataset Name separation Data set separation allows your storage administrator to designate groups of data sets to be kept separate at the physical control unit (PCU) level. Failure isolation means separate volumes, control units, storage subsystems, and paths to the controllers. The feature has been fully incorporated into z/OS starting with DFSMS V1R3. User needs to determine which critical IBM and OEM data sets should be kept separate, and create a data set separation profile for SMS. System critical data such as system configuration data sets, JES2 checkpoint data sets and logging for subsystems such as DB2 or IMS, is often held in two or more copies to protect against failure. Having both data sets on the same physical control units introduces a single point of failure To control this, a SEPARATIONGROUP keyword is introduced when you code your SMS routines. Important: By using this function, user can ensure that if one control unit fails, the sysplex can access the data sets via another control unit. Application data may also benefit from being separated across PCUs (Physical Control Unit) to reduce recovery time in the event of a subsystem failure. Distribute Host connections across multiple physical adapters on the DS8000 When determining where to connect host system to I/O ports on a host adapter to the Storage facility image, the following considerations apply: 1.) Choose the attached I/O ports on different host adapters. 2.) Spread the attached I/O ports evenly between the four I/O enclosure groups. 3.) Spread the I/O ports evenly between the different RIO-G loops. Host Connections provide multiple paths from each host to the storage Open Systems - Subsystem Device Driver (SDD) The IBM Subsystem Device Driver is the standard multipathing tool for many open systems environments. SDD is a no-charge addition to the DS8000 for Windows, AIX,
HP-UX, Sun Solaris, Novell NetWare, and Linux operating systems. The IBM Subsystem Device Driver provides: - An enhanced data availability capability for clients who have more than one path from their server to the DS8000. It is intended to eliminates a potential single point of failure by automatically rerouting I/O operations when a path failure occurs. - Load balancing across the paths when there is more than one path from a system server to the DS8000. This may eliminate I/O bottlenecks that occur when many I/O operations are directed to common devices on the same I/O path. Some operating systems and file systems natively provide similar benefits, for example, z/OS and i5/OS. At least two paths between the DS8000 and the system server are recommended. It is critical to enable the Subsystem Device Driver for its I/O load balancing capabilities when more than one path from any one server connects to the DS8000. Failure to enable this support will significantly reduce the performance and availability that can be achieved with the use of multiple paths. Downloads for the IBM Subsystem Device Driver can be obtained from the Subsystem Device Driver Web site at: http://www.ibm.com/servers/storage/support/software/sdd.html For performance reasons, when doing multipathing, it is a recommended practice to distribute the attachment of each switch (or server, when attaching directly) across as many I/O enclosures and within each I/O enclosure across as many DS8000 HBAs as possible, to balance load across your I/O enclosures and adapters. AIX and Windows support the MPIO which are the OS multi-path framework. AIX - SDDPCM (SDD Path Control Module) starting with 2.0.0.0 Windows SDDDSM (SDD Device Specific Module) starting with 2.0.0.0 zOS Dynamic Path Selection (DPS)/Dynamic Path Reconnect (DPR) In the zSeries environment, the normal practice is to provide multiple paths from each host to a disk subsystem. Typically, four paths are installed. The channels in each host that can access each Logical Control Unit (LCU) in the DS8000 are defined in the HCD (hardware configuration definition) or IOCDS (I/O configuration data set) for that host. Dynamic Path Selection (DPS) allows the channel subsystem to select any available (non-busy) path to initiate an operation to the disk subsystem. Dynamic Path Reconnect (DPR) allows the DS8000 to select any available path to a host to reconnect and resume a disconnected operation; for example, to transfer data after disconnection due to a cache miss. These functions are part of the zSeries architecture and are managed by the channel subsystem in the host and the DS8000. A physical FICON/ESCON path is established when the DS8000 port sees light on the fiber (for example, a cable is plugged in to a DS8000 host adapter, a processor or the DS8000 is powered on, or a path is configured online by OS/390). At this time, logical paths are established through the port between the host and some or all of the LCUs in the DS8000, controlled by the HCD definition for that host. This happens for each physical path between a zSeries CPU and the DS8000. There may be multiple system images in a CPU. Logical paths are established for each system image. The DS8000 then knows which paths can be used to communicate between each LCU and each host. System i
The following are available to users in the System i environment - DSCLI commands executed through i5/OS interface - Copy Services for System i Toolkit - Combination of iSeries Navigator and 5250 interface
Chart 9 Duplicate Storage Subsystems in Campus or Same Data Center Floor Can use Metro Mirror for data redundancy, to enable quick re-IPL Require automation software such as TPC-R or GDPS IBM Softek TDMF to move data around in Real Time Transparent Data Migration Facility (TDMF) migrates data globally and locally at the block and volume level in both mainframe and open systems without disruption. - Migrates data in background while maintaining application availability - Helps maintain data integrity and allow point-in-time copies of data to be available at any time after initial synchronization - Enabling non-disruptive dynamic switchover - Operates without prerequisite software or hardware - Available in the form of a software license, a project service engagement or an element of a managed service agreement zOS Softek TDMF TDMF provides concurrent full volume migration capabilities, which are best described as remote copy functions for migration based on software that allows a controlled switch-over to the new target volume. As a general rule, these might be considered when the number of volumes to be migrated is in the hundreds rather than in the range of thousands of volumes to be migrated. With large migration tasks, the number of volumes has to be broken down to smaller volume sets so that the migration can happen in a controlled fashion. This lengthens the migration period, so if possible, other approaches might be considered. The software product is usually associated with fees or service-based fees. When the number of volumes is in the range of up to a few hundred, then standard software like DFSMSdss is an option. But DFSMSdss-based migration does not automatically switch over to the target volumes and usually requires some weekend efforts to complete. DFSMSdss is standard software and part of z/OS, so there are no extra costs for software. Softek TDMF Unix Softek Transparent Data Migration Facility Unix is a host-based data migration product designed to move data locally. Softek TDMF Unix is excellent for local data migrations. Supports IBM AIX, Sun Solaris and HP-UX environments
Softek TDMF UNIX (IP) Softek Transparent Data Migration Facility UNIX (IP) is a host-based data migration product designed to move data without disruption over distance in UNIX environments. Supports IBM AIX, HP-UX, Sun Solaris and Red Hat Enterprise Linux operating systems Softek TDMF Windows (IP) Softek Transparent Data Migration Facility Windows (IP) is a host-based data migration product designed to move data over TCP/IP connections. Softek TDMF Windows (IP) is excellent for local or distance data migrations across the network For more information visit: Chart 10 ibm.com/services/storage/migration
System Storage Enterprise Disk Practices Advance Functions Overview for High Availability
Disk storage replication is used to protect critical business data. This chart provides a high level overview of the IBM DS8000 Advanced Functions known as FlashCopy, Metro Mirror, and Global Mirror. We will discuss more details of these types of mirroring in the subsequent charts. Fundamentally, clients use DS8000 Advanced Copy Services for two types of protection: Data corruption (human or application errors), forward recovery Point-in-time Copy Data loss (physical destruction, accidental erasure) Remote Copy or Point-in-time Copy Detailed planning information for the usage of IBM Advanced Copy Services on the DS8000, DS6000, ESS, can be found by downloading the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com SG24-6787 DS8000 Copy Services in System z Environment SG24-6788 DS8000 Copy Services in Open Environment Also, a detailed PowerPoint Tutorial with animation, on the functionality of the DS8000, DS6000, and ESS Copy Services can be found on the following WW Technical Support internal IBM web site, at: http://snjgsa.ibm.com/~singj/public/Tutorials Look for the file How Disk Mirroring Really Works. The author is John Sing, John Sing/San Jose/IBM, singj@us.ibm.com
Chart 11
System Storage Enterprise Disk Practices Point in Time internal Data Replication
Here are considerations relating to IBM Point-in-time Copy (FlashCopy): Fast volume copy (production resumes quickly) by making a copy of the pointers (track tables) in storage controller cache memory Generally used to backup data to tape, clone production database, fallback position for changes The FlashCopy target volume must be available and fully provisioned with storage. In other words, if the source FlashCopy volume is 3 GB in size, then an equivalent 3 GB target volume / LUN must be allocated and fully provisioned as well. Note that Space Efficient FlashCopy is not available on the SVC, DS6000, and ESS. To support data integrity of the point in time copy, databases and applications require a temporary quiesce or write suspension (flushing of buffers to disk) prior to doing the disk Point in Time copy. This supports proper updating of internal database/application logical metadata. Appropriate planning should be done to assure that the storage controller can concurrently handle both the incoming IO workload, and the necessary FlashCopy background activity. FlashCopy onto in-session remote mirrored volumes should be used with appropriate planning and appropriate caution. Typically, remote mirrored volumes are used for unplanned outage protection; using FlashCopy onto a remote mirrored volume introduces some planning considerations, as there is a time lag before the DS8000 point in time copy can be copied to the remote site. If there is an unplanned outage prior to the time that the Point in Time copy is forwarded to the remote site, the remote site will be out of sync with the local site. Planning and implementation of procedures to avoid or circumvent this situation should be created. For more information on planning the usage of IBM FlashCopy on the DS8000, DS6000, ESS, download the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com SG24-6787 DS8000 Copy Services in System z Environment
SG24-6788 DS8000 Copy Services in Open Environment Also, a detailed PowerPoint Tutorial with animation, on the functionality of the DS8000, DS6000, and ESS Copy Services can be found on the following WW Technical Support internal IBM web site, at: http://snjgsa.ibm.com/~singj/public/Tutorials Look for the file How Disk Mirroring Really Works. The author is John Sing, John Sing/San Jose/IBM, singj@us.ibm.com
Chart 12 FlashCopy Space Efficient (SE) - Point in Time internal Data Replication Space Efficient Volumes do not require physical storage until updates occur to the Flash Copy source volumes. - A small amount of space is used for each volume (0.5%) and space for a VTOC and index for zSeries volumes. Updates to the source will be copied to the Repository that is shared by the SE volumes in the same extent pool. This is appropriate if there will be small amounts of data written to the source. Like a backup before dumping to tape. Since all the volumes use or reuse the same physical repository space this results in a more efficient use of available physical storage than using normally provisioned volumes. The Repository must have enough physical space to contain all updates that occur during the existence of all the Flash Copy relationships. Space Efficient volumes should be used for flash copy relationships that are temporary (< 24 hr) with limited update activity (<20%) The physical space is made available for reuse by other volumes when the Flash Copy Relationship is withdrawn.
Chart 13
System Storage Enterprise Disk Practices Metro Mirror - Synchronous Data Replication
Metro Mirror is fully synchronized disk mirroring, at the volume level. As the mirroring is fully synchronized and therefore both sides are in lock step with each other, it is possible to implement hot swap types of functions on top of the synchronized mirroring. System z GDPS HyperSwap is one such function. In any disk mirroring environment, it is essential to perform an analysis of the write workload, in order to properly configure and balance the controller and networking infrastructure that will connect the two disk mirroring subsystems. DS8000 Metro Mirror Consistency Group must have an external automation and control software in order to function properly. This external automation and control software runs on a server attached to the Metro Mirror configuration, and can be either customwritten DS8000 command line interface scripts, or it could be an IBM-supplied external automation and control software such as TotalStorage Productivity Center for Replication. When doing disk mirroring, it is customary best practice to have an additional copy of storage at the remote site, for the following reasons: Provide storage for ongoing testing environment that can be done concurrent with the mirroring continuing to be running Preserve a golden copy for protection against errors during a restart Preserve a golden copy for protection against errors during a resync Best practice would be that for every production TB being mirrored, one would have 2 TB of storage at the remote site. 1 TB would be the receiving mirrored volume, providing unplanned outage protection; the other 1 TB would be used for testing and golden copy preservation Some notes about Metro Mirror on DS8000: Synchronous writes provide zero data loss and faster recovery at remote site
Adds protocol overhead (<1 ms) plus 1 ms per 100 km to each write operation Used for critical applications that requires no data loss Up to 300 km distance on DS8000, DS6000, and ESS Up to 100 km on SVC Average reasonable acceptable maximum distance on DS4000, N series is in range of 10-50KM Note that DS4000 Metro Mirror does not provide consistency group support (if DS4000 requires CG, use DS4000 Global Mirror) For more information on planning the usage of IBM Metro Mirror on the DS8000, DS6000, ESS, download the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com
Chart 14 System Storage Enterprise Disk Practices Global Mirror - Asynchronous Data Replication DS8000 Global Mirror is asynchronous disk mirroring, at the volume level. As the mirroring is asynchronous, therefore the secondary site data will lag the primary by some number of seconds or minutes depending on the capacity of the network. DS8000 Global Mirror has the requirement that for every volume that is being Global Mirrored, there must be 2 remote volumes a receiving volume, and a Consistency Group volume. Therefore, the minimum requirement is to have 2x the number of volumes available at the secondary. In addition, if the DS8000 user wishes to do concurrent testing while having Global Mirror running, you will need a 3d volume for that testing space. So, remember to configure enough remote site disk! In any disk mirroring environment, it is essential to perform an analysis of the write workload, in order to properly configure and balance the controller and networking infrastructure that will connect the two disk mirroring subsystems. At the same time, similar speed and throughput characteristics on source and target volumes can provide optimum performance. It is highly recommended to have external automation and control software to best manage the IBM DS8000 Global Mirror configuration. This external automation and control software runs on a server attached to the Global Mirror configuration, and can be either custom-written DS8000 command line interface scripts, or it could be an IBMsupplied external automation and control software such as TotalStorage Productivity Center for Replication. When doing disk mirroring, it is customary best practice to have an additional copy of storage at the remote site, for the following reasons: Provide storage for ongoing testing environment that can be done concurrent with the mirroring continuing to be running
Preserve a golden copy for protection against errors during a restart Preserve a golden copy for protection against errors during a resync Best practice would be that for every production TB being Global Mirrored, one would have 3 TB of storage at the remote site. 1 TB would be the receiving mirrored volume, providing unplanned outage protection; 1 TB will be used for the Data Consistent data volumes, the final 1 TB would be used for testing and golden copy preservation. For more information on planning the usage of IBM Global Mirror on the DS8000, DS6000, ESS, download the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com SG24-6787 DS8000 Copy Services in System z Environment SG24-6788 DS8000 Copy Services in Open Environment
Chart 15
System Storage Enterprise Disk Practices Global Mirror (XRC) - Asynchronous Data Replication
Z/OS Global Mirror (XRC) is a premium, System z-based storage replication capability. Z/OS Global Mirror (XRC) uses System z server cycles to move the data to the remote site; and supports only System z-based data. XRC provides a unique large scale, mission-critical proven capability to do asynchronous disk mirroring over long distances, and is typically employed for large System z mission-critical workloads. Many of the same planning considerations are similar to Metro Mirror and Global Mirror. In addition, z/OS Global Mirror (XRC) will require the proper planning of the System z cycles for data movement. For more information on planning the usage of IBM z/OS Global Mirror (XRC) on the DS8000, DS6000, ESS, download the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com SG24-6787 DS8000 Copy Services in System z Environment Also, a detailed PowerPoint Tutorial with animation, on the functionality of the DS8000, DS6000, and ESS Copy Services can be found on the following WW Technical Support internal IBM web site, at: http://snjgsa.ibm.com/~singj/public/Tutorials Look for the file How Disk Mirroring Really Works. The author is John Sing, John Sing/San Jose/IBM, singj@us.ibm.com
Chart 16 System Storage Enterprise Disk Practices Three site Replication As the need for more resiliencies has accelerated, many clients have requirements requiring the next advanced level above two site recovery. There are several main requirements for which three site replication is needed: Customer desires to avoid any data loss, yet requires out of region recovery capability as well. In this case, the client will bunker to an intermediate local synchronous site, and then go from there out to a long distance remote site. In this scenario, if the production site fails, the theory is the intermediate site will survive, and thus, will be able to propagate the zero data loss image of the data to the remote long distance site Existing metro distance two-site recovery configuration desires to maintain the hot swap capability of Metro Mirror two-site, but also need to add out of region recovery capability Existing out of region asynchronous two site recovery configuration wishes to add local Metro Mirror high availability.
For all of the above cases, the fundamental requirements for three site replication then become: Fast Failover / Failback to any site Fast re-establishment of 3 site recovery, without production outages Quickly resynchronize any site with incremental changes only
Links and bandwidth assumed between all sites As two site recoveries are not inexpensive, in order to cost-justify three site recovery, we should assume the following as co-requisites: Automated failover/failback is already in place for the two site configuration
Full Tertiary Copy capability for testing, problem determination, validation, automation is already implemented Ongoing WAN / bandwidth / workload Capacity Planning is in place to allow proper sizing of the three site environment
For more information on planning the usage of IBM Three Site replication on the DS8000, DS6000, ESS, download the following IBM System Storage Redbooks from the IBM ITSO Redbook web site: http://www.redbooks.ibm.com SG24-6787 DS8000 Copy Services in System z Environment SG24-6788 DS8000 Copy Services in Open Environment
Chart 17
System Storage Enterprise Disk Practices Management of Replication

Achieving cost-effective, prove-able, testable, reliable, repeatable recovery in any of these disk mirroring environments fundamentally requires good automation. Automation provides the ability to combine multiple HA technologies and practices: data management, availability management, hardware and software management, clustering, network management all into a single point of control. Without this single point of control, achieving prove-able, testable, reliable, repeatable recovery is difficult at best and impossible in many circumstances. The recommended automation softwares for three major environments are listed on this page. For more information, following are GDPS URLs: External web site: http://www-03.ibm.com/systems/z/gdps/ Internal IBM web site: http://bvrgsa.ibm.com/projects/g/gdpsweb/index.html?0 For information on GDOC, see the following internal IBM System Sales URL: http://w31.ibm.com/sales/systems/portal/_s.155/254?navID=f320s260&geoID=All&prodID=Syste m Storage&docID=sstlOpenSysBusContSolFAQ For information on TPC for Replication, see the following URL for the TPC Sales Kit: http://w31.ibm.com/sales/systems/portal/_s.155/254?navID=f220s240&geoID=AM&prodID=Syste m Storage&docID=totstorproductcentsk.skit&docType=SalesKit&skCat=DocumentType
The following URL is for the TPC-R technical support external web page: http://www304.ibm.com/jct01004c/systems/support/supportsite.wss/supportresources?taskind=1&b randind=5000033&familyind=5329733 Also, an self-running Windows .avi demonstration of TPC for Replication, V3R3, may be found at: http://snjgsa.ibm.com/~singj/

PRS3026 DS8000 High Availability Best Practices Speaker Notes V1R15 20091007

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

PRS3026 DS8000 High Availability Best Practices Speaker Notes V1R15 20091007

Hochgeladen von

Copyright:

Verfügbare Formate

Recommended Best Practices Considerations for High Availability on IBM System Storage DS8000 and DS6000 and IBM

Copyright 2007 IBM Corporation. All right reserved.

This page is intentionally left blank

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Logical Partitioning (LPAR) capability to distribute workloads (DS8300-9A2, 9B2)

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

System Storage Enterprise Disk Practices Management of Replication

Copyright 2007 IBM Corporation. All right reserved.

Copyright 2007 IBM Corporation. All right reserved.

Das könnte Ihnen auch gefallen