Beruflich Dokumente
Kultur Dokumente
Front cover
ibm.com/redbooks
International Technical Support Organization SAN Storage Performance Management Using Tivoli Storage Productivity Center September 2011
SG24-7364-02
Note: Before using this information and the product it supports, read the information in Notices on page ix.
Third Edition (September 2011) This edition applies to Version 4, Release 2 Modification 3 of IBM Tivoli Storage Productivity Center (product number 5608-VC0).
Copyright International Business Machines Corporation 2009, 2011. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv September 2011, Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Part 1. Storage performance management concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Performance management concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Performance management fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Environmental norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Storage subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 High-level component diagram of storage devices . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Disk storage subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Storage virtualization device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.4 Comparison of a disk storage device and a virtualization device . . . . . . . . . . . . . 13 1.3.5 Data path from your application to the storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Native Storage System Interface (Native API) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.1 SNIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.2 CIMOM and CIM agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.3 SMI-S standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Performance issue factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.1 Types of problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.2 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 Server types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6.4 Running servers in a virtualized environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6.5 Understanding basic performance configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 24 Part 2. Sizing and scoping your Tivoli Storage Productivity Center environment for performance management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 2. Tivoli Storage Productivity Center requirements for performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Determining what Tivoli Storage Productivity Center needs . . . . . . . . . . . . . . . . . . . . . 2.1.1 Tivoli Storage Productivity Center licensing options . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Tivoli Storage Productivity Center Components . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Tivoli Storage Productivity Center server recommendations . . . . . . . . . . . . . . . . 2.1.4 Tivoli Storage Productivity Center database considerations. . . . . . . . . . . . . . . . . 2.1.5 Tivoli Storage Productivity Center database repository sizing formulas . . . . . . . . 2.1.6 Database placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.7 Selecting an SMS or DMS table space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.8 Best practice recommendations for the TPCDB design . . . . . . . . . . . . . . . . . . . .
29 30 30 31 35 36 37 38 38 39
iii
2.1.9 GUI versus CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.10 Tivoli Storage Productivity Center instance guidelines . . . . . . . . . . . . . . . . . . . . 2.2 SSPC considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Configuration data collection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Storage Resource Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Storage Server Native API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 CIMOMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Version control for fabric, agents, subsystems, and CIMOMs . . . . . . . . . . . . . . . 2.4 Performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Performance Data collection tasks: Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Performance Data collection tasks: Considerations . . . . . . . . . . . . . . . . . . . . . . . 2.5 Case Study: Defining the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 CASE STUDY 1: Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Tivoli Storage Productivity Center basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40 40 40 41 41 43 43 45 45 45 45 48 48 50
Part 3. Performance management with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3. General performance management methodology. . . . . . . . . . . . . . . . . . . . 53 3.1 Overview and summary of performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Main objectives of performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Performance management approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Performance data classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Rules of Thumb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.3 Quickstart performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.4 Performance metric guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Creating a baseline with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . . . . . . 68 3.4 Performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.2 Prerequisite tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Defining the performance data collection jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.4 Defining the alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.4.5 Defining the data retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.6 Running performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5 Tivoli Storage Productivity Center performance reporting capabilities . . . . . . . . . . . . . 92 3.5.1 Reporting compared to monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5.2 Predefined performance reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5.3 Customized reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.5.4 Batch reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5.5 Constraint Violations reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.6 Tivoli Storage Productivity Center configuration history . . . . . . . . . . . . . . . . . . . . . . . 120 3.6.1 Viewing configuration changes in the graphical view . . . . . . . . . . . . . . . . . . . . . 122 3.6.2 Viewing configuration changes in the table view. . . . . . . . . . . . . . . . . . . . . . . . . 124 3.7 Tivoli Storage Productivity Center administrator tasks . . . . . . . . . . . . . . . . . . . . . . . . 125 3.7.1 Using Configuration Utility to verify everything is running as expected. . . . . . . . 125 3.7.2 Verifying that Discovery, probes, and performance monitors are running . . . . . 126 3.7.3 Setting system-wide thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.7.4 Defining additional reports and thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.7.5 Regularly reviewing the incoming alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.7.6 Using constraint violation reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 3.7.7 Using the Topology Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 3.7.8 Using the Data Path Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.7.9 Configuring automatic snapshots, then exploring Change History . . . . . . . . . . . 150
iv
Chapter 4. Using Tivoli Storage Productivity Center for problem determination . . . 4.1 Problem determination lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Problem determination steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Identifying acceptable base performance levels . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Understanding your configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Volume information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Determining the subsystem configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 DS8000 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 IBM SAN Volume Controller (SVC) or Storwize V7000 . . . . . . . . . . . . . . . . . . . 4.3.4 DS5000 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 XIV information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Determining what your baselines are . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.7 Determining what your SLAs are . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.8 General considerations about the environment . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.9 Problem perception considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.10 Keeping track of the changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Common performance problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Deciding what can be done to prevent or solve issues . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Dedicating plenty of resources, with storage isolation . . . . . . . . . . . . . . . . . . . . 4.5.2 Spreading work across many resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Choosing the proper disk type and sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Monitoring performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 SVC considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 SVC traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 SVC best practice recommendations for performance . . . . . . . . . . . . . . . . . . . . 4.7 Storwize V7000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Storwize V7000 traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Storwize V7000 best practice recommendations for performance . . . . . . . . . . . Chapter 5. Using Tivoli Storage Productivity Center for performance management reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Data analysis: Top 10 reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Top 10 reports for disk subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Top 10 for Disk #1: Subsystem Performance report . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Top 10 for Disk #2: Controller Performance reports . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Top 10 for Disk #3: Controller Cache Performance reports . . . . . . . . . . . . . . . . 5.2.4 Top 10 for Disk #4: Array Performance reports . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Top 10 for Disk #5-9: Top Volume Performance reports . . . . . . . . . . . . . . . . . . 5.2.6 Top 10 for Disk #10: Port Performance reports . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 IBM XIV Module Cache Performance Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Top 10 reports for SVC and Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports. . . . . . 5.3.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports . . . 5.3.3 Top 10 for SVC #3: Managed Disk Group performance reports . . . . . . . . . . . . . 5.3.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports. . 5.3.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports. . . . . . . . . 5.4 Reports for Fabric and Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Switches reports: Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Top Switch Port Data Rate performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Case study: Server - performance problem with one server . . . . . . . . . . . . . . . . . . . . 5.6 Case study: Storwize V7000- disk performance problem . . . . . . . . . . . . . . . . . . . . . . 5.7 Case study: Top volumes response time and I/O rate performance report. . . . . . . . . 5.8 Case study: SVC and Storwize V7000 performance constraint alerts . . . . . . . . . . . .
153 154 154 154 155 157 162 162 167 170 172 175 175 176 176 177 177 178 178 178 179 181 181 181 182 182 182 183
185 186 187 188 197 202 207 220 227 228 230 232 239 247 253 259 264 265 265 267 271 280 283
Contents
5.9 Case study: IBM XIV Storage System workload analysis . . . . . . . . . . . . . . . . . . . . . . 5.10 Case study: Fabric - monitor and diagnose performance . . . . . . . . . . . . . . . . . . . . . 5.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration . . . . . . 5.11.1 Ensuring that all SVC ports are online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.2 Verifying SVC port zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.3 Verifying paths to storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.4 Verifying host paths to the Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6. Using Tivoli Storage Productivity Center for capacity planning management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Capacity Planning and Performance Management. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Capacity Planning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Performance Management overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Capacity Planning reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Performance of a storage subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 SVC and Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Storage subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A. Rules of Thumb and suggested thresholds . . . . . . . . . . . . . . . . . . . . . . Rules of Thumb summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Response time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU Utilization Percentage Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disk Utilization Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FC: Total Port Data Rate Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overall Port response Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache Holding Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Write-Cache Delay Percentage Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Back-End Read and Write Queue Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Port to Local Node Send/Receive Response Time Thresholds . . . . . . . . . . . . . . . . . . . . . Port to local node Send/receive Queue Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Preferred Node Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CRC Error rate Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero Buffer Credit Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Failure Rate and Error Frame Rate Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
305 306 307 307 308 311 311 323 323 327 328 328 328 328 329 329 329 329 329 329 330 330 330 330 330
Appendix B. Performance metrics and Thresholds in Tivoli Storage Productivity Center performance reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Performance metric collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Essential metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Reports under Disk Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Reports under the Fabric Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 New FC port performance metrics and thresholds in Tivoli Storage Productivity Center 4.2.1 release. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Tivoli Storage Productivity Center Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Common columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 XIV system metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Volume-based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Back-end-based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Front-end and fabric based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Tivoli Storage Productivity Center performance thresholds . . . . . . . . . . . . . . . . . . . . . . . . 357 vi
SAN Storage Performance Management Using Tivoli Storage Productivity Center
Threshold boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Setting the thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Array thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Controller thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Port thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix C. Reporting with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . Using SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQL: Table views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQL: Example Query XIV Performance Table View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CLI: TPCTOOL as a reporting tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tivoli Storage Productivity Center: Batch Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tivoli Storage Productivity Center: Batch Report Example . . . . . . . . . . . . . . . . . . . . . . . . Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
357 358 358 361 362 365 366 366 366 370 374 375 381 381 381 381 382 382
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Contents
vii
viii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
ix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX Cognos DB2 DS4000 DS6000 DS8000 Enterprise Storage Server FICON FlashCopy IBM Lotus POWER4 PowerVM Redbooks Redpaper Redbooks (logo) Symphony System p System Storage Tivoli Enterprise Console Tivoli TotalStorage XIV
The following terms are trademarks of other companies: Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Microsoft, Windows NT, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Snapshot, NetApp, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and other countries. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Disk Magic, and the IntelliMagic logo are trademarks of IntelliMagic BV in the United States, other countries, or both. Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates. QLogic, and the QLogic logo are registered trademarks of QLogic Corporation. SANblade is a registered trademark in the United States. VMware, the VMware "boxes" logo and design are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Preface
IBM Tivoli Storage Productivity Center is an open storage infrastructure management solution designed to help reduce the effort of managing complex storage infrastructures, to help improve storage capacity utilization, and to help improve administrative efficiency. Tivoli Storage Productivity Center can manage performance and connectivity from the host file system to the physical disk, including in-depth performance monitoring and analysis on SAN fabric performance. In this IBM Redbooks publication, we show you how to use Tivoli Storage Productivity Center reporting capabilities to manage performance in your storage infrastructure.
Daniel Frueh is an Advisory IT Specialist in GTS Services Delivery Austria & Switzerland. He has nine years of experience in the Open Storage field. He holds a degree in Computer Science from the University of Rapperswil. His areas of expertise include Tivoli Storage Manager, Tivoli Storage Productivity Center, SVC, IBM DS8000, IBM DS6000, SAN, and N series.
Paolo DAngelo is a Certified IT Architect working in Global Technology Services in Rome, Italy. He has worked at IBM for 12 years, and has 10 years of experience in Open Storage and Storage Management areas. Paolo's areas of expertise, both in design and implementation, include Storage Area Network, Data Migration, Storage Virtualization, Tivoli Storage Manager, Tivoli Storage Productivity Center, and Open Storage design and implementation.
xi
Lloyd Dean is an IBM Sr. Certified IT Architect in IBM S&D, and a Distinguished Chief/Lead Certified Open Group IT Architect. He provides pre-sales technical support within S&D as a Storage Solution Lead Architect throughout the Eastern United States. Lloyd has over 31 years of IT experience with over 15 years in the storage field. Lloyd has held many leadership positions within IBM, focused on storage solution design, implementation, and storage service management. Lloyd has over eight years of extensive experience with both the SAN Volume Controller, and Tivoli Storage Productivity Center. Lloyd has written a number of white papers on using Tivoli Storage Productivity Center to support SVC performance management, and has authored several presentations on Tivoli Storage Productivity Center best practices. Lloyd has also presented sessions at many IBM storage conferences including STGU, and IBM System Storage Storage and Networking Symposium. Thanks to the following people for their contributions to this project: Alex Osuna Mary Lovelace Bertrand Dufrasne Sangam Racherla Ann Lund International Technical Support organization (ITSO) John Hollis Brian Smith Advanced Technical Support, United States Brian De Guia Hope Rodriquez Jeffrey McCallum Nitu Shinde Tivoli Storage Software Test, United States Gary Williams Stefan Jaquet IBM Software Group, Tivoli, United States Katherine Keaney TIvoli Storage Productivity Center, Software Development, Project Manager Xin Wang TIvoli Storage Productivity Center, Software Development, Product Manager Barry Whyte IBM Systems &Technology Group, Virtual Storage Performance Architect, David Whitworth Sonny Williams IBM Storage Performance
xii
Thanks also to the authors of the previous editions of this book. Authors of the second edition, SAN Storage Performance Management Using Tivoli Storage Productivity Center, published in June 2009 were: Mary Lovelace Mark Blunden Lloyd Dean Paolo DAngelo Massimo Mastrorilli
Comments welcome
Your comments are important to us! We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications, in one of the following ways: Use the online Contact us review Redbooks publications form found at: ibm.com/redbooks Send your comments in an email to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
Preface
xiii
xiv
Summary of changes
This section describes the technical changes made in this edition of the book since the second edition, which was published June 2009, Tivoli Storage Productivity Center V3.3.2. This edition might also include minor corrections and editorial changes that are not identified. Summary of Changes for SG24-7364-02 for SAN Storage Performance Management Using Tivoli Storage Productivity Center as created or updated on September 7, 2011.
New information
The following new information is provided: This book was updated to the Tivoli Storage Productivity Center V4.2.1 level. Documentation and case studies have been added and updated to guide you through the problem determination process using standard Tivoli Storage Productivity Center reports to include new storage subsystems and functionality added since 3.3.2. Some of the key highlights are: Support for new storage subsystems: IBM Storwize V7000: Storwize V7000 offers IBM storage virtualization, SSD optimization and thin provisioning technologies built in to improve storage utilization. IBM XIV Storage System: Tivoli Storage Productivity Center supports performance monitoring and provisioning for XIV storage systems through the native interface IBM System Storage SAN Volume Controller Version 6.1: With the Block Server Performance (BSP) subprofile, Tivoli Storage Productivity Center is additionally able to identify other SMI-S certified Disk Storage Subsystems that are not IBM but from other vendors. For a complete list of supported storage, see the IBM Support Portal website for the latest Tivoli Storage Productivity Center interoperability matrix: https://www-01.ibm.com/support/docview.wss?uid=swg21386446 Native storage system interfaces provided for DS8000, SAN Volume Controller, IBM Storwize V7000, and XIV storage systems. Storage Resource agents: The Storage Resource agents now perform the functions of the Data agents and Fabric agents. Out-of-band Fabric agents are still supported and their function has not changed.
xv
Performance Manager enhancements: New performance metrics, counters, and thresholds for DS8000, SAN Volume Controller, and Storwize V7000. XIV storage system enhancements For Help for Tivoli Storage Productivity Center and release details see http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp?topic= /com.ibm.itpc.doc_2.1/tpc_infocenter_home.htm
xvi
Part 1
Part
Chapter 1.
Each of the following components affect the performance, and therefore the baseline, of your environment: Cache HBA Bandwidth Firmware
The different RAID types that you use also affect your subsystem performance. RAID types affect these areas: LUNs Parity penalties Zoning types (soft versus hard, ISL) The use of multipath software is critical to both performance and availablility in a Fibre Channel environment, because this assists in providing better throughput and redundancy for your application I/Os. Different multipath software varies in the way it performs, so it is critical that you understand your multipath software features to find out what is appropriate for your setup to get maximum performance. In 1.6.2, Workloads on page 17 we describe the different application workloads that you might have in your environment. See this section for an understanding of the implications. After a baseline has been generated, you can set your thresholds within Tivoli Storage Productivity Center so that if there is any exceptional behavior, Tivoli Storage Productivity Center can trigger an alert and notify you that something has happened. In 3.7.3, Setting system-wide thresholds on page 128, we show you how to customize your thresholds and alerts. In Known limitations on page 84, we explain some of the threshold limitations.
We do not include diagrams for a SAN switch, because Tivoli Storage Productivity Center really only monitors one component, the ports. Currently, Tivoli Storage Productivity Center cannot monitor the performance of a tape library or tape drives, therefore, we do not show any diagrams for these devices. At the present time, your only option is to monitor the SAN switch ports connected to a tape drive.
By Port
Port Port Port Port Port Port Port Port
By Subsystem
Controller 1
Write Cache Controller 1 Read Cache Controller 1 Write Cache Mirror Controller 2
Controller 2
Write Cache Mirror Controller 1
- Cache
Cache
By Controller
Port
Port
Port
Port
Port
Port
Port
Port
By Array
Subsystem
On the subsystem level, you see metrics that have been aggregated from multiple records to a single value per metric in order to give you the performance of your storage subsystem from a high-level view, based on the metrics of other components. This is done by adding values, or calculating values, depending on the metric.
Cache
In Figure 1-1, we point out the cache and we call this a subcomponent of the subsystem, because the cache plays a crucial role in the performance of any storage subsystem. You do not find the cache as a selection in the Navigation Tree in Tivoli Storage Productivity Center, but there are available metrics that give you information about your cache. The amount of the available information or available metrics depends on the type of subsystem involved, as well as the information provided by the native storage system interfaces (NAPI) or by the CIM agent (SMI-S agent) if the subsystem uses that interface. See 1.4, Native Storage System Interface (NAPI) on page 12 for details on NAPI. See 1.5, Standards on page 13 for details on standards that determine the performance data that is collected and used by Tivoli Storage Productivity Center for SNIA and for the CIMOM and CIM agents. Cache metrics are available in the following report types and levels: Subsystem Controller I/O group Node Array Volume
Ports
The port information is for the front-end ports to which the hosts or SAN attach. Certain subsystems might aggregate multiple ports onto one port card. The SMI-S standards do not reflect this aggregation, and therefore, Tivoli Storage Productivity Center does not show any group of ports, which is important to know, because port cards can sometimes be a bottleneck. Their bandwidth does not always scale with the number of ports and their speeds. When you look at the report, the numbers per port might not seem to cause a problem, but if you total the numbers for all ports on one port card, you might see it differently. Details for individual ports are available for viewing under Disk Manager Reporting Storage Subsystem Performance By Port in the Tivoli Storage Productivity Center Navigation Tree. Ports: Tivoli Storage Productivity Center reports on many port metrics; therefore, be aware that the ports on the DS8000, DS6000, IBM DS4000, XIV, and ESS are the front-end part of the storage device. For the SVC and Storwize V7000, they are part of the virtualization engine and, therefore, are used for front-end and back-end I/O.
Controller
Almost all subsystems have multiple controllers for redundancy (usually, they have two controllers) or components exposing the volumes and managing cache. Whether this is a dual controller, dual cluster configuration, or GRID interface design. To analyze performance data, you need to know that most volumes can only be assigned/used by one controller at a time. With this in mind, you can understand why a single volume most likely never gets the full performance out of a subsystem.
Array
When used in this context, the term array describes the physical group of disk drive modules that are formatted with a certain RAID level. For example, for the DS8000, this is RAID 10, RAID 6, or RAID 5. The number of disks that are included in an array depends on the subsystem type and for certain disks, the actual implementation.
Volumes
The volumes, which are also called logical unit numbers (LUNs), are not shown in Figure 1-1 on page 6. We show the logical view here in Figure 1-2. The host server sees the volumes as physical disk drives and treats them as physical disk drives.
Array site An array is created from one or more array sites, depending on the subsystem. Forming an array means defining it for a specific RAID type. The supported RAID types are RAID 5,
RAID 6, and RAID 10. You can select a RAID type for each array site. The process of selecting the RAID type for an array is also called defining an array. Array sites are the building blocks that are used to define arrays. An array site is a group of eight disk drive modules (DDMs). Which DDMs make up an array site is predetermined by the DS8000, but note that there is no predetermined server affinity for array sites. The DDMs selected for an array site are chosen from two disk enclosures on different loops. The DDMs in the array site are the same DDM type; therefore, they have the same capacity and the same speed or revolutions per minute (RPM). In Figure 1-2, we have only included the volumes. We did not include components, such as host objects or volumes groups, because most of the other logical components vary from vendor to vendor.
Rank
In the DS8000 or DS600 virtualization hierarchy, there is another logical construct, a rank. A rank is defined by the user. The user selects an array and defines the storage format for the rank, which is either count key data (CKD) or fixed block (FB) data. One rank is assigned to one extent pool by the user. Currently on the DS8000, a rank is built using only one array. On the DS6000, a rank can be built from multiple array sites. With the introduction of the DS8800 and with Tivoli Storage Productivity Center 4.2.1 now has the ability to expose multiple ranks in a single Extent pool in the DS8000. For details on added DS8800 functionality, see 4.3.2, DS8000 information on page 162. Figure 1-3 shows the relationship of an array site, an array, and a rank.
Extents
The available space on each rank is divided into extents. The extents are the building blocks of the logical volumes. The characteristic of the extent is its size, which depends on the specified device type when defining a rank: For FB format, the extent size is 1 GB. For CKD format, the extent size is 0.94 GB for model 1.
Extent pools An extent pool refers to a logical construct to manage a set of extents. The user defines
extent pools by selecting one to n ranks managed by one storage facility image. The user defines which storage facility image server (Server 0 or Server 1) to manage the extent pool. All extents in an extent pool must be of the same storage type (CKD or FB). Extents in an extent pool can come from ranks defined with arrays of different RAID formats, but we recommend having the same RAID configuration within an extent pool. The minimum number of extent pools in a storage facility image is two (each storage facility image server manages a minimum of one extent pool).
Subsystem
Figure 1-4 shows an overview of an SVC as compared to Storwize V7000. Currently, an SVC is composed of up to four I/O groups, each of which has two nodes. The IBM Storwize V7000 consists of a Control enclosure that contains two node canisters and disk drives. The pair of node canisters is known as the I/O Group. Optionally, up to nine Expansion enclosures that each contain two expansion canisters and drives can be added. For more information about SVC functionality, see the Redbooks publication, Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933. For more information about Storwize V7000 functionality, see the Redbooks publication, Implementing the IBM Storwize V7000, SG24-7938.
10
SVC
Node 8 Node 3 Node 7 Node 4 Node 1 Node 5 Node 6 Node 2 By Subsystem
IO Group 1
IO Group 2
IO Group 3
IO Group 4
Node canister 2
Node 1
Storwize V7000
Expansion canister 1
Enclosure
Node canister 1
By Subsystem
Control
IO Group 1
IO Group 1
This is different from the way that a typical subsystem looks, because each I/O group can be considered a subsystem with two controllers. Because of so many differences, we show a comparison between a disk storage device and a virtualization device in 1.3.4, Comparison of a disk storage device and a virtualization device on page 13.
I/O group
An input/output (I/O) group contains two SVC nodes or Storwize V7000 node canisters that have been defined by the configuration process. Each SVC node or Storwize V7000 canister node is associated with exactly one I/O group. The nodes in the I/O group provide access to the volumes in the I/O group.
11
Figure 1-5 shows the relationship between two nodes in a virtualization device from a physical view.
Node
For I/O purposes, SVC nodes or Storwize V7000 node canisters within the cluster are grouped into pairs, which are called I/O groups, and a single pair is responsible for serving I/O on a particular volume. One node within the I/O group represents the preferred path for I/O to a particular volume. The other node represents the non-preferred path. This preference alternates between nodes as each volume is created within an I/O group to balance the workload evenly between the two nodes. We show the relationship of the volume to back-end storage in Figure 1-6. Support: The preferred path for a node is supported by SVC or by Storwize V7000 when the multipath driver supports it. Else SVC and Storwize present a virtual disk to all node ports available in the I/O group, unless port binding is used to select only certain ports to present the disk of volume.
Virtual volume
The virtual volume that is presented to a host system is called a volume. The host system treats this virtual volume as a physical disk. Starting with the smallest unit, we explain and list the components that make up a volume.
12
MDisk
A managed disk (MDisk) is a LUN or volume presented by a RAID controller and managed by the SVC or Storwize V7000.
13
Host 1
Potential Bottle Neck
Host 1
HBA HBA
HBA
HBA
SWITCH 1
SWITCH 2
Controller 1
Controller 2
14
For more information about the credential migration tool, see Chapter 5; Credentials Migration Tool in the Redbooks publication, IBM Tivoli Storage Productivity Center V4.2 Release Guide; SG24-7894. The native interfaces are supported for the following release levels: DS8000: Release 2.4.2 or later SAN Volume Controller: Version 4.2 or later XIV storage systems: Version 10.1 or later Storwize V7000: Version 6.1.0 or later For more information about NAPI, see Chapter 7; Native API in IBM Redbooks publication IBM Tivoli Storage Productivity Center V4.2 Release Guide; SG24-7894.
1.5 Standards
In this section, we briefly review the standards that determine the performance data that is collected and used by Tivoli Storage Productivity Center.
1.5.1 SNIA
The Storage Networking Industry Association (SNIA) is an international computer system industry forum of developers, integrators, and IT professionals, who evolve and promote storage networking technology and solutions. SNIA was formed to ensure that storage networks become efficient, complete, and trusted solutions across the IT community. IBM is one of the founding members of this organization. SNIA is uniquely committed to disseminating networking solutions into a broader market. SNIA is using its Storage Management Initiative (SMI) and its Storage Management Initiative-Specification (SMI-S) to create and promote the adoption of a highly functional interoperable management interface for multivendor storage networking products. SMI-S makes multivendor storage networks simpler to implement and easier to manage. IBM has led the industry in not only supporting the SMI-S initiative, but also, in using it across its hardware and software product lines. The specification covers fundamental operations of communications between management console clients and devices, auto-discovery, access, security, the ability to provision volumes and disk resources, LUN mapping and masking, and other management operations. For more information about SNIA, see its official Web site: http://www.snia.org
15
The CIM is an open approach to the management of systems and networks. The CIM provides a common conceptual framework applicable to all areas of management, including systems, applications, databases, networks, and devices. The CIM specification provides the language and the methodology that are used to describe management data. A CIM agent provides a way for a device to be managed by common building blocks rather than proprietary software. If a device is CIM-compliant, software that is also CIM-compliant can manage the device. Vendor applications can benefit from adopting the Common Information Model, because the vendors can manage CIM-compliant devices in a common way, rather than using device-specific programming interfaces. Using CIM, you can perform tasks in a consistent manner across devices and vendor applications. CIMOM is one of the major functional components of the CIM agent. But, in many cases, we call the CIM agent a CIMOM.
These reports also help you determine where a problem might occur in the storage subsystem. In Chapter 5, Using Tivoli Storage Productivity Center for performance management reports on page 185, we show you how to generate performance reports for SLA generation and problem determination. When determining performance issues, consider these factors: Workloads Performance capabilities of the storage subsystem SAN Server types Network Configuration of applications
Batch windows: Backups not completing Application database updates not completing Data warehousing
1.6.2 Workloads
Generally you can break storage workloads into two categories with the following characteristics: Transaction-based: I/O intensive with small records (4 KB) that are either sequential or random Throughput-based: High throughput or large data transfers using high bandwidth The characteristics for these workloads are quite different; arrays configured for one type of workload might perform poorly for the other type of workload. The server application determines the type of workload. Server applications can generally be placed in one of these categories. If you have only one server with slow performance, then you might need to investigate all the factors that can influence that application type. In the remainder of this section, we describe different workloads and show a sample Tivoli Storage Productivity Center report that displays the workload.
17
Transaction-based throughput
You can characterize transaction-based throughput as an I/O intensive workload that usually has a small block size (4 KB), which is often described as cache friendly or cache hostile.
Cache friendly Cache friendly workloads consist of mostly sequential access with a high read to write ratio,
80% or more reads, which are often expressed as 80% read 20% write and 10% random (80/20/10). In a cache friendly write, the data is written to cache and later destaged from cache to disk. This gives the best I/O response times. Figure 1-8 is a sample view of a cache friendly I/O load displayed by volume. The I/O load was created using IOmeter with few writes. In Figure 1-9, we show a cache friendly I/O load from the Volume view using the Volume name filter that is shown in Figure 1-8.
Tip: The filter uses a case sensitive search. For non-case sensitive search, remove the check mark.
Cache hostile Cache hostile workloads consist mostly of random access with a lower read to write ratio,
25% or less reads, which are often expressed as 25% read, 75% write, and 0% random (25/75/0).
18
Cache hostile disk activity is indicated by Write Cache overflow, which is a metric available for the DS8000, DS6000, and ESS. With a cache hostile disk activity condition, the read/write ratio is a low 30%. This means that the cache needs to be destaged to the disk frequently, which leads to longer I/O response times. Figure 1-10 shows cache hostile I/O.
To compare the effect of cache friendly and cache hostile workloads, we created three volumes (all are 16 GB in size) in the XIV storage system. We then set up one cache friendly read I/O thread and one cache friendly read/write I/O thread. This produced the workload shown in Figure 1-11 with total write cache hit rate above 99%.
Figure 1-11 XIV Cache friendly I/O with cache hit percentage above 99%
We then added a cache hostile I/O thread with an overall cache hit percentage at 77% and a response time of 10 msec as seen in Figure 1-12.
Figure 1-12 cache hostile, cache hit at 77%, response time 10 msec
Tip: Observe the low cache hits on the volumes with writes. We have seen that changing the characteristic of the workload affects the performance on volumes. In most instances, we recommend that you practice workload separation. In Table 1-2, we show a possible breakdown of several storage workloads. Consult with your database administrator (DBA) or application owner for the exact profile for your applications. You can use these performance requirements to arrive at a suitable SLA.
19
Table 1-2 Storage workload characteristics I/O intensive File server OLTP Warehousing Multimedia Backup N Y Y N N Throughput intensive Y N N Y Y Read intensive Y Y Y Y N Write intensive Y Y N N Y Sequential Y N Y Y Y Random Y Y Y N N
File server
The role of the file server is to store, retrieve, and update data that is dependent on client requests. Therefore, the critical areas that impact performance are the speed of the data transfer and the networking subsystems. The amount of memory that is available to resources, such as network buffers and disk I/O caching, also greatly influences performance. Processor speed or quantity typically has little impact on file server performance. In larger environments, you must consider where the file servers are located within the networking environment. We advise you to locate them on a high-speed backbone as close to the core switches as possible. The following subsystems have the greatest impact on file server performance, in this order: 1. Network 2. Memory 3. Disk The network subsystem, particularly the network interface card or the bandwidth of the LAN, can create a bottleneck due to heavy workload or latency. Insufficient memory can limit the ability to cache files and, therefore, cause more disk activity, which can result in performance degradation. 20
SAN Storage Performance Management Using Tivoli Storage Productivity Center
When a client requests a file, the server must initially locate it, then read it, and forward the requested data back to the client. The reverse of this sequence applies when the client is updating a file. Therefore, the number of host bus adapters (HBAs) that are installed and the way that they are configured can cause the disk subsystem to be a potential bottleneck. Generally, a file server requires higher throughput to satisfy the users, and I/O response time is not as critical.
Database server
The database servers primary function is to store, search, retrieve, and update data from disk. Examples of database engines include IBM DB2, Microsoft SQL Server, and Oracle. Due to the high number of random I/O requests that database servers are required to perform and the computation-intensive activities that occur, the following areas can potentially impact performance: Memory Disk Processor Network The server subsystems that have the most impact on database server performance are: Memory subsystem Disk subsystem CPU subsystem Network subsystem
Memory subsystem
Buffer caches are one of the most important components in the server, and both memory quantity and memory configuration are critical factors. If the server has insufficient memory, paging occurs that results in excessive disk I/O (to the servers internal disk drives), which generates latencies.
Disk subsystem
Even with sufficient memory, most database servers perform large amounts of disk I/O to bring data records into memory and flush modified data to disk. When the data and logfiles are located on external storage subsystems, there are additional considerations: The number of HBAs The type of RAID The number of disk drives that are used DB performance is impacted by the DB configuration with regard to using SMS or DMS. The storage administrator needs to plan and implement a well-designed storage subsystem to ensure that it is not a potential bottleneck.
CPU subsystem
Processing power is another important factor for database servers, because database queries and update operations require intensive CPU time. The database replication process also requires a considerable number of CPU cycles. Database servers are multi-threaded applications, so symmetric multiprocessor (SMP) capable systems provide improved performance scaling to 16-way and beyond. L2 cache size is also important due to the high hit ratio, that is, the proportion of memory requests that fill from the much faster cache instead of from memory.
21
Network subsystem
The networking subsystem tends to be the least important component on an application or database server, because the amount of data that is returned to the client is a small subset of the total database. The network can be important, however, if the application and the database are on separate servers. A balanced system is especially important, for example, if you add additional CPUs and consider upgrading other subsystems, such as increasing memory and ensuring that disk resources are adequate. In database servers, the design of an application is critical (for example, database design and index design).
Terminal server
Windows Server Terminal Services enables a variety of desktops to access Windows applications through terminal emulation. In essence, the application is hosted and executed on the terminal server and only window updates are forwarded to the client. The following subsystems are the most probable sources of bottlenecks: Memory CPU Network The disk subsystem has very little effect on performance.
Multimedia server
Multimedia servers provide the tools and support to prepare and publish streaming multimedia presentations utilizing your intranet or the Internet. They require high bandwidth networking and high-speed disk I/O because of the large data transfers. If you are streaming audio, the most probable sources of bottlenecks are in these areas: Network Memory Disk If you are streaming video, the following subsystems are most important: Network Disk I/O Memory Disk is more important than memory for a video server due to the volume of data that is transmitting and the large amount of data that is read. If the data is stored on the disk, the disk speed is also an important factor in performance. If compression and decompression of the streaming data is required, then CPU speed and the amount of memory are important factors as well.
Web servers
Today, a Web server is responsible for hosting Web pages and running server-intensive Web applications. If Web site content is static, the following subsystems can be sources of bottlenecks: Network Memory CPU
22
If the Web server is computation-intensive (such as with dynamically created pages), the following subsystems might be sources of bottlenecks: Memory Network CPU Disk The performance of Web servers depends on the site content. There are sites that use dynamic content that connect to databases for transactions and queries, and this connection requires additional CPU cycles. It is important in this type of server that there is adequate RAM for caching and managing the processing of dynamic pages for a Web server. Also, additional RAM is required for the Web server service. The operating system automatically adjusts the size of cache depending on the requirements. Because of the high hit ratio and transferring large dynamic data, the network can be another potential bottleneck.
Backup servers
In todays world of continuity, data recreation, and availability, a Backup server is responsible for an ever increasing amount of data movement across all networks, including LAN, WAN, and SAN. Because of the increasing traffic to and from the backup server, the following systems might be sources of bottlenecks: Network Memory CPU Disk I/O The network subsystem, particularly the network interface card or the bandwidth of the LAN, can create a bottleneck due to heavy workload or latency. The performance of a backup server varies at different times, based on the functions that are being performed. Traffic loads across the networks also change, depending on whether LAN-free backup agents are deployed. LAN-free agents transfer data directly across the Fibre Channel network straight to the tape drives. This reduces the CPU utilization of the server, but increases the load on the FC network. It is important in this type of server that there is adequate RAM for caching and managing the processing of metadata from the agents.
23
24
Workload resource sharing means multiple workloads use a common set of storage subsystem resources, such as ranks, device adapters, and I/O ports. Multiple resource sharing workloads can have logical volumes on the same ranks and can access the same host adapters or even the same I/O ports. Resource sharing allows a workload to access more hardware resources than are dedicated to the workload, therefore, providing greater potential performance. But, this hardware sharing can result in resource contention between applications that impacts performance at times. It is important to allow resource sharing only for workloads that do not consume all of the hardware resources that are available to them. Workload spreading means balancing and distributing workload evenly across all of the storage subsystem hardware resources that are available. Spreading applies to both isolated workloads and resource sharing workloads. For detailed descriptions of configuration and solution design for performance optimization, see the IBM Redbooks publications that are written specifically for your hardware: For DS8000: DS8000 Performance Monitoring and Tuning, SG24-7146 For DS6000: IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning, SG24-7145 For DS3000, DS4000, and DS5000: IBM Virtual Disk System Quickstart Guide, SG24-7794 For Storwize V7000: Implementing the IBM Storwize V7000. SG24-7938 For SVC: Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933 For XIV: IBM XIV Storage System: Architecture, Implementation, and Usage, SG24-7659
25
26
Part 2
Part
Sizing and scoping your Tivoli Storage Productivity Center environment for performance management
In this part of the book we take you through the customization of your Tivoli Storage Productivity Center environment to support performance management.
27
28
Chapter 2.
29
30
For a complete breakdown of features, functions and capabilities by Tivoli Storage Productivity Center product, see the Tivoli Productivity Center Information website. http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ fqz0_r_product_packages.html Figure 2-1 on page 31 is a subset of detailed information available at the above website when you scroll through the Licenses for Tivoli Storage Productivity Center right side frame available from that site.
Figure 2-1 Tivoli Storage Productivity Center Function Breakdown by License Summary
31
Data Server
This component is the control point for product scheduling functions, configuration, event information, reporting, and graphical user interface (GUI) support. It coordinates communication with and data collection from agents that scan file systems and databases to gather storage demographics and populate the database with results. Automated actions can be defined to perform file system extension, data deletion, and Tivoli Storage Manager backup or archiving, or event reporting when defined thresholds are encountered. The Data server is the primary contact point for GUI user interface functions. It also includes functions that schedule data collection and discovery for the Device server.
Device Server
This component discovers, gathers information from, analyzes performance of, and controls storage subsystems and SAN fabrics. It coordinates communication with and data collection from agents that scan SAN fabrics and storage devices.
Single sign-on
Enables you to access Tivoli Storage Productivity Center and then Tivoli Storage Productivity Center for Replication using a single user ID and password.
32
For more information, see the Redbooks publication, Tivoli Productivity Center V4.2 Release Guide, SG24-7894.
DB2 database
A single database instance serves as the repository for all Tivoli Storage Productivity Center components other than Tivoli Storage Productivity Center for Replication. The default database for Tivoli Storage Productivity Center for Replication is the open source Derby. This is supplied on the Tivoli Storage Productivity Center install DVD, or PPA download package. You are given the option to use DB2 for this, but the default is Derby. The TPCDB database installed on DB2 is a database central repository where all of your storage information and usage statistics are stored. All agent and user interface access to the central repository is done through a series of calls and requests made to the server. All database access is done using the server component to maximize performance and to eliminate the need to install database connectivity software on your agent and UI machines.
Agents
Outside of the server, there are several interfaces that are used to gather information about the environment. The most important sources of information are the Tivoli Storage Productivity Center agents (Storage resource agent, Data agent and Fabric agent) for servers and either Native Application Interface (NAPI) or SMI-S enabled storage and switch devices that use a CIMOM agent (either embedded or as a proxy agent). Agents: In Tivoli Storage Productivity Center v4.2 and above, you can deploy Storage Resource agents only. If you want to install a Data agent, you must own a previous version of the product. See the documentation for the previous version of the product for information about how to install a Data agent in the Redbooks publication, Tivoli Storage Productivity Center 4.1, SG24-7809. Storage Resource agents, CIM agents, and Out of Band fabric agents gather host, application, storage system, and SAN fabric information and send that information to the Data Server or Device server.
33
Tip: Data agents and Fabric agents are supported in V4.2. However, no new functions were added to those agents for that release. For optimal results when using Tivoli Storage Productivity Center, migrate the Data agents and Fabric agents to Storage Resource agents.
Interfaces
As Tivoli Storage Productivity Center gathers information from your storage (servers, subsystems, and switches) across your enterprise, it accumulates a repository of knowledge about your storage assets and how they are used. You can use the reports provided in the user interface view and analyze that repository of information from various perspectives to gain insight into the use of storage across your enterprise. The user interfaces (UI) enables users to request information and then generate and display reports based on that information. Certain user interfaces can also be used for configuration of Tivoli Storage Productivity Center or storage provisioning for supported devices. The following interfaces are available for Tivoli Storage Productivity Center: Tivoli Storage Productivity Center GUI: This is the central point of Tivoli Storage Productivity Center administration. Here you have the choice to configure Tivoli Storage Productivity Center after installation, define jobs to gather information, initiate provisioning functions, view reports, and work with the advanced analytics functions. Java Web Start GUI: When you use Java Web Start, the regular Tivoli Storage Productivity Center GUI will be downloaded to your workstation and started automatically, so you do not have to install the GUI separately. The main reason for using the Java Web Start is that it can be integrated into other products (for example, TIP). By using Launch in Context from those products, you will be guided directly to the select panel. The Launch in Context URLs can also be assembled manually and be used as bookmarks. TPCTOOL: TPCTOOL is a command line (CLI) based program which interacts with the Tivoli Storage Productivity Center Device Server. Most frequently it is used to extract performance data from the Tivoli Storage Productivity Center repository database in order to create graphs and charts with multiple metrics, with various unit types and for multiple entities (for example, Subsystems, Volumes, Controller, Arrays) using charting software. Commands are entered as lines of text (that is, sequences of types of characters) and output can be received as text. Furthermore, the tool provides queries, management, and reporting capabilities, but you cannot initiate Discoveries, Probes and performance collection from the tool. Database access: Starting with Tivoli Storage Productivity Center V4, the Tivoli Storage Productivity Center database provides views that provide access to the data stored in the repository, which allows you to create customized reports. The views and the required functions are grouped together into a database schema called TPCREPORT. For this, you need to have sufficient knowledge about SQL. To access the views, DB2 supports various interfaces, for example, JDBC and ODBC.
34
Sample configuration
Your configuration depends on the storage and SAN configuration that Tivoli Storage Productivity Center will be managing. An example might be a currently installed Tivoli Storage Productivity Center server supporting six IBM System Storage DS8000s and two IBM System Storage SAN Volume Controller (SVC) clusters installed on an AIX LPAR with four p5 processors and 16 GB of RAM. With that configuration, the processors average 80% utilization.
35
Database size
The size of the Tivoli Storage Productivity Center database is directly proportional to the number of records that are retained. These records include asset data from devices, file data from servers, volume data from local drives and subsystems, and performance data from subsystems and fabrics. Obviously, as the number of servers, subsystem, volumes and switches grow, more and more records are created and stored within the repository. Over time, as storage capacity grows, and new volumes are constantly added, your Tivoli Storage Productivity Center database repository grows. For performance data, you are gathering data over different time intervals, and expiring that data at different times. Due to differences in storage device, fabric device and server types of data collected there is no straightforward formula available to size a Tivoli Storage Productivity Center database. See Chapter 16 of the Redbooks publication, Tivoli Storage Productivity Center V4.2 Release Guide, SG24-7894 to get a more precise understanding of your own data size and growth requirements. The size and duration of your performance monitors can add significant quantities of data to your Tivoli Storage Productivity Center repository. For example, a small interval size will store more data than a large sample size. An example of not monitoring the database size was recently seen by a Tivoli Storage Productivity Center customer when the customers Tivoli Storage Productivity Center DB2 filesystem space for the database was identified with less than 3 GB free of available space. This left Tivoli Storage Productivity Center with less than enough available space to perform any recovery actions by normal Tivoli Storage Productivity Center Resource History Retention period actions to recover disk space. The only action available was within DB2 administrative commands provided by IBM DB2 support directly. This activity required loss of access to the Tivoli Storage Productivity Center server during this time, and a loss of some historical data was required to shrink the space needed to support restoration of the Tivoli Storage Productivity Center environment. Tip: Best practice for Tivoli Storage Productivity Center regarding managing database space is to set the Resource History Retention settings, then monitor the space growth and adjust the Resource History Retention settings as appropriate.
36
Data retention
The amount of time that you store and retain your information also has a significant bearing on your repository size. We recommend that the following changes to the default 14 day retention values be inserted: Sample: Change to 30 days. Hourly: Change to 180 days (history for trending). Daily: Change to 365 days. These values give you the capability of providing significant and detailed trending reports.
37
The sum of the subsystem data and switch data gives us a total of 4,916,296,000 bytes, or 4.9 GB. This is the number of bytes that are used after a year of data collection. You must remember that, in addition to this capacity, there is also an amount used for normal Tivoli Storage Productivity Center for data information. This number is insignificant compared to the amount of records used for performance data collections.
38
A container is a physical storage device and is assigned to a table space. A single table space can span many containers, but each container can belong to only one table space.
39
40
Basic Edition to Tivoli Storage Productivity Center Standard Edition so that you can do performance data collections and reporting, then it might be worth installing your own HBA(s) into the box, or pre-ordering the 2805-MC5 version with the internal HBA already included. You can upgrade from Tivoli Storage Productivity Center Basic Edition to Tivoli Storage Productivity Center Standard Edition for use as your main Tivoli Storage Productivity Center performance management server. For more information about SSPC, see the Redbooks publication, IBM System Storage Productivity Center, SG24-7560.
41
The SRA agent drastically simplifies the agents needed by Tivoli Storage Productivity Center to gather server information. Whether or not you expect or plan to gather filesystem or any of the enhanced agent features, deploying the SRA provides tremendous value for the storage administrator when using the Tivoli Storage Productivity Center Topology Viewer, Data Path Explorer, or storage performance management. The key is in the server hardware platform detail, the operating system detail, and the Fibre Channel Host Bus Adapter (HBA) detail provided. This makes the data visualized meaningful. As a comparison that might be more meaningful, consider seeing an HBA address or seeing a server object identified as a Windows 2008 server, running SP2 with an Emulex HBA. Important: The SRA can be used without a Tivoli Storage Productivity Center Data license, but with limited functionality. The Scan function is not available but the Data Path Explorer can be used. This is important for end users to know that have only have Tivoli Storage Productivity Center basic edition installed as delivered with the SSPC. See SSPC considerations on page 40 for more details on SSPC. Figure 2-3 has been included to visualize a Tivoli Storage Productivity Center topology table view without the SRA or any Tivoli Storage Productivity Center agent installed on the attached server. While Tivoli Storage Productivity Center can visualize this server, few details are available for any identification other than an HBA WWPN.
Figure 2-3 Tivoli Storage Productivity Center Computer View without an SRA Agent deployed
Figure 2-4 has been included to visualize a Tivoli Storage Productivity Center topology table view with the SRA agent installed. In this view, many server details are exposed, such as the Operating System, Service Pack installed. In addition, Figure 2-5 reveals the HBA details that are available by clicking the HBA tab.
Figure 2-4 Tivoli Storage Productivity Center Computer View with SRA Agent deployed
42
Figure 2-5 Tivoli Storage Productivity Center Computer View with SRA Agent deployed and HBA details shown
Full details about the new Tivoli Storage Productivity Center SRA agent can be found in the Redbooks publication, Tivoli Storage Productivity Center 4.2 Release Update, SG24-7894, Chapter 8.
2.3.3 CIMOMs
CIMOMs are pieces of code that act as a proxy agent to communicate and transfer data, to and from different devices. These devices can be either storage devices or fabric devices. It is important for you to understand where to get CIMOMs, which ones to use, when to use them, how to use them, where to deploy them, and how many to use, so that you get the optimal data information to match your configuration. The following sections provide some recommendations and assistance.
Providers
CIMOMs are provided by the manufacturing vendor of the device. For example, a CIMOM for an IBM DS6000 storage subsystem is provided by IBM, but a CIMOM for an IBM DS5300 is provided by Engenio, as Engenio is the manufacturer of the control units in the DS5300. 43
As a result, it is imperative that the correct CIMOM is obtained from the device vendor. For your level of Tivoli Storage Productivity Center, refer to the compatibility matrix to check the level of CIMOM that you expect to need. Each vendor has multiple versions of CIMOMs. There can be different releases or versions of the CIMOM, providing different functions, or there can be different CIMOMs designed for different software products, such as Tivoli Storage Productivity Center. For Tivoli Storage Productivity Center, it is important to get the correct version of the CIMOM for the specific version of Tivoli Storage Productivity Center, for the specific device type you are monitoring. CIMOMs are very easy to use, but read any release notes or documentation available, to ensure there are not any conditions, or restrictions with that version. Most vendors provide installation documentation for each version of their code. See this website for the CIMOM compatibility matrix for Tivoli Storage Productivity Center: http://www-01.ibm.com/support/docview.wss?rs=1134&context=SS8JFM&context=SSESLZ&dc =DB500&uid=swg21265379&loc=en_US&cs=utf-8&lang=en
CIMOM deployment
The deployment of CIMOMs can be important, because port conflicts can occur if similar CIMOMs are installed on the same box. We recommend that each type of CIMOM be deployed in a unique box to prevent not only port conflicts, but traffic collisions as data is transmitted back to the Tivoli Storage Productivity Center server. Some vendors require the CIMOM to be placed on a server that has an HBA installed, and Fibre Channel (FC) disk volumes need to be allocated to that server, as they communicate between the CIMOM and the managed device through the FC data path. Read each vendors documentation for instructions and requirements. Some CIMOMs are provided as part of the firmware or microcode of the device. These are called Embedded CIMOMs. Examples are the CISCO switches. If you are monitoring these devices, there is no need to install any external CIMOMs, you only have to enable the CIMOM by the vendor provided method.
Compatibility matrix
The Tivoli Storage Productivity Center Compatibility matrix is the starting point to see which devices are supported, which operating systems are supported, and which CIMOM must be used. There is a matrix for subsystem support, and another one for fabric support, including HBAs. As is the case for any software or hardware implementation, we always recommend that you get the latest version. At the time of writing, we are using Tivoli Storage Productivity Center version 4.2.1. Following are the websites for the two compatibility matrices: For fabric management; supports Tivoli Storage Productivity Center v4.2.1: https://www-01.ibm.com/support/docview.wss?uid=swg27019378 For storage device management; supports Tivoli Storage Productivity Center v4.2.1: https://www-01.ibm.com/support/docview.wss?uid=swg27019305
45
Duration
Within Tivoli Storage Productivity Center, you can set the duration of the performance data collection job. That means, when the job starts collecting data from the device, and when you want to stop. You can also set the collection to indefinitely, in which the job continues running unless manually stopped. When a performance data collection job is commenced, Tivoli Storage Productivity Center queries the device and creates a table of valid resources, such as volumes, and stores that with the Tivoli Storage Productivity Center memory. If you start a collection job with the indefinitely value, and subsequently a volume is added, or removed, from the resource list, Tivoli Storage Productivity Center does not know this unless one of two things occurs. If a Probe job is run, or if the CIMOM is advanced enough to understand that a new volume is created, or an existing volume is deleted, it tells Tivoli Storage Productivity Center to do a mini internal probe to update its list. See Performance data collection on page 70 for specific information about this. To overcome this issue, you can set the data collection to run daily, but the duration in past releases was 23 hours. With version 4.2.1, the new value supported is 24 hours. When the data collection automatically starts again, a new table is built with the changed devices. The reason that we now can host a 24 hour performance monitor is that the Tivoli Storage Productivity Center for Disk code was enhanced in 4.2.1 to support the shutdown and restart without having the long delays seen in prior versions. This now allows for true 24 hour a day performance data collection, and the ability to recover from configuration changes, which are induced in to a storage environment, outside of the Tivoli Storage Productivity Center provisioning management interface.
Collection intervals
Storage subsystems from different vendors, or even systems within the same vendor, can have different capabilities or limits as to the sample interval they can support. For your subsystems, you need to see what the collection intervals are. You will set your interval time differently, according to the purpose of the data report: If you are producing reports over a long period of time, you might want to select an interval time that is quite large, for example, 30 minutes, or 1 hour. This gives you fewer data points on your reports and thus does not make the report look too busy. If you are recording the data because you need to do some problem determination, then you need as small an interval as possible, to give you a very a detailed report to help you analyze the problem.
46
If you are monitoring the environment to help you set your SLAs, you will probably set it at 15 minutes. When you are creating your original measurement for your baseline consideration, you can start at 15 minutes, and then change to 5 minutes to refine your true baseline value. See Creating a baseline with Tivoli Storage Productivity Center on page 68 for specifics on baseline creation.
Fabric considerations
When you are creating performance data collection jobs for your fabric, there is not the same level of complexity or flexibility compared to the subsystem performance data collection. In a fabric collection, there are no dynamic changes that need to be refreshed, so the collection job period can be set to indefinite. One point has been raised with modern fabric directors: It is quite common to hot-insert fabric blades into running directors. When this happens Tivoli Storage Productivity Center needs to become aware of this situation. If you are running your fabric monitor indefinitely, you have to start and stop the monitor after you run a configuration probe on the fabric involved. Else you can change the fabric monitors to behave like the storage performance monitors and have them run on a 24 hour basis also. The selection is yours. Attention: When SNMP has been set up on a fabric switch, and the alerts are sent to Tivoli Storage Productivity Center, a Discovery process is initiated to update the switch status and record switch changes. This Discovery job refreshes the topology view with the changed information as well as updating any out-of-band agents.
Change in environment
As Tivoli Storage Productivity Center is reporting by device resources at a volume or port level, if a volume is removed, or a port is unplugged, this means that records showing zero activity are put in the repository. This is obviously an accurate indication of the data, but for removed volumes, it can change the look of your reports in a negative perception manner. When a probe, or a CIM indication (change in status) is actioned, the data reporting then removes the old volume from reports. If you see this as a problem, you can set up scripts to troll through the Tivoli Storage Productivity Center server logs and look for zero performance devices. You can then initiate a manual Probe to remove the volume from Tivoli Storage Productivity Center.
47
tpcblade3-7
tpcblade3-11
- SRA Agent
brees
- CIMOM Agent
texas
- DCFM
Servers
Storage Virtualizing
Storwize V7000-2076Ford1_tbird-IBM
S VC-2145-svc1-IBM
volumes
volumes
backend volumes
DS8000-21071301901-IBM NAPI
DS8000-21071302541-IBM NAPI
Storage Subsystems
volumes
XIV-281060000646-IBM NAPI
volumes
DS5300tpc5k-LSI CIMOM
jumbo rocky
48
The sequence of steps is important for a successful Tivoli Storage Productivity Center implementation. We recommend the following process: 1. Understand your environment; it is important to understand what hardware components you have, and set them up correctly: a. Plan the Tivoli Storage Productivity Center installation. b. Plan your implementation. c. Configure the server. d. Install Tivoli Storage Productivity Center Standard Edition components. 2. It is very important to have the correct firmware, microcode, or CIMOM levels. Many functions are only supported by specific levels. The Tivoli Storage Productivity Center compatibility matrix website is shown in Compatibility matrix on page 45. 3. Install NAPI attached storage devices by using Disk Manager. After installation, run a configuration probe to gather configuration data. Tip: Create a separate configuration probe per storage or fabric because this allows Tivoli Storage Productivity Center to utilize multiple job threads for this task, as a large storage or switch device can take considerable time to complete. 4. Install your CIMOMs for your devices as needed; these are in no specific order. As part of the CIMOM install, you need to register the devices for which you will collect performance data, to the CIMOM. See CIMOM recommended capabilities on page 44 for the recommended number of CIMOMs per Tivoli Storage Productivity Center instance: Engenio Brocade/DCFM McData CISCO NetApp Other vendor
5. Register CIMOMs to Tivoli Storage Productivity Center. This can be performed manually, or in some cases, Tivoli Storage Productivity Center can discover them using the Autodiscovery feature. 6. Install SRAs or utilize the older Data and Fabric agents if you are upgrading from an old Tivoli Storage Productivity Center version. These can be done using in-band or out-of-band, depending on your Fibre Channel connections. Communications: In-band communication means that device communications to the network management facility are most commonly directly across the Fibre Channel transport by using the Small Computer System Interface (SCSI) Enclosure Service (SES), and they require no LAN connections. 7. Discover Fabric. 8. Probe your devices. 9. Set up your Storage and Fabric performance monitors. 10.Set up your alerts and thresholds based on your initial performance collection, according to your SLAs. 11.Look at the results of your data collection jobs, and compare it to your expectations and your SLAs.
49
Tivoli Storage Productivity Center instance guidelines on page 40 shows the recommended limits that you must use when defining the number of CIMOMs per Tivoli Storage Productivity Center instance. If you have an existing Tivoli Storage Productivity Center implementation, we recommend that you review your current Tivoli Storage Productivity Center repository size, and then calculate your expected growth to make sure you have enough space. Use the formulas given in Database size on page 36 to help you understand this.
50
Part 3
Part
51
52
Chapter 3.
53
3.1.2 Metrics
In this section we illustrate in detail how Tivoli Storage Productivity Center actually gathers and elaborates performance data collected from the Native Application Interface (NAPI) or Common Interface Model Object Manager (CIMOM), and how this information is available for your reports.
Overview of metrics
A metric in this context is a unit of measurement. The device counts the statistics so that Tivoli Storage Productivity Center can gather them using the NAPI or a CIMOM Agent. These counters are then used to calculate new values, which are called metrics. Technically, the counters are in the microcode of the storage subsystem or SAN switches. The counters are usually monotonically increasing, so it is necessary to take the delta between two sets of counters (combined with the time) to convert the counters into values, such as I/O rates. These become the metrics. It is also possible to use two or more metrics to derive other metrics. 54
SAN Storage Performance Management Using Tivoli Storage Productivity Center
Even so, many people call the metrics counters. In most cases, this term is acceptable. Figure 3-1 shows the value calculations for metrics in general. The value is always an average over a period of time.
Example 3-1 shows a simple equation for this metric: The value of the counter that counts the number of I/Os is decided by the interval length to calculate the I/Os per second (IOPS).
Example 3-1 I/O rate equation
I/O Rate = (Number of I/Os at T2 - Number of I/Os at T1)/Interval length This equation already shows one potential problem, because a counter cannot increase indefinitely, so eventually, every counter wraps and starts with 0 again. This creates a spike in the data, because the deltas are too large to be managed by Tivoli Storage Productivity Center, because the counter might be reset from a relatively big value to zero. There are other reasons that might lead to the spikes, for example, restart of the Subsystem or the CIMOM/NAPI, bugs in the subsystem or the CIMOM/NAPI, or failover of the controllers of a subsystem. Tivoli Storage Productivity Center tries to detect these situations and discard all the data from this sample.This is necessary, because the false values otherwise aggregate to the hourly and daily values, which is far worse than simply disregarding suspicious data.
55
Tivoli Storage Productivity Center can report on many different performance metrics, which indicate the particular performance characteristics of the monitored devices. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for a complete list of performance metrics that Tivoli Storage Productivity Center supports.
56
Throughput
Throughput is measured and reported in several different ways. There is throughput of an entire box (subsystem), or controller (ESS/DS6000/DS8000/XIV/SMI-S Block Server Performance), or module (XIV), or of each I/O group or nodes (SVC/IBM Storwize V7000). There are throughputs measured for each volume (LUN), throughputs measured at the Fibre Channel interfaces (ports) and on Fibre Channel switches, and throughputs measured at the RAID array after cache hits have been filtered out. These are the main front-end throughput metrics: Total I/O Rate (overall) Read I/O Rate (overall) Write I/O Rate (overall) Total Data Rate (overall) Read Data Rate (overall) Write Data Rate (overall) These are the main back-end throughput metrics: Total Back-End I/O Rate (overall) Back-End Read I/O Rate (overall) Back-End Write I/O Rate (overall) Back-End Total Data Rate (overall) Back-End Read Data Rate (overall) Back-End Write Data Rate (overall)
Response time
Response time is closely related to throughput and cache hits. It is desirable to track any growth or change in the rates and response times. Frequently, the I/O rate grows over time, and that response time increases as the I/O rates increase. This relationship is what capacity planning is all about. As I/O rates increase, and as response times increase, you can use
Chapter 3. General performance management methodology
57
these trends to project when additional storage performance (as well as capacity) are required. In Chapter 6, Using Tivoli Storage Productivity Center for capacity planning management on page 305 we discuss the approach to use Tivoli Storage Productivity Center for capacity planning. These are the corresponding front-end response time metrics: Overall Response Time Read Response Time Write Response Time These are the corresponding back-end response time metrics: Overall Back-End Response Time Back-End Read Response Time Back-End Write Response Time Depending on the particular storage environment, the throughput or response times might change drastically from hour to hour, or day to day. There can be periods when the values fall outside the expected range of values. In that case, the metrics related to cache hit rate can be used to understand what is happening. Cache hit rate is the number of times that an I/O request, either read or write, was satisfied from the device cache or memory (typically shown as a percentage). In addition you might find the storage transfer size for the application IO might have changed. Due to an application or database tuning activity. This can alter the application and if tuning on the storage subsystem was not accounted for, can be the cause for reduced application response time performance. Tip: Large transfer sizes usually indicate more of a batch workload, in which case the overall data rates are more important than the I/O rates and the response times.
58
There are a few throughput metrics that must be used to monitor thresholds: Total I/O Rate Threshold Total Back-End I/O Rate Threshold Overall Back-End Response Time Threshold Write-cache Delay Percentage Threshold
SAN switch
For switches, the important metrics are Total Port Packet Rate, Total Port Data Rate, Port Send Data Rate and Port Receive Data Rate, which provide the traffic pattern over a particular switch port. When there are lost frames from the host to the switch port, or from the switch port to a storage device, the dumped frame rate on the port can be monitored.
Utilization metrics
To monitor the environment some additional useful utilization metrics exist. Those metrics use percentage: CPU Utilization (only for SVC and Storewize V7000) Volume Utilization Disk Utilization Percentage (only for ESS, DS6000 and DS8000) Port Send Utilization Percentage Port Receive Utilization Percentage Port Send Bandwidth Percentage (also for SAN switches available) Port Receive Bandwidth Percentage (also for SAN switches available) Important: Monitor the relevant patterns over time for your environment, develop an understanding of expected behaviors, and investigate the deviations from normal patterns to get warning signs of abnormal behavior, or to generate the trend of workload changes. This is possible only if a solid baseline is available. In 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68, we describe what creating a baseline means, and how Tivoli Storage Productivity Center helps you to define it.
59
The appropriate value might also change between shifts or on the weekend. A response time of 5 milliseconds might be required from 8 a.m. until 5 p.m., while 50 milliseconds is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications. By using solid state disks (SSD) the response time is expected below 2 msec. Because SSDs are still very expensive, the usage is currently preferred for demanding environments. Alternatively current storage subsystem offer a mixture between traditional disks and SSDs and use elaborated software to optimize the utilization of the components which is a complex process. Through that fast response times, large capacity by medium costs can be achieved.
Application response
There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in megabytes per second that is most important, and maximizing this metric can drive response times much higher than 30 msec. Appendix A, Rules of Thumb and suggested thresholds on page 327 summarizes some Rules of Thumb that can be used as a basis for performance problem determination.
60
This situation can happen for all the storage subsystems or virtualization engines (IBM or non-IBM). The reason for this is often related to the internal cache management of the device. When I/O rates are low for a particular volume, it is usually the case that the volume is completely idle for an extended period of time, perhaps a minute or more. This can cause the device to flush the cache of that volume to disk, to free up valuable cache space for other volumes which are not idle. However the first I/O that arrives for such a volume after an idle period requires the cache to be re-initialized and requires proper cache synchronization to be achieved across redundant controllers or nodes of the storage subsystem. This process can be expensive in terms of performance, and can cause significant delays (sometimes multiple seconds) for that first I/O. In addition, for write I/O, the volume can operate in write-through mode until the cache has been fully synchronized, which will cause further slowdown because each write will be reported as complete only after the update has been written to the back-end disk(s). Normally, each write will be reported as complete after the update has been written to cache, which is of course several magnitudes faster. As a result, depending on the caching scheme of the storage subsystem it might occur that idle or almost idle volumes have extremely high response times. This is generally nothing to worry about unless the application performance is affected. Important: If in your environment, a high Response Time with low I/O Rate is detected frequently by your threshold alert configuration, you might consider modifying the alert definition to define a Response Time threshold with an additional filtering option on I/O Rate. See 3.4.4, Defining the alerts on page 80 for details. This will prevent your Response Time alert from triggering when the corresponding I/O Rate is very low, avoiding too many false-positive alert notifications.
61
In addition to throughput graphs, you can also produce graphs (or tabular data) for any of the metrics that you might have selected. For example, if the Write Response Time becomes high, you might want to look at the NVS Full metric for various components, such as the Volume or Disk Array. The Read, Write, and Overall Transfer Sizes are useful for understanding throughput and response times, and provide useful information for modeling tools like Disk Magic. In fact, the main inputs for Disk Magic are readily available in various performance reports. These include the Total I/O Rate, Read Percentage, Read Sequential, and Read Hit Percentages and Average Transfer Size. The data rate information in the performance reports (as well as Response Times, if available) can be used to calibrate Disk Magic Model results. For most components, whether subsystem, controller, array, or port, there can be expected limits to many of the performance metrics. But there are few Rules of Thumb, because it depends so much on the nature of the workload. Online Transaction Processing (OLTP) is so different from Backup (such as IBM Tivoli Storage Manager Backup) that the expectations cannot be similar. OLTP is characterized by small transfers, consequently data rates might be lower than the capability of the array or box hosting the data. TSM Backup uses large transfer sizes, so the I/O Rates might seem low, yet the data rates test the limits of individual arrays (RAID ranks). And each storage subsystem has different performance characteristics, from XIV, Storwize V7000, SVC, N series, DS4000, DS5000, DS6000, to DS8000 models, each box can have different expectations for each component. The best Rules of Thumb are derived from looking at current (and historical) data for the configuration and workloads that are not getting complaints from their users. From this performance base, you can do trending, and in the event of performance complaints, look for the changes in workload that can cause them.
Small block reads (4-8KB/Op) must have average response times in the 2 msec to 15 msec
range. The low end of the range comes from a very good Read Hit Ratio, while the high end of the range can represent either a lower hit ratio or higher I/O rates. Average response times can also vary from time interval to time interval. It is not uncommon to see some intervals with higher response times.
Small block writes must have response times near 1 msec. These must all be writes to cache
and NVS and be very fast, unless the write rate exceeds the NVS and rank capabilities. Later we discuss performance metrics for these considerations.
Large reads (32 KB or greater) and large writes often signify batch workloads or highly
sequential access patterns. These environments often prefer high throughput to low response times, so there is no guideline for these I/O characteristics. Batch and overnight workloads can tolerate very high response times without indicating problems.
Read Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. The typical cache usage for enterprise database servers is with sequential IO workloads involving pre-fetch cache loads. For very low hit ratios, you need many ranks providing good back-end response time.
62
SAN Storage Performance Management Using Tivoli Storage Productivity Center
It is difficult to predict whether more cache might improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. For random read I/O, the back-end rank (disk) read response times must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Times can be higher because of RAID 5, RAID 6, or RAID 10 algorithms, but must seldom exceed 80 msec. There can be some time intervals when response times exceed these guidelines.
RAID array
RAID arrays also have IO/sec limitations that depend on the type of RAID (for example, RAID 5 versus RAID 10), disk type, and the number of disks in the array. Because of the different RAID algorithms, it is not easy to know how many I/Os are actually going to the back-end RAID arrays. For many RAID 5 subsystems, a worst case scenario can be approximated by using the back-end read rate plus 4 times the back-end write rate of (R + 4 * W) where R and W are the back-end read rate and back-end write rate. Sequential writes can behave considerably better than worst case. Use care when trying to estimate the number of back-end operations to a RAID array. The performance metrics seldom report this number precisely. You have to use the number of back-end read and write operations to deduce an approximate back-end Ops/sec number. The RAID array I/O limit depends on many factors, chief among them are the number of disks in the array and the speed (RPM) of the disks. But when the number of IO/sec to a array (array size 7 or 8 disks) is near or above 1000, the array can be considered very busy! For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end array are not consistent with good performance. They imply that the back-end arrays are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy arrays. For a little more precision (but dubious accuracy), in Table 3-1, consider the upper limit of performance for 10K and 15K RPM using RAID 5 with 7 or 8 disks per raid array, enterprise class devices. Be aware that different people have different opinions about these limits, but rest assured that all these numbers represent very busy DDMs.
Table 3-1 Disk performance limits DDM speed 10 K RPM 15 K RPM Max Ops/sec 150 - 175 200 - 225 6+P Ops/sec 1050 - 1225 1400 - 1575 7+P Ops/sec 1200 - 1400 1600 - 1800
While disks can achieve these throughputs, they imply a lot of queueing delay and high response times. These ranges probably represent acceptable performance only for batch oriented applications, where throughput is the paramount performance metric. For Online Transaction Processing (OLTP) applications, these throughputs might already have unacceptably high response times. Because 15K RPM DDMs are most commonly used in OLTP environments where response time is at a premium, here is a simple Rule of Thumb. Rule of Thumb: If the array using RAID 5 with 7+P is doing more than 1000 Ops/sec, it is too busy, no matter what the RPM.
63
For batch applications, you can notice low I/O rates and high response times. In these cases, the response time is not the appropriate performance metric. Rather, the throughput in megabytes per second is most important, and maximizing this metric can drive response times much higher than 30 msec. If available, it is the average front-end response time that really matters. Tip: A safe and sane limit for physical Ops to the RAID arrays is closer to 100 Ops/sec per disk, which for typical 6+P and 7+P RAID 5 arrays translates to 700 or 800 Ops/sec, per RAID 5 array. In addition to these enterprise class drives, near-line drives, formerly known as SATA, of high capacity (currently 1 or 2 TB disks) and somewhat lower performance capabilities are now becoming options in mixtures with higher performing, enterprise class drives. These are definitely considered lower performance, capacity oriented drives, and have their own limits, as shown in Table 3-2.
Table 3-2 High capacity disk performance limit DDM speed 7.2 K RPM Max Ops/sec 85 - 110 6+P Ops/sec 595 - 770 7+P Ops/sec 680 - 880
These drive types must have limited exposure to enterprise class workloads, unless included in storage subsystems such as the IBM XIV Storage System, and the guidelines might be subject to substantial revision based on field experience. Another new disk type, especially for high IO/s workload, are the Solid State Drives (SSDs). In addition to better IO/s performance, Solid State Drives offer a number of potential benefits over electromechanical Hard Disk Drives, including better reliability, lower power consumption, less heat generation, and lower acoustical noise. From the costs perspective, SSDs are much more expensive per GB but cheaper per IO than electromechanical Hard Disk Drives. An important observation from test results is that the performance improvement with SSDs for large block writes is not as remarkable as seen with just reads or with small block I/O in general. For example, while SSDs provide about 20 times the throughput of 15K RPM HDDs for 4KB reads, the difference is only about 2 times for large block writes. This is a property of Enterprise SSDs and not specific to the DS8000. Thus the best use cases for SSDs tend to be small block I/Os that have a higher percentage of reads. Be aware by using SSD disks, the performance bottleneck might move to other components within the storage subsystem such as device adapters, controllers, or SAN ports. Because different types of SSDs exist with different performance characteristics, as well as different implementations and usage of SSDs, it is difficult to advise an IO/sec limit in general. Todays enterprise class storage subsystems normally balance performance requirements over several disk arrays. For example, a SVC cluster uses several Managed Disks within a Managed Disk Group to stripe the volumes (volumes) over the whole back-end storage. Through that, all disk arrays are utilized equally, which avoids hot arrays.
64
To calculate the total Total IO rate capability of such a Managed Disk Group, based on the physical disks, you can use following formula (see the white paper EMC Symmetrix or DMX storage Controller Best practices when attached to IBM System Storage SAN Volume Controller (SVC) v4.2.x or later clusters for more details): Formula: P = n((D * Q)/(R+(W*4))) Where P = Total IO capability, n = number of Managed Disks, D = IOPs capability of a Disk, Q = Quantity of physical disks per Managed Disk, R = Read workload percentage, W = Write workload percentage, 4 = RAID 5 write penalty. For example, we have a DS8000 with 48 disk arrays as back-end storage. 24 arrays have 7 disks, 24 have 8 disks. For disk type, we use 15k RPM DDMs. We assume a total workload of 80% read and 20% write. By using this formula, we get the physical disk performance capability (back-end) of the Managed Disk Group of 45000 I/O (see Example 3-2).
Example 3-2 Calculate total back-end I/O rate capability
45000 = 24*(200*8/(0.8+(0.2*4)) + 24*(200*7/(0.8+(0.2*4)))) This calculated Total Back-End I/O Rate does not reflect any cache hits (on the SVC nor on the DS8000). Therefore on the front-end (SVC) can be more I/O. Because cache hits are not determined, we recommend to use the Total Back-End IO Rate for capacity planning.
65
Figure 3-3 Performance Report Option in Tivoli Storage Productivity Center for Disk
It can then be useful to track any growth or change in the rates and response times. Frequently, it happens that I/O rate grows over time, and that response time increases as the I/O rates increase. This relationship is what capacity planning is all about. After a baseline is defined, and you know the expectations from your environment (for example, the maximum total I/O rate expected on an MDisk), as I/O rates increase, and as response times increase, you can use these trends to project when additional storage performance (as well as capacity) is required, or alternative application designs or data layouts. Typically, throughput and response time change drastically from hour to hour, day to day, and week to week. This is usually a result of different workloads between first or third shift production, or business cycles like month-end processing versus normal production. There can be periods when the values lie outside the expected range of values, and the reasons are not clear. Then you can use the other performance metrics to try to understand what is happening. The following additional metrics can be used to make sense of throughput and response time: Total Cache Hit percentage: Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit Percentage: Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. If Non-Volatile Storage (NVS) is full, a write can be delayed while some changed data is destaged to the disk arrays to make room for the new write data in NVS. 66
SAN Storage Performance Management Using Tivoli Storage Productivity Center
Write-cache Delay Percentage: Write-cache Delay Percentage refers to Non-Volatile Storage for writes. Read Transfer Size (KB/Op): The Read Transfer Sizes are the average number of bytes transferred per I/O operation. Write Transfer Size (KB/Op): The Write Transfer Sizes are the average number of bytes transferred per I/O operation.
Utilization metrics
In addition, Tivoli Storage Productivity Center offers different utilization metrics that help to identify the current and historical utilization of a device. These are the most important metrics: CPU Utilization (%): Average utilization percentage of the processors (SVC, Storwize V7000). Disk utilization (%): The approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). Port Send Utilization Percentage (%) Average amount of time that the port was busy sending data over a specified time interval. Port Receive Utilization Percentage (%) Average amount of time that the port was busy receiving data over a specified time interval. Volume utilization (%) The approximate utilization percentage of a volume over a specified time interval (the average percent of time that the volume was busy). This value is calculated by Tivoli Storage Productivity Center: 1) Calculates the Population = Average I/O Rate * Average Response Time / 1000 2) Utilization = 100 * Population/(1+Population). Tip: The new Volume Utilization metric, which is available for all storage subsystems, can provide a quick view into hot volumes as seen by servers and can be used as a starting point for performance analysis. This metric allows you to display a combination of two important metrics in a single report. There are many more metrics available through Tivoli Storage Productivity Center, but these are important ones for understanding throughput and response time.
Constraint violations
Another way to use Tivoli Storage Productivity Center to monitor performance is through the use of constraint (or threshold) violations. In 3.5.5, Constraint Violations reports on page 113, we illustrate in detail how to manage this special kind of report. There are a limited number of performance metrics for which you can set constraints. Several very useful throughput metrics can be monitored. Back-End Response Time (available for most IBM storage boxes) is the time to do staging or destaging between cache and disk arrays. Particularly useful thresholds include these: Total I/O Rate Threshold Total Back-End I/O Rate Threshold Write-cache Delay Percentage Threshold
Chapter 3. General performance management methodology
67
Overall Back-End Response Time Threshold (a very important threshold) CPU Utilization Threshold (for SVC and Storwize V7000) Disk Utilization Percentage Threshold Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold Remember that the back-end I/O rate is the rate of I/O between cache and the disk RAID ranks in the back-end of the storage. For storage boxes that support this metric, they typically include Disk Reads from the array to cache caused by a Read Miss in the cache. The Disk Write activity from cache to disk array is normally an asynchronous operation to move change data from cache to disk, freeing up space in the NVS. The Back-End Response Time gets averaged together with response time for cache hits to give the Overall Response Time mentioned earlier. You need to always be clear whether you are looking at throughput and response time at the front-end (very close to system level response time as measured from a server), or the throughput/response time at the back-end (just between cache and disk). There are other useful constrain thresholds, such as the Port Response Time thresholds, but they do not usually impact the throughput and response time from disk storage. When these thresholds are triggered, it is usually from a problem in the path between the servers and storage. When there are throughput or response time anomalies, you can be led rather naturally to look at performance reports for other metrics and other resources, such as Write-cache Delay Percentage, or the performance of individual RAID ranks, or particular volumes in critical applications. Important: There is more to performance management than defining absolute threshold values for performance metrics. The key is to monitor normal operations, develop an understanding of expected behavior, and then track the behavior for either performance anomalies or simple growth in the workload. This historical performance information is the main source for an effective baseline, and is the best source of data for any Performance Management environment.
68
Daily analysis is one of the major tasks for performance management. The baseline is the foundation for daily analysis. Through careful daily analysis, you can get an in-depth view of how your storage subsystem performs. Changes in performance status are revealed, and anomalies are discovered. Before carrying out a daily analysis, have a clear idea about what type of performance status you expect to see. Only with this premise can you know how to compare the current data with the baseline, how to judge whether the current status is acceptable, and how to choose the direction for further analysis if there is a significant difference between the current status report and baseline, or if constraint threshold violations occur on a regular basis. Every storage administrator has an expectation for the performance status of their storage subsystem before putting it into production. In order to meet this expectation, different configurations are made and different solutions are designed according to the workloads that the devices support. After configurations are implemented, repeatedly tuned, and after the performance status becomes stable, that means the baseline of the performance status is set. Then, the further daily analysis can be carried out to check whether the performance status is as expected, whether it continues to follow the patterns of the baseline, and whether the original expectation for the performance configuration is still valid. In all, you must have a basic understanding of how to configure storage subsystem to meet the performance expectation. The general method is to store regular performance reports in normal work conditions when there are no complaints raised by users. Then, when problems occur, you can use the report that was generated in the timeframe when the user complained to compare with the reports you generate now to analyze what happened. With Tivoli Storage Productivity Center, you need to set up a performance data collection job for the device and think about the polling intervals, intervals to be skipped, and the data retention period. Note that retention cannot be set per storage subsystem. For certain situations, you might be able to get around this limitation by using the Tivoli Storage Productivity Center batch reporting function and storing the data outside of Tivoli Storage Productivity Center as a comma-separated value (CSV) text file or HTML report. When you need to bring up the baseline, you can either use the Tivoli Storage Productivity Center GUI, or if that does not provide you the required graphical reports, you can use TPCTOOL together with Excel to extract and display the baseline. See Appendix C., Reporting with Tivoli Storage Productivity Center , CLI: TPCTOOL as a reporting tool on page 370. The baseline implementation with Tivoli Storage Productivity Center passes through the following main steps: 1. Set up the performance collection tasks. In 3.4, Performance data collection on page 70, we explain how to plan, define, and run a performance data collection. 2. Analyze the data to get familiar with the workload of the subsystems. In 3.5, Tivoli Storage Productivity Center performance reporting capabilities on page 92, we show all the Tivoli Storage Productivity Center reporting capabilities and how use them. 3. Finally, define alerts if required. In 3.4.4, Defining the alerts on page 80, we explain how to define alerts in Tivoli Storage Productivity Center performance data collection jobs. In order to get familiar with the workload of the subsystem, you have to gather performance data for an extended period, so that you can detect certain workload patterns. We cannot give recommendations for how long you need to collect data. As a starting point, let the collection run for at least a week in order to account for more accurate daily data as well as changes in the workload between weekdays and weekends.
Chapter 3. General performance management methodology
69
Needless to say, if you run into a problem before you establish your baseline, you still can use Tivoli Storage Productivity Center to diagnose the problem, but you cannot be sure that the problem that you think you see is really something new.
3.4.1 Planning
In this planning section, we discuss what you need to consider in order to decide how you will use Tivoli Storage Productivity Center. In the following section, we show you how to set up Tivoli Storage Productivity Center. Tivoli Storage Productivity Center uses the new NAPI for some storage devices (DS8000, XIV, SVC, and Storwize V7000) and also still uses the SMI-S standard for getting performance data from other storage devices. The SMI-S standard defines a polling mechanism to gather the data from the CIMOMs (the standard does not define how the CIMOMs get the data from the devices). With this polling approach, it is important that you consider each of the topics in this section: Devices to include Polling interval Scheduling Data retention Alerting
70
If you need to know more about Tivoli Storage Productivity Center installation and configuration, see the following resources: Tivoli Storage Productivity Center Version 4.2.1: Installation and Configuration Guide, SC27-2337-04: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.d oc/fqz0_installguide_v421.pdf Tivoli Storage Productivity Center Hints and Tips: http://www-01.ibm.com/support/docview.wss?uid=swg27008254&aid=1
CIMOM sizing
In case of usage of an external CIMOM, it can be used to collect performance data from multiple storage subsystems at the same time when you set up different performance data collection jobs even with different interval settings. Obviously, if you use one CIMOM to monitor multiple storage subsystems, the computer where this CIMOM resides has to afford more workload. It is also easy to understand that the real workload is caused by the number of volumes on which each CIMOM reports. This is an important issue that you must address by proper sizing. Regarding the number of devices and volumes per CIMOM, follow the general rules described in CIMOM recommended capabilities on page 44.
71
72
We have listed the supported intervals for some subsystems and SAN switches in Table 3-3.
Table 3-3 Supported CIMOM/NAPI intervals (examples) Subsystem/Fabric CIM/NAPI NAPI (SVC, Storwize V7000, DS8000, XIV)a IBM CIM Agent for DS Open API (DS6000, ESS)a LSI SMI-S Provider 1.3 10.06.GG.33 (DS4000, DS5000)a Engenio Version 10.50.G0.04 (DS3000, DS4000)a Brocade DCFM 10.4.1b Minimum interval 5 minutes 5 minutes 5 minutes 5 minutes 5 minutes
As mentioned previously, the shorter sampling interval generates more data than can be stored in the database. With default settings in Tivoli Storage Productivity Center, all the sampling data is stored in the database. To avoid the database size problem, while watching for potential performance issues, Tivoli Storage Productivity Center has the ability to skip inserting samples into the database. This skipping function is useful when you need to do SLA reporting and longer term capacity planning at the same time. It is important to understand that every time a defined alerting threshold is reached, the sample is stored in the database anyway.
Capacity planning
When you want to use the data for capacity planning, there is no need to run the collection all the time. Instead, you can run it at certain intervals and just capture data for a specific period of time that can also represent those periods that you do not monitor. Important here is that you still set the database retention periods accordingly, because the database retention is only time-based, not job-based. For more details on how to use Tivoli Storage Productivity Center for Capacity planning management, see Capacity Planning and Performance Management on page 306.
73
For more details on how to use Tivoli Storage Productivity Center for problem determination, see Chapter 5, Using Tivoli Storage Productivity Center for performance management reports on page 185.
Figure 3-4 Performance monitor running for 24 hours and restarted every day
74
Retention
Tivoli Storage Productivity Center allows you to specify a retention for the each of the three types of samples, which gives you a second point to control the amount of performance data that is stored within Tivoli Storage Productivity Center. The smallest interval that you can specify is always a day. These are the recommended values for each sample, in order to define a consistent baseline and long-term historical data: Samples: 30 days Hourly averages: 180 days Daily averages: 365 days In 3.4.5, Defining the data retention on page 84, we explain how to set up the data retention values. For an analysis of the impact of the collected data on the database sizing, see 2.1.5, Tivoli Storage Productivity Center database repository sizing formulas on page 37.
75
The alert function is extremely useful for the purpose of maintaining SLAs. Every time that a defined threshold is exceeded, you not only get an e-mail, SNMP trap, or whatever you use as the standard way that Tivoli Storage Productivity Center alerts you, but a constraint violation is recorded in addition. In addition, you can look at a special report (the Constraint Violation report) to get a quick overview about when something happened. You can even drill down and look at the sample data that triggered the alert. For long-term performance capacity planning, this function is also useful. You can set up an alert level to inform you that the workload has reached this level. The more often you get this alert, the closer you are to defining the SLA level for the workload. See following link for information about threshold/performance related alerts: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ fqz0_r_perf_thresh.html
76
2. On the next panel (see Figure 3-6), select the subsystem from which you want to collect the data. Remember that you can only create one job for each subsystem. When you include a subsystem in a job, it is no longer available in the left column.
77
3. On the next panel (Figure 3-7), which is the most important one, specify when the job starts and finishes, how often data is collected, and if the job is repeated. We have explained these options in the previous sections.
Previously, we explained how you can use the skipping function to define how much data to actually store in the database. 4. Click the Advanced button to the right of the interval length field to open the dialog shown in Figure 3-8. This dialog is where you can configure this function.
78
5. On the last panel, define the alerts for a monitoring failure. In this example, an email will be sent to the Tivoli Storage Productivity Center administrator in case of a condition Monitor Failed.
79
Stress alerts
Figure 3-10 shows a diagram to illustrate the four thresholds that create five regions. Stress alerts define levels that, when exceeded, trigger an alert. An idle threshold level triggers an alert when the data value drops below the defined idle boundary. There are two types of alerts for both the stress category and the idle categories, and one type of alert for normal conditions: 1. Critical Stress: No warning stress alert is created, because both (warning and critical) levels are exceeded with the interval. 2. Warning Stress: It does not matter that the metric shows a lower value than in the last interval. An alert is triggered because the value is still above the warning stress level. 3. Normal workload and performance: No alerts are generated. 4. Warning Idle: The workload drops significantly, and this drop might indicate a problem (does not have to be performance-related). 5. Critical Idle: The same applies as for critical stress.
80
If you do not want to be notified of threshold violations for any boundaries, you can leave the boundary field blank and the performance data will not be checked against any value. For example, if the Critical Idle and Warning Idle fields are left blank, no alerts will be sent for any idle conditions.
Alert suppression
If you selected a threshold as a triggering condition for alerts, you can specify the conditions under which alerts are triggered and choose whether you want to suppress repeating alerts. Alerts can be suppressed to avoid generating too many alert log entries or too many actions when the triggering condition occurs often. You can view suppressed alerts in the constraint violation reports. You can define the following options that enable you to specify conditions that trigger and suppress alerts. If a threshold is not selected as a triggering condition, these options are not available. Trigger alerts for both critical and warning conditions: Generates an alert upon the violation of either critical or warning threshold boundaries. This is the default. Trigger alerts for critical conditions only: Generates alerts only upon violation of one of the critical threshold boundaries. Violation of a warning boundary creates an entry in the constraint violation report, but does not result in an entry in the alert log or an action being triggered. Trigger no alerts: Does not generate an alert upon violation of any threshold boundaries. Creates entries only in the constraint violation report.
81
Do not suppress repeating alerts: Does not suppress any repeating alerts. This is the default. Suppress alerts unless the Triggering Condition has been violated continuously for a specified length of time: Generates alerts only if the triggering condition has occurred continuously within the length of time specified in the Length of time field. Alerts for the first and any subsequent occurrences of the triggering condition within the specified time in minutes will be suppressed. At the point that there have been consecutive occurrences with the specified number of minutes, an alert is generated. When the specified suppression period has expired, the cycle starts again. Note that the timing for this feature is based on the IBM Tivoli Storage Productivity Center server clock rather than the various system clocks. This option is useful for cases when a single occurrence of the triggering condition might be insignificant, but repeated occurrences can signal a potential problem. Suppress alerts if a repeat violation has occurred within a specified length of time after the initial violation of the Triggering Condition: Generates alerts only for the first occurrence of the triggering condition. Alerts for repeated occurrences of the triggering condition within the length of time specified in the Length of time field are suppressed. When the specified suppression period has expired, the cycle starts again. Note that the timing for this feature is based on the IBM Tivoli Storage Productivity Center server clock rather than the various system clocks. This option is useful for avoiding e-mail messages or similar disruptive alerts if the same triggering condition occurs repeatedly in successive sample passes. This option generally useful for all threshold types. Figure 3-11 illustrates suppressing of alerts. The CPU utilization of a SVC triggers only an alert in case of the node CPU continuously stay busy (Warning Stress) for at least 20 minutes.
82
Write Cache Delay, Back-End Queue Time, and Back-End Response Time Threshold Filtering
A filtering option is also available for the following metrics: Write Cache Delay Percentage Threshold Back-End Write Queue Time Threshold Back-End Read Queue Time Threshold Back-End Write Response Time Threshold Back-End Write Response Time Threshold Overall Back-End Response Time Threshold By using this option, it is possible to ignore Response Time or Queue Time samples where the I/O rate was less than x, where x is entered by the user. This option allows you to address those circumstances where, even if a performance issue is not detected, Low I/O Rates together with High Response Time are measured by Tivoli Storage Productivity Center, as described in Low I/O rates and high response time considerations on page 60. Figure 3-12 shows an example of alert filtering, where the Back-End Response Time Threshold triggering condition will be ignored when the Back-End Read I/O rate is less than 5 ms (default value when this option is checked):
83
Known limitations
After thresholds are defined, alerts are always generated when the thresholds are reached. There is no way to specify a period of time when alerts do or do not have to be generated. As stated in 3.2.1, Performance data classification on page 56, depending on the particular storage environment, it might be that throughput or response times change drastically from hour to hour or day to day (for example, backup sessions or batch operations). This means that there might be periods when the values of several metrics fall outside the expected range of values, and an alert is triggered even if it does not have to be under normal conditions. This is a known Tivoli Storage Productivity Center limitation in threshold definition and alert generation. The only way to avoid false alarms is to be aware as possible of the workloads distribution in your environment and understand the expected behavior in different hours or days. When an alert is triggered, you have to consider when it happened in order to verify if it actually is an indication that a real problem is occurring.
History Aggregation. Tivoli Storage Productivity Center has a configuration panel that
controls how much history is kept over time. Important: The history aggregation process is a global setting, which means that the values set for history retention are applied to all performance data from all devices. It is impossible to set history retention on an individual device basis.
84
Figure 3-13 shows the Tivoli Storage Productivity Center panel for setting the history retention for performance monitors as well as other types of collected statistics.
Figure 3-14 shows retention periods for the performance monitors, where you can see the recommended values as discussed in 2.1.4, Tivoli Storage Productivity Center database considerations on page 36>.
85
The descriptions of the performance monitor values as seen in Figure 3-14 are as follows: Per performance monitoring task: The value set here defines the number of days that Tivoli Storage Productivity Center keeps individual data samples for all devices sending performance data. The example shows 30 days, as recommended in How long do you retain the data on page 75. When per sample data reaches this age, Tivoli Storage Productivity Center permanently deletes it from the database. Increasing this value allows you to look back at device performance at the most granular level at the expense of consuming more storage space in the Tivoli Storage Productivity Center repository database. Data held at this level is good for plotting performance over small time periods but not for plotting data over many days or weeks because of the number of data points. Consider keeping more data in the hourly and daily sections for longer time period reports. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for per sample data. Hourly: This value defines the number of days that Tivoli Storage Productivity Center holds performance data that has been grouped into hourly averages. Hourly average data has the potential to consume less space in the database. For example, if you collect performance data from an IBM SAN Volume Controller at 15 minute intervals, the hourly averages require four times less space in the database to retain. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for hourly data. Daily: This value defines the number of days that Tivoli Storage Productivity Center holds performance data that has been grouped into daily averages. After the defined number of days, Tivoli Storage Productivity Center permanently deletes records of the daily history from the repository. Daily averaged data requires 24 times less space in the database to store compared to hourly data. This savings is at the expense of granularity; however, plotting performance over a longer period (perhaps weeks or months) becomes more meaningful. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for daily data.
86
Figure 3-15 Starting the performance data collection manually using Job definition
Figure 3-16 Starting the performance data collection manually using central job management
87
After starting the monitor the job is exclusively visible in the Job Management Panel IBM Tivoli Storage Productivity Center Job Management (see Figure 3-17).
The message in the logfile indicates if it was a manual stop as shown in Example 3-3.
Example 3-3 Logfile for manual job stop
2011-06-01 10:13:50.159 HWNPM2127I The performance monitor for device DS8000-2107-1301901-IBM (2107.1301901) is stopping due to a user request.
88
HWNPM2022E A performance monitor for device DS8000-2107-1301901-IBM (2107.1301901) is already active. A new monitor for the same device cannot be started until the previous monitor completes or is cancelled. This is the only situation where the job entry of the currently running job is not in the last position in the job list. Note the red circle next to the failed job in Figure 3-19, which indicates that it is not running and was not successful.
During the last interval of Job 15, the performance data collection job had problems gathering data. For details, you need to look into the jobs logfile.
If you end the job in this state, it does not end with a failed condition if the job collected at least one sample.
89
State
Description This performance job finished successfully (as indicated by the green square), either because the duration time has elapsed or the user has stopped the job manually. If you want to know the details of the job after it completes, select the job log and click View Log File(s). This performance job finished with an error as indicated by the red circle. An alert is being created, if this has been set up. The reason is that there was already a job running that was started earlier.
2011-05-31 21:29:09.069 HWNPM2113I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is starting in an active state. 2011-05-31 21:29:09.069 HWNPM2115I Monitor Policy: name="XIV-0646", creator="administrator", description="XIV-0646" 2011-05-31 21:29:09.069 HWNPM2116I Monitor Policy: retention period: sample data=14 days, hourly data=30 days, daily data=90 days. 2011-05-31 21:29:09.069 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. 2011-05-31 21:29:09.084 HWNPM2118I Threshold Policy: name="Default Threshold Policy for XIV", creator="System", description="Current default performance threshold policy for XIV devices. This default policy can be overridden for individual devices." 2011-05-31 21:29:09.084 HWNPM2119I Threshold Policy: retention period: exception data=14 days. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Port IO Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 90
2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Port Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Port Send Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Port Receive Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-05-31 21:29:09.209 HWNPM2112I Agent 9.11.123.69/9.11.123.70 has been selected for performance data collection from device XIV-2810-6000646-IBM (2810.6000646). 2011-05-31 21:29:10.163 HWNPM2203I Successfully retrieved the configuration data for the storage subsystem. Found 15 modules, 24 ports, and 921 volumes. 2011-05-31 21:30:11.819 HWNPM2123I Performance data for timestamp 05/31/11 08:54:39 PM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 21:35:11.475 HWNPM2123I Performance data for timestamp 05/31/11 08:59:39 PM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 21:40:13.069 HWNPM2123I Performance data for timestamp 05/31/11 09:04:39 PM was collected and processed successfully. 460 performance data records were inserted into the database. 2011-05-31 21:45:11.444 HWNPM2123I Performance data for timestamp 05/31/11 09:09:40 PM was collected and processed successfully. 306 performance data records were inserted into the database. Device configuration changes are displayed in the performance job log file. This is a normal behavior. Depending of the storage device type those changes are handled different. For SVC the change will be recognized at the next restart of the performance monitor and by running a probe. We restart the performance monitor every day. Therefore we will at a maximum miss the first 24 hours of performance data of a new object such as a volume (Example 3-6).
Example 3-6 Device configuration changes during performance monitoring job
2011-06-01 11:20:52.084 HWNPM2123I Performance data for timestamp 06/01/11 11:10:02 AM was collected and processed successfully. 4981 performance data records were inserted into the database. 2011-06-01 11:25:38.975 HWNPM4189W 2 of the MDisk statistics from the device agent were unrecognized and were not included in this sample interval. 2011-06-01 11:25:38.975 HWNPM4182W 6 of the volume statistics from the device agent were unrecognized and were not included in this sample interval. Every hour the number of performance data records is much more than the previous collections. This indicates the collection of the hourly data which is done every hour at the same time (Example 3-7).
Example 3-7 Samples and hourly data records
2011-05-31 03:20:11.460 HWNPM2123I Performance data for timestamp 05/31/11 02:44:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:25:12.132 HWNPM2123I Performance data for timestamp 05/31/11 02:49:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:30:12.132 HWNPM2123I Performance data for timestamp 05/31/11 02:54:41 AM was collected and processed successfully. 306 performance data records were inserted into the database.
91
2011-05-31 03:35:11.116 HWNPM2123I Performance data for timestamp 05/31/11 02:59:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:40:12.382 HWNPM2123I Performance data for timestamp 05/31/11 03:04:41 AM was collected and processed successfully. 460 performance data records were inserted into the database. 2011-05-31 03:45:12.023 HWNPM2123I Performance data for timestamp 05/31/11 03:09:41 AM was collected and processed successfully. 306 performance data records were inserted into the database There are a lot of messages and we cannot give examples for all of them. See following link for more information about messages: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ tpcmsg42122.html
Tivoli Storage Productivity Center server restarts while performance data collection jobs run
If performance data collection jobs were running when the server is stopped, they are restarted when the server is up again as long as the specified duration has not been reached. This situation is indicated by the log message PM HWNPM2113I in Example 3-8.
Example 3-8 Job start after Tivoli Storage Productivity Center restart: Active state
... 2011-06-01 12:30:11.076 HWNPM2123I Performance data for timestamp 06/01/11 11:54:38 AM was collected and processed successfully. 308 performance data records were inserted into the database. 2011-06-01 12:30:42.951 HWNPM2129I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is stopping because of a shutdown request. 2011-06-01 12:35:37.390 HWNPM2113I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is starting in an active state. 2011-06-01 12:35:37.406 HWNPM2115I Monitor Policy: name="XIV-2210-6000646", creator="administrator", description="XIV-2210-6000646" 2011-06-01 12:35:37.406 HWNPM2116I Monitor Policy: retention period: sample data=30 days, hourly data=180 days, daily data=365 days. 2011-06-01 12:35:37.406 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. ... You see that a new logfile has not been created.
92
93
Outlined in Table 3-5, we list the logical points where metrics are collected for several IBM systems. Other storage subsystems likely provide similar information to the DS4000; however, this depends on their conformance to the SMI-S standard in which certain metrics are required and other metrics are optional. Not all vendors provide identical data.
Table 3-5 Logical reporting levels by device type Device type ESS, DS6000, and DS8000 Performance report Includes: Subsystem Controller Array Volume Port Includes: Subsystem Controller Volume Port Includes: Subsystem I/O group Node / Node Canister Managed Didk Group (Storage Pool) Volumes Managed Disk Port Includes: Subsystem Module Volume Port Include port
XIV
It is useful to learn what data is available for a certain subsystem, which is typically the data listed in Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331. One advantage of using Tivoli Storage Productivity Center instead of other tools is that some vendor-provided low level tools do not report values that are based on standard units, such as GB and MB, but report on more hardware-related units. If you try to compare the results of those tools with the number displayed in Tivoli Storage Productivity Center, you must convert the numbers of the other tools into the units that are used by Tivoli Storage Productivity Center. This conversion is especially important when you compare capacities. Basically, with Tivoli Storage Productivity Center we have three types of reports:
Predefined Performance Reports: These are the standard reports that ship as part of the
Tivoli Storage Productivity Center package.
Customized Reports: In addiction to standard reports, Tivoli Storage Productivity Center gives you the option to create and save your own reports. These reports can be configured to run on a regular basis and to be saved to a file; in this case we talk of Batch Reports. Constraint Report: This is a special report available in the Reporting navigation subtree that lists all the threshold-based alerts.
94
In the following sections we illustrate in detail each of these reporting capabilities, describing how to display and work with the data. Moreover, in Appendix C., Reporting with Tivoli Storage Productivity Center on page 365, we discuss additional reporting capabilities and tools that can be used with Tivoli Storage Productivity Center to generate and export reports based on the Tivoli Storage Productivity Center data you have collected.
The predefined Tivoli Storage Productivity Center performance reports are customized reports that include only specific metrics. In contrast, the reports you can generate in Disk Manager contain by default all the metrics that apply to that component of the subsystem (for example, controller, array, and volume). You must use the Selection and Filter buttons to reduce the size of the report to suit your requirements. We recommend that you create and save your own reports in order not to have to do this every time that you want to open up a report. In the following sections we describe the Tivoli Storage Productivity Center standard reports.
95
Array performance
The Array performance report (Figure 3-21) is useful when you want to see if the workload is evenly distributed across the arrays that you have in your environment. Do this check on a regular basis. Currently, the array level information is only available for DS6000, DS8000, and ESS subsystems.
97
Port performance
This report has many metrics that are specific to the SVC/Storwize V7000: port to host or port to disk as shown in Figure 3-27. It is also a useful tool to confirm that SAN traffic is balanced across all the front-end ports of the Storage Subsystem.
Subsystem performance
In the Subsystem performance report, the metrics are aggregated into the overall performance data for the reported metrics as shown in Figure 3-28. This report gives a high-level administrative view to gauge your subsystems overall performance.
98
99
100
For Disk Manager, there are also additional capacity-related reports, which are indicated by the selection Storage Subsystems just above the Storage Subsystem Performance selection. While in the IBM Tivoli Storage Productivity Center subtree, you see preconfigured reports and reports that you have saved (which we show you later). The Reporting Navigation Tree selections of the Disk Manager or Fabric Manager provide many more details. Because the granularity of these additional reports seems to provide an overwhelming amount of detailed information, regard these reports really as the basis for creating your own reports. Otherwise, you often see not applicable (N/A) values, because the level of information provided by the storage subsystems varies greatly. The varying levels of information are because the Block Server Performance (BSP) subprofile is still in the early stage and also because of different storage subsystem architectures. Tivoli Storage Productivity Center groups the various metrics differently according to the levels at which the data was collected or aggregated, not by device type. The whole concept of SMI-S is to standardize the information, and otherwise, there are too many individual reports if Tivoli Storage Productivity Center provided one report per storage subsystem and per level.
101
Therefore, we recommend that you use the standard reports in the Reporting navigation subtrees as a set of building blocks to create and save your own reports so that you do not need to customize a standard Tivoli Storage Productivity Center report every time you open Tivoli Storage Productivity Center. The following items are most suited for customization: Included columns Filters Unfortunately, the following options currently are not available: Subsystem device type Host names in the volume report All the attributes that Tivoli Storage Productivity Center uses to drill down for more details (for example, array names and device adapter IDs), which we need, especially on the volume level. To overcome these limitations, implement a common naming schema for those items that are included in the reports and can be customized: Storage Subsystem names: Include the type so that you can filter on type. Volume names: Include the host name.
Charts cannot be saved, only exported, by using File Print (Printer, PDF, HTML). However, they can be easily recreated because the charts are created from the data of the underlying tabular reports. After you save a report, you can still modify it. For example, if you saved a report containing only SVC volumes (based on a filter for the subsystem name), you can add more filter options to just include a certain volume.
102
Within the individual subtrees, you see the components on which Tivoli Storage Productivity Center can report, as well as the constraint statistics report. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for more information about the available metrics. If you select a report type, you see the same panel layout that is used for all of the reports as shown in Figure 3-33 on the right side. You can select the columns (metrics) that you want to include in the report. You can also click Selection to limit the information included in the report, and you can create additional filters. If you do not limit the records that are included in the report by using the Selection or Filter function, you might get an extremely long report. Tip: If you want to reduce the amount of records displayed, for some reports it is better to use the filter function instead of the selection function. Displaying the list with the available components that you can select or deselect by itself might take a long time, because a query has to be submitted to the database. However, defining a filter does not include any database activity. This is especially true for any volume reports where the selection list might already include hundreds of volume entries.
103
You have also the capability to specify the time range on the first panel, so that less data is included in the report, which speeds up the following steps.
104
If you click the drill down icon in Figure 3-34, you get a report containing all the volumes that are stored on that specific array. If you click the drill up icon, you get a performance report at the controller level. In Figure 3-35, we show you the various components and levels to which you can drill up and down.
Figure 3-35 Various drill down possibilities: SVC/Storwize V7000 and DS8000/DS6000
See Figure 3-36 for DS4000, DS5000, and XIV drill up and drill down capabilities.
105
Reports: There is one drawback when you drill up or down to the next level with saved customized reports. After you drill up or down, the next level reports include all columns again, so you cannot skip from one customized report to another one. This function currently does not work for a DS4000, because there are no drill down level reports available from the controller. Tivoli Storage Productivity Center displays the error in Figure 3-37.
Creating a chart
From the tabular report, you can also create a chart to visualize the data. No matter if you are looking at a predefined report or at one that you have customized and saved. There are two types of charts that Tivoli Storage Productivity Center can display: a bar chart and a line chart for displaying history information. In order to create a chart, just select the records to display and click the pie chart icon in the upper left corner (see Figure 3-34 on page 104). Tivoli Storage Productivity Center displays the dialog box shown in Figure 3-38 where you can select the options for the chart. First, you need to select the chart type and then the metrics to display. You can select multiple components and multiple metrics at the same time, as long as those metrics use the same units. For example, you can display the Read I/O Rate (overall) and the Write I/O rate (overall) at the same time (as shown in the example in Figure 3-38), but you cannot display the Total Read I/O and the Read Response Time on one chart. Tip: Whenever you make use of this function, we recommend that you set History Chart Ordering to By Components. By doing this, you get all metrics for a component at the same time, instead of all the components with the first metric and then all the components with the second metric on another page.
106
After you click Ok, it might take a while for Tivoli Storage Productivity Center to query the data from the database and display it (see Figure 3-39).
If you right-click the chart, you see a context menu with the single entry, customize this chart, to set additional options.
107
There are several considerations to keep in mind: If for any reason, Tivoli Storage Productivity Center did not receive any samples for a certain time frame, the diagram does not indicate this gap, but just draws a straight line between the last and the first samples that enclose this gap. See Figure 3-40.
In Figure 3-40, it is obvious that given the short interval (see left part of the diagram), there was an interruption in the performance collection, because the line in the illustrated gap is straight from one point to another point that is multiple intervals later. If all of the records cannot display on a single chart, you can use the Next and Prev buttons (on top of bar charts or in the top right corner of line charts) to scroll through the pages.
108
2. Continue with the batch report creation by selecting the performance report type from the Report tab as shown in Figure 3-42.
109
3. After selecting the performance report type, complete your customization by clicking the tabs: a. b. c. d. Selection Options When to Run Alerts
The following series of panels take you through the steps to create a batch report. 4. From the Selection tab, specify the criteria for selecting and displaying report data (see Figure 3-43). 5. Click Selection in the upper right side to specify the resources (for example, all DS8000, SVC, and Storwize V7000) to display in the report. 6. Click Filter in the upper side to apply filters to the data that you want to display.
7. On the Options tab (see Figure 3-44), specify: The machine and file system location on which to save the report file The report format: CSV File: In this format you have the possibility to include headers and totals Formatted file HTML file History CSV File: in this format you have the possibility to include headers PDF Chart HTML Chart
Use of Classic Column Names Whether to run a script when the report process completes The format for the name of the batch report
110
8. On the When to Run tab (see Figure 3-45), specify when to run the batch report, the frequency of running the batch report, and the time zone. The following options are available for the time zone: Local time in each time zone: Select this option to use the time zone of the location where the agent running the batch report is located. Same Global time across all time zones: Use the options in this section to apply the same global time across all the time zones where the probe is run.
111
9. The Alert tab (see Figure 3-46) allows you to define an alert that is triggered if the report does not run on schedule.
112
10.After you have specified the options for the batch report, click File Save As and type a name for the batch report. The batch report is saved using the user ID with which you are logged on as a prefix. In our case, we are logged on to Tivoli Storage Productivity Center Tivoli Storage Productivity Center as tpcadmin; therefore, the name of the batch report is administrator.Storage Subsystem Performance R1 (see Figure 3-47).
After you run this example reporting job, the output looks like Figure 3-48. Each sample (5 minute interval) represents a row. Each metric is displayed in a column.
113
These are the most useful thresholds for Storage Subsystems for performance monitoring: Write-cache Delay Percentage Threshold Total I/O Rate Threshold Back-End Write Response Time Threshold Back-End Read Response Time Threshold Total Back-End I/O Rate Threshold Disk Utilization Percentage Threshold CPU Utilization Threshold Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold These are the most useful thresholds for Switches for performance monitoring: Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold Tip: Tivoli Storage Productivity Center also offers several thresholds for error counters on SAN ports that are highly recommended to use in case no other tool already is doing that. Errors on SAN ports, especially on Inter Switch Links (ISLs), can heavily impact the performance in the SAN fabric. For information about the exact meaning of these thresholds and the other thresholds, see Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331, where we explain the metrics and show you which subsystem really supports each metric. You also see at which level each metric is available. For example, the Total I/O Rate is a controller and I/O Group metric. Therefore, you need to specify the values for a controller or I/O Group and not the whole subsystem, even if you need to select the name of the subsystem in the second panel. Also, not all metrics are supported for all subsystems. You might select a threshold, but later, you cannot select your subsystem, because that subsystem does not support the selected metric (this is often the case with DS4000 systems, which have a limited number of metrics compared to the DS8000).
114
In Figure 3-50 you can see an example of the constraint violation report using the Disk Utilization Percentage Threshold in the time range limit of 01. June till 02. June 2011. This report shows that the DS8000.2107-1301901-IBM subsystem has exceeded the Disk Utilization Percentage threshold 28 times:
115
By clicking the lens icon, it is possible to see the details, as shown in Figure 3-51.
As you can see, the Critical Stress threshold has been exceeded 28 times, and always by the array A14. The thresholds values are those set in the performance monitor job, as you can see from the lines of the job log reported in Example 3-9 (see bolted lines; -1 indicates blank):
Example 3-9 Performance collection job log
2011-06-01 19:29:05.500 HWNPM2113I The performance monitor for device DS8000-2107-1302541-IBM (2107.1302541) is starting in an active state. 2011-06-01 19:29:05.500 HWNPM2115I Monitor Policy: name="ds8k-1302541", creator="administrator", description="ds8k-1302541" 2011-06-01 19:29:05.500 HWNPM2116I Monitor Policy: retention period: sample data=30 days, hourly data=180 days, daily data=365 days. 2011-06-01 19:29:05.500 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. 2011-06-01 19:29:05.500 HWNPM2118I Threshold Policy: name="Default Threshold Policy for DS8000", creator="System", description="Current default performance threshold policy for DS8000 devices. This default policy can be overridden for individual devices." 2011-06-01 19:29:05.500 HWNPM2119I Threshold Policy: retention period: exception data=14 days. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Port IO Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Port Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Overall Port Response Time Threshold, enabled=no , boundaries=-1,-1,-1,-1 ms/op.
116
2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Error Frame Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Link Failure Rate Threshold, enabled=no , boundaries=0.0030,-1,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=CRC Error Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Send Utilization Percentage Threshold, enabled=no , boundaries=-1,-1,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Receive Utilization Percentage Threshold, enabled=no , boundaries=-1,-1,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Send Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Receive Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Invalid Transmission Word Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Write-cache Delay Percentage Threshold, enabled=yes, boundaries=10,3,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Cache Holding Time Threshold, enabled=yes, boundaries=30,60,-1,-1 s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Back-end I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Back-end Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Back-end Read Response Time Threshold, enabled=no , boundaries=35,25,-1,-1 ms/op. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Back-end Write Response Time Threshold, enabled=no , boundaries=120,80,-1,-1 ms/op. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Disk Utilization Percentage Threshold, enabled=yes, boundaries=80,50,-1,-1 %.
117
By selecting the Storage Subsystem Performance - By Array chart and selecting the DS8000 arrays, we can see the related graphs. To do so, follow these steps: 1. Access the Array Storage Subsystems Performance Panel (Disk Manager Reporting Storage Subsystems Reporting By Array; see Figure 3-52).
2. Click the Generate Report button, and in the next panel, select the specific DS8000 lines, as shown in Figure 3-53.
118
3. Click the chart icon and select the Disk Utilization Percentage metric (Figure 3-54).
4. As shown in Figure 3-55, click the chart icon. In the next panel, select the time interval of interest (01. June 00:00 till 02. June 24:00, that is, the time range when the thresholds were exceeded).
As you can see from the chart, the array A14 exceeded 28 times the Critical Stress of 80% value between 01. and 02. June 2011.
119
120
In Figure 3-56 we show the panel with the recommended values, 12 hours for the create
In the Navigation Tree pane, expand IBM Tivoli Storage Productivity Center Analytics Configuration History, and click Configuration History. The software loads the snapshot data for the length of time that you specified. The Configuration History page (a variation of the Topology Viewer) displays the configurations entities and a floating snapshot selection panel. The panel allows you to define the time periods against which the configuration is compared to determine whether changes have occurred (see Figure 3-57). Use the thumb sliders to establish the time interval that you want to examine.
1. To define the time periods that you want to compare, perform the following tasks: a. Using the mouse, drag the two thumbs in the left Time Range slider to establish the desired time interval. The Time Range slider covers the range of time from the oldest snapshot in the system to the current time. It indicates the date as mm/dd/yy, where mm equals the month, dd equals the day, and yy equals the year.
Chapter 3. General performance management methodology
121
b. Drag the two thumbs in the right Snapshots in Range slider to indicate the two snapshots to compare. The Snapshots in Range slider allows you to select any two snapshots from the time interval specified by the Time Range slider. The value in parentheses beside the Snapshots in Range slider indicates the total snapshots in the currently selected time range. The Snapshots in Range slider has one check mark for each snapshot from the time interval that you specified in the Time Range slider. Each snapshot in the Snapshots in Range slider is represented as time stamp mm/dd/yy hh:mm, where the first mm equals the month, dd equals the day, yy equals the year, hh equals the hour, and the second mm equals the minute. The value in parentheses beside each snapshot indicates the number of changes that have occurred between this and the previous snapshot. Snapshots with zero changes are referred to as empty snapshots. If you provided a title while creating an on demand snapshot, the title displays after the time stamp. If you want to remove empty snapshots, click the check box to display a check mark in Hide Empty Snapshots. The Displaying Now box indicates the two snapshots that are currently active. c. Click Apply to continue. d. Determine the changes that have occurred to the entities by examining the icons and colors associated with them in the graphical and table views: for information about viewing the changes, see 3.6.1, Viewing configuration changes in the graphical view on page 122 and 3.6.2, Viewing configuration changes in the table view on page 124. One single snapshot selection panel applies for all Configuration History views that are open at the same time. Any change that you make in this panel is applied to all of the Configuration History views.
Yellow pencil Dark gray background Entity did not change between the time that the snapshot was taken and the time that a later snapshot was taken.
122
Description Entity was created or added between the time that the snapshot was taken and the time that a later snapshot was taken. Entity was deleted or removed between the time that the snapshot was taken and the time that a later snapshot was taken. Entity did not exist at the time that the snapshot was taken or at the time that a later snapshot was taken.
Green cross Red background Red minus sign Not applicable Light gray background
Figure 3-58 shows the icons and colors of the change overlay. In the graphical view, the pencil icon beside the switches and storage entities and the blue background color indicate that change occurred to these entities. The pencil icon and blue background also appears for these entities in the table view. In the snapshot selection panel, use the Time Range and Snapshots in Range sliders to determine when the change occurred.
To distinguish them from tabs in the Topology Viewer page, tabs in the Configuration History page (Overview, Computers, Fabrics, Storage, and Other) have a light gray background and are outlined in orange. The minimap in the Configuration History page uses the following colors to indicate the aggregated change status of groups:
Blue: One or more entities in the group have changed. Note that the addition or removal of an entity is considered a change. Gray: All of the entities in the group are unchanged.
Entities in the graphical view can be active (they existed at one or both snapshots) or inactive (not yet created or deleted):
Active entities act like they normally do in the Topology Viewer; when you select them all relevant information also appears in the table view. You can adjust a grouping of active entities, but you cannot perform actions that change the database, such as pinning.
123
Inactive entities do not exist in the selected snapshots, but exist in the other snapshots.
They are shown in a light gray background and do not have a change icon associated with them. Inactive entities display to keep the topology layout stable and to make it easier to follow what has changed (instead of having the entities flicker in and out of existence when you change the snapshot selection). Inactive entities are not listed in the table view. An entity that is moved from one group to another group appears only once in the new group in the graphical view. For example, if the health status of a computer has changed from Normal to Warning, the Configuration History page displays the computer as changed in the Warning health group (and no longer displays the computer in the Normal health group). Attention: In the Configuration History view, the performance and alert overlays are disabled and the minimaps shortcut to the Data Path Explorer is not available.
124
125
Access this panel daily to verify that all the services are up and running and the data sources (Fabric and Data/Storage Resource Agents, CIM Agents and NAPI, Out Of Band agents) are up and reachable.
3.7.2 Verifying that Discovery, probes, and performance monitors are running
The Discovery, probes, and monitor jobs (and scan as well, even if these jobs are out of context in this book) must be configured to run daily. Figure 3-61 and Figure 3-62 show two examples; a CIMOM discovery job and an SVC probe job.
126
Also check if all planned performance monitors are running. For best practices, restart the job daily. See example Figure 3-63.
Also check within the IBM Tivoli Storage Productivity Center Job Management for jobs in status Warnings or Failed (see Figure 3-64).
127
To check if all storage subsystems are monitored, you can choose Entity Type Storage Subsystem. A button appears with the label Show Recommendation. Click the button to see the recommendations from Tivoli Storage Productivity Center. By selecting the entry and clicking the button Take Actions, you can solve the issues directly from this panel. See Figure 3-65.
128
Tivoli Storage Productivity Center for Disk threshold: CPU utilization threshold
To create a storage subsystem alert, you can choose Disk Manager Alerting Storage Subsystems Alerts. In Figure 3-66 we show an alert setting for the SVC and Storwize V7000 subsystem.
The CPU Utilization Thresholds takes each SVC node into account. In case the CPU utilization of a node is higher than 50% (Warning) or 70% (critical), an alert will be generated which triggers an email information to the defined recipients. We recommend that you use reporting for performance analyzing and capacity planning. You can use performance alerts for identifying a performance problem within your environment. In this case, we recommend that you use high alert thresholds; otherwise, you might be overwhelmed with alerts. A good strategy is to start with a high performance alerting threshold, solve performance bottlenecks, and lower the thresholds over time to an accurate value. Keep in mind that the storage subsystem-attached hosts and the associated applications determine a valid subsystem performance.
129
Tivoli Storage Productivity Center for Fabric threshold: Port send bandwidth percentage threshold
To create a switch, you can choose Fabric Manager Alerting Switch Alerts. In Figure 3-67 we show the alert setting for the SAN ports in our Fabric.
In case the send bandwidth utilization of a SAN port passes 85% (independent of the port speed of 2, 4, or 8 Gbit/s), an alert is triggered. Further alerts are suppressed during one hour.
130
131
By clicking Generate Report, we can see the metrics in a tabular view. From here we can select some of the subsystems and generate a chart using all or some of the metrics allowed. In this case we choose all storage subsystems, and look at the Total I/O Rate (Figure 3-69).
The result is the chart shown in Figure 3-70. With this chart we can see what happened to each storage subsystem in the last 3 days.
132
Frequently compare this short-term data with long-term data. Take data hourly or even daily and adjust the timeline to several days to month. The following examples of storage subsystem performance consumption show daily averages over some month (Figure 3-71).
Figure 3-71 Slowly increasing of Storage Performance consumption (Total I/O Rate)
Figure 3-72 shows a slow performance consumption reduction over the time period.
Figure 3-72 Slowly decreasing of Storage Performance consumption (Total I/O Rate)
133
From the Selection view, we now save the short-term report (see Figure 3-73).
Now the saved report is accessible from the Tivoli Storage Productivity Center tree (IBM Tivoli Storage Productivity Center Reporting My Reports administrators Reports1), as shown in Figure 3-74.
The subtree shows the reports defined by the logged user, in this case administrator (the same user that created the report).
134
135
In the Selection section, we choose the metrics of our interest (this is Total I/O Rate), choose a time range of the last 7 days, and if required, select a filtering option to limit the data for the subsystems as shown in Figure 3-76 and Figure 3-77.
136
In the Option section, we define where the batch report is to be run, its format (.pdf in our case), and the metrics contained, as shown in Figure 3-78.
Finally, in the When to run section, we choose to run repeatedly the batch report each week on Monday morning 6 AM, as shown in Figure 3-79.
137
The result is now a PDF file containing a report with all DS8000 storage subsystems. Each Storage Subsystem has its own history chart (see Figure 3-80).
138
This report can be used now as an entry point to see any obvious abnormality in the storage environment. For detailed investigations, you still need to go to the Tivoli Storage Productivity Center. In Figure 3-81 we show the Subsystem Performance reports implemented in our environment. As you can see, there is one performance collection job for each subsystem, to avoid CIM overload, for a clear view, and for best alerting.
TIP: Note that all the jobs are defined with a duration of 24 hours (see Figure 3-81). This is to be sure to detect the error on any communication error with the CIM Agent or NAPI on the start-up of the subsequent collection. The log files of the performance data collection job give you a lot of information about the job itself. To access at the log file details, select a job instance, right-click and choose Job History. This will now jump to the job management. Figure 3-82 shows an example of job log file, with the relevant information highlighted: 1) Serial Number of the device 2) Defined Thresholds for this performance job: -1 indicates this value is disabled 3) Assets of the device 4) Performance data collections
139
140
In Figure 3-83 we show an example of several alerts that frequently occur in our environment. These alerts are all Total I/O Rate threshold.
In Figure 3-84 you can see the detailed message for the alert on the Storwize V7000.
141
As you can see, the threshold value of 20% CPU Utilization has been exceeded by the node. This value is the defined Critical Stress CPU Utilization Threshold, as the highlighted line in Figure 3-85.
The value is definitely too small for the Storwize V7000 (see Figure 3-86). We set this low value to demonstrate the alerting. Values defined: In Tivoli Storage Productivity Center for Disk threshold: CPU utilization threshold on page 129 we defined the Critical Stress and Warning Stress values for SVC and Storwize V7000.
142
After clicking the Generate Report (see Figure 3-88), we can see that the Disk Utilization Percentage Time Threshold has been exceeded for the DS8000 subsystem several times.
143
The Overall Back-End Response Time is the approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). In our example we have set Critical and Warning Stress thresholds respectively to 80 and 50 percent (for details on alert definition see 3.4.4, Defining the alerts on page 80), and have forced these thresholds to be exceeded on the DS8000 by simulating heavy traffic using IO Meter. By clicking the lens icon, we get access to the constraint violation details for the metrics of our interest, as shown in Figure 3-89.
The components affected by the constraint violations are the DS8000, mostly array A10 and sometimes A9.
144
To go deeper in the analysis, you can click the lens of the violation entry; for example, the entry number 3 from Jun 6, 2011 12:01:42 AM. Tivoli Storage Productivity Center automatically provides you a new panel including the volumes that are located on that array. Click Generate Report to create a report (Figure 3-90).
The new report shows all volumes on that array, the last samples, and which hosts are affected (see Figure 3-91).
145
To find out which volume utilizes the array most, we create a history graph with the I/O rate metrics over all volumes (Figure 3-92).
Figure 3-92 Constraint Violation: Total I/O Rate of the affected volumes
Figure 3-92 shows three volumes producing traffic to this array. A solution can be to move one of these volumes to another, less utilized array.
146
147
At this stage you can analyze further the relationships to the Storwize V7000. There you see that mdisk0 and mdisk1 are in the Managed Disk Group (mdiskgroup) IBM Cognos. Furthermore you see which volumes are placed in this Managed Disk Group (see Figure 3-94). Now you can start analyzing which of those volumes produce the most load to this Managed Disk Group, called Cognos on the Storwize V7000.
148
149
Important: To see the Data Path from the Server to the end device, a Data Agent or Storage Resource Agent has to be installed on the Server itself. Viewing the data paths in a single view allows you to monitor performance status and pinpoint weaknesses without navigating the many entities typically seen in Topology Viewer. You can do the following tasks, for example: Identify critical path and/or potential performance bottlenecks Identify unexpectedly convoluted paths through the SAN Verify logical and efficient paths Reminder: All topology views for performance values only present the performance data for the last performance monitor collection.
150
During the time interval, a configuration change has occurred on the Storwize V7000 subsystem, as you can see both in the surrounded icon in the graphical view and the highlighted line in the tabular view in Figure 3-96. For detailed descriptions of the icons and colors overlay, see Table 3-6, icons and colors of the change overlay on page 122. In Figure 3-97 we go in further detail, also accessing the level 2 view for the Storwize V7000 subsystem (double-click the subsystem icon).
As you can see both from the graphical and the tabular views, one change has occurred in the Managed Disk Group Cognos: The volume test has been created.
151
152
Chapter 4.
153
Performance analysis usually can be triggered by two events: Response time problems: Users are complaining, or you want to tune the performance. System indicators: Ongoing monitoring shows signs of a problem. This said, in the following sections we describe some steps and considerations that we suggest as an approach to performance problem resolution.
SVC
SAN
Host Adapter Host Adapter Host Adapter Host Adapter Host Adapter Host Adapter
N-way SMP
Cache
NVS
Cache
NVS
N-way SMP
RAID RAID RAID RAID RAID RAID Adapters Adapter Adapters Adapter Adapters Adapter
RAID RAID RAID RAID RAID RAID Adapters Adapter Adapters Adapter Adapters Adapter
Many performance problems are related to the performance of an array, because that is often the most restricting factor in a subsystem. Based on this fact, the arrays are a good point to start your investigation if you cannot determine specific volumes causing a problem or if the problem that you face is not related to more than a few volumes.
155
Generally, understanding a performance problem and the techniques to remediate the issue apply to all types of enterprise storage box types at any level (see Figure 4-2). Hardware resources include these: Host Adapter (HA) Ports Interconnect (PCI-e Busses, RIO loops) Cache and NVS DA (RAID) adapters RAID ranks (the disk count) SVC or storage virtualization layers that introduce additional layers in the data path: Front-end I/O is timed and counted at the fiber interfaces to the SVC representing I/O to/from the servers. Back-end I/O is timed and counted again at the fiber adapters representing I/O to/from SVC and the backing storage. Tivoli Storage Productivity Center can provide monitoring and configuration support for a variety of storage subsystems. In order to analyze storage subsystem performance, it is important to understand the storage environment including the storage subsystem configuration and storage assignments to the servers. This can be achieved using the Tivoli Storage Productivity Center Topology Viewer. SAN Planner assists the user in end-to-end planning involving fabrics, hosts, storage controllers, storage pools, volumes, paths, ports, zones, zone sets, storage resource groups (SRGs), and replication. Moreover, Tivoli Storage Productivity Center is a very good tool for using the various reports provided to summarize and understand the current storage subsystem configuration and allocation. Tivoli Storage Productivity Center can provide you the information to understand your SAN and Storage environment, both in asset and relationships, through the following features: Topology Viewer: This is designed to provide an extended graphical topology view and the relationships among resources: Synchronized graphical and tabular views that allow users to manipulate views by enlarging, reducing, or closing one of the views A locate function to search and find entities, synchronized with a tabular view Overlays that allow you to turn on or off aggregated status (for example, health and performance) and membership (for example, zone and zone set) information See 3.7.7, Using the Topology Viewer on page 147 for an example of use of the Topology Viewer. Predefined and user-defined configuration data: This is useful to view information about your system in addition to the graphical depiction. You can use Tivoli Storage Productivity Center for Disk and Tivoli Storage Productivity Center for Data to get information such as the array where a specific volume is located, other volumes on that array, the number of disk drives on that array, and the RAID level. You need at least the server name to get this information: Disk Asset and configuration data, accessible expanding Disk Manager Storage Subsystems, as shown in Figure 4-3
156
Fabric configuration (and zone configuration), accessible by expanding Fabric Manager Fabrics, as shown in Figure 4-4
157
The second report, Volume HBA Assignment: Not Visible to Monitored Computer (Figure 4-6) provides the details for volumes allocated to a Tivoli Storage Productivity Center monitored server, yet the server has not had the volume allocated to a filesystem. This is a great report to identify orphaned storage independent of the version of Tivoli Storage Productivity Center you currently are licensed with from Tivoli Storage Productivity Center for Basic, through Tivoli Storage Productivity Center Standard Edition.
Figure 4-6 Volume HBA Assignment: Not Visible to Monitored Computer Report
The third report, Volume HBA Assignment: By Storage Subsystem (Figure 4-7) reviews the volumes allocated to Tivoli Storage Productivity Center monitored servers with a Tivoli Storage Productivity Center agent installed.
158
The report in Figure 4-7 includes the following columns: a. Storage Subsystem b. Volume Name c. Volume World Wide Name (WWN) d. HBA Port WWN e. SMIS Host Alias f. Volume Space g. Computer h. Network Address i. OS Type j. Disk Path k. Disk Space l. Available Disk Space m. WWN Match: The WWN Match column indicates a Yes if a Data agent is installed on the host machine and was able to match the HBA Port WWN that was returned by a storage subsystem probe job. This column will display a value for: Windows, Solaris, and HP-UX machines if the HBA API client is installed on the host agent. The volume format column indicates the format of the storage volume. The valid values are: Unknown, Fixed Block, Block 512, Block 520 Protected, Block 520 Unprotected, 3380, 3390, 3390 Extended, Count Key Data. The mainframe volumes are identified as 3380, 3390, 3390 Extended, or Count Key Data.1
n. Volume Format:
o. Manufacturer p. Model q. Probe Last Run date r. Volume Real Space The volume real space column reflects the physical allocated space of the volume. For normal volumes, this is equal to the volume space. For space efficient/thin provisioned volumes, this is equal to the real space allocated when data is written to the volume.1
This column explanation detail was captured from the Tivoli Storage Productivity Center Volume to HBA report F1 help screen.
159
The fourth report Volume HBA Assignment: By Volume Space (Figure 4-8) is nearly identical to the third report. The only variation in this report is the column on which the data is sorted. In this case it is on the volume space instead of the Storage Subsystem column.
Important: Depending on the implementation of the LUN masking of your storage subsystem and the way that the information is passed through CIM or NAPI interface to Tivoli Storage Productivity Center, the SMI-S Host Alias might have a suffix that identifies the individual HBAs. Because the report can be very long, you need to use the filter function and specify the server name (SMI-S Host Alias) so that it is easy to find the server. Tip: If you do not know the full name for which to search, type the beginning of the name and then enter an asterisk at the end together with the LIKE operator. This is especially useful if the host name includes a suffix for HBA identification.
160
Figure 4-9 shows an example of the report that you get using the filter criteria available as an option when building a custom report. We choose the volume for the tpcblade3-11.storage.tuscon.ibm.com server.
Figure 4-9 Volume HBA Assignment: By Storage Subsystem Custom Filter Report
With this report, you can easily identify all the subsystems and volumes that are assigned to the server you are investigating, regardless of whether the server has storage assigned from multiple subsystems. If the server has either the Tivoli Storage Productivity Center Storage Resource Agent (SRA) or Data agent installed, the disk path is now available if you scroll to the right (Figure 4-10). This might help you to discover the disk that is causing the problem, as you now have the information to correlate the information provided by the system administrator to the block storage details seen at the storage subsystem.
Figure 4-10 Volume HBA Assignment: By Storage Subsystem with SRA Agent Disk Path Shown Report
Now that you have identified your subsystem and the volumes, you can use the asset reports of the Data Manager to learn more about the configuration of your subsystem.
161
Figure 4-11 Tivoli Storage Productivity Center Data Manager Navigation Tree
Depending on the type of the subsystem, you see the components that are available. Most of the time, the tree view is helpful for understanding the organization of the storage device quickly, but sometimes, there is also additional information available that you cannot find anywhere else in the product, such as the RAID level. The asset reports are useful, because you can look at the data from different angles and drill up and drill down where needed.
162
Array sites: Array sites are usually groups of eight single disk drive modules (DDMs). If you create an array in a DS8000, you assign a RAID level to that group of disks. Within Tivoli Storage Productivity Center, the term array and array site are used interchangeably even though there is a difference (see Figure 4-12).
Figure 4-12 Tivoli Storage Productivity Center Asset Report: DS8000 Array Site Detail
Important information here includes the number of disks, the RAID level, and the Disk Adapter (DA) to which the array or array site is connected. Ranks: Tivoli Storage Productivity Center actually gathers statistics on a rank level from the DS8000, DS6000 and ESS subsystems (see Figure 4-13). These values are directly available in TPCTOOL and are converted for the array reports in the Tivoli Storage Productivity Center GUI. For details on TPCTOOL, see Appendix C., Reporting with Tivoli Storage Productivity Center on page 365.
Figure 4-13 Tivoli Storage Productivity Center Asset Report: DS8000 Rank Detail
You can also see in this panel whether a rank is formatted using count key data (CKD) or fixed block (FB) data:
163
Storage pools: On the DS8000, these are called extent pools (see Figure 4-14).
Figure 4-14 Tivoli Storage Productivity Center Asset Report: DS8000 Storage Pool (Extent Pool) Details
As shown in Figure 4-14, Tivoli Storage Productivity Center 4.2.1 now exposes additional details in the report shown for the DS8000 storage pools. Here we can see whether EZ-Tier is in use, Solid State Disks (SSD) disk drives are in use, just as a simple example. As Thin Provisioning, or Space Efficient volumes are now an available feature in the DS8000 and several storage servers, Tivoli Storage Productivity Center was enhanced to expose this feature through a simple check of the Is Space Efficient field. If yes, then reviewing the Configured Real Space against the Available Space can show you what is really available in this Space Efficient storage pool. Tip: With the introduction of the DS8800 and with Tivoli Storage Productivity Center 4.2.1 having the ability to expose multiple ranks in a single extent pool in the DS8000, the best practice for DS8000 volumes is now to use volume striping across multirank extent pools for direct attached servers to the DS8000 and for SVC. While multirank extent pool volume striping is now a best practice, this does add complexity to the storage solution from a performance troubleshooting perspective. When you have multiple volume striping methods in the storage data path for a server, you must be able to review the per rank or array details. Tivoli Storage Productivity Center is able to provide that detail and so the use of multirank extent pools is less of a challenge. WARNING: Use caution when implementing a storage solution with multiple disk striping methods without tools such as Tivoli Storage Productivity Center in the solution. Else the solution will have Black Boxes, or areas of the solution without any details on performance. This can extend or even remove the possibility for performance problem determination. For more information about DS8000 performance, see 4.9 Plan Extent Pools in IBM TotalStorage DS8000 Series: Performance Monitoring and Tuning, SG24-7146, at this website: http://www.redbooks.ibm.com/abstracts/sg247146.html?Open 164
SAN Storage Performance Management Using Tivoli Storage Productivity Center
Disks: There is little information that you can get from the disk panel (see Figure 4-15). With additional information from the disk vendor, you can determine if the disk is a Fibre Channel (FC) or Serial Advanced Technology Attachment (SATA) disk, and perhaps the revolutions per minute (RPMs).
Figure 4-15 Tivoli Storage Productivity Center Asset Report: DS8000 Disk Drive Detail
Tip: As seen in Figure 4-15, with Tivoli Storage Productivity Center 4.2.1, additional disk drive data fields were added. This is intended to support identifying if a Disk Drive Module (DDM) in the DS8000 is a Solid State Disk (SSD), is an encryptable candidate drive, or is encrypted.
165
Volumes: If you select a volume, you can actually see the number of disks onto which the volume is spread as well as the RAID level of the volume. One drawback is that you cannot see the array in which the volume is created. In the array tree, you can see the volumes; therefore, you have to go through all the arrays to look for the volume (for a CKD volume example, see Figure 4-17; for a FB volume example, see Figure 4-16). However, during performance problem determination, you do not need to go through all the arrays, because you can drill down in the performance reports in the GUI.
Figure 4-16 Tivoli Storage Productivity Center Asset Report: DS8000 Fixed Block Volume
Figure 4-17 Tivoli Storage Productivity Center Asset Report: DS8000 Mainframe CKD Volume
166
Figure 4-18 Tivoli Storage Productivity Center Asset Report: Manage Disk Group (SVC Storage Pool) Detail
Managed Disks: Figure 4-19 shows the Managed Disk for the selected SVC. No additional information is provided here that you need for performance problem determination. The report was enhanced in 4.2.1 to also reflect if the MDisk was a Solid State Disk (SSD). This is key as you must manually change an MDisk in SVC to mark it as a SSD candidate for EZ-Tier.
Figure 4-19 Tivoli Storage Productivity Center Asset Report: Managed Disk Detail
167
Virtual Disks (Also called volumes): Figure 4-20 shows virtual disks for the selected SVC, or in this case, a virtual disk or volume from a Storwize V7000. Tip: Virtual Disks for either the Storwize V7000 or SVC are identical within Tivoli Storage Productivity Center for this report. So only Storwize V7000 screens were selected, because they review an SVC version 6.1 version impact with Tivoli Storage Productivity Center 4.2.1.
Figure 4-20 Tivoli Storage Productivity Center Asset Report: Virtual Disk Detail
The virtual disks are referred to as volumes in other performance reports. For the volumes, you see the managed disk (MDisk) on which the virtual disks are allocated, but you do not see the correct RAID level. From an SVC perspective, you often stripe the data across the MDisks so that Tivoli Storage Productivity Center displays RAID 0 as the RAID level. As with many other reports, this report was also enhanced to report on EZ-Tier and Space Efficient usage. The key value add features were added to SVC since the first release of this book. In this example screen capture, you see that EZ-Tier is enabled for this volume, yet it is inactive. In addition, this report was also enhanced to show the quantity of storage for this volume in EZ-Tier compared to HDD Tier. Tip: IBM EZ-Tier is a function that automatically removes hot spots for volumes to hosts through migration of sub-volume extents from volumes built on HDD to SSD. This migration removes hot-spots and can drastically increase application performance. While this automatic function is enabled, only sub-volume extents actually migrated from HDD to SSD disks show activity. There is another report that can help you see the actual configuration of the volume. This report includes the MDG or Storage Pool, Back-End Controller, MDisks, and lots more detail; however, this information is not available in the asset reports on the MDisks.
168
Volume to Back-End Volume Assignment: Figure 4-21 shows the location of the Volume to Back-End Volume Assignment report within the Navigation Tree.
Figure 4-22 shows the report, in which the virtual disks are referred to as volumes.
Figure 4-22 Tivoli Storage Productivity Center Asset Report: Volume to Back-End Volume Assignment
This report provides many details about the volume. While specifics of the RAID configuration of the actual MDisks are not presented, the report is quite useful in that all aspects from the host perspective to the back-end are in one report. The following details are available and are quite useful: Storage Subsystem containing the Disk in View; for this report, this is the SVC. Storage Subsystem type; for this report, this is the SVC.
169
User-Defined Volume Name. Volume Name. Volume Space, total usable capacity of the volume. Tip: For space-efficient volumes, this value is the amount of storage space requested for these volumes, not the actual allocated amount. This can result in discrepancies in the overall storage space reported for a storage subsystem using space-efficient volumes. This also applies to other space calculations, such as the calculations for the Storage Subsystem's Consumable Volume Space and FlashCopy Target Volume Space. Storage Pool associated with this volume. Disk, what MDisk is the volume placed upon. Tip: For SVC or Storwize V7000 volumes spanning multiple MDisks, this report will have multiple entries for that volume to reflect the actual MDisks the volume is using. Disk Space, what is the total disk space available on the MDisk. Available Disk Space, what is the remaining space available on the MDisk. Back-End Storage Subsystem, what is the name of Storage Subsystem this MDisk is from. Back-End Storage Subsystem type, what type of storage subsystem is this. Back-End Volume Name, what is the volume name for this MDisk as known by the back-end storage subsystem. (Big Time Saver). Back-End Volume Space. Copy ID. Copy Type, this will present the type of copy this volume is being used for, such as primary or copy for SVC versions 4.3 and newer. Primary is the source volume, and Copy is the target volume. Back-End Volume Real Space, for full back-end volumes this is the actual space. For Space Efficient back-end volumes this is the real capacity being allocated. Easy Tier, indicated whether EZ-Tier is enabled or not on the volume. Easy Tier status, active or inactive. Tiers. Tier Capacity.
170
The most important information in Figure 4-23 is the RAID level as provided under the Type field.
There are no reports about the storage pools in the performance reports. A DS5000 storage pool can be compared with the extent pools in a DS8000 or a DS6000. For the Storage Pools reports, you can drill down to the disks and LUNs in the storage pools similar to the reports on the other storage devices. Disks: The Disks panel gives you information about the DDMs. Figure 4-24 shows information that you can get for each DDM. However, you do not see the position of the disks in the enclosures and loops of the DS5000.
The best way to get information about the disks is to look at the LUNs report and then drill down (see lower right part of Figure 4-25).
171
LUNs: On the LUN report, you can also find the RAID level of a LUN, which is always the same as the RAID level of an array. The nice thing about this report is that you can actually see in which enclosure (tray) and slot the disks are located. You do not see the DDMs worldwide name (WWN) so it is hard to correlate this information with the information in the Disks report in Figure 4-24, but for performance problem determination, this is not that important.
You do not have to remember all the information that is in the reports. We just wanted to step you through the necessary information about your environment, and to show you the volume and server that currently face a performance problem.
172
Figure 4-26 Tivoli Storage Productivity Center Asset Report: XIV Detail for Storage Subsystem
Storage Pools: The XIV storage pools shown in Figure 4-27 are the logical pools that have been combined even though the storage pool has nothing to do with the physical cabling or the location of disks within the enclosures.
Figure 4-27 Tivoli Storage Productivity Center Asset Report: XIV Storage Pools
Because EZ-Tier is not available on XIV, the fields reflect this through the N/A state.
Chapter 4. Using Tivoli Storage Productivity Center for problem determination
173
Disks: Each of the disks in the XIVstorage system are shown as displayed in Figure 4-28.
Figure 4-28 Tivoli Storage Productivity Center Asset Report: XIV Disk Details
Volumes: On the volume report (see Figure 4-29, you can also find the RAID level of a volume. While Tivoli Storage Productivity Center reflects this as RAID 10, in reality XIV volumes do not use traditional RAID techniques at all. Yet RAID 10 reflects the stripping across numerous disks and that the volume extents are mirrored.
174
Figure 4-29 Tivoli Storage Productivity Center Asset Report: XIV Volume Details
175
If a problem can be seen by an end-user, then ask yourself if the following conditions exist: If the problem is due to a perception that things are slower than yesterday If the performance result is still within business defined boundaries for the solution If so, then the solution is working as desired, and to resolve the complaint might require a solution redesign.
176
can be perceived as a problem that you cannot resolve unless a solution redesign is performed.
There are often scheduling issues that result in higher I/O and more problems at night time than during the day, due to batch, backup, and other applications running all at the same time. Overloaded RAID ranks: There is too much I/O on too few disks. Rank skew occurs when there is too much I/O to too few arrays. It just gets to be a bigger problem when we see bigger and bigger physical disks, because users buy fewer spindles. That is usually the first place where we start seeing performance problems.
177
Overloaded ports: This occurs rarely, but it can happen. The ports are driven too heavily during a heavy workload. Poor read hit ratio: Poor read hits are common in an online transaction environment. There is not much you can do about it. Poor read hit ratios are a characteristic of the applications, not storage.
A hot array means a single array that experiences a disproportionate amount of I/Os resulting in high disk utilizations and long response times. It is bad because it results in poor performance that is unnecessary, because other arrays can share the load.
179
XIV storage
A unique storage server available today is the IBM XIV Storage System. This device, utilizing either 1 TB or 2 TB SATA disks in a configuration from 72 to 180 disk drives, is able to use a GRID technology and storage software to provide very high availability and performance rich storage to application servers. This class of storage server is self tuning and is able to provide high availability without typical RAID hardware or software techniques. Using a patented redundancy technique at a 1 MB level, the XIV storage system is able to withstand single or dual drive failures in a single module, or even full drive modules without suffering a volume outage or performance penalty normally obtained during RAID rebuild times for drive replacements. The XIV storage system can provide remarkable performance and availability for applications and enterprises where deep performance skills are not available, or reduced staff expectations required vendor technologies to provide the performance and availabilities without the normal staffing overhead for tuning or performance management.
Automatic tiering
Automatic tiering is the ability to move volumes, or subvolumes dynamically between tiers of storage without causing an outage and an application impact. In IBM technologies this technique or feature is known as EZ-Tier. This technology currently has been introduced in the IBM System Storage DS8800 and SVC. Although both storage devices support EZ-Tier, the specifics are slightly different. In the SVC implementation, SSD disks can be utilized from within the SVC nodes, or from managed disks (MDisks) or LUNs from back-end storage servers that support SSD disks, such as the DS8800. With the SVC, after the SSD disks are identified and LUNs or MDisks are created, these MDisks can be added to storage pools managed by SVC. Currently SVC supports a two tier automatic tiering implementation. This means that through SVCs EZ-Tier feature, virtual volumes, after being enabled, will allow the SVC to manage at an extent level the ability to migrate hot sub volume extents from the HDD MDisks to the SSD MDisks within the same storage pool. This feature can dramatically enhance the performance of the virtual volumes supported by applications using the SVC. For further details on either the SVC and Storwize V7000 or DS8800 implementations of EZ-Tier, see one of the following topics. If your storage is being managed by Tivoli Storage Productivity Center from another vendor, see that vendors documentation for specifics. While several vendors have announced facilities similar to EZ-Tier, they each have their own unique characteristics.
180
181
4. 4. If the I/O is a read I/O: a. The SVC needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the SVC needs to read the data from the physical LUNs. 5. At some point, write I/Os are sent to the storage controller. 6. The SVC might also do some read ahead I/Os to load the cache in case the next read I/O SVC striping for performances
182
c. If the I/O is part of a Metro or Global Mirror, a copy needs to go to the target volume of the relationship. d. If the I/O is part of a FlashCopy and the FlashCopy block has not been copied to the target volume, this action needs to be scheduled. 4. If the I/O is a read I/O: a. The Storwize V7000 needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the Storwize V7000 needs to read the data from the physical mDisks. 5. At some point, write I/Os are destaged to Storwize V7000 managed mDisks or sent to the back-end SAN attached storage controller(s). 6. The Storwize V7000 might also do some data optimized sequential detect pre-fetch cache I/Os to pre-load the cache in case the next read I/O has been determined by the Storwize V7000 cache algorithms to benefit from this cache approach over the more common Least Recently Used (LRU) method used for non sequential IO.
183
184
Chapter 5.
185
186
Figure 5-1 illustrates the Top 10 reports for SVC, Storwize V7000, and Disk subsystems.
Dis k
Con tro ller performance report (*) Data Rates I/O rates balanced
XIV reports availability: (*) Not available (**) Available only for XIV 10.2.4 or later (***) also available on XIV (at Module level)
Managed Disk Group and Managed Disk report Response Time Backend Data Rates Volumes/M report Top Volumes Cache performance Top Volumes Data rate performances Top Volumes Disk performances Top Volumes I/O rate performances Top Volumes Response performance Ports report (**)
Array performance report (*) Disk Utilization Total I/O rate Backend I/O rate Backend Data rate Backend Response Time Write cache delay
Figure 5-1 Top 10 reports for SVC, Storwize V7000, and Disk
To interpret and evaluate these reports, we recommend that you first refer to your created baseline, as documented in Creating a baseline with Tivoli Storage Productivity Center on page 68. During each report review, we review some Rule of Thumb (ROT) impacts. Considerations about ROT values are detailed in Chapter 3, General performance management methodology on page 53, but a summary of them is included in Appendix A, Rules of Thumb and suggested thresholds on page 327.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
187
Figure 5-2 provides a view of the Top 10 reports that we are reviewing and a numbered prioritization approach showing how to walk through the reports.
Important: For the IBM XIV Storage System (not for DSx000 storage subsystems), the additional Module Cache Performance Report is available. See 5.2.7, IBM XIV Module Cache Performance Report on page 228 for details.
188
Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, as shown in Figure 5-3.
When the report is presented, you see a tabular report on the screen. This includes many columns that are commonly of interest, including these: Time Interval Read, Write, and Total I/O Rate Read, Write, and Total Data Rate Read, Write, and Overall Response Times. The reports included in this section have a larger purpose than just a group of reports for disk. These reports are a set of foundation reports that you can use to create custom saved reports. Many additional columns are available through the selection tab on each of these reports, and after you have selected the columns and have placed them into the order you want. You can then save this new tabular report for future reuse. A potential reuse for this saved tabular report can be as a SLA report.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
189
Figure 5-4 shows the pop-up window and the three metrics selected for all I/O rates to be seen on a graph.
After you have selected your metrics, click Ok to generate the chart. We recommend that you create three separate reports to provide a total I/O view of your storage environment. One report will be based upon Read I/O, the next on Write I/O, and finally on Total I/O. This provides a quick review of your subsystems by type of I/O, and a quick reference regarding how your I/O workloads are distributed within your storage environment. Finally, these reports can be used to identify your most I/O bound subsystems. Figure 5-5 shows a Total I/O rate report as an example. This is the report from which we expect you to start your analysis (as you can see, there is a straight line for Storwize that joins the samples from May 23, 00:15 to May 23 08:40. It means that a connection problem with the device occurred during that time frame, and Tivoli Storage Productivity Center did not get any performance value).
190
In Figure 5-5, which shows the Total I/O Rate report in our lab test case, you can see that DS8000-1302541 is receiving the highest workload.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
191
The foregoing three reports provide a foundational view from an I/O perspective, that resolves one of the aspects of problem determination discussed in Chapter 4., Using Tivoli Storage Productivity Center for problem determination on page 153. We introduced this idea in 4.2.2, Understanding your configuration on page 155. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
192
Data rates
In the same way that you reviewed I/O rates previously, you can look at Data Rates for your Subsystems, starting with Total Data Rates and from there going deeper to analyze Read and Write workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, click the icon, and select the Total Data Rates. Then click Ok, as shown in Figure 5-8.
Click Ok to generate the chart. This report provides an overview of a Subsystems Total Data Rate throughput, as shown in Figure 5-9.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
193
In Figure 5-9, confirming what was seen in the previous section (I/O rates (overall) on page 189), we see that the highest throughput belongs to DS8000-1302541, again. As in our example, the data rate reports are able to demonstrate our hottest utilized storage devices. Through this report you can understand the total data bandwidth usage. Although Data Bandwidth is not a typical metric used with SLA reporting, being aware of the bandwidth usage by a subsystem can assist in reviewing whether you have a bottleneck or lack of bandwidth in your SAN environment, both of which are elements of capacity planning. We review capacity planning in Chapter 6., Using Tivoli Storage Productivity Center for capacity planning management on page 305.
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
194
Response Time
To further understand the performance of your subsystem, you can produce a report about Response Times as provided for the entire subsystem. This metric is very high level, but is typically included in SLA reports. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, click the icon and select the Overall Response Time. Then click Ok, as shown in Figure 5-10.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
195
Figure 5-11 on page 195 shows the following response times: Good response time at Subsystem level report for XIV and DS8000-1901 devices (less or equal to 10 msec) during all the time interval analyzed; Bad response time for DS8000-2451 device (more than 20 msec) in a time frame between May, 23 at 10:30 and May, 23 at 12:40; Very bad response time for SVC-svc1 and V7000-2076 (peaks of 50 msec) in a time frame between May,23 at 10:30 and May 23 at 17:30. This chart reflects an average of all activities on each storage subsystem. You have to perform a deep dive into your subsystem to identify bottlenecks. These bottlenecks can be inserted through the improper planning, configuration, or overusage of an array. Reviewing your response times from the top down allows you to identify these bottlenecks. This deeper dive impact on reports is further reviewed later in this chapter (see Top 10 for Disk #2: Controller Performance reports on page 197).
Recommendations
In storage performance management we often assume that 10 msec is pretty high for Tier 1 class of storage. Most disk modeling tools assume this. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time or higher. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current magnetic disk products. In crude terms, the service time of a magnetic disk is composed of a seek, latency, and a data transfer packet size. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5 to 7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications. For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time.
196
To simplify, we can assume that (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-end write response times can be even higher generally up to 80 msec. There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in Megabytes per second that is most important, and maximizing this metric will drive response times much higher than 30 msec. Important: All the above considerations are not valid for SSD disks, where seek time and latency are not applicable. We can expect for these disks much better performance and therefore very short response time (less than 4 ms), especially when the IO Workloads are random in nature and use small block IO transfers. These are the types of workloads that SSD and specifically EZ-Tier can provide large value opportunities. See page 64 for further details on SSD performance. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
197
Data rates
You can use this chart to understand how the throughput is divided between the controllers of your subsystems, to determine if this is a well balanced subsystem or not. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Controller Performance. Separate rows are generated for Server 1 and Server 2 of DS8000 subsystems, as shown in Figure 5-12.
In our example we focus on the DS8000-1302541, examining separately the Server 1 and Server 2 performance. To do this, we can highlight the related Server line in the Controller tab, or choose the desired Server by clicking in the Selection... button in the Selection tab of the Controller Performance window, as shown in Figure 5-13.
198
To review the throughput of the DS8000 controllers, click the icon and select Read Data Rate, Write Data Rate, and Total Data Rate, holding the Shift key pressed. A line graph is generated, as shown in Figure 5-14. Tip: To get a more readable picture, we recommend that you create a chart for one controller at time.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
199
You can read the maximum throughput reached by Server 1 and compare to the value of Server 2, as shown in Figure 5-15. To create the follow chart, repeat this step: Click the icon and select Read Data Rate, Write Data Rate, and Total Data Rate, holding the Shift key pressed.
The throughput of Server 2 is for most of the time higher than Server 1, even if there are some time frames where Server 1 is much higher than Server 2. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53. In a real environment, the typical usage of this report relates to determining if your controllers are obtaining a balanced usage based upon your current configuration. Best practice design expects that LUNs or volumes can obtain maximum performance only when the storage device utilizes all resources available in a balanced approach. This report provides a direct method to measure this critical aspect in your storage environment.
200
I/O rates
As was seen with Data Rate, I/O Rates also require a review for balance. As shown before, we are providing a set of example reports typically for this type of request. Remember that the key here is to identify when a less than balanced distribution of I/O exists within the different storage subsystems. From the same starting point as with the Data Rates, we have the performance metrics pop-up displayed in Figure 5-16.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
201
Click the Ok button to generate the performance chart report by controller with the selected metric, as shown in Figure 5-17. As for Data Rates reports, we focus on DS8000-1302541.
The chart shown in Figure 5-17 confirms a poorly balanced I/O distribution within the DS8000. There are also some spikes representative of great difference in I/O rates. Further digging into the specific I/O patterns of read/write will provide some additional input on where that imbalance load was coming from. In a typical user environment, this will be indicative of an error to further investigate.
202
We now focus on DS8000-1301901. Click the DS8000-1301901 Server 1 entry to highlight it. Click the icon and select Write Cache Hit Percentage (normal) and Write Cache Hit Percentage (sequential). A line graph is generated, as shown in Figure 5-19.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
203
Recommendations
Write Cache Delay percentage is the percentage of write requests coming from a server (or another subsystem when doing remote mirroring) that had to be delayed while existing cache pages are destaged to free up cache pages to store the new writes. The optimum value is zero, or anything close to that. Tivoli Storage Productivity Center has built-in default alerts that have been defined at 3% for warning stress and 10% critical stress. To overcome a problem that is related to high Write Cache Delay percentage values, you have two options: Add more cache. Although this solution sounds logical, often the relief is only minimal, especially if the duration of this problem is not short but is seen in consecutive sample intervals. Distribute the load. This is the better solution, because in situations where you see high Write Cache Delay percentage values, the reason is that more write requests are incoming than can be written to the back-end in the same time. The real problem is that the destage process is too slow. To solve this problem, you need to determine how you can distribute the load to allow data to be destaged faster. How to achieve this depends on the subsystem, but generally you must try to spread the load onto more resources, which can be multiple disks, arrays, disk adapters, or loops.
204
To analyze Controller cache Read usage, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Controller Cache Performance. Select DS8000 Server entries to highlight it. Click the icon and select Read Cache Hit Percentage (overall), as shown in Figure 5-21.
Figure 5-21 DS8000 Controller Read Cache Hit selection Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
205
Recommendations
Read Cache Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. The typical cache usage for enterprise database servers is with sequential IO workloads involving pre-fetch cache loads. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
206
Recommendations
When the number of IOPS to a rank is near or above 1000, the rank can be considered very busy. For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end ranks are not consistent with good performance; they imply that the back-end ranks are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy ranks. A hot array describes a single array that experiences a disproportionate amount of I/Os resulting in high disk utilizations and long response times. This is bad, resulting in poor performance that is unnecessary, because other arrays can share the load.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
207
The Disk Utilization percentage metric available in the Array Performance report is normally used to understand the degree of utilization of back-end arrays services by a subsystem. This is critical to understand, because when an array reaches 50% utilization, there is an impact to write and read response times. Typically a 70% or higher sustained array is one that is a target for capacity management. While it is very easy to reach nearly 100% utilization on an array in an instance such as during a sequential batch import or export, or during a backup job. Having an array above 70% will inject storage queuing into your storage volumes that are based upon this array. With multiple arrays selected, the chart generated can be used to check if one array is busier than the others and whether the workload is balanced. Select all the arrays you want to investigate, click the icon. Select the Disk Utilization percentage metric and click Ok. Tip: When creating a chart, you can always modify the Limit days From: and To: to zoom into the time period that you want to focus on. Then click Generate Chart to regenerate the graph.
208
Figure 5-24 shows a report for first 10 arrays. Array 14 on DS8000-1301901 had two peaks of disk workload that can be investigated further. This will be a capacity planning alarm that must be raised. As any event such as a loss of a cache module, rebuild of a RAID array from a hot spare, and so on, can cause large application impacts.
Recommendations
If there are a lot of cache hits, the subsystem response time might be well below 5 msec even with a high Disk Utilization percentage, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time.
This will be a capacity planning alarm that must be raised. As any event such as a loss of a cache module, rebuild of a RAID array from a hot spare, and so on, can cause large application impacts.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
209
You can run historical charts on the controller by clicking the icon and select any available metric. See 5.2.2, Top 10 for Disk #2: Controller Performance reports on page 197 for further details.
See Total I/O Rate (overall) on page 211 to investigate further on how to understand which logical volumes is the primary I/O generating volume, and is causing the excessive Disk Utilization percentage, shown in the foregoing report.
210
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 IOPS) to very large values (more than 1000 IOPS). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume. I/O rates for disks and RAID ranks are discussed in the next section. In case of traditional volumes on a simple array the volume performance is mostly limited by the disk array performance.
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
211
212
Click Ok to generate the chart shown in Figure 5-29. The red traffic light is displayed here because the report reflects a rate of more than 1,000 IOPS for one of the arrays included in this report. Because this is not a recommended workload for a DS8000 array, we draw your attention to this situation.
Recommendations
The rank I/O limit depends on many factors, chief among them are the number of disks in the rank and the speed (RPM) of the disks. But when the number of IO/sec to a rank is near or above 1000, the rank must be considered very busy! For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end ranks are not consistent with good performance. They imply that the back-end ranks are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy ranks.
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
213
214
Click Ok to generate the chart shown in Figure 5-31. Only the array A14 reached a throughput of 143 MB/sec as the highest value, that is an high but still acceptable value.
Recommendations
The maximum bandwidth at the rank level for sequential read activity (64 KB I/O size) is 240 MB/sec, 150 MB/sec for write. The Redbooks publication, IBM TotalStorage DS8000 Series: Performance Monitoring and Tuning, SG24-7146, contains information about the maximum bandwidth. Limiting write workload to one rank can increase the persistent memory destaging execution time and so, impact all write activities on the same DS8000 subsystem. To avoid this situation, you have to spread write I/O on multiple ranks. For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
215
216
Click Ok to produce the report shown in Figure 5-33. Notice that the Response time as shown at back-end level reached values that are too high. This will need to be investigated at the volume level to determine the current impact to volumes using these arrays.
Recommendations
For random read I/O, the back-end rank (disk) read response time must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Time will be higher because of RAID 5, RAID 6, or RAID 10 striping algorithms, but must seldom exceed 80 msec.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
217
To investigate which volumes can be affected by poor Back-End Response Time, click the next to array 2, as shown in Figure 5-34.
You get the list of all volumes belonging to that array. Select them all, click the icon and select Overall Response Time. Click Ok to generate the next chart. You get multiple screens, each one containing 10 volumes or any other customized value. Look for volumes with highest response time. Figure 5-35 shows that highest value is 36 msec, which sounds reasonable. A good Cache Hit percentage can justify this configuration and ought not to be a factor for applications. In a best practice environment, this report will justify a review of rebalancing the storage configuration to obtain a better storage balance environment. The expectation will be a large reduction in these higher than normal I/O operations.
218
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Recommendations
The DS8000 stores data in the persistent memory before sending acknowledgement to the host. If the persistent memory is full of data (no space available), the host will receive a retry for its write request. In parallel, the server has to destage data stored in its persistent memory to the back-end disk before accepting new write operations from any host. If one of your volumes is facing write operation delayed due to persistent memory constraint, you need to move your volume to a new rank which is less used or spread this volume on multiple ranks (increase the number of DDMs used) to avoid this situation. If this solution does not fix the persistent memory constraint problem, you can consider adding cache capacity to your DS8000.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
219
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
220
Metrics: In Tivoli Storage Productivity Center 4.2, a new Volume Utilization metric is available for all storage devices. In such cases it can provide a quick view into hot volumes as seen by servers and can be used as a starting point for performance analysis. This metric allows you to display a combination of two important metrics in a single report. For details on this metric, see Table B-2 on page 337. Even if the Top Volume reports do not include the Volume Utilization metric, this metric can be quite helpful when review this report for highly utilized volumes to also determine if the volume needs attention. This metric again reviews the potential for the volume to be out of gas.
This report provides a valuable tool for reviewing the impact of caching within your storage environment. As caching both for read and write I/Os has a direct relationship to the I/O response times seen by applications.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
221
Click the icon and select Read cache Hits percentage (overall). Click Ok to generate the chart shown in Figure 5-38.
Recommendations
Read cache hit ratio shows how efficiently your cache works on the Disk subsytems. For example, the value of 100% indicates that all read requests are satisfied within the cache. If the Disk subsystem cannot complete I/O requests within the cache, it transfers data from the DDMs. The subsystem suspends the I/O request until it has read the data. This situation is called cache-miss. If an I/O request is cache-miss, the response time will include not only the data transfer time between host and cache, but also the time that it takes to read the data from DDMs to cache before sending it to the host. A database can be cache-unfriendly to applications by nature. An example might be if a large amount of sequential data is written to a highly fragmented file system in an open systems environment. If an application reads this file, the cache hit ratio will be very low, because the application never reads the same data, due to the nature of sequential access. In this case, de-fragmentation of the file system will improve the performance. You cannot determine if increasing the size of cache improves the I/O performance, without knowing the characteristics of data. We recommend that you monitor the read hit ratio over an extended period of time: If the cache hit ratio has been low historically, it is most likely due to the nature of your data, and you do not have much control over this. If you have a high cache hit ratio initially and it is decreasing as you load more data with the same characteristics, then adding cache or moving some data to another cluster that uses the other clusters cache can improve the situation.
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53. 222
SAN Storage Performance Management Using Tivoli Storage Productivity Center
Click Generate Report on the Selection panel to regenerate the report, shown in Figure 5-40. The top five volumes with the highest total data rate at the last collection time are listed on the report.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
223
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 IOPS) to very large values (more than 1000 IOPS). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume.
224
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Recommendations
Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume that 10 milliseconds is pretty high. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
225
For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time. To make a long story short, (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-End write response times can be even higher, generally up to 80 msec. Important: All the above considerations are not valid for SSD disks, where seek time and latency are not applicable. Expect for these disks much better performance and therefore very short response time (less than 4 ms) for workloads that can be benefited by SSD disks. Today that will be small pack IO workloads with random read type of patterns. See 3.2.4, Performance metric guidelines on page 62 for further details on SSD performance. There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in megabytes per second that is most important, and maximizing this metric will drive response times much higher than 30 msec. For further details, refer Chapter 3, General performance management methodology on page 53.
See 5.7, Case study: Top volumes response time and I/O rate performance report on page 280 to create a tailored report for your environment.
226
For more details regarding these Rules of Thumb and how to interpret these values, see Appendix A, Rules of Thumb and suggested thresholds on page 327.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
227
This report is valuable because you can quickly determine if you have a potential imbalance in your data rates for a storage subsystem, SVC I/O Group, or for specific volumes for your mission critical applications. The first report presented is a general review of all ports, for all subsystems known by Tivoli Storage Productivity Center, and sorted by the subsystem name. This chart is valuable because you can see everything available, but in regard to SLA, Problem Determination, or Change Management, it is a starting or root report from which you can build custom reports. For SLA reporting, you have a set of questions that require reports to provide answers for. One such question might be: Is your storage environment supporting the application data rate required by your Mission Critical application? With this report you can use the selection button to specify the storage subsystem and FC ports that are being used by this specific application server. Then, by reviewing the Total I/O and Total Data Rate set of reports, you can present either a tabular view, graphic view, or a combination of the two to respond to this question. In our environment, the Storwize V7000 showed the highest Total Port I/O Rate during the time frame used in this report. See 5.3, Top 10 reports for SVC and Storwize V7000 for some example reports.
228
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
229
230
Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk to view system reports that are relevant to SVC and Storwize V7000. I/O Group Performance and Managed Disk Group Performance are specific reports for SVC and Storwize V7000, while Module/Node Cache Performance is also available for IBM XIV. In Figure 5-48 those reports are highlighted.
Figure 5-49 shows a sample structure to review basic SVC / Storwize V7000 concepts about product structure and then proceed with performance analysis at the different component levels.
MDisk (2 TB)
MDisk (2 TB)
MDisk (2 TB)
MDisk (2 TB)
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
231
5.3.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports
Tip: For SVCs with multiple I/O groups, a separate row is generated for every I/O group within each SVC. In our lab environment, data was collected for SVC that had a single I/O group. The scroll bar at the bottom of the table indicates that additional metrics can be viewed, as shown in Figure 5-50.
Important: The data displayed in a performance report is the last collected value at the time the report is generated. It is not an average of the last hours or days, but it simply shows the last data collected. Click the next to SVC io_grp0 entry to drill down and view the statistics by nodes within the selected I/O group. Notice that a new tab, Drill down from io_grp0, is created containing the report for nodes within the SVC. See Figure 5-51.
To view a historical chart of one or more specific metrics for the resources, click the icon. A list of metrics is displayed, as shown in Figure 5-52. You can select one or more metrics that use the same measurement unit. If you select metrics that use different measurement units, you will receive an error message.
232
Restriction: Multiple metrics with different measurement units, that is, MB/s, IO/s, Percentages, to be visualized in a single graphic will need to be generated using an external tool such as Microsoft Excel. To get the data to Excel, you will use the export facility of Tivoli Storage Productivity Center available within the GUI or by the CLI using TPCTOOL. See Appendix C, Reporting with Tivoli Storage Productivity Center on page 365, CLI: TPCTOOL as a reporting tool.
You can change the reporting time range and click the Generate Chart button to re-generate the graph, as shown in Figure 5-53. A continual high Node CPU Utilization rate, indicates a busy I/O group; in our environment CPU utilization does not rise above 24%, that is a more than acceptable value.
Recommendations
If the CPU utilization for the SVC or Storwize V7000 version 6.2 Node remains constantly high above 70%, it might be time to increase the number of I/O Groups in the cluster. You can also redistribute workload to other I/O groups in the cluster if available, or also move volumes from one IO Group to another if another IO Group is available. Remember through SVC or Storwize V7000 version 6.2, it will still require a volume outage to move a volume from one SVC or Storwize V7000 IO Group to another. You can add cluster I/O groups (up to the maximum of four I/O Groups per SVC cluster, or maximum of two I/O Group per Storwize V7000 version 6.2). If there are already four I/O Groups in a cluster (with the latest firmware installed), and you are still having high SVC or Storwize V7000 Node CPU utilization as indicated in the reports, it is time to build a new cluster and consider either migrating some storage to the new cluster, or if existing SVC nodes are not of the 2145-CG8 version, upgrading them to the CG8 nodes.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
233
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
234
Notice that the I/Os are only present on Node 2. So, in Figure 5-56, you can see a configuration problem, where workload is not well balanced, at least during this time frame (this is the reason for the red traffic light shown in that figure).
Recommendations
To interpret your performance results, the first recommendation is to go always back to your baseline (see 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68. Moreover, some industry benchmarks for the SVC and Storwize V7000 are available. SVC 4.2, and the 8G4 node brought a dramatic increase in performance as demonstrated by the results in the Storage Performance Council (SPC) Benchmarks, SPC-1 and SPC-2. The benchmark number, 272,505.19 SPC-1 IOPS, is the industry-leading OLTP result and the PDF is available at the following website: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf An SPC Benchmark2 was also performed for Storwize V7000; the Executive Summary PDF is available at the following website: http://www.storageperformance.org/benchmark_results_files/SPC-2/IBM_SPC-2/B00052_I BM_Storwize-V7000/b00052_IBM_Storwize-V7000_SPC2_executive-summary.pdf Figure 5-55 shows numbers on max I/Os and MB/s per I/O group. SVC performance or your realized SVC obtained performance will be based upon multiple factors such as these: The specific SVC nodes in your configuration The type of Managed Disks (volumes) in the Managed Disk Group (MDG) The application I/O workloads using the MDG The paths to the back-end storage These are all factors that ultimately lead to the final performance realized. In reviewing the SPC benchmark (see Figure 5-55), depending upon the transfer block size used, the results for the I/O and Data Rate obtained are quite different. Looking at the two-node I/O group used, you might see 122,000 I/Os if all of the transfer blocks were 4K. In typical environments, they rarely are. So if you jump down to 64K, or bigger. with anything over about 32K, you might realize a result more typical of the 29,000 as seen by the SPC benchmark.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
235
Max I/Os and MB/s Per I/O Group 70/30 R/W Miss
2145-8G4 4K Transfer Size 122K 500MB/s 64K Transfer Size 29K 1.8GB/s 2145-8F4 4K Transfer Size 72K 300MB/s 64K Transfer Size 23K 1.4GB/s 2145-4F2 4K Transfer Size 38K 156MB/s 64K Transfer Size 11K 700MB/s 2145-8F2 4K Transfer Size 72K 300MB/s 64K Transfer Size 15K 1GB/s
Figure 5-55 SPC SVC benchmark Max I/Os and MB/s per I/O group
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53. As mentioned before, in the I/O rate graph shown in Figure 5-56, you can see a configuration problem indicated by the red traffic light in the lower right corner.
236
Response time
To view the read and write response time at Node level, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Backend Read Response Time and Backend Write Response Time metrics, as shown in Figure 5-57.
Click Ok to generate the report, as shown in Figure 5-58. We see values that can be accepted in back-end response time for both read and write operations, and these are consistent for both our I/O Groups.
Recommendations
For random read I/O, the back-end rank (disk) read response times must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Times will be higher because of RAID 5 (or RAID 10) algorithms, but must seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines. In case of poor response time, you have to investigate using all available information from the SVC and the back-end storage controller. Possible causes for a large change in response times from the back-end storage might be visible using the storage controller management tool include these: Physical array drive failure leading to an array rebuild. This drives additional back-end storage subsystem internal read/write workload while the rebuild is in progress. If this is causing poor latency, it might be desirable to adjust the array rebuild priority to lessen the load. However, this must be balanced with the increased risk of a second drive failure during the rebuild, which will cause data loss in a RAID 5 array. Cache battery failure leading to cache being disabled by the controller. This can usually be resolved simply by replacing the failed battery.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
237
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
238
Data Rate
To look at the Read Data rate, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Read Data Rate metric. Press down Shift key and select Write Data Rate and Total Data Rate. Then click Ok to generate the chart, shown in Figure 5-59.
To interpret your performance results, the first recommendation is to go always back to your baseline (see 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68.) Moreover, some benchmark is available. The throughput benchmark, 7,084.44 SPC-2 MBPS, is the industry-leading throughput benchmark, and the PDF is available here: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf
5.3.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports
Efficient use of cache can help enhance virtual disk I/O response time. The Node Cache Performance report displays a list of cache related metrics such as Read and Write Cache Hits percentage and Read Ahead percentage of cache hits. The cache memory resource reports provide an understanding of the utilization of the SVC or Storwize V7000 cache. These reports provide you with an indication of whether the cache is able to service and buffer the current workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Module/Node Cache performance report. Notice that this report is generated at SVC and Storwize V7000 node level (moreover there is an entry that refers to the IBM XIV Storage System, see 5.2.7, IBM XIV Module Cache Performance Report on page 228), as shown in Figure 5-60.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
239
Figure 5-60 SVC and Storwize V7000 Node cache performance report
Important: The flat line for node1 does not mean that the read request for that node cannot be handled by the cache, it means that there is no traffic at all on that node, as is illustrated in Figure 5-62 and Figure 5-63, where Read Cache Hit Percentage and Read I/O Rates are compared in the same time interval.
240
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
241
242
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. If you need to analyze further cache performances and try to understand if it is enough for your workload, you can run multiple metrics charts. Select the metrics named percentage, because you can have multiple metrics with the same unit type, in one chart. In the Selection panel, move from Available Column to Included Column the percentage metrics you want include, then in the Selection.. button, check only the Storwize V7000 entries. Figure 5-66 on page 245 shows an example where several percentage metrics are chosen for Storwize V7000. The complete list of metrics is as follows: CPU utilization percentage: The average utilization of the node controllers in this I/O group during the sample interval. Dirty Write percentage of Cache Hits: The percentage of write cache hits which modified only data that was already marked dirty in the cache; re-written data. This is an obscure measurement of how effectively writes are coalesced before destaging.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
243
Read/Write/Total Cache Hits percentage (overall): The percentage of reads/writes/total during the sample interval that are found in cache. This is an important metric. The write cache hot percentage must be very nearly 100%. Readahead percentage of Cache Hits: An obscure measurement of cache hits involving data that has been prestaged for one reason or another. Write Cache Flush-through percentage: For SVC and Storwize V7000, the percentage of write operations that were processed in Flush-through write mode during the sample interval. Write Cache Overflow percentage: For SVC and Storwize V7000, the percentage of write operations that were delayed due to lack of write-cache space during the sample interval. Write Cache Write-through percentage: For SVC and Storwize V7000, the percentage of write operations that were processed in Write-through write mode during the sample interval. Write Cache Delay percentage: The percentage of all I/O operations that were delayed due to write-cache space constraints or other conditions during the sample interval. Only writes can be delayed, but the percentage is of all I/O. Small Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are <= 8 KB. Small Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are <= 8 KB. Medium Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Medium Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 8 KB and <= 64 KB. Large Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Large Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 64 KB and <= 512 KB. Very Large Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 512 KB. Very Large Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 512 KB.
244
Overall Host Attributed Response Time Percentage: The percentage of the average response time, both read response time and write response time, that can be attributed to delays from host systems. This metric is provided to help diagnose slow hosts and poorly performing fabrics. The value is based on the time taken for hosts to respond to transfer-ready notifications from the SVC nodes (for read) and the time taken for hosts to send the write data after the node has responded to a transfer-ready notification (for write). The following metric is only applicable in a Global Mirror Session: Global Mirror Overlapping Write Percentage: Average percentage of write operations issued by the Global Mirror primary site which were serialized overlapping writes for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes were serialized.
After selecting Storwize V7000 node1 and node2, select all the metrics in the Select charting option pop-up window and click Ok to generate the chart. In our test, as shown in Figure 5-67, we notice that there is a drop in the Cache Hits percentage. Even if the drop is not so dramatic, this can be considered as an example for further investigation of problems arising. Changes in these performance metrics together with an increase in back-end response time (see Figure 5-68) show that the storage controller is heavily burdened with I/O, and the Storwize V7000 cache can become full of outstanding write I/Os. Host I/O activity will be impacted with the backlog of data in the Storwize V7000 cache and with any other Storwize V7000 workload that is going on to the same MDisks.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
245
I/O Groups: If cache utilization is a problem, in SVC and Storwize V7000 version 6.2 you can add additional cache to the cluster by adding an I/O Group and moving volumes to the new I/O Group. Also, adding an I/O Group and moving a volume from one I/O group to another are still disruptive actions. So proper planning to manage this disruption is required.
246
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
5.3.3 Top 10 for SVC #3: Managed Disk Group performance reports
The Managed Disk Group performance report provides disk performance information at the managed disk group level. It summarizes read and write transfer size, back-end read, write, and total I/O rate. From this report you can easily drill up to see the statistics of virtual disks supported by a managed disk group or drill down to view the data for the individual mdisks that make up the managed disk group. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and select Managed Disk Group performance. A table is displayed listing all the known Managed Disk Groups and their last collected statistics, based on the latest performance data collection. See Figure 5-69.
One of the Managed Disk Groups is CET_DS8K1901mdg. Click the drill down icon on the entry CET_DS8K1901mdg to drill down. A new tab is created, containing the Managed Disks in the Managed Disk Group. See Figure 5-70.
Figure 5-70 Drill down from Managed Disk Group Performance report
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
247
Click the drill down icon on the entry mdisk61 to drill down. A new tab is created, containing the Volumes in the Managed Disk. See Figure 5-71.
I/O rate
We recommend that you analyze how the I/O workload is split between Managed Disk Groups, to determine if it is well balanced or not. Click Managed Disk Groups tab, select all Managed Disk Groups, click the icon, and select Total Backend I/O Rate, as shown in Figure 5-72.
Figure 5-72 Top 10 SVC - Managed Disk Group I/O rate selection
Click Ok to generate the next chart, as shown in Figure 5-73. When reviewing this general chart, you must understand that it reflects all I/O to the back-end storage from the Managed Disks included within this Managed Disk Group. The key for this report is a general understanding of back-end I/O rate usage, not whether there is balance outright. 248
Although the SVC and Storwize V7000 by default stripes writes and read I/Os across all Managed Disks, the striping is not through a RAID 0 type of stripe. Rather, as the Virtual Disk is a concatenated volume, the striping injected by the SVC and Storwize V7000 is only in how we identify extents to be used when we create a Virtual Disk. Until host I/O write actions fill up the first extent, the remaining extents in the block Virtual Disk provided by SVC will not be used. It is very likely when you are looking at the Managed Disk Group Back-End I/O report, that you will not see a balance of write activity even for a single Managed Disk Group. In the report shown in Figure 5-73, for the time frame specified, we see that at one point we have a maximum of nearly 8200 IOPS.
Figure 5-73 Top 10 SVC - Managed Disk Group I/O rate report
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
249
Response time
Now you can get back to the list of Managed Disks, by moving to the Drill down from CET_DS8K1901mdg tab (see Figure 5-70 on page 247). Select all the Managed Disks entries, click the icon and select the Backend Read Response time metric, as shown in Figure 5-74.
250
Recommendations
For random read I/O, the back-end rank (disk) read response time must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Time will be higher because of RAID 5, RAID 6, or RAID 10 algorithms, but must seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines.
For further details about the SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
251
Select all the Managed Disks from the Drill down from CET_DS8K1901mdg tab, click the icon, and select the Backend Data Rates, as shown in Figure 5-76.
Click Ok to generate the report shown in Figure 5-77. Here the workload is not balanced on Managed Disks.
252
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
5.3.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports
Tivoli Storage Productivity Center provides five reports on Top Volume performance: Top Volume Cache performance: Prioritized by the Total Cache Hits percentage (overall) metric. Top Volume Data Rate performance: Prioritized by the Total Data Rate metric. Top Volume Disk performance: Prioritized by the Disk to cache Transfer rate metric. Top Volume I/O Rate performance: Prioritized by the Total I/O Rate (overall) metric. Top Volume Response performance: Prioritized by the Total Data Rate metric. Volumes referred in these reports correspond to the Virtual Disks in SVC. Important: The last collected performance data on volumes are used for the reports. The report creates a ranked list of volumes based on the metric used to prioritize the performance data. You can customize these reports according to the needs of your environment. To limit these system reports to just SVC subsystems, you have to specify a filter, as shown in Figure 5-78. Click the Selection tab, then click Filter. Click Add to specify another condition to be met. This has to be done for all the five reports.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
253
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
For further details about SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
254
Click Generate Report on the Selection panel to regenerate the report, shown next in Figure 5-81. If this report is generated during the run time periods, the volumes will have the highest total data rate and be listed on the report.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
255
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 I/O per second) to very large values (more than 1000 I/O/second). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume.
256
For further details about the SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Recommendations
Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume (and our performance models assume) that 10 milliseconds is pretty high. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
257
For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time. To simplify, we can assume that (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-End write response times can be even higher, generally up to 80 msec. Important: All of these considerations are not valid for SSD disks, where seek time and latency are not applicable. Expect for these disks much better performance and therefore very short response time (less than 4 ms) for workloads that can be benefited by SSD disks. Today that will be small pack IO workloads with random read type of patterns. See 3.2.4, Performance metric guidelines on page 62 for further details on SSD performance.
See 5.7, Case study: Top volumes response time and I/O rate performance report on page 280 to create a tailored report for your environment. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
258
5.3.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports
The SVC and Storwize V7000 port performance reports help you understand the SVC and Storwize V7000 impact on the fabric and give you an indication of the traffic between the following systems: SVC (or Storwize V7000) and hosts that receive storage SVC (or Storwize V7000) and back-end storage Nodes in the SVC (or Storwize V7000) cluster These reports can help you understand if the fabric might be a performance bottleneck and if upgrading the fabric can lead to performance improvement. The Port Performance report summarizes the various send, receive and total port I/O rates and data rates. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and click Port Performance. In order to display only SVC and Storwize V7000 ports, click Filter to produce a report for all the volumes belonging to SVC or Storwize V7000 subsystems, as shown in Figure 5-85.
A separate row is generated for each subsystems ports. The information displayed in each of the rows reflect data last collected for the port. Notice the Time column displayed the last collection time, which might be different for different subsystem ports. Not all the metrics in the Port Performance report are applicable for all ports. For example, the Port Send Utilization percentage, Port Receive Utilization Percentage, Overall Port Utilization percentage data are not available on SVC or Storwize V7000 ports. N/A is displayed in the place when data is not available, as shown in Figure 5-86. By clicking Total Port I/O Rate you get a prioritized list by I/O rate.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
259
Figure 5-87 SVC and Storwize V7000 Port I/O rate report
Recommendations
Based on the nominal speed of each of the FC ports, which can be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of round 800 MB/sec. So, you need to generate an alert when it is more than 400 MB/sec. See 3.4.4, Defining the alerts on page 80 for information about how to set up alerts.
260
To investigate further using the Port performance report, go back to the I/O group performances report. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click I/O group Performance and drill-down to Node level. In the example in Figure 5-88 we choose Node 1 of the SVC subsystem:
Then click the icon and select Port to Local Node Send Queue Time, Port to Local Node Receive Queue Time, Port to Local Node Receive Response Time and Port to Local Node Send Response Time, as shown in Figure 5-89.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
261
Look at port rates between SVC nodes, hosts, and disk storage controllers. Figure 5-90 shows low queue and response times, indicating that the nodes do not have a problem communicating with each other.
If this report shows high queue and response times, the write activity (because each node communicates to each other node over the fabric) is affected. Unusually high numbers in this report indicate: SVC (or Storwize V7000) node or port problem (unlikely) Fabric switch congestion (more likely) Faulty fabric ports or cables (most likely) For further details about this SVC and Storwize V7000 Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
262
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
263
After you have the I/O rate review chart, you also need to generate a data rate chart for the same time frame. This will support a review of your HA ports for this application. Generate another historical chart with the Total Port Data Rate metric, as shown in Figure 5-92, that confirms the unbalanced workload for one port shown in the foregoing report.
Recommendations
According to the nominal speed of each FC ports, which can be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of around 800 MB/sec. So, you need to generate an alert when it is more than 400 MB/sec. See 3.4.4, Defining the alerts on page 80 for information about how to set up alerts.
For further details about this SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
264
Tip: Rather than using a specific report to monitor Switch Port Errors, we recommend that you use the Constraint Violation report. By setting an Alert for the number of errors at the switch port level, the Constraint Violation report becomes a direct tool to monitor the errors in your fabric. For details, see 3.5.5, Constraint Violations reports on page 113.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
265
Expand: IBM Tivoli Storage Productivity Center Reporting System Reports Fabric and select Top Switch Ports Data Rates performance. Click the icon and select Total Port Data Rate, as shown in Figure 5-94.
Click Ok to generate the chart shown next in Figure 5-95. Port Data Rates do not reach a warning level, in this case, knowing that FC Port speed is 8 Gbits/sec.
Recommendations
Use this report to monitor if some Switch Ports are overloaded or not. According to FC Port nominal speed (2 Gbit, 4 Gbit or more) as shown in Table 5-1, you have to establish the maximum workload a switch port can reach. We recommend to not exceed 50-70%.
Table 5-1 Switch Port data rates FC Port speed Gbits/sec 1 Gbits/sec 2 Gbits/sec 4 Gbits/sec 8 Gbits/sec 10 Gbit/sec FC Port speed MBytes/sec 100 MB/sec 200 MB/sec 400 MB/sec 800 MB/sec 1000 MB/sec Recommended Port Data Rate threshold 50 MB/sec 100 MB/sec 200 MB/sec 400 MB/sec 500 MB/sec
266
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
267
Click Generate Report to get the output shown in Figure 5-97. Scrolling to the right of the table more information is available, such as the volume names, volume capacity, allocated and unallocated volume spaces are listed.
268
Data on the report can be exported by selecting File Export Data to a comma delimited file, comma delimited file with headers or formatted report file and HTML file. You can start from this volumes list to analyze performance data and workload I/O rate. Tivoli Storage Productivity Center provides a report that shows volume to back-end volume assignments. To display the report, expand Disk Manager Reporting Storage Subsystem Volume to Backend Volume Assignment By Volume. Click Filter to limit the list of the volumes to the ones belonging to server tpcblade3-7, as shown in Figure 5-98.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
269
Scroll to the right to see the SVC Managed Disks and Back-End Volumes on DS8000, as shown in Figure 5-100. Back-end storage: The highlighted lines with N/A values are related to a back-end storage subsystem not defined in our Tivoli Storage Productivity Center environment. To obtain the information about the back-end storage subsystem, it has to be added in the Tivoli Storage Productivity Center environment, together with the corresponding probe job (see the first line in the report in Figure 5-100, where the back-end storage subsystem is part of our Tivoli Storage Productivity Center environment and therefore the volume is correctly showed in all its details).
With this information and the list of volumes mapped to this computer, you can start to run a Performance Report to understand where the problem for this server can be.
270
Recommendations
Looking at disk performance problems, you need to check the overall response time as well as its overall I/O rate. If they are both high, there might be a problem. If the overall response time is high and the I/O rate is trivial, the impact of the high overall response time might be inconsequential. Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Then click Filter to produce a report for all the volumes belonging to Storwize V7000 subsystems, as shown in Figure 5-101.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
271
Click the volume you need to investigate, click the icon and select Total I/O Rate (overall). Then click Ok to produce the graph, as shown in Figure 5-102.
The chart in Figure 5-103 shows that I/O rate had been around 900 operations per second and suddenly declined to around 400 operations per second. Then, it goes back to 900 operations per second. In this case study we limited the days to the time frame reported by the customer when the problem was noticed.
272
Select again the Volumes tab, click the volume you need to investigate, click the icon and scroll down to select Overall Response Time. Then click Ok to produce the chart, as shown in Figure 5-104.
The chart in Figure 5-105 indicates the increase in response time from a few milliseconds to around 30 milliseconds. This information, combined with the high I/O rate, indicates there is a significant problem and further investigation is appropriate.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
273
The next step is to look at the performance of MDisks in the MDisk group. To identify to which Managed Disk the Virtual Disk tpcblade3-7-ko2 belongs, go back to Volumes tab and click the drill up icon, as shown in Figure 5-106.
Figure 5-107 shows the Managed Disks where tpcblade3-7-ko2 extents reside:
274
Select all the MDisks. Click the icon and select Overall Backend Response Time. Click Ok as shown in Figure 5-108.
Keep the charts generated relevant to this scenario, using the charting time range. You can see from the chart in Figure 5-109 that something happened around May, 26 at 6:00 pm that probably caused the back-end response time for all MDisks to dramatically increase.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
275
If you take a look at the chart for the Total Back-End I/O Rate for these two MDisks during the same time period, you will see that their I/O rates all remained in a similar overlapping pattern, even after the introduction of the problem. This is as expected and will be because tpcblade3-7-ko2 is evenly striped across the two MDisks. The I/O rate for these MDisks is only as high as the slowest MDisk, as shown in Figure 5-110.
At this point, we have identified that the response time for all Managed Disks dramatically increased. The next step is to generate a report to show the volumes that have an overall I/O rate equal to or greater than 1000 Ops/ms and then generate a chart to show which of the I/O rates for those volumes changed around 5:30 pm on August 20.
276
Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Click Display historic performance data using absolute time and limit the time period to 1 hour before and1 hour after the event reported in Figure 5-109. Click Filter to limit to Storwize V7000 Subsystem and Add a second filter to select the Total I/O Rate (overall) greater than 1000 (it means high I/O rate). Click Ok, as shown in Figure 5-111.
The report in Figure 5-112, shows all the performance records of the volumes filtered above. In the Volume column there are only three volumes that meet these criteria: tpcblade3-7-ko2, tpcblade3-7-ko3 and tpcblade3ko4. There are multiple rows for each as there is a row for each performance data record. Look for what volumes I/O rate changed around 6:00 pm on May 26. You can click the Time column to sort.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
277
Now we have to compare the Total I/O rate (overall) metric for the above volumes and the volume subject of the case study, tpcblade3-7-k02. To do so remove the filtering condition on the Total I/O Rate defined in Figure 5-111 and generate the report again. Then select one row for each of these volumes and select Total I/O Rate (overall). Then click Ok to generate the chart, as shown in Figure 5-113.
For Limit days From, insert the time frame we are investigating. Results: Figure 5-114 shows the root cause. Volume tpcblade3-7-ko2 (the blue line in the screen capture) started around 5:00pm and has a Total I/O rate around 1000 IOPS. When the new workloads (generated by tpcblade3-7-ko3 and tpcblade3-ko4)started together, the Total I/O rate for volume tpcblade3-7-ko2 fell from around1000 IOPS to less than 500 I/Os, and then grew up again to about 1000 I/Os when one of the two loads decreased. The hardware has physical limitations on the number of IOPS that it can handle and this was reached at 6:00 pm.
278
To confirm this behavior, you can generate a chart by selecting Response time. The chart shown in Figure 5-115 confirms that as soon as the new workload started, response time for tpcblade3-7-ko2 gets worse.
The easy solution is to split this workload, moving one Virtual Disk to another Managed Disk Group.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
279
5.7 Case study: Top volumes response time and I/O rate performance report
The default Top Volumes Response Performance Report can be useful identifying problem performance areas. A long response time is not necessarily indicative of a problem. It is possible to have volumes with long response time with very low (trivial) I/O rates. These situations might pose a performance problem to be further investigated. In this section we tailor Top Volumes Response Performance Report to identify volumes with both long response times and high I/O rates. The report can be tailored for your environment; it is also possible to update your Filters to exclude volumes or subsystems you no longer want in this report. Expand Disk Manager Reporting Storage Subsystem Performance by Volume as shown in Figure 5-116 and keep only desired metrics as Included Columns, moving all the others to Available Columns. You can save this report to be referenced in the future from IBM Tivoli Storage Productivity Center My Reports your user Reports.
280
You have to specify the filters to limit the report, as shown in Figure 5-117. Click Filter and then Add the conditions. In our example we are limiting the report to Subsystems SVC* and DS8* and to the volumes that have an I/O Rate greater than 100 Ops/sec and a Response Time greater than 5 msec.
Prior to generating the report, you need to specify the date and time of the period for which you want to make the inquiry. Important: Specifying large intervals might require intensive processing and a long time to complete. As shown in Figure 5-118, click Generate Report.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
281
Figure 5-119 shows the resulting Volume list. Sorting by response time or by I/O Rate columns (by clicking the column header), you can easily identify which entries have both interesting total I/O Rates and Overall Response Times.
Recommendations
We suggest that in a production environment, you might want to initially specify a Total I/O Rate overall somewhere between 1 and 100 Ops/sec and Overall Response Time (msec) that is greater than or equal to 15 msec, and adjust those numbers to suit your needs as you gain more experience.
282
5.8 Case study: SVC and Storwize V7000 performance constraint alerts
Along with reporting on SVC and Storwize V7000 performance, Tivoli Storage Productivity Center can generate alerts when performance has not met, or has exceeded a defined threshold. Like most Tivoli Storage Productivity Center tasks, the alerting can report to these choices: SNMP: Enables you to send an SNMP trap to an upstream systems management application. The SNMP trap can then be used with other events occurring within the environment to help determine the root cause of an SNMP trap. In this case was generated by the SVC. For example, if the SVC or Storwize V7000 reported to Tivoli Storage Productivity Center that a fibre port went offline, it might in fact be because a switch has failed. This port failed trap, together with the switch offline trapped can be analyzed by a systems management tool to be a switch problem, not an SVC (or Storwize V7000) problem, so that the switch technicians called. Tivoli Omnibus Event: Select to send a Tivoli Omnibus event. Login Notification: Select to send the alert to a Tivoli Storage Productivity Center user. The user receives the alert upon logging in to Tivoli Storage Productivity Center. In the Login ID field, type the user ID. UNIX or Windows NT system event logger Script: The script option enables you to run a predefined set of commands that can help address this event, for example, simply open a trouble ticket in your helpdesk ticket system. Email: Tivoli Storage Productivity Center will send an e-mail to each person listed. Tip: Remember that for Tivoli Storage Productivity Center to be able to email addresses, an email relay must be identified in the Administrative Services Configuration Alert Disposition and then Email settings. These are some useful alert events that you need to set: CPU utilization threshold: The CPU utilization report will alert you when your SVC or Storwize V7000 nodes become too busy. If this alert is being generated too often, it might be time to upgrade your cluster with additional resources. Development recommends this setting to be at 75% as warning or 90% as critical. These are the defaults that come with Tivoli Storage Productivity Center 4.2.1. So to enable this function, just create an alert selecting the CPU Utilization. Then define the alert actions to be performed. Next, using the Storage Subsystem tab, select the SVC or Storwize V7000 cluster to have this alert set for.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
283
Overall port response time threshold: The port response times alert can let you know when the SAN fabric is becoming a bottleneck. If the response times are consistently bad, you must perform additional analysis of your SAN fabric. Overall back-end response time threshold: An increase in back-end response time might indicate that you are overloading your back-end storage: Because back-end response times can very depending on what I/O workloads are in place. Prior to setting this value, capture 1 to 4 weeks of data to baseline your environment. Then set the response time values. Because you can select the storage subsystem for this alert. You are able to set different alerts based upon the baselines you have captured. Our recommendation is to start with your mission critical Tier 1 storage subsystems. To create an alert, as shown in Figure 5-120, expand Disk Manager Alerting Storage Subsystem Alerts and right-click to Create a Storage Subsystems Alert. On the right you get a pull-down menu where you can choose which alert you want to set.
Tip: The best place for you to verify which thresholds are currently enabled, and at what values, is at the beginning of a Performance Collection job. Expand Tivoli Storage Productivity Center Job Management and select in the Schedule table the latest performance collection job running or that has run for your subsystem. In the Job for Selected Schedule part of the panel (lower part), expand the corresponding job and select the instance, as shown in Figure 5-121.
284
Figure 5-121 Job management panel - SVC performance job log selection
By clicking to the View Log File(s) button, you can access to the corresponding log file, where you can see the threshold defined, as shown in Figure 5-122. Tip: To go to the beginning of the log file, click the Top button.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
285
Expand IBM Tivoli Storage Productivity Center Alerting Alert Log Storage Subsystem to list all the alerts occurred. Look for your SVC subsystem, as shown in Figure 5-123.
By clicking the icon next to the alert you want to enquire about, you get detailed information as shown in Figure 5-124.
286
port
In this case study, we compare the overall I/O rate of some IBM XIV volumes. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, select the IBM XIV subsytems, click the icon and select Read I/O Rate (overall) and Write I/O Rate (overall), as shown in Figure 5-125.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
287
Click Ok to generate the graph shown in Figure 5-126. From the chart result, we see that this subsystem has a higher read I/O rate but very low write I/O rate, which means that it has a more read extensive workload.
288
This type of information can be used, for example, to do performance tuning from the application, operating system, or the storage subsystem side. This can be a starting point for further analysis. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Top Active Volume Cache hit Performance, click the Selection tab to specify additional Filter option. Click Filter on the upper right corner and add a Filter at Subsystem level, as shown in Figure 5-127.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
289
Then click Generate Report to get volumes list. Click the icon and select Read Cache Hits percentage (overall) and Write Cache Hits percentage (overall). Click Ok to generate the chart shown in Figure 5-128. In this case study, we notice that the IBM XIV volume tpcblade3-7_cet_1 make good use of Cache during Read activity, while the others have low Read Cache Hits percentage. This can depend on the type of workload or application.
290
After generating this report on the next page, you will use the Topology Viewer to identify what device is being impacted and identify a possible solution. Figure 5-130 shows the result we are getting in our lab.
Figure 5-130 Ports exceeding filters set for switch performance report
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
291
Click the icon and select Port Send Data Rate, Port Receive Data Rate and Total Port Data Rate, holding Ctrl key. Click Ok to generate the chart shown in Figure 5-131. Tip: This chart gives you an indication as to how persistent this high utilization for this port is. This is an important consideration in order to establish the importance and the impact of this bottleneck. important: To get all the values in the selected interval, you have to remove the filters defined in Figure 5-129. The chart shows a consistent throughput higher than 300 MB/sec in the selected time period. You can change the dates, extending the Limit days.
292
To identify what device is connected to port 7 on this switch, expand IBM Tivoli Storage Productivity Center Topology Switches. Right-click, select Expand all Groups and look for your switch, as shown in Figure 5-132.
Tip: To navigate in the Topology Viewer, press and hold the Alt key and press and hold the left mouse button to anchor your cursor. With these keys all held down, you can use the mouse to drag the screen to show what you need.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
293
Find and click port 7. The line shows that it is connected to computer tpcblade3-7, as shown in Figure 5-133. Note that in the tabular view on the bottom, you can see Port details. If you scroll right, you can check Port speed, too.
Double-click this computer to highlight it. Click Datapath Explorer (see DataPath Explorer shortcut highlighted in the minimap on Figure 5-133) to get a view of the paths between servers and storage subsystems or between storage subsystems (for example you can get SVC to back-end storage, or server to storage subsystem). The view consists of three panels (host information, fabric information and subsystem information) that show the path through a fabric or set of fabrics for the endpoint devices, as shown in Figure 5-134. Tip: A possible scenario utilizing Data Path Explorer is an application on a host that is running slow. The system administrator wants to determine the health status for all associated I/O path components for this application. Are all components along that path healthy? Are there any component level performance problems that might be causing the slow application response? Looking at the data paths for computer tpcblade3-7, this indicates that it has a single port HBA connection to the SAN. A possible solution to improve the SAN performance for computer tpcblade3-7 is to upgrade it to a dual port HBA.
294
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
295
5.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration
After Tivoli Storage Productivity Center has probed the SAN environment, it takes the information from all the SAN components (switches, storage controllers, and hosts) and automatically builds a graphical display of the SAN environment. This graphical display is available by the Topology Viewer option on the Tivoli Storage Productivity Center navigation tree. The information about the Topology Viewer panel is current as of the successful resolution of the last problem. By default, Tivoli Storage Productivity Center will probe the environment daily; however, you can execute an unplanned or immediate probe at any time. Tip: If you are analyzing the environment for problem determination, we recommend that you execute an ad hoc probe to ensure that you have the latest up-to-date information about the SAN environment. Make sure that the probe completes successfully.
296
Figure 5-135 shows the SVC ports connected and the switch ports.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
297
Important: Figure 5-135 shows an incorrect configuration for the SVC connections, as it was implemented for lab purposes only. In real environments it is important that each SVC (or Storwize V7000) node port is connected to two separate fabrics. If any SVC (or Storwize V7000) node port is not connected, each node in the cluster displays an error on LCD display. Tivoli Storage Productivity Center also shows the health of the cluster as a warning in Topology Viewer, as shown in Figure 5-135. It is also important that: You have at least one port from each node in each Fabric; You have an equal number of ports in each Fabric from each node; that is, do not have three ports in Fabric 1 and only one port in Fabric 2 for an SVC (or Storwize V7000) node.
Ports: In our example, the connected SVC ports are both online. When an SVC port is not healthy, a black line drawn is shown between the switch and the SVC node. Because Tivoli Storage Productivity Center knew where the unhealthy ports were connected to on a previous probe (and, thus, they were previously shown with a green line), the probe discovered that these ports were no longer connected, which resulted in the green line becoming a black line. If these ports had never been connected to the switch, no lines will show for them.
298
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
299
300
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
301
The Data Path Viewer in Tivoli Storage Productivity Center can also be used to check and confirm path connectivity between a disk that an operating system sees and the volume that the Storwize V7000 provides.
302
Figure 5-139 shows the path information relating to host tpcblade3-11 and its volumes. However, Figure 5-139 does not show that you can hover over each component to also get health and performance information, which might be useful when you perform problem determination and analysis.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
303
304
Chapter 6.
305
306
management
307
308
From the foregoing list, you can see that you can view reports by Disk Space, Filesystem Space, Consumed Filesystem Space and Available Filesystem Space. Each of these categories can then be broken down further. If you click Available Filesystem Space By Computer, then click the Generate Report button on the right hand window, you will be presented with a window, as shown in Figure 6-2.
You can then highlight the computer you want to monitor, or select all. You select all by selecting the top computer, then, while holding shift key down, left-click the bottom computer. After that is done, when you click the graph symbol, you will be presented with another window which allows you to select which graphical report you want to view. This can be seen in Figure 6-3.
management
309
After selecting History Chart Used Space for selected, the graph as shown in Figure 6-4 is provided.
As you can see from the graph, ours is not a very dynamic environment. This is not typical of a full production configuration. The dashed (---) lines on the right show the expected trend for the near future. At first glance, the above lines appear to be constant, but by clicking each of the data collection points, it shows the values in that collection and our data is changing slightly. You can use this procedure daily to see your trends for usage growth or freespace. 310
SAN Storage Performance Management Using Tivoli Storage Productivity Center
management
311
One of the managed disk groups in Storwize V7000 is mdiskgrp1. Click the drill down icon on the row mdiskgrp1. A new tab is created, containing the Managed Disks within this Managed Disk Group. See Figure 6-6.
Figure 6-6 Drill down from managed disk group performance report
From here, click the drill down icon again to get to the Virtual Disks that reside on the Managed Disk. See Figure 6-7.
I/O Rate
We recommend that you start analyzing how your workload is split between Managed Disk Groups to understand if is well balanced or not. Click Managed Disk Groups tab, select all Managed Disk Groups for the Storwize V7000, click the icon and select Total Backend I/O Rate, as shown in Figure 6-8.
312
Figure 6-8 Storwize V7000 Managed Disk Group I/O rate selection
Click Ok to generate the chart as shown in Figure 6-9. Back-end workload is not equally distributed, because mdiskgrp2 is much less used than the other Managed Disk Groups, confirming an unbalanced workload distribution. This does not necessary mean that a problem occurred, because there can be different back-end storage subsystems with different technologies and sizes, and therefore different workloads for the Managed Disk Groups. If this is a problem, you can look at moving some of the Virtual Disks into other Managed Disk Group to balance the workload.
Figure 6-9 Storwize V7000 Managed Disk Group I/O rate report
For further details about Rules of Thumb and how to interpret these values, see 3.2.2, Rules of Thumb on page 59.
management
313
314
As you can see here the workload is balanced across the Managed Disks. This generally happens when Managed Disks in a Managed Disk Group are of the same size, so they sustain the same Data Rate. Figure 6-12 confirms this: The Storwize Element Manger shows that both volumes mdisk0 and mdisk1 in Managed Disk Group cognos are of the same size.
Figure 6-13 represents an example of poor balanced Data Rate, in this case between Managed Disks mdisk61 and msidk91 in Managed Disk Group CET_DS8K1901 on SVC subsystem:
management
315
Looking at the SVC Element Manager, we can see that the two volumes are not of the same size, and this most probably is the reason for poor balanced configuration. See Figure 6-14.
For further details about SVC and Storwize V7000 Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
I/O Groups
For SVCs with multiple I/O Groups, a separate row is generated for every I/O Group within each SVC Cluster. For capacity planning at the I/O group level, you will monitor each node, the CPU utilization of those nodes, and the Cache Hit Rates pertaining to those nodes, to determine if the current configuration is sufficiently sized for the workload you currently have, or are growing into. In our Lab environment, data was collected for one SVC which only have a single I/O group (and from a Storwize V7000 that cannot have more than one I/O group). The scroll bar at the bottom of the table indicates additional metrics can be viewed, as shown in Figure 6-15.
Important: The data displayed in this performance report is the last collected value at the time the report is generated - it is not an average of last hours or days.
316
Click the Drill Down button next to SVC io_grp0 entry to drill down and view the statistics by nodes within the selected I/O Group. Notice that a new tab, Drill down from io_grp0, is created containing the report for nodes within the SVC I/O Group (Figure 6-16).
To view a historical chart of one or more specific metrics for the resources, you can click the icon and select the metrics of interest. You can select one or more metrics that use the same measurement unit. If you select metrics that use different measurement units, you will receive an error message. Note: If you want to create graphs including metrics with different measurement units, you have to use TPCTOOL. See Appendix C., Reporting with Tivoli Storage Productivity Center on page 365.
management
317
A consistently high CPU Utilization rate indicates a busy Node in the Cluster. If the CPU utilization remains high, it might be time to increase the cluster by adding more resources, or migrate Virtual Disks to another I/O Group or SVC Cluster. You can add cluster resources by adding another I/O Group to the cluster (two nodes) up to the maximum of four I/O Groups per cluster (SVC only); alternatively you might replace old Nodes with new ones. In case the Cluster is already composed by four I/O Groups and still there is high CPU utilization, it is time to build a new cluster and consider either migrating some storage to the new cluster or servicing new storage requests from it. Tip: We recommend that you plan additional resources for the cluster if your CPU utilization indicates workload continually above 70%.
Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. The Read and Write Transfer Sizes are the average number of bytes transferred per I/O operation.
318
To look at the Read cache hits percentage by node for Storwize V7000 nodes: select the Storwize V7000 nodes, click the icon and select the Read Cache Hits Percentage (overall). Then click Ok to generate the chart, as shown in Figure 6-19.
Figure 6-19 Storwize V7000 Read Cache hits percentage - per node
Read Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data, than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. It is not possible to increase the size of the cache in a particular SVC (or Storwize V7000) node. Therefore if you have a cache problem, it is important that you understand how the cache works and the implications of the structure at the back-end.
management
319
If you need to analyze further Cache performance metrics and try to understand if it is enough for your workload, you can run multiple metrics charts. Select all the metrics named percentage, because you can have multiple metrics with the same unit type, in one chart, as shown in Figure 6-20 where two percentage metrics are selected for a report on SVC1 node1.
In our example we compare the reports in the same time frame for SVC1 node1 and node2, selecting one node for each report. See Figure 6-21 and Figure 6-22.
320
We notice in Figure 6-21 on page 320 that for node1 there is a high Write Cache Delay Percentage (almost 80%), write cache hits percentage almost 0 and a drop in Read Cache Hits Percentage. These values, together with an increase in back-end response time shows that the node1 is heavily burdened with I/O, and at this time interval, the SVC cache is probably full of outstanding write I/Os. Host I/O activity will now be impacted with the backlog of data in the SVC cache and with any other SVC workload that is going on to the same Managed Disk Group. Figure 6-22 shows a completely different situation for node2, because there is no traffic stressing the node. Therefore, the foregoing two figures show a very poorly balanced configuration for SVC1.
management
321
322
6.2.3 Fabric
The monitoring of your Fabric environment is important so that you know how much data is transferring across your SAN. Tivoli Storage Productivity Center provides performance information by port across all monitored switches. When you have Inter-Switch Links (ISL) traffic between switches in the same Fabric, it is critical for those ports (these ISLs are named E_Ports) to be monitored so that you have sufficient bandwidth to satisfy your application response time. For Capacity Planning, it is especially important that sufficient bandwidth is available in a Copy Services environment when you are mirroring between subsystems. You can identify the E_Ports by looking at the FabricTopology view. In the tabular view of a switch, select the Switch Port tab, so you can see the Switch Type in one of the displayed columns. As shown in Figure 6-23 you can see the port types for jumbo switch. As you can see, there is an E_Port (port 12 in slot 7 - index 76) connected to switch l3bumper.
management
323
Known limitation: At this stage, the monitoring of E_Port is a two-step process. You need to identify the relevant E_Port number(s) and then use the Selection option in a report by Port.
324
management
325
Click Ok then Generate Report. You can then use this report to keep track of your E_Port performance. Figure 6-25 shows an example of a report for Send and Received Bandwidth Percentage metrics, where are present peaks in Send Bandwidth Percentage (the 80% peak must trigger an alert in a production environment):
Port throughput
You also need to monitor individual port throughput to ensure that your application has sufficient bandwidth available. If the switch, or HBA, ports were a bottleneck, then additional ports or HBAs must be installed. You need to install a multi-path driver to be able to use the extra paths.
326
Appendix A.
327
329
330
Appendix B.
Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
This appendix contains a list of performance metrics and thresholds for IBM Tivoli Productivity Center Performance Reports, with explanations of their meanings.
331
Counters
The counters in the firmware are usually unsigned 32-bit or 64-bit counters. Eventually, these counters wrap, meaning that the difference between the counters at T2 and T1 might be difficult to interpret. The Tivoli Storage Productivity Center Performance Manager adjusts for these wraps during its delta computations. The Tivoli Storage Productivity Center Performance Manager stores the deltas in the database. Certain counters are also stored in the Tivoli Storage Productivity Center database, but the performance data is mostly comprised of rates and other calculated metrics that depend on the counter deltas and the sample interval, that is, the time between T1 and T2.
Essential metrics
The primary and essential performance metrics are few and simple, for example, Read I/O Rate, Write I/O Rate, Read Response Time, and Write Response Time. Also important are data rates and transfer sizes. Then come the cache behaviors in the form of Read Hit Ratio and Write Cache delays (percentages and rates). There are a myriad of additional metrics in the Tivoli Storage Productivity Center performance reports, but they need to be used as adjuncts to the primary metrics, sometimes helping you to understand why the primary metrics have the values they have. There are a very few metrics that measure other kinds of values. For example, SVC and Storwize V7000 storage subsystems also report the maximum read and write response times that occur between times T1 and T2. Each time that a sample of the counters is pulled, this type of counter is set back to zero. But the vast majority of counters are monotonically increasing, reset to zero only by very particular circumstances, such as hardware, software, or firmware resets. The design of the Tivoli Storage Productivity Center Performance Manager allows several storage subsystems to be included in a report (or individual subsystems by selection or filtering). But not all the metrics apply to every subsystem or component. In these cases, a -1 appears, indicating that no data is expected for the metric in this particular case.
332
In the remainder of this section, we look at the metrics that can be selected for each report. We examine the reports in the order in which they appear in the Tivoli Storage Productivity Center Navigation Tree.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
333
New FC port performance metrics and thresholds in Tivoli Storage Productivity Center 4.2.1 release
The Performance Manager component in IBM Tivoli Storage Productivity Center collects, reports, and alerts users on various performance metrics for a variety of SAN devices. One request by customers is to provide more information regarding FCLink problems in their SAN environment, particularly related with their DS8000 and SVC (or Storwize V7000) ports.
Metrics
Numerous metrics are already collected for DS8000, SVC and Storwize ports; however, those pertaining to error counts are currently not tracked or reported by Tivoli Storage Productivity Center. For consistency, the switch port counters that are currently not exposed as metrics that are the same as counters for either DS8000, SVC, or Storwize V7000 ports must be displayed in reports as well. The following error counters will be provided by this work item: Error frame rate for DS8000 ports This metric is defined as the number of frames per second that violated Fibre Channel protocol for a particular port over a particular time interval. Link failure rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of miscellaneous Fibre Channel link errors, such as unexpected NOS received or a link state machine failure detected, per second that were experienced by a particular port over a particular time interval. Loss-of-synchronization rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of loss of synchronization errors per second, where it is a confirmed and a persistent synchronization loss on the Fibre Channel link, for a particular port over a particular time interval. Loss-of-signal rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of times per second that a loss of signal was detected on the Fibre Channel link when a signal was previously detected for a particular port over a particular time interval. Invalid CRC rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of frames received per second where the CRC in the frame did not match the CRC computed by the receiver for a particular component over a particular period of time. Primitive Sequence protocol error rate for SVC, Storwize V7000, DS8000 and Switch ports This metric is the average number of primitive sequence protocol errors per second where an unexpected primitive sequence was received on a particular port over a particular time interval. Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports This metric is the average number of times per second that "bit" errors were detected on a particular port over a particular time interval. Zero buffer-buffer credit timer for SVC and Storwize V7000 ports The zero buffer-buffer credit timer is the number of microseconds for which the port has been unable to send frames due to lack of buffer credit since the last node reset.
334
Link Recovery (LR) sent rate for DS8000 and Switch ports This metric is the average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR1) state over a particular time interval. Note: I think this is the same as Link Reset transmitted. Link Recovery (LR) received rate for DS8000 and Switch ports This metric is the average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR2) state over a particular time interval. Note: I think this is the same as Link Reset received. Out of order data rate for DS8000 ports This metric is the average number of times per second that an out of order frame was detected for a particular port over a particular time interval Out of order ACK rate for DS8000 ports This metric is the average number of times per second that an out of order ACK frame was detected for a particular port over a particular time interval. Duplicate frame rate for DS8000 ports This metric is the average number of times per second that a frame was received that has been detected as previously processed for a particular port over a particular time interval. Invalid relative offset rate for DS8000 ports This metric is the average number of times per second that a frame was received with bad relative offset in the frame header for a particular port over a particular time interval. Sequence timeout rate for DS8000 ports This metric is the average number of times per second the port has detected a timeout condition on receiving sequence initiative for a Fibre Channel exchange for a particular port over a particular time interval. Note: Bit error rate for DS8000 ports will not be supported. The metric is very similar to the invalid transmission word rate that will be supported, and being limited to 5 minute counting windows makes this counter unreliable for collection frequencies greater than 5 minutes.
Thresholds
While it is preferable to be able to define thresholds for each of the new metrics being introduced, the following thresholds are currently deemed to be the most important ones to include at this time: Error (illegal) frame rate for DS8000 ports. Link failure rate for SVC, Storwize V7000 and DS8000 ports. Invalid CRC rate for SVC, Storwize V7000, DS8000 and Switch ports. Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports. Zero buffer-buffer credit timer for SVC and Storwize V7000 ports.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
335
Common columns
Table B-1 contains information about the columns that are common among performance reports.
Table B-1 Common columns Column Time Interval Description Date and time that the data was collected Size of the sample interval in seconds. You can specify a minimum interval length of five minutes and a maximum interval length of sixty minutes for the following models:
Tivoli Storage Productivity Center Enterprise Storage Server DS6000 DS8000 XIV storage system
For SAN Volume Controller models earlier than V4.1, you can specify a minimum interval length of 15 minutes and a maximum interval length of sixty minutes. For SAN Volume Controller models V4.1 and later and for Storwize V7000, you can specify a minimum interval length of 5 minutes, and a maximum interval length of sixty minutes.
Note: When you view metrics for the ESS and DS series of storage systems, you must take into account the following differences between Tivoli Storage Productivity Center reports and the native reports of those systems: Tivoli Storage Productivity Center reports display port performance metrics as send and receive metrics (for example, Send Data Rate and Receive Data Rate). Storage system native reports (for example, reports based on data collected by the DS CLI) display port performance metrics as read and write metrics (for example, Byteread and Bytewrite). When a host performs a read operation, the DS port sends data to the host. Therefore "read" metrics in DS reports correspond to "send" metrics in Tivoli Storage Productivity Center reports. When a host performs a write operation, DS ports receive data from the host. Therefore "write" metrics in DS reports correspond to "receive" metrics in Tivoli Storage Productivity Center reports. When you view port Peer-to-Peer Remote Copy (PPRC) performance metrics, you must take into account the following additional differences between Tivoli Storage Productivity Center reports and native reports for storage systems: Metrics for PPRC reads in storage system native reports are represented as PPRC receives in Tivoli Storage Productivity Center (reads = receives). Metrics for PPRC writes in storage system native reports are represented as PPRC sends in Tivoli Storage Productivity Center (writes = sends).
336
is displayed next to metrics that are available in XIV storage system version 10.2.2 or later. is displayed next to metrics that are available in XIV storage system version 10.2.4 or later.
For example: The Read I/O Rate (overall) metric is available for XIV storage systems version 10.2.2 and later. In the Devices: components column of the list of metrics, the entry for Read I/O Rate (overall) is displayed like this: XIV1 : volume, module, subsystem The Small Transfers Response Time metric is available for XIV storage systems version 10.2.4 and later. In the Devices: components column of the list of metrics, the entry for Small Transfers Response Time is displayed like this: XIV2 : volume, module, subsystem
Volume-based metrics
Table B-2 contains information about volume-based metrics. Note: Tivoli Storage Productivity Center does not calculate volume-based metrics if there are space efficient volumes allocated in an extent pool consisting of multiple ranks. In this case, the columns for volume-based metrics display the value N/A in the Storage Subsystem Performance By Array report for the arrays associated with that extent pool. However, if there are no space efficient volumes allocated in a multi-rank extent pool, or if the space efficient volumes are allocated in an extent pool consisting of a single rank, then this limitation does not apply and all volume-based metrics are displayed in the By Array report.
Table B-2 Volume-based metrics Column I/O Rates Read I/O Rate (normal) Read I/O rate (sequential) Read I/O Rate (overall) ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: Volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem Average number of I/O operations per second for nonsequential read operations for a component over a specified time interval. Average number of I/O operations per second for sequential read operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential read operations for a component over a specified time interval. Devices: components Description
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
337
Write I/O Rate (normal) Write I/O Rate (sequential) Write I/O Rate (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of I/O operations per second for nonsequential write operations for a component over a specified time interval. Average number of I/O operations per second for sequential write operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential write operations for a component over a specified time interval.
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of I/O operations per second for nonsequential read and write operations for a component over a specified time interval. Average number of I/O operations per second for sequential read and write operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential read and write operations for a component over a specified time interval.
Global Mirror Write I/O Rate Global Mirror Overlapping Write Percentage
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem
Average number of write operations per second issued to the Global Mirror secondary site for a component over a specified time interval. Average percentage of write operations issued by the Global Mirror primary site which were serialized overlapping writes for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes were serialized. Average number of serialized overlapping write operations per second encountered by the Global Mirror primary site for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes are serialized. Average number of read operations per second that were issued by the High Performance IBM FICON (HPF) feature of the storage subsystem for a component over a specified time interval.
338
Average number of write operations per second that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. Average number of read and write operations per second that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. The percentage of all I/O operations that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. Average number of track transfer operations per second for PPRC usage for a component over a specified time interval. This metric shows the activity for the source of the PPRC relationship, but shows no activity for the target. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are <= 8 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 512 KB.
DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem
Small Transfers I/O Percentage Medium Transfers I/O Percentage Large Transfers I/O Percentage Very Large Transfers I/O Percentage Cache hit percentages Read Cache Hits Percentage (normal) Read Cache Hits Percentage (sequential) Read Cache Hits Percentage (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for nonsequential read operations for a component over a specified time interval. Percentage of cache hits for sequential read operations for a component over a specified time interval. Percentage of cache hits for both sequential and nonsequential read operations for a component over a specified time interval.
Write Cache Hits Percentage (normal) Write Cache Hits Percentage (sequential)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem
Percentage of cache hits for nonsequential write operations for a component over a specified time interval. Percentage of cache hits for sequential write operations for a component over a specified time interval.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
339
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for both sequential and nonsequential write operations for a component over a specified time interval.
Total Cache Hits Percentage (normal) Total Cache Hits Percentage (sequential) Total Cache Hits Percentage (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for nonsequential read and write operations for a component over a specified time interval. Percentage of cache hits for sequential read and write operations for a component over a specified time interval. Percentage of cache hits for both sequential and nonsequential read and write operations for a component over a specified time interval.
Readahead Percentage of Cache Hits Dirty Write Percentage of Cache Hits Read Data Cache Hit Percentage Write Data Cache Hit Percentage Total Data Cache Hit Percentage Data rates Read Data Rate
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem XIV2: volume, module, subsystem XIV2: volume, module, subsystem XIV2: volume, module, subsystem
Percentage of all read cache hits which occurred on prestaged data. Percentage of all write cache hits which occurred on already dirty data in the cache. Percentage of read data that was read from the cache over a specified time interval. Percentage of write data that was written to the cache over a specified time interval. Percentage of all data that was read from or written to the cache for a component over a specified time interval.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of megabytes (2^20 bytes) per second that were transferred for read operations for a component over a specified time interval.
340
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of megabytes (2^20 bytes) per second that were transferred for write operations for a component over a specified time interval.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of megabytes (2^20 bytes) per second that were transferred for read and write operations for a component over a specified time interval.
Small Transfers Data Percentage Medium Transfers Data Percentage Large Transfers Data Percentage Very Large Transfers Data Response times Read Response Time
Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are <= 8 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 8 KB and <= 64 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 64 KB and <= 512 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 512 KB.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of milliseconds that it took to service each read operation for a component over a specified time interval.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
341
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP volume, controller, subsystem XIV1: volume, module, subsystem
Average number of milliseconds that it took to service each write operation for a component over a specified time interval.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of milliseconds that it took to service each I/O operation (read and write) for a component over a specified time interval.
Peak Read Response Time Peak Write Response Time Global Mirror Write Secondary Lag
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem
The peak (worst) response time among all read operations. The peak (worst) response time among all write operations. The average number of additional milliseconds it takes to service each secondary write operation for Global Mirror, over and above the time that is required to service primary writes. The percentage of the average response time, both read response time and write response time, that can be attributed to delays from host systems. This metric is provided to help diagnose slow hosts and poorly performing fabrics. The value is based on the time taken for hosts to respond to transfer-ready notifications from the SVC nodes (for read) and the time taken for hosts to send the write data after the node has responded to a transfer-ready notification (for write). Average number of milliseconds that it takes to service each read cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each write cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each read cache hit operation and each write cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each read cache miss operation over a specified time interval.
Read Cache Hit Response Time Write Cache Hit Response Time Overall Cache Hit Response Time Read Cache Miss Response Time
342
Write Cache Miss Response Time Overall Cache Miss Response Time Small Transfers Response Time Medium Transfers Response Time Large Transfers Response Time Very Large Transfers Response Time Transfer sizes Read Transfer Size
Average number of milliseconds that it takes to service each write cache miss operation over a specified time interval. Average number of milliseconds that it takes to service each read cache miss operation and each write cache miss operation over a specified time interval. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are <= 8 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 512 KB.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
343
Write-cache constraints Write-cache Delay Percentage ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Write-cache Delayed I/O Rate ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Write-cache Overflow Percentage Write-cache Overflow I/O Rate Write-cache Flush-through Percentage Write-cache Flush-through I/O Rate Write-cache Write-through Percentage Write-cache Write-through I/O Rate Record mode reads Record Mode Read I/O Rate Record Mode Read Cache % Cache transfers Disk to Cache I/O Rate ESS/DS6000/DS8000: volume, array, controller SVC, Storwize V7000: volume, node, I/O group, subsystem Cache to Disk I/O Rate ESS/DS6000/DS8000: volume, array, controller SVC, Storwize V7000: volume, node, I/O group, subsystem Miscellaneous computed values Average number of I/O operations (track transfers) per second for cache to disk transfers for a component over a specified time interval. Average number of I/O operations (track transfers) per second for disk to cache transfers for a component over a specified time interval. ESS/DS6000/DS8000: volume, array, controller ESS/DS6000/DS8000: volume, array, controller Average number of I/O operations per second for record mode read operations for a component over a specified time interval. Percentage of cache hits for record mode read operations for a component over a specified time interval. SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Percentage of write operations that were delayed due to lack of write-cache space for a component over a specified time interval. Average number of tracks per second that were delayed due to lack of write-cache space for a component over a specified time interval. Percentage of write operations that were processed in Flush-through write mode for a component over a specified time interval. Average number of tracks per second that were processed in Flush-through write mode for a component over a specified time interval. Percentage of write operations that were processed in Write-through write mode for a component over a specified time interval. Average number of tracks per second that were processed in Write-through write mode for a component over a specified time interval. Percentage of I/O operations that were delayed due to write-cache space constraints or other conditions for a component over a specified time interval. (The ratio of delayed operations to total I/Os.) Average number of I/O operations per second that were delayed due to write-cache space constraints or other conditions for a component over a specified time interval.
344
ESS/DS6000/DS8000: controller, subsystem SVC, Storwize V7000: node, I/O group, subsystem SVC, Storwize V7000: volume, I/O group ESS/DS6000/DS8000: volume SVC, Storwize V7000: volume XIV1: volume
Average cache holding time, in seconds, for I/O data in this subsystem controller (cluster). Shorter time periods indicate adverse performance. Average utilization percentage of the processors. The overall percentage of I/O performed or data transferred by the non-preferred nodes of the volumes, for a component over a specified time interval. The approximate utilization percentage of a volume over a specified time interval (the average percent of time that the volume was busy).
Back-end-based metrics
Table B-3 contains information about back-end-based metrics.
Table B-3 Back-end-based metrics Column I/O rates Back-End Read I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Total Back-End I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Data rates Back-End Read Data Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Average number of megabytes (2^20 bytes) that were transferred for read operations. Average number of I/O operations per second for read and write operations. Average number of I/O operations per second for write operations. Average number of I/O operations per second for read operations. Devices: components Description
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
345
ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Average number of megabytes (2^20 bytes) that were transferred for write operations.
ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Average number of megabytes (2^20 bytes) that were transferred for read and write operations.
Response times Back-End Read Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Overall Back-End Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Read Queue Time Back-End Write Queue Time Overall Back-End Queue Time Peak Back-End Read Response Time SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Average number of milliseconds that it took to respond to each read operation. For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that it took to respond to each write operation. For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that it took to respond to each I/O operation (read and write). For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that each read operation spent on the queue before being issued to the back-end device. Average number of milliseconds that each write operation spent on the queue before being issued to the back-end device. Average number of milliseconds that read and write operations spent on the queue before being issued to the back-end device. The peak (worst) response time among all read operations for a component over a specified time interval. For SAN Volume Controller, it represents the external response time of the MDisks. The peak (worst) response time among all write operations for a component over a specified time interval. For SAN Volume Controller, it represents the external response time of the MDisks.
346
SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
The lower bound on the peak (worst) queue time for read operations for a component over a specified time interval. The queue time is the amount of time that the read operation spent on the queue before being issued to the back-end device. The lower bound on the peak (worst) queue time for write operations for a component over a specified time interval. The queue time is the amount of time that the write operation spent on the queue before being issued to the back-end device.
SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Transfer sizes Back-End Read Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Overall Back-End Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Disk utilization Disk Utilization Percentage ESS/DS6000/DS8000: array The approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). Note: Tivoli Storage Productivity Center does not calculate a value for this column if there are multiple ranks in the extent pool where the space-efficient volumes are allocated. This column displays value of N/A for the reports in which it appears. However, if there is only a single rank in the extent pool, Tivoli Storage Productivity Center does calculate the value for this column regardless of the space-efficient volumes. Percentage of all I/O operations performed for an array over a specified time interval that were sequential operations. Average number of KB per I/O for read and write operations for a component over a specified time interval. Average number of KB per I/O for write operations for a component over a specified time interval. Average number of KB per I/O for read operations for a component over a specified time interval.
ESS/DS6000/DS8000: array
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
347
348
Port to Disk Send I/O Rate A Port to Disk Receive I/O Rate Total Port to Disk I/O Rate Port to Local Node Send I/O Rate Port to Local Node Receive I/O Rate Total Port to Local Node I/O Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Average number of exchanges (I/Os) per second sent to storage subsystems by a component over a specified time interval. Average number of exchanges (I/Os) per second received from storage subsystems by a component over a specified time interval. Average number of exchanges (I/Os) per second transmitted between storage subsystems and a component over a specified time interval. Average number of exchanges (I/Os) per second sent to other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second received from other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second transmitted between other nodes in the local SAN Volume Controller cluster and a component over a specified time interval. Average number of exchanges (I/Os) per second sent to nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second received from nodes in the remote SAN Volume Controller cluster. Average number of exchanges (I/Os) per second transmitted between nodes in the remote SAN Volume Controller cluster and a component over a specified time interval. Average number of send operations per second using the FCP protocol, for a port over a specified time interval. Average number of receive operations per second using the FCP protocol for a port over a specified time interval. Average number of send and receive operations per second using the FCP protocol for a port over a specified time interval. Average number of send operations per second using the FICON protocol for a port over a specified time interval. Average number of receive operations per second using the FICON protocol for a port over a specified time interval. Average number of send and receive operations per second using the FICON protocol for a port over a specified time interval. Average number of send operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Port to Remote Node Send I/O Rate Port to Remote Node Receive I/O Rate Total Port to Remote Node I/O Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Port FCP Send I/O Rate* Port FCP Receive I/O Rate* Total Port FCP I/O Rate* Port FICON Send I/O Rate* Port FICON Receive I/O Rate* Total Port FICON I/O Rate* Port PPRC Send I/O Rate
ESS/DS6000/DS8000: port
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
349
Port PPRC Receive I/O Rate Total Port PPRC I/O Rate Data rates Port Send Data Rate
Average number of receive operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of send and receive operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port switch port, switch XIV2: port
Average number of megabytes (2^20 bytes) per second that were transferred for send (read) operations for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port switch port, switch XIV2: port
Average number of megabytes (2^20 bytes) per second that were transferred for receive (write) operations for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port switch port, switch XIV2: port
Average number of megabytes (2^20 bytes) per second that were transferred for send and receive operations for a port over a specified time interval.
Port Peak Send Data Rate Port Peak Receive Data Rate Port to Host Send Data Rate Port to Host Receive Data Rate
switch port switch port SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Peak number of megabytes (2^20 bytes) per second that were sent by a port over a specified time interval Peak number of megabytes (2^20 bytes) per second that were received by a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to host computers by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from host computers by a component over a specified time interval.
350
Total Port to Host Data Rate Port to Disk Send Data Rate Port to Disk Receive Data Rate Total Port to Disk Data Rate Port to Local Node Send Data Rate Port to Local Node Receive Data Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Average number of megabytes (2^20 bytes) per second transmitted between host computers and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to storage subsystems by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from storage subsystems by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between storage subsystems and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between other nodes in the local SAN Volume Controller cluster and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between nodes in the remote SAN Volume Controller cluster and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent or received over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent over the FICON protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received over the FICON protocol, for a port over a specified time interval.
Port to Remote Node Send Data Rate Port to Remote Node Receive Data Rate Total Port to Remote Node Data Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Port FCP Send Data Rate* Port FCP Receive Data Rate* Total Port FCP Data Rate* Port FICON Send Data Rate* Port FICON Receive Data Rate*
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
351
Total Port FICON Data Rate* Port PPRC Send Data Rate Port PPRC Receive Data Rate Total Port PPRC Data Rate Response times Port Send Response Time
ESS/DS6000/DS8000: port
Average number of megabytes (2^20 bytes) per second sent or received over the FICON protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second transferred for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Average number of milliseconds that it took to service each send (read) operation for a port over a specified time interval. Average number of milliseconds that it took to service each receive (write) operation for a port over a specified time interval. Average number of milliseconds that it took to service each operation (send and receive) for a port over a specified time interval. Average number of milliseconds it took to service each send operation to another node in the local SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each receive operation from another node in the local SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each send or receive operation between another node in the local SAN Volume Controller cluster and a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds that each send operation issued to another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each receive operation from another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval.
352
Average number of milliseconds that each operation issued to another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds it took to service each send operation to a node in the remote SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each receive operation from a node in the remote SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each send or receive operation between a node in the remote SAN Volume Controller cluster and a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds that each send operation issued to a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each receive operation from a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each operation issued to a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds it took to service all send operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all receive operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all send operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all receive operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all send operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Port to Remote Node Receive Response Time Total Port to Remote Node Response Time
Port FCP Send Response Time* Port FCP Receive Response Time* Overall Port FCP Response Time* Port FICON Send Response Time* Port FICON Receive Response Time* Overall Port FICON Response Time* Port PPRC Send Response Time
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
353
Port PPRC Receive Response Time Overall Port PPRC Response Time Transfer sizes Port Send Transfer Size
Average number of milliseconds it took to service all receive operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Average number of KB sent per I/O by a port over a specified time interval.
Average number of KB received per I/O by a port over a specified time interval.
Average number of KB transferred per I/O by a port over a specified time interval.
Port Send Packet Size Port Receive Packet Size Overall Port Packet Size
Average number of KB sent per packet by a port over a specified time interval.
Average number of KB received per packet by a port over a specified time interval.
Special computed values Port Send Utilization Percentage Port Receive Utilization Percentage Overall Port Utilization Percentage Port Send Bandwidth Percentage ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port Port Receive Bandwidth Percentage ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port The approximate bandwidth utilization percentage for receive operations by this port, based on its current negotiated speed. Average amount of time that the port was busy sending data over a specified time interval. Average amount of time that the port was busy receiving data over a specified time interval. Average amount of time that the port was busy sending or receiving data over a specified time interval. The approximate bandwidth utilization percentage for send operations by a port based on its current negotiated speed.
354
ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port
The approximate bandwidth utilization percentage for send and receive operations by this port.
Error rates Error Frame Rate switch port, switch DS8000: port, subsystem Dumped Frame Rate switch port, switch The number of frames per second that were lost due to a lack of available host buffers for a port over a specified time interval. The number of link errors per second that were experienced by a port over a specified time interval. The number of frames per second that were received in error by a port over a specified time interval.
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
The average number of times per second that synchronization was lost for a component over a specified time interval.
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
The average number of times per second that the signal was lost for a component over a specified time interval.
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
The average number of frames received per second in which the CRC in the frame did not match the CRC computed by the receiver for a component over a specified time interval.
The average number of frames received per second that were shorter than 28 octets (24 header + 4 CRC) not including any SOF/EOF bytes for a component over a specified time interval. The average number of frames received per second that were longer than 2140 octets (24 header + 4 CRC + 2112 data) not including any SOF/EOF bytes for a component over a specified time interval. The average number of disparity errors received per second for a component over a specified time interval. The average number of class-3 frames per second that were discarded by a component over a specified time interval.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
355
The average number of F-BSY frames per second that were generated by a component over a specified time interval. The average number of F-RJT frames per second that were generated by a component over a specified time interval. The average number of primitive sequence protocol errors detected for a component over a specified time interval.
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
The average number of transmission words per second that had an 8b10 code violation in one or more characters; had a K28.5 in its second, third, or fourth character positions; and/or was an ordered set that had an incorrect Beginning Running Disparity. The number of microseconds that the port has been unable to send frames due to lack of buffer credit since the last node reset. The average number of times per second that a port has transitioned from an active (AC) state to a Link Recovery (LR1) state over a specified time interval. The average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR2) state over a specified time interval The average number of times per second that an out of order frame was detected for a port over a specified time interval. The average number of times per second that an out of order ACK frame was detected for a port over a specified time interval. The average number of times per second that a frame was received that has been detected as previously processed for a port over a specified time interval. The average number of times per second that a frame was received with invalid relative offset in the frame header for a port over a specific time interval. The average number of times per second the port has detected a timeout condition on receiving sequence initiative for a Fibre Channel exchange for a port over a specified time interval.
Zero Buffer-Buffer Credit Timer Link Recovery (LR) Sent Rate Link Recovery (LR) Received Rate Out of Order Data Rate Out of Order ACK Rate Duplicate Frame Rate
SVC, Storwize V7000: port, node, I/O group, subsystem switch port, switch DS8000: port, subsystem switch port, switch DS8000: port, subsystem DS8000: port, subsystem
Note: * The value N/A is displayed for this metric if you set the Summation Level to hourly or daily before generating the report.
356
Threshold boundaries
You can establish your boundaries for the normal expected subsystem performance when defining storage subsystem alerts for performance threshold events. When the collected performance data samples fall outside out of the range you have set, you are notified of this threshold violation so you are aware of the potential problem. The upper boundaries are Critical Stress and Warning Stress. The lower boundaries are Warning Idle and Critical Idle. Usually you will want the stress boundaries to be high numbers and the idle to be low numbers. The exception to this rule is Cache Holding Time Threshold, where you want the stress numbers to be low and the idle numbers to be high. If you do not want to be notified of threshold violations for any boundaries, you can leave the boundary field blank and the performance data will not be checked against any value. For example, if the Critical Idle and Warning Idle fields are left blank, no alerts will be sent for any idle conditions. The Ignore triggering condition when the sequential I/O percentage exceeds check box is active only for the triggering condition Disk Utilization Percentage Threshold. It is a filter condition. The default is 80%. The Ignore triggering condition when the Back-End Read I/O Rate is less than check box only applies to the Back-End Read Response Time and Back-End Read Queue Time thresholds. The Ignore triggering condition when the Back-End Write I/O Rate is less than check box only applies to the Back-End Write Response Time and Back-End Write Queue Time thresholds. The Ignore triggering condition when the Total Back-End I/O Rate is less than check box only applies to the Overall Back-End Response Time threshold. The Ignore triggering condition when the Total I/O Rate is less than check box only applies to the Non-preferred Node Usage Percentage threshold. The Ignore triggering condition when the Write-cache Delay I/O Rate is less than check box only applies to the Write-cache Delay Percentage threshold.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
357
Array thresholds
Table B-5 lists and describes the Array thresholds:
Table B-5 Array Thresholds Threshold (Metric) Array Thresholds Disk Utilization Percentage DS6000/DS8000 array Sets thresholds on the approximate utilization percentage of the arrays in a particular subsystem; for example, the average percentage of time that the disks associated with the array were busy. The Disk Utilization metric for each array is checked against the threshold boundaries for each collection interval. This threshold is enabled by default for IBM TotalStorage Enterprise Storage Server systems and disabled by default for others. The default threshold boundaries are 80%, 50%, -1, -1. For DS6000 and DS8000 subsystems, this threshold applies only to those ranks which are the only ranks in their associated extent pool. Sets thresholds on the average number of I/O operations per second for array and MDisk read and write operations. The Total I/O Rate metric for each array or MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of MB per second that were transferred for array and MDisk read and write operations. The Total Data Rate metric for each array or MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of milliseconds that it took to service each array and MDisk read operation. The Back-End Read Response Time metric for each array or MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 35,25,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Device/Component Type Description
358
Sets thresholds on the average number of milliseconds that it took to service each array and MDisk write operation. The Back-End Write Response Time metric for each array or MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 120,80,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Write I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Sets thresholds on the average number of milliseconds that it took to service each MDisk I/O operation, measured at the MDisk level. The Total Response Time (external) metric for each MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. A filter is available for this threshold which will ignore any boundary violations if the Total Back-End I/O Rate is less than a specified filter value. The pre-populated filter value is 10. Sets thresholds on the average number of milliseconds that each read operation spent on the queue before being issued to the back-end device. The Back-End Read Queue Time metric for each MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 5,3,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Violation of these threshold boundaries means that the SVC deems the MDisk to be overloaded. There is a queue algorithm that determines the number of concurrent I/O operations that the SVC will send to a given MDisk. If there is any queuing (other than during a backup process) then this suggests performance can be improved by resolving the queuing issue. Sets thresholds on the average number of milliseconds that each write operation spent on the queue before being issued to the back-end device. The Back-End Write Queue Time metric for each MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 5,3,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Violation of these threshold boundaries means that the SVC deems the MDisk to be overloaded. There is a queue algorithm that determines the number of concurrent I/O operations that the SVC will send to a given MDisk. If there is any queuing (other than during a backup process) then this suggests performance can be improved by resolving the queuing issue.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
359
Sets thresholds on the peak (worst) response time among all MDisk write operations by a node. The Back-End Peak Write Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 30000,10000,-1,-1. Violation of these threshold boundaries means that the SVC cache is having to partition-limit for a given MDisk group. The de-staged data from the SVC cache for this MDisk group is causing the cache to fill up (writes are being received faster than they can be de-staged to disk). If delays reach 30 seconds or more, then the SVC will switch into short-term mode where writes are no longer cached for the MDisk Group. Sets thresholds on the average number of milliseconds it took to service each send operation to another node in the local SVC cluster. The Port to Local Node Send Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 3,1.5,-1,-1. Violation of these threshold boundaries means that it is taking too long to send data between nodes (on the fabric), and suggests that there is either congestion around these FC ports, or an internal SVC microcode problem. Sets thresholds on the average number of milliseconds it took to service each receive operation from another node in the local SVC cluster. The Port to Local Node Receive Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 1,0.5,-1,-1. Violation of these threshold boundaries means that it is taking too long to send data between nodes (on the fabric), and suggests that there is either congestion around these FC ports, or an internal SVC microcode problem. Sets thresholds on the average number of milliseconds that each send operation issued to another node in the local SVC cluster spent on the queue before being issued. The Port to Local Node Send Queue Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 2,1,-1,-1. Violation of these threshold boundaries means that the node has to wait too long to send data to other nodes (on the fabric), and suggests congestion on the fabric. Sets thresholds on the average number of milliseconds that each receive operation issued to another node in the local SVC cluster spent on the queue before being issued. The Port to Local Node Receive Queue Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 1,0.5,-1,-1. Violation of these threshold boundaries means that the node has to wait too long to receive data from other nodes (on the fabric), and suggests congestion on the fabric.
360
Controller thresholds
Table B-6 lists and describes the Controller thresholds:
Table B-6 Controller Thresholds Threshold (Metric) Controller Thresholds Total I/O Rate (overall) DS6000/DS8000 controller SVC, Storwize V7000 I/O group Sets threshold on the average number of I/O operations per second for read and write operations, for the subsystem controllers (clusters) or I/O groups. The Total I/O Rate metric for each controller or I/O group is checked against the threshold boundaries for each collection interval. These thresholds are disabled by default. Sets threshold on the average number of MB per second for read and write operations for the subsystem controllers (clusters) or I/O groups. The Total Data Rate metric for each controller or I/O group is checked against the threshold boundaries for each collection interval. These thresholds are disabled by default. Sets thresholds on the percentage of time that NVS space constraints caused I/O operations to be delayed, for the subsystem controllers (clusters). The NVS Full Percentage metric for each controller is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 10, 3, -1, -1. Sets thresholds on the average cache holding time, in seconds, for I/O data in the subsystem controllers (clusters). Shorter time periods indicate adverse performance. The Cache Holding Time metric for each controller is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 30, 60, -1, -1. Sets thresholds on the percentage of I/O operations that were delayed due to write-cache space constraints. This metric for each controller or node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 10, 3, -1, -1. In addition, a filter is available for this threshold which will ignore any boundary violations if the Write-cache Delay I/O Rate is less than a specified filter value. The pre-populated filter value is 10 I/Os per second. Sets thresholds on the Non-Preferred Node Usage Percentage of an I/O group. This metric of each I/O group is checked against the threshold boundaries at each collection interval. his threshold is disabled by default. In addition, a filter is available for this threshold which will ignore any boundary violations if the Total I/O Rate of the I/O group is less than a specified filter value. Device/Component Type Description
DS6000/DS8000 controller
DS6000/DS8000 controller
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
361
Port thresholds
Port thresholds are used to set limits for such things as bandwidth utilization, data rates, and I/O operations. lists and describes the Port thresholds:
Table B-7 Threshold (Metric) Port Thresholds Total Port I/O Rate DS6000/DS8000 port switch port XIV port Total Port Data Rate DS6000/DS8000 port switch port XIV port Overall Port Response DS6000/DS8000 port Time Sets thresholds for ports on the average number of I/O operations or packets per second for send and receive operations. The Total I/O Rate metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds for ports on the average number of MB per second for send and receive operations. The Total Data Rate metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds for ports on the average number of milliseconds that it takes to service each send and receive I/O operation. The Total Response Time metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of frames per second received in error by ports. The Error Frame Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of link errors per second for ports. The Link Failure Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the critical and warning data rates for stress and idle in MB per second. The Total Port Data Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on critical and warning data rates for stress and idle conditions in packets per second. For example, a critical stress or warning stress condition occurs when the upper boundary for the packet rate of a switch is detected. A critical idle or warning idle condition occurs when the lower boundary for the packet rate of a switch is detected. The Total Port Packet Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average amount of time that ports are busy sending data. The metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average amount of time that ports are busy receiving data. The metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Device/Component Type Description
DS8000 port
Switch port
DS6000/DS8000 port
DS6000/DS8000 port
362
DS8000 port SVC, Storwize V7000 port switch port XIV port
Sets thresholds on the average port bandwidth utilization percentage for send operations. The Port Send Utilization Percentage metric is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries 85,75,-1,-1. Sets thresholds on the average port bandwidth utilization percentage for receive operations. The Port Send Utilization Percentage metric is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries 85,75,-1,-1.
Port Receive DS8000 port Bandwidth Percentage SVC, Storwize V7000 port switch port XIV port CRC Error Rate DS8000 port SVC, Storwize V7000 port switch port Invalid Transmission Word Rate DS8000 port SVC, Storwize V7000 port switch port Zero Buffer - Buffer Credit Timer SVC, Storwize V7000 port
Sets thresholds on the average number of frames received in which the cyclic redundancy check (CRC) in a frame does not match the CRC computed by the receiver. The CRC Error Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of bit errors detected on a port. The Invalid Transmission Word Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the number of microseconds that a port has been unable to send frames because of a lack of buffer credit since the last node reset. The Zero Buffer-Buffer Credit Timer metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default.
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
363
364
Appendix C.
365
Using SQL
Because the Tivoli Storage Productivity Center database repository is a standard DB2 database, you can use external commands to access some of the information, without the use of the Tivoli Storage Productivity Center GUI. Tivoli Storage Productivity Center provides a predefined set of Table Views that must be used because these Table Views will not change (only by additions) with new Tivoli Storage Productivity Center releases. Furthermore, only the Table Views are documented. For details, see the IBM Tivoli Storage Productivity Center V4.1 Release Guide, SG24-7725: Section 10.1 provides a reporting overview and a collection of reporting information. Chapter 11 provides details on customized reporting through Tivoli common reporting.
required to get the right values and metrics out of the table views. Without the UDFs we will need to understand each column in the table view and calculate our own metrics, which is error-prone. Choose the tab in the Excel of the type of Storage Subsystem. We choose XIV. Afterwards you can select the metric as shown in Figure C-1.
Figure C-1 Choose tab XIV and filter to Total Response Time
Now we choose the Category. The category can be XIV System, XIV Module or XIV Volume. We select XIV System because we want the Total (overall) Response Time. The Excel shows two rows as shown in Figure C-2.
The last part is to decide which Table View we want to use. In Figure C-2 we see four possible Views. PRF_XIV_SYSTEM, LATEST_PRF_XIV_SYSTEM PRF_HOURLYDAILY_XIV_SYSTEM, LATEST_PRF_HOURLYDAILY_XIV_SYSTEM Those Views are described in the Tivoli Storage Productivity Center 4.2.1_TPCREPORT_schema.zip. In this example we use the Table View called PRF_HOURLYDAILY_XIV_SYSTEM which includes hourly aggregated performance data from the XIV Storage Subsystem. We set the filter in the Excel and get finally the required View, Metric, Unit and the UDF & Parameters. See Figure C-3. Those Values are prerequisite to create the SQL statement.
Figure C-3 SQL Parameters Appendix C. Reporting with Tivoli Storage Productivity Center
367
Important: A set of User Defined Functions (UDFs) is provided to ease implementation of metric calculations. The UDFs automate all required transformations. You do not need to be aware of the details of individual values in order to generate performance metrics using the UDFs. Usage of the UDFs is documented in the Excel Sheet PM_Metrics.xls To create the SQL select statement we have a look on the table view description. All table views belong to the schema TPCREPORT. From the table PRF_HOURLYDAILY_XIV_SYSTEM (see Figure C-4) we choose DEV_ID, PRF_TIMESTAMP and INTERVAL_LEN.
Because the DEV_ID is just a Tivoli Storage Productivity Center internal number we use the table view STORAGESUBSYSTEM to display a meaningful name of the Storage Subsystem in our report (see Figure C-5).
368
With that information we create now the SQL select statement: select s.DISPLAY_NAME, p.PRF_TIMESTAMP, p.INTERVAL_LEN, TPCREPORT.PM_HD_XIV_TOT_RESP_TIME(p.READ_IO, p.WRITE_IO,p.READ_TIME,p.WRITE_TIME) AS "TOTAL RESPONSE TIME (ms/op)" from TPCREPORT.PRF_HOURLYDAILY_XIV_SYSTEM as p, TPCREPORT.STORAGESUBSYSTEM as s where p.DEV_ID = s.SUBSYSTEM_ID ORDER BY PRF_TIMESTAMP DESC for fetch only with UR The output of the command is shown in Figure C-6 which lists the Total Response Time of the XIV Storage System based on a hourly performance averages.
369
370
Here we list the normal sequence of steps that you follow to create a report with TPCTOOL: 1. Start the TPCTOOL CLI: Run the batchfile which came with the client installation. <TPC_Installation_Directory>\cli\tpctool. This opens a command prompt with tpctool. 2. List the storage devices by using the lsdev command as shown in Figure C-7.
3. Determine the component type on which you want to report by using the lstype command as shown in Figure C-8.
4. Next, you need to decide which metrics or counters that you want to include in the report. You can either use the lists provided in the book or use the lsmetrics or lscounter command. Remember that the metrics returned by the lsmetrics command are the same as the columns in the Tivoli Storage Productivity Center GUI, as compared to the counters, which represent the raw data that Tivoli Storage Productivity Center has gathered from the CIMOMs and NAPIs.
371
5. Before you run the reporting command, you need to decide on which time frame to report and which level of the samples to include. 6. Run the report, and redirect the output to a file. Tip: If you want to import the data into Excel later, we recommend using a semi-colon as the field separator (-fs parameter). A comma can easily be mistaken as a decimal or decimal grouping symbol. The disadvantage is that Excel does not recognize the structure of a csv file when you open it with a double-click. The book titled Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364, contains an Excel template that you can use with TPCTOOL for reporting.
372
7. Now you can use the data in any kind of reporting tool, for example, Excel. The lstime command, which is one of the ls commands, is very helpful because you use this command to verify that the performance collection is running and data is inserted into the database (see Example 6-1).
Example 6-1 lstime command sample output
tpctool> lstime -user administrator -pwd xxxxx -url localhost:9550 -ctype subsystem -level sample -subsys 2810.6000646+0 Start Duration Option =================================== 2011.06.13:18:04:08 81298 server Figure C-10 shows a performance report of an XIV storage subsystem. Hourly counters, for 10 hours starting at 2011.06.14 at 5am. The reported components are: Total I/O Rate (overall) = 809 Total Data Rate (overall) = 821 Total Response Time (overall) = 824
The IBM Redpaper publication, Reporting with TPCTOOL, REDP-4230, where TPCTOOL is discussed in more detail, is available at the following link: http://w3.itso.ibm.com/abstracts/redp4230.html?Open
373
374
375
In the tab Selection we define (see Figure C-12): 1. Selection: We choose the desired XIV 2. Set the Display historic performance data using relative time to 1 day (24 hours) 3. Set the Summary Level to Hourly 4. Add the Included Columns for the report
376
On the next panel we define where to place the report as a file, the format of the file, and the naming convention of the file. For details see Figure C-13 .
377
In the next panel we define when to run the batch report. We want to run it once a day. And we set the time to 8pm. Therefore every day at 8pm such a report is generated. See Figure C-14 for the details.
In the last panel you can set the alert in case a report generation fails (see Figure C-15). In this example we send an email to the tpcadmin in case of a failure. To be able to use alerting you need to configure Tivoli Storage Productivity Center. You can find the alert configuration panel in the Tivoli Storage Productivity Center Navigation Tree: Administrative Services Configuration Alert Disposition.
378
At the end, save the job. We also create a batch report for the Storwize V7000. We exactly use the same configuration as used for the XIV. Except we select the Storwize V7000 instead of the XIV, take another destination path for the report files, and also add the SVC/Storwize V7000 specific metric CPU Utilization to the Selected Column. From now every night at 8pm a report (HTML file) is generated for each storage subsystem, the Storwize V7000 and the XIV, containing 24 hourly samples. The HTML file name contains an incremental number (up to 9999), the storage subsystem device name and also the timestamp. Therefore the file is never overwritten. After the first run of the batch job we see the HTML file in the defined directories. See Figure C-16. For the example output of the XIV report and see Figure C-17 for the output of the Storwize V7000 report.
379
380
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.
Other publications
These publications are also relevant as further information sources: Tivoli Storage Productivity Center Storage Productivity Center for Replication Version 4.2.1 Installation and Configuration Guide, SC27-2337
Online resources
These Web sites are also relevant as further information sources: IBM Storage Software support Web site: http://www.ibm.com/servers/storage/support/software/ Tivoli Storage Productivity Center: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.d oc/fqz0_r_product_packages.html CIMOM compatibility matrix: For fabric management - supports Tivoli Storage Productivity Center v4.2.1:
https://www-01.ibm.com/support/docview.wss?uid=swg27019378
http://www-01.ibm.com/support/docview.wss?rs=1134&context=SS8JFM&context=SSESLZ &dc=DB500&uid=swg21265379&loc=en_US&cs=utf-8&lang=en
381
382
Index
Symbols
78, 24, 48, 99101, 108, 182, 187, 207 cache hostile workloads 18 cached storage subsystems 60 cache-miss 222 Capacity Planning 306 Capacity reports disk capacity 308 TPC-wide Storage Space 309 storage capacity management 306 storage subsystem performance 311 Case study Basics 48 fabric performance 291 IBM XIV workload analysis 287 Server performance 267 SVC & Storwize performance constraint alerts 283 Top Volumes Response Performance 280 Topology Viewer - SVC and Fabric 296 change history 150 change overlay 122 chart reports 102 CIMOM 15, 43 CIMOM compatibility matrix 44 deployment 44 Providers 43 recommended capabilities 44 sizing 71 comma separated values (CSV) 69, 108 Common Information Model Object Manager (CIMOM) 6 Compatibility matrix 45 fabric management 45, 381 storage device management 45 configuration history 120, 124 change overlay 122 use 120 constraint 80 constraint violation reports 113, 143 constraint violation thresholds 114 constraint violations 67 controller cache performance report cache hit percentage 203 Controller cache read usage 205 Controller performance reports Data rates 198 I/O rates 201 counters 54 CPU Utilization Thresholds 129 customized performance report displaying 103 Customized predefined reports 98 Predefined Fabric manager reports 99 switch performance report 99 Switch Port Errors Report 100 Top Active Volume Cache Hit Performance 99 Top Switch Ports Data Rate 100 Top Switch Ports Packet Rate 100
A
Agents 33 CIMOM Agent 33 Data agent 33 fabric agent 33 Native Application Interface (NAPI) 33 Storage resource agent 33 alert events CPU utilization threshold 283 Overall back-end response time threshold 284 overall port response time threshold 284 alerts 80, 140 data gathering 93 default write-cache delay 204 performance-related 93 application response 60 application workloads 5 array 7 Array Performance report 96, 216 array site 8, 163 Automatic Tiering 180 EZ-Tier 180
B
back-end data rate 214, 314 back-end I/O metrics 56 back-end I/O rate 68, 212 back-end response time 67, 216 back-end response time metrics 58 back-end throughput metrics 57 backup server 23 baseline creation 68 baseline management 4 baselines 175 batch report creation 109 when to run 111 batch report formats 108 Block Server Performance 101 Block Server Performance Subprofile 16 BSP 186 By Volume report 99
C
cache 7 battery failure 237 cache friendly workloads 18 Cache hit percentage 203 cache hit rate 58
383
Top Volume Disk Performance 99 Top Volumes Data Rate, I/O Rate, Response Performance 99 customized reports 101 charts 102 Generate Report 104 location 102 tabular reports 106 time range 104
F
fabric agent 33, 41 Fabric manager reports Top Switch Ports Data Rate 100 Top Switch Ports Packet Rate 100 Fabric reporting E_Ports 323 FC port speed 264 front-end 57 front-end I/O metrics 57 front-end ports 7 front-end response time metrics 58
D
daily administration tasks 125 Data agent 33, 41 Data Path Explorer 148 Data Path View 300 data retention daily monitoring task 86 hourly monitoring task 86 data spike 55 Database access 34 DB2 34 SQL 34 TPCREPORT 34 database backups 36 database managed space 38 database repository capacity 76 placement 38 sizing formulas 37 database table space 38 Datapath Explorer 294 DB2 database 33, 366 TPCDB database 33 DDM 8, 63, 222, 251 Disk Manager reports performance reports 101 Disk to cache Transfer rate metric 224 Disk Utilization Percentage 208209 high 211 Disk Utilization Percentage Threshold Filtering 83 Disks 171 DMS table space 39 drive modules 8 DS4000 information 172 DS5000 information 170 DS8000 information 162 array site 163 DS8800 180 EZ-Tier 164 Ranks 163 SSD 164 Storage Pools 164 Thin Provisioning 164
G
Globally Unique Identifier 370 Graphical user interface (GUI) 32 GUI versus CLI 40
H
HBA identification 160 HBA WWPN 42 High NVS Full Percentage 58 History Aggregation 84 hot array 207 HTML chart 108 HTML report. 69
I
I/O Groups 11, 97 I/O performance 154 I/O response performance 14 IBM Storwize V7000 xv, 10 IBM System Storage DS8800 Automatic Tiering 180 idle threshold level 80 Interfaces 34 GUI 34 Java Web Start GUI 34 TPCTOOL 34 ITSO environment 48
J
Java Web Start GUI 34 Job History 139
L
large writes guidelines 62 latency 60 Logical 94 Logical reporting levels by device type 94 LUN 57, 172, 251 LUN mapping 15 LUN masking 160
E
embedded CIMOM 44 environmental norms 4 extent pool 10 extents 10
384
M
Managed Disk 13, 167 Managed Disk Group reports 311312 Managed Disk Group performance 247 MDisk definition 13 MDisk performance 274 messages HWNPM2123I 90 PM HWNPM2113I 92 PM HWNPM2115I 90 PM HWNPM2120I 90 metric 6, 54 metrics versus counters 93 multipath software 5
N
N/A values 101 NAPI supported storage devices IBM System Storage DS8000 43 IBM System Storage SAN Volume Controller (SVC) 43 IBM System Storage Storwize v7000 43 IBM System Storage XIV 43 Native Application Interface (NAPI) 4, 6 Native Storage System Interface (Native API) 14 Enterprise Storage Server Network Interface (ESSNI ) 14 Secure Shell (SSH) interface 14 XML CLI (XCLI) 14 Near Line (NL)-SATA 183 networking subsystem 22 new volume utilization metric 67 Node Cache performance report 228, 239 Node level reports 237 N-Series support 41
O
OLTP performance rates 62 OLTP response time 196 online monitor 93 Overall Back-End Response Time 216 oversubscription of links 14
P
performance 71 baselines 175 batch reports 108 performance analysis guidelines 62 performance collection scheduler 86 performance collection task 77 performance configuration 24 workload isolation 24 Workload resource sharing 25 workload spreading 25 performance considerations random workloads 179 sequential workloads 179
performance counters 332 performance data collection considerations 74 counters 75 retention 75 samples 75 performance data classification Cache hit rate 58 Response time 57 SAN switch 59 throughput 57 performance data collection 70 alerts 80 CIMOM intervals 72 CIMOM sizing 71 intervals 46 job duration 46 job starts 78 job status 89 new 24 hour value 46 sample interval. 72 server restart 92 Service Level Agreement 72 skipping function 78 start 87 stop 88 task considerations 46 performance management 45, 54, 307 applications 60 cache hit rate 58 cached storage subsystems 60 daily analysis 69 data collection 72 metrics 55 OLTP 59 performance data collection job 69 prerequisite tasks 70 problem determination 73 response time ranges 59 top 10 reports top 10 reports 186 performance management concepts 3 performance measurement RAID ranks 63 performance metrics 332 -1 value 332 Bit error rate for DS8000 ports 335 Counters 332 Duplicate frame rate for DS8000 ports 335 Error frame rate for DS8000 ports 334 essential 332 Important Thresholds 335 Invalid CRC rate for SVC, Storwize V7000 and DS8000 ports 334 Invalid relative offset rate for DS8000 ports 335 Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports 334 Link failure rate for SVC, Storwize V7000 and DS8000 ports 334 Link Recovery (LR) received rate for DS8000 and Switch ports 335
Index
385
Link Recovery (LR) sent rate for DS8000 and Switch ports 335 Loss-of-signal rate for SVC, Storwize V7000 and DS8000 ports 334 Loss-of-synchronization rate for SVC, Storwize V7000 and DS8000 ports 334 Out of order ACK rate for DS8000 ports 335 Out of order data rate for DS8000 ports 335 Primitive Sequence protocol error rate for SVC, Storwize V7000, DS8000 and Switch ports 334 quickstart 61 Sequence timeout rate for DS8000 ports 335 XIV system metrics 337 Zero buffer-buffer credit timer for SVC and Storwize V7000 ports 334 performance monitor 36, 71, 80, 93, 181 24 hours 74 data retention 86 performance monitoring Fabric environment 323 performance problems determination 73 identification 185 rank skew 177 resource sharing 176 performance reports back-end data rate 214 back-end response time 216 drill up 106 Managed Disk Group 247 N/A values 101 SVC port performance 259 Top Volume Cache 221 Top Volume Disk Performance 224 persistent memory 204 persistent memory constraint 219 Port performance 98 Port Send Receive Response Time 227 ports 7 PPRC 336 predefined performance reports 98 Array performance 96 controller cache performance report 96 controller performance report 96 I/O Group performance 97 Managed disks group 97 Module/Node cache performance 97 Node cache performance 97 port performance 98 Subsystem performance 98 problem determination basics 154 proxy agent 43
R
RAID 63, 179 RAID 5, RAID 6, and RAID 10 Considerations 179 asynchronous writes 179 random write workloads 179 sequential and random reads 179 sequential writes 179
RAID algorithms 251 RAID array utilization 208 RAID level 7 RAID5 algorithms 251 random read IO 217 random workloads 179 rank 9 count key data (CKD) 9 fixed block (FB) data 9 rank busy recommendations 207 rank I/O limit 63, 213 rank level information 163 rank skew 177 Read cache Hit Percentages 206 Read cache hit ratio 222 Read Data rate 239 Read Hit percentages guidelines 62 Redbooks publications Web site 382 Redbooks Web site Contact us xiii reports 365 constraint report 94 Customized Reports 94 Predefined Performance Reports 94 Reports for Fabric and Switches Switches reports 265 Total Port Data Rate 265 response time ranges 225 response time recommendations 196 response times back-end 226 front-end 226 reviewing alerts 140 Rules of Thumb Back-End Read and Write Queue Time 329 Cache Holding Time Threshold 329 CPU Utilization Percentage 328 CPU Utilization Percentage Threshold 328 CRC Error rate 330 Disk Utilization 328 Disk Utilization Threshold 328 Link Failure Rate and Error Frame Rate 330 Non-Preferred Node Usage 330 Overall Port response Time 329 Overall Port response Time Threshold 329 Port Data rate threshold 329 Port to local node Send/receive Queue Time 330 Port to Local Node Send/Receive Response Time 329 Read Cache Hit Percentage 328 Response Time Threshold 328 Write-Cache Delay Percentage 329 Zero Buffer Credit 330
S
SAN Planner 120, 156 SAN Volume Controller Version 6.1 xv SAN zoning 299 SAS disks 183
386
SATA disks 165, 180 scheduler in TPC 86 server applications 20 server to disk information 157 Service Level Agreement 19, 68, 175 SLA reporting 228 small block reads 62 small block writes guidelines 62 SMI-S Block Server Performance Subprofile 16 SMI-S profile 44 SMI-S standard 94 ports 7 response time 96 SMS table space 38 snapshot Create 120 Delete 120 SNIA 15 SNMP 47 solid state disks (SSD) 60, 180 SQL Example Query XIV Performance Table View 366 PM_Metrics.xls 366 select statement 369 Table views 366 TPCREPORT 366, 368 User Defined Functions (UDFs 368 XIV Total Response Time performance report 369 SSPC considerations 40 Host Bus Adapter (HBA) 40 SSPC appliance 40 volume management 40 storage performance management 196 Storage Pool information 164 Storage Pools 173 Storage Resource Agent (SRA) xv, 41 Common Agent Strategy (CAS) 41 deploying the SRA 42 Tivoli Storage Productivity Center topology table view with the SRA agent 42 Tivoli Storage Productivity Center topology table view without the SRA 42 Storage Server Native API 43 storage subsystem architecture 5 storage subsystem counters 54 Storage Subsystem Performance reports 65 Storage virtualization device 10 storage volume throughput 59 storage workloads 17 Storwize V7000 Case study disk performance 271 cluster 12 Control enclosure 10 Expansion enclosures 10 I/O group 11 I/O Groups in a cluster 233 Managed Disk Group (Storage Pool) 13 MDisk 13 Node 12
SPC Benchmark2 235 Storwize V7000 metric selection 275 Storwize V7000 Nodes 228 Storwize V7000 performance constraint alerts 283 Storwize V7000 performance report - volume selection 273 two node canisters 10 V6.2.0 restrictions 322 Vdisk 12 Verifying host paths to the Storwize V7000 302 Viewing host paths to the Storwize V7000 303 virtual volume 12 virtualization device 13 volume 12 Volume and Managed Disk selection 274 Storwize V7000 Best Practice Recommendations For Performance 183 Storwize V7000 considerations 182 Storwize V7000 nodes 283 Storwize V7000 version 6.2 233 Stress alerts 80 subsystem data considerations 72 subsystem metrics 94 Subsystem Performance Monitor 71 Subsystem Performance report 98 cached storage subsystems 196 data rates 193 front-end response times 197 I/O rate 189 Read I/O rate 191 recommendations 196 Response Times 195 Write I/O rate 191 SVC 43, 61, 64, 9798, 182 and Storwize V7000 performance reports 311 Back-end Read Response time 250 Best Practice Recommendations For Performance 182 HDD MDisks 180 Managed Disk information 167 performance benchmarks 235 Storage Performance Council (SPC) Benchmarks 235 V6.2.0 restrictions 322 SVC / Storwize V7000 concepts 231 SVC and Storwize V7000 Automatic Tiering 180 CPU Utilization Percentage metric 317 Element Manager 316 EZ-Tier 167 Managed Disks 167 MDisk 167 Solid State Disk (SSD) 167 Top Volume Performance reports 253 Virtual Disks 168 Volume to Back-End Volume Assignment 169 SVC and Storwize V7000 reports 311 back-end data rate 314 back-end subsystems 311
Index
387
Back-end throughput and response time 314 Cache performance 254 cache utilization 239 Clusters 322 CPU Utilization 233 CPU utilization by node 233 CPU utilization percentage 243 Dirty Write percentage of Cache Hits 243 I/O Groups 316 I/O Rate 312 Managed Disk Group 247 Managed Disk Group Performance 311 MDisk performance 274 Node Cache performance 228, 239, 318 Node CPU Utilization rate 233 node CPU Utilization reports 317 node statistics 232 over utilized ports 263 overall IO rate 234 Read Cache Hit percentage 229, 240 Read Cache Hits percentage 244 Read Data rate 239 Read Hit Percentages 229, 243 Readahead percentage of Cache Hits 244 report metrics 232 response time 237 Top Volume Cache performance 253 Top Volume Data Rate performances 253 Top Volume Disk performances 253 Top Volume I/O Rate performances 253 Top Volume Response 257 Top Volume Response performances 253 Total Back-End I/O Rate 312 Total Cache Hit percentage 240 Total Data Rate 239 Write Cache Flush-through percentage 244 Write Cache Hits percentage 244 Write Cache Overflow percentage 244 Write Cache Write-through percentage 244 Write Data Rate 239 Write-cache Delay Percentage 244 SVC cache utilization 246 SVC considerations 181 SVC traffic 181 SVC health 297 SVC performance 181182 Top Volumes Data Rate 254 SVC port information 227 SVC ports 298 SVC Rule of Thumb SVC response 257 SVC version 6.2 233 switch metrics 59 Switch Port Errors report 100 switch ports 297 Switches 265 symmetric multiprocessor 21 System Storage Productivity Center 35 system-wide thresholds 128
T
table space system managed space 38 Terminal Services 22 threshold-based alerts 80 thresholds setting 128 Warning Stress 116 throughput metrics 65 throughput recommendations 224 Tier0 180 time zone 111 Tivoil Storage Productivity Center report batch reports 108 Tivoli 34, 95, 332, 336 Tivoli Storage Productivity Center CLI 40 data retention 37 database backups 36 GUI 40 hardware sizing 35 instances 40 packaging options 30 repository sizing 36 Tivoli Storage Productivity Center Components 31 Agents 33 CIMOM agent 33 Data agent 33 Fabric agent 33 Native Application Interface (NAPI) 33 Storage resource agent 33 Data Server 32 Graphical user interface (GUI) 32 Device Server 32 Interfaces 34 Java Web Start GUI 34 Tivoli Storage Productivity Center GUI 34 user interfaces (UI) 34 Tivoli Integrated Portal (TIP) 32 Tivoli Integrated Portal(TIP) Single sign-on 32 Tivoli Common Reporting (TCR) 32 Tivoli Storage Productivity Center for Replication 33 Tivoli Storage Productivity Center licensing options 30 License Summary 31 Tivoli Storage Productivity Center Basic Edition 30 Tivoli Storage Productivity Center for Data 30 Tivoli Storage Productivity Center for Disk 30 Tivoli Storage Productivity Center Mid-Range Edition 30 Tivoli Storage Productivity Center Standard Edition 30 Tivoli Storage Productivity Center performance management functions 54 performance monitoring 54 performance reports 54 performance threshold/alerts 54 Tivoli Storage Productivity Center Performance Metrics Metrics for PPRC reads 336 Metrics for PPRC writes 336
388
Tivoli Storage Productivity Center reports Batch reports comma separated values (CSV 108 HTML chart 108 charts 106 Constraint Violations reports 113 tabular report 106 Tivoli Storage Productivity Center SAN Planner 120 Tivoli Storage Productivity Center for Replication 33 Top 10 Disk reports 188 Array Performance reports 207 Controller Cache Performance report 202 Controller Performance reports 197 Port Performance reports 227 Subsystem Performance report 188 Top Volume Performance reports 220 Top Volume Cache performance 221 Top Volume Data Rate Performance 223 Top Volume Disk Performance 224 Top volume I/O rate performance 224 Top Volume response performance 225 Top 10 reports for SVC and Storwize V7000 I/O Group Performance reports 232 Managed Disk Group performance report 247 Node Cache Performance report 239 Top Volume Performance reports 253 SVC performance Top Volume I/O Rate 256 SVC reports Top Volume Disk 256 Top Volume Cache performance 254 Top Volume Data Rate performance 254 Top Volume Response 257 TOP 10 reports for SVC, Storwize V7000 and Disk At a Glance 187 Topology Viewer 147, 156 Data Path Explorer 294 Data Path View 300 navigation 293 SVC health 297 zone configuration 299 Total Cache Hit percentage 229, 240 Total I/O Rate 211 TPC performance metrics collection 332 TPCTOOL 34 CLI as a reporting tool 370 command line interface 370 limitations 370 ls commands 373 lsmetrics command 371 lstime command 373 Multiple components 370 Multiple metrics 370 Report generation 370 Start the TPCTOOL CLI 371 TSM backup 62
virtualization device 5 VMware ESX Server 23 Volume HBA Assignment 157 volume information 157 volume report 174 Volume to Back-End Volume Assignment 169 volumes 8
W
Warning Stress 116 Web server 22 Windows Server 2008 R2 Hyper-V 24 workload isolation 24 workload spreading 178 host connection 178 workloads backup server 23 cache 18 cache friendly 18 database server 21 file server 20 multimedia servers 22 terminal server 22 transaction based 18 web servers 22 Windows hypervisor 20 Write Cache overflow 19 Write-cache Delay Percentage 68, 204, 219 Write-cache Delay percentage 204
X
XIV 173, 366 Disks 174 information 172 Module/Node Cache Performance 231 storage device 19 Storage Pools 173 Volumes 174 XIV system metrics 337 XIV Module Cache Performance Report 228 XIV reports IBM XIV Module Cache Performance Report 228 Read Cache Hit percentage 229 Storage Pools 173 Total Cache Hit percentage 229 volume report 174 RAID level of a volume 174 XIV Disk Details 174 XIV Storage 180 Automatic Tiering 180 GRID technology 180 SATA disks 180 Solid State Disks 180 Tier0 180 XIV Storage System xv
V
Verifying host paths to the Storwize V7000 302 virtual disks 168
Z
zone configuration 299
Index
389
390
Back cover
Customize Tivoli Storage Productivity Center environment for performance management Review standard performance reports at Disk and Fabric layers Identify essential metrics and learn Rules of Thumb
IBM Tivoli Storage Productivity Center is an ideal tool for performing storage management reporting, because it uses industry standards for cross vendor compliance, and it can provide reports based on views from all application servers, all Fibre Channel fabric devices, and storage subsystems from different vendors, both physical and virtual. This IBM Redbooks publication is intended for experienced storage managers who want to provide detailed performance reports to satisfy their business requirements. The focus of this book is to use the reports provided by Tivoli Storage Productivity Center for performance management. We do address basic storage architecture in order to set a level playing field for understanding of the terminology that we are using throughout this book. Although this book has been created to cover storage performance management, just as important in the larger picture of Enterprise-wide management are both Asset Management and Capacity Management. Tivoli Storage Productivity Center is an excellent tool to provide all of these reporting and management requirements.