SAN Storage Performance

IBM Tivoli
Front cover
SAN Storage Performance Management Using Tivoli Storage Productivity Center

Customize Tivoli Storage Productivity Center environment for performance management Review standard performance reports at Disk and Fabric layers Identify essential metrics and learn Rules of Thumb
Karen Orlando Daniel Frueh Paolo DAngelo Lloyd Dean
ibm.com/redbooks
International Technical Support Organization SAN Storage Performance Management Using Tivoli Storage Productivity Center September 2011
SG24-7364-02
Note: Before using this information and the product it supports, read the information in Notices on page ix.
Third Edition (September 2011) This edition applies to Version 4, Release 2 Modification 3 of IBM Tivoli Storage Productivity Center (product number 5608-VC0).
Copyright International Business Machines Corporation 2009, 2011. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi The team that wrote this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv Summary of changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv September 2011, Third Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Part 1. Storage performance management concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Performance management concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1 Performance management fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Environmental norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Storage subsystem architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.1 High-level component diagram of storage devices . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.2 Disk storage subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.3 Storage virtualization device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.4 Comparison of a disk storage device and a virtualization device . . . . . . . . . . . . . 13 1.3.5 Data path from your application to the storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Native Storage System Interface (Native API) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.1 SNIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.2 CIMOM and CIM agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.5.3 SMI-S standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Performance issue factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.6.1 Types of problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.2 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.6.3 Server types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.6.4 Running servers in a virtualized environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6.5 Understanding basic performance configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 24 Part 2. Sizing and scoping your Tivoli Storage Productivity Center environment for performance management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Chapter 2. Tivoli Storage Productivity Center requirements for performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Determining what Tivoli Storage Productivity Center needs . . . . . . . . . . . . . . . . . . . . . 2.1.1 Tivoli Storage Productivity Center licensing options . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Tivoli Storage Productivity Center Components . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Tivoli Storage Productivity Center server recommendations . . . . . . . . . . . . . . . . 2.1.4 Tivoli Storage Productivity Center database considerations. . . . . . . . . . . . . . . . . 2.1.5 Tivoli Storage Productivity Center database repository sizing formulas . . . . . . . . 2.1.6 Database placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.7 Selecting an SMS or DMS table space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.8 Best practice recommendations for the TPCDB design . . . . . . . . . . . . . . . . . . . .
29 30 30 31 35 36 37 38 38 39
Copyright IBM Corp. 2009, 2011. All rights reserved.
iii
2.1.9 GUI versus CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.10 Tivoli Storage Productivity Center instance guidelines . . . . . . . . . . . . . . . . . . . . 2.2 SSPC considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Configuration data collection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Storage Resource Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Storage Server Native API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 CIMOMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Version control for fabric, agents, subsystems, and CIMOMs . . . . . . . . . . . . . . . 2.4 Performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Performance Data collection tasks: Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Performance Data collection tasks: Considerations . . . . . . . . . . . . . . . . . . . . . . . 2.5 Case Study: Defining the environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 CASE STUDY 1: Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Tivoli Storage Productivity Center basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40 40 40 41 41 43 43 45 45 45 45 48 48 50
Part 3. Performance management with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3. General performance management methodology. . . . . . . . . . . . . . . . . . . . 53 3.1 Overview and summary of performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 Main objectives of performance management . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.2 Performance management approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Performance data classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 Rules of Thumb. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.3 Quickstart performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.4 Performance metric guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3 Creating a baseline with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . . . . . . 68 3.4 Performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.2 Prerequisite tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.3 Defining the performance data collection jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.4.4 Defining the alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 3.4.5 Defining the data retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.4.6 Running performance data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5 Tivoli Storage Productivity Center performance reporting capabilities . . . . . . . . . . . . . 92 3.5.1 Reporting compared to monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5.2 Predefined performance reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3.5.3 Customized reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.5.4 Batch reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.5.5 Constraint Violations reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.6 Tivoli Storage Productivity Center configuration history . . . . . . . . . . . . . . . . . . . . . . . 120 3.6.1 Viewing configuration changes in the graphical view . . . . . . . . . . . . . . . . . . . . . 122 3.6.2 Viewing configuration changes in the table view. . . . . . . . . . . . . . . . . . . . . . . . . 124 3.7 Tivoli Storage Productivity Center administrator tasks . . . . . . . . . . . . . . . . . . . . . . . . 125 3.7.1 Using Configuration Utility to verify everything is running as expected. . . . . . . . 125 3.7.2 Verifying that Discovery, probes, and performance monitors are running . . . . . 126 3.7.3 Setting system-wide thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.7.4 Defining additional reports and thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.7.5 Regularly reviewing the incoming alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 3.7.6 Using constraint violation reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 3.7.7 Using the Topology Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 3.7.8 Using the Data Path Explorer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 3.7.9 Configuring automatic snapshots, then exploring Change History . . . . . . . . . . . 150
iv
Chapter 4. Using Tivoli Storage Productivity Center for problem determination . . . 4.1 Problem determination lifecycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Problem determination steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Identifying acceptable base performance levels . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Understanding your configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Volume information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Determining the subsystem configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 DS8000 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 IBM SAN Volume Controller (SVC) or Storwize V7000 . . . . . . . . . . . . . . . . . . . 4.3.4 DS5000 information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 XIV information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.6 Determining what your baselines are . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.7 Determining what your SLAs are . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.8 General considerations about the environment . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.9 Problem perception considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.10 Keeping track of the changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Common performance problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Deciding what can be done to prevent or solve issues . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Dedicating plenty of resources, with storage isolation . . . . . . . . . . . . . . . . . . . . 4.5.2 Spreading work across many resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Choosing the proper disk type and sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Monitoring performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 SVC considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 SVC traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 SVC best practice recommendations for performance . . . . . . . . . . . . . . . . . . . . 4.7 Storwize V7000 considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Storwize V7000 traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 Storwize V7000 best practice recommendations for performance . . . . . . . . . . . Chapter 5. Using Tivoli Storage Productivity Center for performance management reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Data analysis: Top 10 reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Top 10 reports for disk subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Top 10 for Disk #1: Subsystem Performance report . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Top 10 for Disk #2: Controller Performance reports . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Top 10 for Disk #3: Controller Cache Performance reports . . . . . . . . . . . . . . . . 5.2.4 Top 10 for Disk #4: Array Performance reports . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Top 10 for Disk #5-9: Top Volume Performance reports . . . . . . . . . . . . . . . . . . 5.2.6 Top 10 for Disk #10: Port Performance reports . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.7 IBM XIV Module Cache Performance Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Top 10 reports for SVC and Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports. . . . . . 5.3.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports . . . 5.3.3 Top 10 for SVC #3: Managed Disk Group performance reports . . . . . . . . . . . . . 5.3.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports. . 5.3.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports. . . . . . . . . 5.4 Reports for Fabric and Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Switches reports: Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Top Switch Port Data Rate performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Case study: Server - performance problem with one server . . . . . . . . . . . . . . . . . . . . 5.6 Case study: Storwize V7000- disk performance problem . . . . . . . . . . . . . . . . . . . . . . 5.7 Case study: Top volumes response time and I/O rate performance report. . . . . . . . . 5.8 Case study: SVC and Storwize V7000 performance constraint alerts . . . . . . . . . . . .
153 154 154 154 155 157 162 162 167 170 172 175 175 176 176 177 177 178 178 178 179 181 181 181 182 182 182 183
185 186 187 188 197 202 207 220 227 228 230 232 239 247 253 259 264 265 265 267 271 280 283
Contents
5.9 Case study: IBM XIV Storage System workload analysis . . . . . . . . . . . . . . . . . . . . . . 5.10 Case study: Fabric - monitor and diagnose performance . . . . . . . . . . . . . . . . . . . . . 5.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration . . . . . . 5.11.1 Ensuring that all SVC ports are online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.2 Verifying SVC port zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.3 Verifying paths to storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.4 Verifying host paths to the Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6. Using Tivoli Storage Productivity Center for capacity planning management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Capacity Planning and Performance Management. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Capacity Planning overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Performance Management overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Capacity Planning reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Performance of a storage subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 SVC and Storwize V7000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Storage subsystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix A. Rules of Thumb and suggested thresholds . . . . . . . . . . . . . . . . . . . . . . Rules of Thumb summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Response time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU Utilization Percentage Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disk Utilization Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FC: Total Port Data Rate Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Overall Port response Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache Holding Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Write-Cache Delay Percentage Threshold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Back-End Read and Write Queue Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Port to Local Node Send/Receive Response Time Thresholds . . . . . . . . . . . . . . . . . . . . . Port to local node Send/receive Queue Time Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . Non-Preferred Node Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CRC Error rate Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero Buffer Credit Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Link Failure Rate and Error Frame Rate Threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
287 291 296 297 299 300 302
305 306 307 307 308 311 311 323 323 327 328 328 328 328 329 329 329 329 329 329 330 330 330 330 330
Appendix B. Performance metrics and Thresholds in Tivoli Storage Productivity Center performance reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Performance metric collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Essential metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Reports under Disk Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Reports under the Fabric Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 New FC port performance metrics and thresholds in Tivoli Storage Productivity Center 4.2.1 release. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Tivoli Storage Productivity Center Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 336 Common columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 XIV system metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Volume-based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Back-end-based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 Front-end and fabric based metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Tivoli Storage Productivity Center performance thresholds . . . . . . . . . . . . . . . . . . . . . . . . 357 vi
Threshold boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Setting the thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Array thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Controller thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Port thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix C. Reporting with Tivoli Storage Productivity Center . . . . . . . . . . . . . . . . Using SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQL: Table views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SQL: Example Query XIV Performance Table View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CLI: TPCTOOL as a reporting tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tivoli Storage Productivity Center: Batch Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tivoli Storage Productivity Center: Batch Report Example . . . . . . . . . . . . . . . . . . . . . . . . Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IBM Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to get Redbooks publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
357 358 358 361 362 365 366 366 366 370 374 375 381 381 381 381 382 382
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Contents
vii
viii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs.
ix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol ( or ), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both:
AIX Cognos DB2 DS4000 DS6000 DS8000 Enterprise Storage Server FICON FlashCopy IBM Lotus POWER4 PowerVM Redbooks Redpaper Redbooks (logo) Symphony System p System Storage Tivoli Enterprise Console Tivoli TotalStorage XIV
The following terms are trademarks of other companies: Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Microsoft, Windows NT, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Snapshot, NetApp, and the NetApp logo are trademarks or registered trademarks of NetApp, Inc. in the U.S. and other countries. Java, and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Disk Magic, and the IntelliMagic logo are trademarks of IntelliMagic BV in the United States, other countries, or both. Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered trademarks of Oracle Corporation and/or its affiliates. QLogic, and the QLogic logo are registered trademarks of QLogic Corporation. SANblade is a registered trademark in the United States. VMware, the VMware "boxes" logo and design are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.
Preface
IBM Tivoli Storage Productivity Center is an open storage infrastructure management solution designed to help reduce the effort of managing complex storage infrastructures, to help improve storage capacity utilization, and to help improve administrative efficiency. Tivoli Storage Productivity Center can manage performance and connectivity from the host file system to the physical disk, including in-depth performance monitoring and analysis on SAN fabric performance. In this IBM Redbooks publication, we show you how to use Tivoli Storage Productivity Center reporting capabilities to manage performance in your storage infrastructure.
The team who wrote this book

This book was produced by a team of specialists from around the world working at the International Technical Support Organization, San Jose Center. Karen Orlando is a Project Leader at the International Technical Support Organization, Tucson Arizona Center. Karen has over 25 years in the I/T industry with extensive experience in open systems, product test, and information and software development of IBM hardware and software storage. She holds a degree in Business Information Systems from the University of Phoenix and is Project Management Professional (PMP) certified since 2005.
Daniel Frueh is an Advisory IT Specialist in GTS Services Delivery Austria & Switzerland. He has nine years of experience in the Open Storage field. He holds a degree in Computer Science from the University of Rapperswil. His areas of expertise include Tivoli Storage Manager, Tivoli Storage Productivity Center, SVC, IBM DS8000, IBM DS6000, SAN, and N series.
Paolo DAngelo is a Certified IT Architect working in Global Technology Services in Rome, Italy. He has worked at IBM for 12 years, and has 10 years of experience in Open Storage and Storage Management areas. Paolo's areas of expertise, both in design and implementation, include Storage Area Network, Data Migration, Storage Virtualization, Tivoli Storage Manager, Tivoli Storage Productivity Center, and Open Storage design and implementation.
xi
Lloyd Dean is an IBM Sr. Certified IT Architect in IBM S&D, and a Distinguished Chief/Lead Certified Open Group IT Architect. He provides pre-sales technical support within S&D as a Storage Solution Lead Architect throughout the Eastern United States. Lloyd has over 31 years of IT experience with over 15 years in the storage field. Lloyd has held many leadership positions within IBM, focused on storage solution design, implementation, and storage service management. Lloyd has over eight years of extensive experience with both the SAN Volume Controller, and Tivoli Storage Productivity Center. Lloyd has written a number of white papers on using Tivoli Storage Productivity Center to support SVC performance management, and has authored several presentations on Tivoli Storage Productivity Center best practices. Lloyd has also presented sessions at many IBM storage conferences including STGU, and IBM System Storage Storage and Networking Symposium. Thanks to the following people for their contributions to this project: Alex Osuna Mary Lovelace Bertrand Dufrasne Sangam Racherla Ann Lund International Technical Support organization (ITSO) John Hollis Brian Smith Advanced Technical Support, United States Brian De Guia Hope Rodriquez Jeffrey McCallum Nitu Shinde Tivoli Storage Software Test, United States Gary Williams Stefan Jaquet IBM Software Group, Tivoli, United States Katherine Keaney TIvoli Storage Productivity Center, Software Development, Project Manager Xin Wang TIvoli Storage Productivity Center, Software Development, Product Manager Barry Whyte IBM Systems &Technology Group, Virtual Storage Performance Architect, David Whitworth Sonny Williams IBM Storage Performance
xii
Thanks also to the authors of the previous editions of this book. Authors of the second edition, SAN Storage Performance Management Using Tivoli Storage Productivity Center, published in June 2009 were: Mary Lovelace Mark Blunden Lloyd Dean Paolo DAngelo Massimo Mastrorilli
Now you can become a published author, too!

Heres an opportunity to spotlight your skills, grow your career, and become a published author---all at the same time! Join an ITSO residency project and help write a book in your area of expertise, while honing your experience using leading-edge technologies.Your efforts can help increase product acceptance and customer satisfaction, as you expand your network of technical contacts and relationships. Residencies run from two to four weeks in length, and you can participate either in person or as a remote resident working from your home base. Find out more about the residency program, browse the residency index, and apply online at: ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us! We want our books to be as helpful as possible. Send us your comments about this book or other IBM Redbooks publications, in one of the following ways: Use the online Contact us review Redbooks publications form found at: ibm.com/redbooks Send your comments in an email to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400
Preface
xiii
Stay connected to IBM Redbooks

Find us on Facebook: http://www.facebook.com/IBMRedbooks Follow us on Twitter: http://twitter.com/ibmredbooks Look for us on LinkedIn: http://www.linkedin.com/groups?home=&gid=2130806 Explore new Redbooks publications, residencies, and workshops with the IBM Redbooks weekly newsletter: https://www.redbooks.ibm.com/Redbooks.nsf/subscribe?OpenForm Stay current on recent Redbooks publications with RSS Feeds: http://www.redbooks.ibm.com/rss.html
xiv
Summary of changes
This section describes the technical changes made in this edition of the book since the second edition, which was published June 2009, Tivoli Storage Productivity Center V3.3.2. This edition might also include minor corrections and editorial changes that are not identified. Summary of Changes for SG24-7364-02 for SAN Storage Performance Management Using Tivoli Storage Productivity Center as created or updated on September 7, 2011.
September 2011, Third Edition

This revision reflects the addition, deletion, or modification of new and changed information described below.
New information
The following new information is provided: This book was updated to the Tivoli Storage Productivity Center V4.2.1 level. Documentation and case studies have been added and updated to guide you through the problem determination process using standard Tivoli Storage Productivity Center reports to include new storage subsystems and functionality added since 3.3.2. Some of the key highlights are: Support for new storage subsystems: IBM Storwize V7000: Storwize V7000 offers IBM storage virtualization, SSD optimization and thin provisioning technologies built in to improve storage utilization. IBM XIV Storage System: Tivoli Storage Productivity Center supports performance monitoring and provisioning for XIV storage systems through the native interface IBM System Storage SAN Volume Controller Version 6.1: With the Block Server Performance (BSP) subprofile, Tivoli Storage Productivity Center is additionally able to identify other SMI-S certified Disk Storage Subsystems that are not IBM but from other vendors. For a complete list of supported storage, see the IBM Support Portal website for the latest Tivoli Storage Productivity Center interoperability matrix: https://www-01.ibm.com/support/docview.wss?uid=swg21386446 Native storage system interfaces provided for DS8000, SAN Volume Controller, IBM Storwize V7000, and XIV storage systems. Storage Resource agents: The Storage Resource agents now perform the functions of the Data agents and Fabric agents. Out-of-band Fabric agents are still supported and their function has not changed.
xv
Performance Manager enhancements: New performance metrics, counters, and thresholds for DS8000, SAN Volume Controller, and Storwize V7000. XIV storage system enhancements For Help for Tivoli Storage Productivity Center and release details see http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/index.jsp?topic= /com.ibm.itpc.doc_2.1/tpc_infocenter_home.htm
xvi
Part 1
Part
Storage performance management concepts

In this part of the book we provide a basic understanding of storage performance management concepts and review how Tivoli Storage Productivity Center supports these concepts.
Chapter 1.
Performance management concepts

Storage performance management is not an exact science. There are many factors that influence how you do the performance management tasks. This chapter covers the concepts that factor into performance management and shows you how to tailor a process for your environment.
1.1 Performance management fundamentals

Tivoli Storage Productivity Center is a single integrated solution designed to help you improve your storage TCO and ROI by combining the assets, capacity, performance, and operational management, traditionally offered by separate Storage Resource Management, SAN Management, and Device Management applications, into a single platform. Tivoli Storage Productivity Center Standard Edition provides full performance management and also includes Tivoli Storage Productivity Center for Disk performance information for either Disk or Fabric. In this book, we are assuming that you have Tivoli Storage Productivity Center already installed. Tivoli Storage Productivity Center comes with standard documentation. There are also many IBM Redbooks publications that will help you get the most out of Tivoli Storage Productivity Center. We point you to several of these documents throughout this book. This book shows IBM Storage Subsystems, but the principles can be applied to any Native Application Interface (NAPI) attached or SMI-S 1.2/1.3 or compliant subsystem.
1.2 Environmental norms

To understand how well your environment is performing, you need to have a starting point so that you can see any abnormalities. The process of managing to that starting point is called baseline management. Baseline reporting is not one report, but a set of reports that document your environment: The readings that are made when everything is behaving normally is called a baseline. It is important to create a baseline using realistic performance numbers so that when abnormalities happen, you can see the deviances from the baseline reports. We explain how to set a baseline in 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68. Baselines are going to be different for every environment, because each implementation is unique in the applications that are running, in the SANs that are deployed, and in the back-end subsystems that are installed. You need to set a baseline for your configuration, and then use the baseline values to set your thresholds and alerts so you can be automatically informed when things deviate, either up or down, from those baseline levels, or norms. Review your baseline regularly as your configuration changes over time. Keep a copy of your baseline in the Tivoli Storage Productivity Center database repository, both for the current environment, and historically, so you can see trends. How long you keep repository data is up to you. You need to keep it for as long as it is relevant, and for as long as the business needs that data. For baseline management, you start by knowing your environment: In Storage subsystem architecture on page 5, we show you how to determine what comprises your environment. Performance issue factors on page 16 explains the factors that have a significant affect on performance.
Each of the following components affect the performance, and therefore the baseline, of your environment: Cache HBA Bandwidth Firmware
The different RAID types that you use also affect your subsystem performance. RAID types affect these areas: LUNs Parity penalties Zoning types (soft versus hard, ISL) The use of multipath software is critical to both performance and availablility in a Fibre Channel environment, because this assists in providing better throughput and redundancy for your application I/Os. Different multipath software varies in the way it performs, so it is critical that you understand your multipath software features to find out what is appropriate for your setup to get maximum performance. In 1.6.2, Workloads on page 17 we describe the different application workloads that you might have in your environment. See this section for an understanding of the implications. After a baseline has been generated, you can set your thresholds within Tivoli Storage Productivity Center so that if there is any exceptional behavior, Tivoli Storage Productivity Center can trigger an alert and notify you that something has happened. In 3.7.3, Setting system-wide thresholds on page 128, we show you how to customize your thresholds and alerts. In Known limitations on page 84, we explain some of the threshold limitations.
1.3 Storage subsystem architecture

In this section, we explain the general concepts of storing data onto modern storage devices. We are doing this so that you have a detailed understanding of the considerations that need to be realized in order to do performance management properly. We provide a high-level description of a storage subsystem and a virtualization device, and we indicate at which levels to do performance reporting with Tivoli Storage Productivity Center. We also show you the complete picture of the data flow from the application all the way to the storage that is used.
1.3.1 High-level component diagram of storage devices

In the following sections, we show the components of storage subsystems from a high-level perspective that is unrelated to a specific device. Our plan is to explain different components according to the Navigation Tree within Tivoli Storage Productivity Center, but for certain components, we drill down even further. For example, we also show the cache in our diagrams even though the cache is tied to the subsystem component in the Navigation Tree of Tivoli Storage Productivity Center. For the IBM System Storage SAN Volume Controller (SVC) and Storwize V7000, we make an exception, because the SVC includes additional entities that are not available in storage subsystems.
Chapter 1. Performance management concepts
We do not include diagrams for a SAN switch, because Tivoli Storage Productivity Center really only monitors one component, the ports. Currently, Tivoli Storage Productivity Center cannot monitor the performance of a tape library or tape drives, therefore, we do not show any diagrams for these devices. At the present time, your only option is to monitor the SAN switch ports connected to a tape drive.
1.3.2 Disk storage subsystem

Tivoli Storage Productivity Center can display performance metrics for the components for a disk subsystem, as shown in Figure 1-1. Displaying a metric within Tivoli Storage Productivity Center depends on the architecture of the storage subsystem and the Common Information Model Object Manager (CIMOM) and the Native Application Interface (NAPI) to provide the performance data and related information, such as the values that are assigned to controllers. The IBM DS8000, DS6000, and ESS all provide this information for the components and levels that we describe next. We guide you through the diagram in Figure 1-1 by drilling down from the overall subsystem level. Tip: A metric is a numerical value derived from the information that is provided by a device. It is not just the raw data, but a calculated value. For example, the raw data is just the transferred bytes, but the metric uses this value and the interval to tell you the bytes/second.
Front End Port for server connections
By Port
Port Port Port Port Port Port Port Port
By Subsystem
Controller 1
Write Cache Controller 1 Read Cache Controller 1 Write Cache Mirror Controller 2
Controller 2
Write Cache Mirror Controller 1
- Cache
Cache
Read Cache Controller 2 Write Cache Controller 2
By Controller
Back End Port for connection to the disks
Port
Port
Port
Port
Port
Port
Port
Port
By Array
Arrays (sometimes called enclosures, 8-Pack, Mega Pack, )
Figure 1-1 Storage device: Physical view
Subsystem
On the subsystem level, you see metrics that have been aggregated from multiple records to a single value per metric in order to give you the performance of your storage subsystem from a high-level view, based on the metrics of other components. This is done by adding values, or calculating values, depending on the metric.
Cache
In Figure 1-1, we point out the cache and we call this a subcomponent of the subsystem, because the cache plays a crucial role in the performance of any storage subsystem. You do not find the cache as a selection in the Navigation Tree in Tivoli Storage Productivity Center, but there are available metrics that give you information about your cache. The amount of the available information or available metrics depends on the type of subsystem involved, as well as the information provided by the native storage system interfaces (NAPI) or by the CIM agent (SMI-S agent) if the subsystem uses that interface. See 1.4, Native Storage System Interface (NAPI) on page 12 for details on NAPI. See 1.5, Standards on page 13 for details on standards that determine the performance data that is collected and used by Tivoli Storage Productivity Center for SNIA and for the CIMOM and CIM agents. Cache metrics are available in the following report types and levels: Subsystem Controller I/O group Node Array Volume
Ports
The port information is for the front-end ports to which the hosts or SAN attach. Certain subsystems might aggregate multiple ports onto one port card. The SMI-S standards do not reflect this aggregation, and therefore, Tivoli Storage Productivity Center does not show any group of ports, which is important to know, because port cards can sometimes be a bottleneck. Their bandwidth does not always scale with the number of ports and their speeds. When you look at the report, the numbers per port might not seem to cause a problem, but if you total the numbers for all ports on one port card, you might see it differently. Details for individual ports are available for viewing under Disk Manager Reporting Storage Subsystem Performance By Port in the Tivoli Storage Productivity Center Navigation Tree. Ports: Tivoli Storage Productivity Center reports on many port metrics; therefore, be aware that the ports on the DS8000, DS6000, IBM DS4000, XIV, and ESS are the front-end part of the storage device. For the SVC and Storwize V7000, they are part of the virtualization engine and, therefore, are used for front-end and back-end I/O.
Controller
Almost all subsystems have multiple controllers for redundancy (usually, they have two controllers) or components exposing the volumes and managing cache. Whether this is a dual controller, dual cluster configuration, or GRID interface design. To analyze performance data, you need to know that most volumes can only be assigned/used by one controller at a time. With this in mind, you can understand why a single volume most likely never gets the full performance out of a subsystem.
Array
When used in this context, the term array describes the physical group of disk drive modules that are formatted with a certain RAID level. For example, for the DS8000, this is RAID 10, RAID 6, or RAID 5. The number of disks that are included in an array depends on the subsystem type and for certain disks, the actual implementation.
Volumes
The volumes, which are also called logical unit numbers (LUNs), are not shown in Figure 1-1 on page 6. We show the logical view here in Figure 1-2. The host server sees the volumes as physical disk drives and treats them as physical disk drives.
DS6000 and DS8000 storage device

Next we describe components that are unique to the DS6000 and DS8000 storage devices.
Array site An array is created from one or more array sites, depending on the subsystem. Forming an array means defining it for a specific RAID type. The supported RAID types are RAID 5,
RAID 6, and RAID 10. You can select a RAID type for each array site. The process of selecting the RAID type for an array is also called defining an array. Array sites are the building blocks that are used to define arrays. An array site is a group of eight disk drive modules (DDMs). Which DDMs make up an array site is predetermined by the DS8000, but note that there is no predetermined server affinity for array sites. The DDMs selected for an array site are chosen from two disk enclosures on different loops. The DDMs in the array site are the same DDM type; therefore, they have the same capacity and the same speed or revolutions per minute (RPM). In Figure 1-2, we have only included the volumes. We did not include components, such as host objects or volumes groups, because most of the other logical components vary from vendor to vendor.
Figure 1-2 Storage device: Logical view
Rank
In the DS8000 or DS600 virtualization hierarchy, there is another logical construct, a rank. A rank is defined by the user. The user selects an array and defines the storage format for the rank, which is either count key data (CKD) or fixed block (FB) data. One rank is assigned to one extent pool by the user. Currently on the DS8000, a rank is built using only one array. On the DS6000, a rank can be built from multiple array sites. With the introduction of the DS8800 and with Tivoli Storage Productivity Center 4.2.1 now has the ability to expose multiple ranks in a single Extent pool in the DS8000. For details on added DS8800 functionality, see 4.3.2, DS8000 information on page 162. Figure 1-3 shows the relationship of an array site, an array, and a rank.
Figure 1-3 Array site, array, and rank relationship
Extents
The available space on each rank is divided into extents. The extents are the building blocks of the logical volumes. The characteristic of the extent is its size, which depends on the specified device type when defining a rank: For FB format, the extent size is 1 GB. For CKD format, the extent size is 0.94 GB for model 1.
Extent pools An extent pool refers to a logical construct to manage a set of extents. The user defines
extent pools by selecting one to n ranks managed by one storage facility image. The user defines which storage facility image server (Server 0 or Server 1) to manage the extent pool. All extents in an extent pool must be of the same storage type (CKD or FB). Extents in an extent pool can come from ranks defined with arrays of different RAID formats, but we recommend having the same RAID configuration within an extent pool. The minimum number of extent pools in a storage facility image is two (each storage facility image server manages a minimum of one extent pool).
1.3.3 Storage virtualization device

A Storage Virtualization Device provides a way to virtualize the back-end storage arrays. Functionality is available in a single intelligent storage subsystem, such as a DS8000 or EMC. Symmetrix is moved from a hardware based solution to a SAN fabric wide solution. The intelligence of managing the data is now the responsibility of the Storage Virtualization Device. Additionally, all storage on the back-end storage subsystems can be configured to support different environments, and the Storage Virtualization Device is used to control who gains access to that storage. By virtue of the inherent design of the Storage Virtualization Device, tasks such as data replication, data migration, storage management, disaster recovery, and data backup and recovery become minimal. This affect reduces the storage management costs dramatically. For a storage virtualization device, we used the IBM SVC and Storwize V7000 as a model for this discussion, and the multi-path or preferred node support is based upon IBM Subsystem Device Driver (SDD).
Subsystem
Figure 1-4 shows an overview of an SVC as compared to Storwize V7000. Currently, an SVC is composed of up to four I/O groups, each of which has two nodes. The IBM Storwize V7000 consists of a Control enclosure that contains two node canisters and disk drives. The pair of node canisters is known as the I/O Group. Optionally, up to nine Expansion enclosures that each contain two expansion canisters and drives can be added. For more information about SVC functionality, see the Redbooks publication, Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933. For more information about Storwize V7000 functionality, see the Redbooks publication, Implementing the IBM Storwize V7000, SG24-7938.
10
SVC
Node 8 Node 3 Node 7 Node 4 Node 1 Node 5 Node 6 Node 2 By Subsystem
IO Group 1
IO Group 2
IO Group 3
IO Group 4
Controller Enclosure Expansion Enclos ure
Node canister 2
Node 1
Storwize V7000
Node 2 Expansion canister 2

N o d e 1
Expansion canister 1
Enclosure
Node canister 1
By Subsystem
Control
IO Group 1
IO Group 1
Figure 1-4 Virtualization device overview
This is different from the way that a typical subsystem looks, because each I/O group can be considered a subsystem with two controllers. Because of so many differences, we show a comparison between a disk storage device and a virtualization device in 1.3.4, Comparison of a disk storage device and a virtualization device on page 13.
I/O group
An input/output (I/O) group contains two SVC nodes or Storwize V7000 node canisters that have been defined by the configuration process. Each SVC node or Storwize V7000 canister node is associated with exactly one I/O group. The nodes in the I/O group provide access to the volumes in the I/O group.
11
Figure 1-5 shows the relationship between two nodes in a virtualization device from a physical view.
Figure 1-5 Virtualization device: physical view
Node
For I/O purposes, SVC nodes or Storwize V7000 node canisters within the cluster are grouped into pairs, which are called I/O groups, and a single pair is responsible for serving I/O on a particular volume. One node within the I/O group represents the preferred path for I/O to a particular volume. The other node represents the non-preferred path. This preference alternates between nodes as each volume is created within an I/O group to balance the workload evenly between the two nodes. We show the relationship of the volume to back-end storage in Figure 1-6. Support: The preferred path for a node is supported by SVC or by Storwize V7000 when the multipath driver supports it. Else SVC and Storwize present a virtual disk to all node ports available in the I/O group, unless port binding is used to select only certain ports to present the disk of volume.
Virtual volume
The virtual volume that is presented to a host system is called a volume. The host system treats this virtual volume as a physical disk. Starting with the smallest unit, we explain and list the components that make up a volume.
12
Figure 1-6 Virtualization device: logical view
MDisk
A managed disk (MDisk) is a LUN or volume presented by a RAID controller and managed by the SVC or Storwize V7000.
Managed disk group (storage pool)

The managed disk group (storage pool) is a collection of one or more MDisks that jointly contain all of the data for a specified set of virtual volumes. IBM SVC and Storwize V7000 organize storage with storage pools to make storage easy and efficient to manage.
1.3.4 Comparison of a disk storage device and a virtualization device

For a comparison between a disk storage and a virtualized volume, see Table 1-1.
Table 1-1 Comparison of the general structure of disk storage devices and virtualization devices Disk storage device Subsystem - cache Controller Ports Array Virtualization device Subsystem - Cache - Back-end storage I/O group Nodes, node canisters Ports Managed disk group (storage pool) Managed disk Volume Volume (virtual volumes)
13
1.3.5 Data path from your application to the storage

When you consider I/O response performance, remember that the I/O response time is the end-to-end measurement of the time that is required to complete an I/O. Be aware that any latencies can extend the time, including latencies in the fabric. So when you investigate the SAN, be sure to verify error rates on the ports, the host bus adapter (HBA) configuration, and oversubscription of the links (see Figure 1-7). Tip: Oversubscription occurs when the aggregated data rate of traffic through a port is greater than the total data rate that a port can sustain.
Host 1
Potential Bottle Neck
Host 1
HBA HBA
HBA
HBA
Potential Bottle Neck
SWITCH 1
SWITCH 2
Controller 1
Controller 2
Figure 1-7 Typical redundant path host to storage configuration
1.4 Native Storage System Interface (Native API)

Native storage system interfaces (also called Native API or NAPI) are provided to improve the management capabilities and performance data collection for DS8000, SAN Volume Controller, IBM Storwize V7000, and XIV storage systems. Now Tivoli Storage Productivity Center communicates with these storage systems through the IBM Enterprise Storage Server Network Interface (ESSNI) for DS8000, a standard Secure Shell (SSH) interface for SVC and the Storwize V7000, or the XML CLI (XCLI) for the IBM XIV Storage System. These interfaces replace the CIM agent (SMI-S agent) implementation. If you are upgrading Tivoli Storage Productivity Center, a storage subsystem credential migration tool is provided to help you migrate the existing storage system credentials for the native interfaces.
14
For more information about the credential migration tool, see Chapter 5; Credentials Migration Tool in the Redbooks publication, IBM Tivoli Storage Productivity Center V4.2 Release Guide; SG24-7894. The native interfaces are supported for the following release levels: DS8000: Release 2.4.2 or later SAN Volume Controller: Version 4.2 or later XIV storage systems: Version 10.1 or later Storwize V7000: Version 6.1.0 or later For more information about NAPI, see Chapter 7; Native API in IBM Redbooks publication IBM Tivoli Storage Productivity Center V4.2 Release Guide; SG24-7894.
1.5 Standards
In this section, we briefly review the standards that determine the performance data that is collected and used by Tivoli Storage Productivity Center.
1.5.1 SNIA
The Storage Networking Industry Association (SNIA) is an international computer system industry forum of developers, integrators, and IT professionals, who evolve and promote storage networking technology and solutions. SNIA was formed to ensure that storage networks become efficient, complete, and trusted solutions across the IT community. IBM is one of the founding members of this organization. SNIA is uniquely committed to disseminating networking solutions into a broader market. SNIA is using its Storage Management Initiative (SMI) and its Storage Management Initiative-Specification (SMI-S) to create and promote the adoption of a highly functional interoperable management interface for multivendor storage networking products. SMI-S makes multivendor storage networks simpler to implement and easier to manage. IBM has led the industry in not only supporting the SMI-S initiative, but also, in using it across its hardware and software product lines. The specification covers fundamental operations of communications between management console clients and devices, auto-discovery, access, security, the ability to provision volumes and disk resources, LUN mapping and masking, and other management operations. For more information about SNIA, see its official Web site: http://www.snia.org
1.5.2 CIMOM and CIM agent

In this book, we use Common Information Model Object Manager (CIMOM) and Common Information Model (CIM) agent interchangeably, although in theory, they are different. They represent the interface between Tivoli Storage Productivity Center and SMI-S enabled storage and switch devices, because the subsystem and Tivoli Storage Productivity Center do not speak the same language. Most of the interface is through a subsystem command-line interface (CLI), such as DS Open API is the subsystem CLI for IBM ESS, DS8000, and DS6000. CIMOMs are currently used for switch and third party storage.
15
The CIM is an open approach to the management of systems and networks. The CIM provides a common conceptual framework applicable to all areas of management, including systems, applications, databases, networks, and devices. The CIM specification provides the language and the methodology that are used to describe management data. A CIM agent provides a way for a device to be managed by common building blocks rather than proprietary software. If a device is CIM-compliant, software that is also CIM-compliant can manage the device. Vendor applications can benefit from adopting the Common Information Model, because the vendors can manage CIM-compliant devices in a common way, rather than using device-specific programming interfaces. Using CIM, you can perform tasks in a consistent manner across devices and vendor applications. CIMOM is one of the major functional components of the CIM agent. But, in many cases, we call the CIM agent a CIMOM.
1.5.3 SMI-S standards

The Storage Management Initiative-Specification (SMI-S) is an open standard, which is maintained by the SNIA to provide a basis for Tivoli Storage Productivity Center to collect performance metrics from storage subsystems. Tivoli Storage Productivity Center is able to interoperate and collect data from SMI-S compliant storage systems. The SMI-S 1.1 specification defines the SMI-S Block Server Performance Subprofile. This profile provides the basis for collecting performance data, thresholds, and reporting for storage subsystems.
Block Server Performance Subprofile

The Block Server Performance Subprofile defines the classes and methods for collecting performance information in block servers (for example, arrays, virtualization systems, and volume managers). Performance measurement is the key deliverable and the focus of this subprofile. In this book, we discuss which metrics provide the information that is required to help you manage performance.
Common metrics by SMI-S standard

Any storage subsystem that complies with the SMI-S Block Server Performance Subprofile most likely can supply the same metrics as the DS4000. One point to remember is that some of the metrics that are defined in the subprofile are required, and other metrics are optional. In Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 we show the metrics available, along with a brief description of each of those metrics.
1.6 Performance issue factors

Storage administrators often face the task of resolving performance issues, which usually occur at critical times and require immediate action. Understanding the dynamics of the various performance characteristics of the storage subsystems, the workloads to which they are exposed, and the storage infrastructure helps you determine the cause and resolve the problem in a timely manner. Storage administrators are also required to produce management reports and planning reports. These reports serve as guidelines to establish a Service Level Agreement (SLA) for storage performance, as well as the basis for proactive performance planning. 16
These reports also help you determine where a problem might occur in the storage subsystem. In Chapter 5, Using Tivoli Storage Productivity Center for performance management reports on page 185, we show you how to generate performance reports for SLA generation and problem determination. When determining performance issues, consider these factors: Workloads Performance capabilities of the storage subsystem SAN Server types Network Configuration of applications
1.6.1 Types of problems

When you analyze performance issues, you might see that users have the following complaints: It is too slow. It takes too long. It did not finish in time. Here are examples of possible causes for these performance issues: Slow user response time: Customer Relationship Management (CRM) users Web server Database query Mail Server
Batch windows: Backups not completing Application database updates not completing Data warehousing
1.6.2 Workloads
Generally you can break storage workloads into two categories with the following characteristics: Transaction-based: I/O intensive with small records (4 KB) that are either sequential or random Throughput-based: High throughput or large data transfers using high bandwidth The characteristics for these workloads are quite different; arrays configured for one type of workload might perform poorly for the other type of workload. The server application determines the type of workload. Server applications can generally be placed in one of these categories. If you have only one server with slow performance, then you might need to investigate all the factors that can influence that application type. In the remainder of this section, we describe different workloads and show a sample Tivoli Storage Productivity Center report that displays the workload.
17
Transaction-based throughput
You can characterize transaction-based throughput as an I/O intensive workload that usually has a small block size (4 KB), which is often described as cache friendly or cache hostile.
Cache friendly Cache friendly workloads consist of mostly sequential access with a high read to write ratio,
80% or more reads, which are often expressed as 80% read 20% write and 10% random (80/20/10). In a cache friendly write, the data is written to cache and later destaged from cache to disk. This gives the best I/O response times. Figure 1-8 is a sample view of a cache friendly I/O load displayed by volume. The I/O load was created using IOmeter with few writes. In Figure 1-9, we show a cache friendly I/O load from the Volume view using the Volume name filter that is shown in Figure 1-8.
Figure 1-8 Editing the filter
Tip: The filter uses a case sensitive search. For non-case sensitive search, remove the check mark.
Figure 1-9 XIV Volumes view filtered by volume name
Cache hostile Cache hostile workloads consist mostly of random access with a lower read to write ratio,
25% or less reads, which are often expressed as 25% read, 75% write, and 0% random (25/75/0).
18
Cache hostile disk activity is indicated by Write Cache overflow, which is a metric available for the DS8000, DS6000, and ESS. With a cache hostile disk activity condition, the read/write ratio is a low 30%. This means that the cache needs to be destaged to the disk frequently, which leads to longer I/O response times. Figure 1-10 shows cache hostile I/O.
Figure 1-10 XIV Cache hostile I/O
To compare the effect of cache friendly and cache hostile workloads, we created three volumes (all are 16 GB in size) in the XIV storage system. We then set up one cache friendly read I/O thread and one cache friendly read/write I/O thread. This produced the workload shown in Figure 1-11 with total write cache hit rate above 99%.
Figure 1-11 XIV Cache friendly I/O with cache hit percentage above 99%
We then added a cache hostile I/O thread with an overall cache hit percentage at 77% and a response time of 10 msec as seen in Figure 1-12.
Figure 1-12 cache hostile, cache hit at 77%, response time 10 msec
Tip: Observe the low cache hits on the volumes with writes. We have seen that changing the characteristic of the workload affects the performance on volumes. In most instances, we recommend that you practice workload separation. In Table 1-2, we show a possible breakdown of several storage workloads. Consult with your database administrator (DBA) or application owner for the exact profile for your applications. You can use these performance requirements to arrive at a suitable SLA.
19
Table 1-2 Storage workload characteristics I/O intensive File server OLTP Warehousing Multimedia Backup N Y Y N N Throughput intensive Y N N Y Y Read intensive Y Y Y Y N Write intensive Y Y N N Y Sequential Y N Y Y Y Random Y Y Y N N
Being proactive and isolating workloads

The solution seems to be simple, but in reality, most storage subsystems are a compromise between cost and size. We see large amounts of storage and few physical disks. The old adage still holds true, that more small, faster disks are better than a few large disks. With many smaller disks, you can separate applications to their own arrays. On storage subsystems, such as the ESS and DS8000, you also must look at the device adapters when allocating volumes.
1.6.3 Server types

While this section can help you understand impacts, the communication between Application systems administrators and Storage administrators is critical to remove configuration errors. The following list of server applications indicates the workload types: File server Database server Terminal server Multimedia server Web server Backup server
File server
The role of the file server is to store, retrieve, and update data that is dependent on client requests. Therefore, the critical areas that impact performance are the speed of the data transfer and the networking subsystems. The amount of memory that is available to resources, such as network buffers and disk I/O caching, also greatly influences performance. Processor speed or quantity typically has little impact on file server performance. In larger environments, you must consider where the file servers are located within the networking environment. We advise you to locate them on a high-speed backbone as close to the core switches as possible. The following subsystems have the greatest impact on file server performance, in this order: 1. Network 2. Memory 3. Disk The network subsystem, particularly the network interface card or the bandwidth of the LAN, can create a bottleneck due to heavy workload or latency. Insufficient memory can limit the ability to cache files and, therefore, cause more disk activity, which can result in performance degradation. 20
When a client requests a file, the server must initially locate it, then read it, and forward the requested data back to the client. The reverse of this sequence applies when the client is updating a file. Therefore, the number of host bus adapters (HBAs) that are installed and the way that they are configured can cause the disk subsystem to be a potential bottleneck. Generally, a file server requires higher throughput to satisfy the users, and I/O response time is not as critical.
Database server
The database servers primary function is to store, search, retrieve, and update data from disk. Examples of database engines include IBM DB2, Microsoft SQL Server, and Oracle. Due to the high number of random I/O requests that database servers are required to perform and the computation-intensive activities that occur, the following areas can potentially impact performance: Memory Disk Processor Network The server subsystems that have the most impact on database server performance are: Memory subsystem Disk subsystem CPU subsystem Network subsystem
Memory subsystem
Buffer caches are one of the most important components in the server, and both memory quantity and memory configuration are critical factors. If the server has insufficient memory, paging occurs that results in excessive disk I/O (to the servers internal disk drives), which generates latencies.
Disk subsystem
Even with sufficient memory, most database servers perform large amounts of disk I/O to bring data records into memory and flush modified data to disk. When the data and logfiles are located on external storage subsystems, there are additional considerations: The number of HBAs The type of RAID The number of disk drives that are used DB performance is impacted by the DB configuration with regard to using SMS or DMS. The storage administrator needs to plan and implement a well-designed storage subsystem to ensure that it is not a potential bottleneck.
CPU subsystem
Processing power is another important factor for database servers, because database queries and update operations require intensive CPU time. The database replication process also requires a considerable number of CPU cycles. Database servers are multi-threaded applications, so symmetric multiprocessor (SMP) capable systems provide improved performance scaling to 16-way and beyond. L2 cache size is also important due to the high hit ratio, that is, the proportion of memory requests that fill from the much faster cache instead of from memory.
21
Network subsystem
The networking subsystem tends to be the least important component on an application or database server, because the amount of data that is returned to the client is a small subset of the total database. The network can be important, however, if the application and the database are on separate servers. A balanced system is especially important, for example, if you add additional CPUs and consider upgrading other subsystems, such as increasing memory and ensuring that disk resources are adequate. In database servers, the design of an application is critical (for example, database design and index design).
Terminal server
Windows Server Terminal Services enables a variety of desktops to access Windows applications through terminal emulation. In essence, the application is hosted and executed on the terminal server and only window updates are forwarded to the client. The following subsystems are the most probable sources of bottlenecks: Memory CPU Network The disk subsystem has very little effect on performance.
Multimedia server
Multimedia servers provide the tools and support to prepare and publish streaming multimedia presentations utilizing your intranet or the Internet. They require high bandwidth networking and high-speed disk I/O because of the large data transfers. If you are streaming audio, the most probable sources of bottlenecks are in these areas: Network Memory Disk If you are streaming video, the following subsystems are most important: Network Disk I/O Memory Disk is more important than memory for a video server due to the volume of data that is transmitting and the large amount of data that is read. If the data is stored on the disk, the disk speed is also an important factor in performance. If compression and decompression of the streaming data is required, then CPU speed and the amount of memory are important factors as well.
Web servers
Today, a Web server is responsible for hosting Web pages and running server-intensive Web applications. If Web site content is static, the following subsystems can be sources of bottlenecks: Network Memory CPU
22
If the Web server is computation-intensive (such as with dynamically created pages), the following subsystems might be sources of bottlenecks: Memory Network CPU Disk The performance of Web servers depends on the site content. There are sites that use dynamic content that connect to databases for transactions and queries, and this connection requires additional CPU cycles. It is important in this type of server that there is adequate RAM for caching and managing the processing of dynamic pages for a Web server. Also, additional RAM is required for the Web server service. The operating system automatically adjusts the size of cache depending on the requirements. Because of the high hit ratio and transferring large dynamic data, the network can be another potential bottleneck.
Backup servers
In todays world of continuity, data recreation, and availability, a Backup server is responsible for an ever increasing amount of data movement across all networks, including LAN, WAN, and SAN. Because of the increasing traffic to and from the backup server, the following systems might be sources of bottlenecks: Network Memory CPU Disk I/O The network subsystem, particularly the network interface card or the bandwidth of the LAN, can create a bottleneck due to heavy workload or latency. The performance of a backup server varies at different times, based on the functions that are being performed. Traffic loads across the networks also change, depending on whether LAN-free backup agents are deployed. LAN-free agents transfer data directly across the Fibre Channel network straight to the tape drives. This reduces the CPU utilization of the server, but increases the load on the FC network. It is important in this type of server that there is adequate RAM for caching and managing the processing of metadata from the agents.
1.6.4 Running servers in a virtualized environment

Running servers in a virtualized environment provides the ability to run multiple simultaneous servers (or virtual machines) on a single hardware platform. You achieve this ability by installing a product, such as VMware ESX Server, which then provides the capability to divide the hardware subsystems into smaller partitions. These smaller partitions appear as multiple individual servers. These partitions can then be configured with an operating system and function as traditional server types. For example, a server with two CPUs, 2 GB of RAM, and 36 GB of disk can be partitioned into four virtual servers. Each virtual server then has one virtual CPU and 500 GB of RAM with 8 GB of disk. These servers can then be configured as different server types. For example, they can be configured as an Active Directory server, a WINS server, a DNS server, and a DHCP server.
23
Benefits and bottlenecks of virtualization

The benefit of virtualization is that you can reconfigure servers that have spare capacity as multiple different servers, which reduces the number of physical servers that you need to support in the environment. The individual virtual server type still has the same potential bottlenecks and performance issues as the physical server type, and there is still the added overhead of having to support the virtualization layer. These are potential bottlenecks on the virtual operating system: Memory CPU Disk I/O Network
Windows hypervisor (Windows Server 2008 R2 Hyper-V)

Windows hypervisor (Windows Server 2008 R2 Hyper-V) provides an abstraction layer to virtualize the physical I/O, but it is the I/O characteristics of the applications running within the virtual machines that are important. And, there are different ways to do that virtualization. For example, on the Power platform, we use a separate partition to do that virtualization: VIOS (Virtual I/O Server). But, for other hypervisors/platforms, the resources (specifically memory and CPU) required to do the required mappings between virtual and physical I/O are hidden. The benefits of virtualization do not come for free, so there is certainly some amount of hypervisor overhead that can be noticeable in a shared processor partition relative to running, for example, in a dedicated LPAR. Here is a list of hypervisor/platforms: ESX (VMWare) ZID Virtual Server IBM AIX LPAR (IBM PowerVM) HP-UX variant
1.6.5 Understanding basic performance configuration

There are three important principles for creating a logical configuration for a storage subsystem to optimize its performance: Workload isolation Workload resource sharing Workload spreading Workload isolation means providing a high priority workload with dedicated storage subsystem hardware resources (this subset resource of the storage subsystem can be controllers, ranks, arrays, and so forth) to reduce the impact of less important workloads. Workload isolation can also mean limiting a lower priority workload to a subset of storage subsystem hardware resources so that it does not impact more important workloads by fully utilizing all hardware resources. But, unless an application has an entire storage subsystem dedicated to its use, there is still potential contention with other applications for any hardware (such as cache and processor resources), which are not dedicated. However, typically, isolation is implemented to improve the performance of all workloads by separating different workload types.
24
Workload resource sharing means multiple workloads use a common set of storage subsystem resources, such as ranks, device adapters, and I/O ports. Multiple resource sharing workloads can have logical volumes on the same ranks and can access the same host adapters or even the same I/O ports. Resource sharing allows a workload to access more hardware resources than are dedicated to the workload, therefore, providing greater potential performance. But, this hardware sharing can result in resource contention between applications that impacts performance at times. It is important to allow resource sharing only for workloads that do not consume all of the hardware resources that are available to them. Workload spreading means balancing and distributing workload evenly across all of the storage subsystem hardware resources that are available. Spreading applies to both isolated workloads and resource sharing workloads. For detailed descriptions of configuration and solution design for performance optimization, see the IBM Redbooks publications that are written specifically for your hardware: For DS8000: DS8000 Performance Monitoring and Tuning, SG24-7146 For DS6000: IBM TotalStorage DS6000 Series: Performance Monitoring and Tuning, SG24-7145 For DS3000, DS4000, and DS5000: IBM Virtual Disk System Quickstart Guide, SG24-7794 For Storwize V7000: Implementing the IBM Storwize V7000. SG24-7938 For SVC: Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933 For XIV: IBM XIV Storage System: Architecture, Implementation, and Usage, SG24-7659
25
26
Part 2
Part
Sizing and scoping your Tivoli Storage Productivity Center environment for performance management
In this part of the book we take you through the customization of your Tivoli Storage Productivity Center environment to support performance management.
27
28
Chapter 2.
Tivoli Storage Productivity Center requirements for performance management

In this chapter we describe what Tivoli Storage Productivity Center requires in order to collect performance data and produce reports for your environment.
29
2.1 Determining what Tivoli Storage Productivity Center needs

In this section we review what is needed by Tivoli Storage Productivity Center from your environment to interface to all your SAN and storage components to collect, store, and maintain performance data.
2.1.1 Tivoli Storage Productivity Center licensing options

Tivoli Storage Productivity Center comes in a variety of licensed packages options: Tivoli Storage Productivity Center Basic Edition: This version of Tivoli Storage Productivity Center provides volume management including discovery, provisioning, and basic reporting for all storage devices supported by Tivoli Storage Productivity Center. In addition, Tivoli Storage Productivity Center Basic Edition provides fabric management including in-band and out-of-band discovery, alert notification, SAN zoning of volumes provisioned for all fabric vendors supported by Tivoli Storage Productivity Center. See the IBM Support Portal for the latest Tivoli Storage Productivity Center interoperability matrix for a complete list of support storage and SAN fabric devices supported: https://www-01.ibm.com/support/docview.wss?uid=swg21386446 Tivoli Storage Productivity Center for Data: This license enabled version of Tivoli Storage Productivity Center supports the asset capacity, Storage Resource Management (SRM) along with many of the server reporting features available within Tivoli Storage Productivity Center. For further details outside of the scope of this book, see the Redbooks publication, Tivoli Storage Productivity Center 4.2, SG24-7894. Tivoli Storage Productivity Center for Disk: This license enabled version of Tivoli Storage Productivity Center provides the critical elements needed to provide storage performance management by Tivoli Storage Productivity Center. After being enabled with the license, Tivoli Storage Productivity Center now enables the features to collect, analyze, and report on storage devices configured to Tivoli Storage Productivity Center through the interfaces available with Tivoli Storage Productivity Center. Tivoli Storage Productivity Center Standard Edition: This licensed enabled version brings the full complement of features available within Tivoli Storage Productivity Center to the enterprise. This includes all of the basic, data, disk, and fabric components. In addition, advanced features of Tivoli Storage Productivity Center regarding analytics and analysis are only available in Tivoli Storage Productivity Center with this licensed enabled solution. These features are the Tivoli Storage Productivity Center for Fabric performance analytics, Storage Optimizer, and SAN Planner. These advanced features and their intended usage are described later in this book. Tivoli Storage Productivity Center Mid-Range Edition: Tivoli Storage Productivity Center for Disk Midrange Edition is designed help reduce the complexity of managing storage devices by allowing administrators to configure, manage, and monitor performance of their entire storage infrastructure from a single console. Tivoli Storage Productivity Center for Disk Midrange Edition provides the same features and functions as Tivoli Storage Productivity Center for Disk, but is limited to managing IBM System Storage DS3000, DS4000, DS5000, and FAStT devices. It provides performance management, monitoring, and reporting for these devices. Structure: Tivoli Storage Productivity Center for Disk Midrange Edition has a different structure than Tivoli Storage Productivity Center for Disk, and is structured by terabyte rather than by controller.
30
For a complete breakdown of features, functions and capabilities by Tivoli Storage Productivity Center product, see the Tivoli Productivity Center Information website. http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ fqz0_r_product_packages.html Figure 2-1 on page 31 is a subset of detailed information available at the above website when you scroll through the Licenses for Tivoli Storage Productivity Center right side frame available from that site.
Figure 2-1 Tivoli Storage Productivity Center Function Breakdown by License Summary
2.1.2 Tivoli Storage Productivity Center Components

Figure 2-2 on page 32 shows the high level components of Tivoli Storage productivity Center, which includes the DB2 V9.7 database repository, Tivoli Storage Productivity Center GUI, Data server, Device server, Application Servers, as well as Native Application Interface (NAPI) or SMI-S enabled storage and switch devices.
Chapter 2. Tivoli Storage Productivity Center requirements for performance management
31
Figure 2-2 Tivoli Storage Productivity Center Architecture Overview
Data Server
This component is the control point for product scheduling functions, configuration, event information, reporting, and graphical user interface (GUI) support. It coordinates communication with and data collection from agents that scan file systems and databases to gather storage demographics and populate the database with results. Automated actions can be defined to perform file system extension, data deletion, and Tivoli Storage Manager backup or archiving, or event reporting when defined thresholds are encountered. The Data server is the primary contact point for GUI user interface functions. It also includes functions that schedule data collection and discovery for the Device server.
Device Server
This component discovers, gathers information from, analyzes performance of, and controls storage subsystems and SAN fabrics. It coordinates communication with and data collection from agents that scan SAN fabrics and storage devices.
Tivoli Integrated Portal

Tivoli Storage Productivity Center V4 is integrated with IBM Tivoli Integrated Portal (TIP). This integration provides functionalities like single sign-on and the use of Tivoli Common Reporting (TCR).
Single sign-on
Enables you to access Tivoli Storage Productivity Center and then Tivoli Storage Productivity Center for Replication using a single user ID and password.
32
For more information, see the Redbooks publication, Tivoli Productivity Center V4.2 Release Guide, SG24-7894.
Tivoli Common Reporting

Tivoli Common Reporting (TCR) is a component provided by TIP. It is one possible option to implement customized reporting solutions using SQL database access, providing output in HTML, PDF or Microsoft Excel. Reports: Tivoli Common Reporting is intended to provide a platform to reproduce custom reports in an easy way or for reports that are to be run repeatedlytypically on a daily, weekly, or monthly basis. It does not provide any online report creation or report customization features.
Tivoli Storage Productivity Center for Replication

Starting with Tivoli Storage Productivity Center V4.1, the Tivoli Storage Productivity Center for Replication product started to be integrated into Tivoli Storage Productivity Center. Currently the integration is limited to basic functions such as providing Launch in Context links in the Tivoli Storage Productivity Center GUI, as well as crosschecks when a volume is deleted with Tivoli Storage Productivity Center and mapping of user roles.Tivoli Storage Productivity Center Basic Edition has Flash Copy support enabled.
DB2 database
A single database instance serves as the repository for all Tivoli Storage Productivity Center components other than Tivoli Storage Productivity Center for Replication. The default database for Tivoli Storage Productivity Center for Replication is the open source Derby. This is supplied on the Tivoli Storage Productivity Center install DVD, or PPA download package. You are given the option to use DB2 for this, but the default is Derby. The TPCDB database installed on DB2 is a database central repository where all of your storage information and usage statistics are stored. All agent and user interface access to the central repository is done through a series of calls and requests made to the server. All database access is done using the server component to maximize performance and to eliminate the need to install database connectivity software on your agent and UI machines.
Agents
Outside of the server, there are several interfaces that are used to gather information about the environment. The most important sources of information are the Tivoli Storage Productivity Center agents (Storage resource agent, Data agent and Fabric agent) for servers and either Native Application Interface (NAPI) or SMI-S enabled storage and switch devices that use a CIMOM agent (either embedded or as a proxy agent). Agents: In Tivoli Storage Productivity Center v4.2 and above, you can deploy Storage Resource agents only. If you want to install a Data agent, you must own a previous version of the product. See the documentation for the previous version of the product for information about how to install a Data agent in the Redbooks publication, Tivoli Storage Productivity Center 4.1, SG24-7809. Storage Resource agents, CIM agents, and Out of Band fabric agents gather host, application, storage system, and SAN fabric information and send that information to the Data Server or Device server.
33
Tip: Data agents and Fabric agents are supported in V4.2. However, no new functions were added to those agents for that release. For optimal results when using Tivoli Storage Productivity Center, migrate the Data agents and Fabric agents to Storage Resource agents.
Interfaces
As Tivoli Storage Productivity Center gathers information from your storage (servers, subsystems, and switches) across your enterprise, it accumulates a repository of knowledge about your storage assets and how they are used. You can use the reports provided in the user interface view and analyze that repository of information from various perspectives to gain insight into the use of storage across your enterprise. The user interfaces (UI) enables users to request information and then generate and display reports based on that information. Certain user interfaces can also be used for configuration of Tivoli Storage Productivity Center or storage provisioning for supported devices. The following interfaces are available for Tivoli Storage Productivity Center: Tivoli Storage Productivity Center GUI: This is the central point of Tivoli Storage Productivity Center administration. Here you have the choice to configure Tivoli Storage Productivity Center after installation, define jobs to gather information, initiate provisioning functions, view reports, and work with the advanced analytics functions. Java Web Start GUI: When you use Java Web Start, the regular Tivoli Storage Productivity Center GUI will be downloaded to your workstation and started automatically, so you do not have to install the GUI separately. The main reason for using the Java Web Start is that it can be integrated into other products (for example, TIP). By using Launch in Context from those products, you will be guided directly to the select panel. The Launch in Context URLs can also be assembled manually and be used as bookmarks. TPCTOOL: TPCTOOL is a command line (CLI) based program which interacts with the Tivoli Storage Productivity Center Device Server. Most frequently it is used to extract performance data from the Tivoli Storage Productivity Center repository database in order to create graphs and charts with multiple metrics, with various unit types and for multiple entities (for example, Subsystems, Volumes, Controller, Arrays) using charting software. Commands are entered as lines of text (that is, sequences of types of characters) and output can be received as text. Furthermore, the tool provides queries, management, and reporting capabilities, but you cannot initiate Discoveries, Probes and performance collection from the tool. Database access: Starting with Tivoli Storage Productivity Center V4, the Tivoli Storage Productivity Center database provides views that provide access to the data stored in the repository, which allows you to create customized reports. The views and the required functions are grouped together into a database schema called TPCREPORT. For this, you need to have sufficient knowledge about SQL. To access the views, DB2 supports various interfaces, for example, JDBC and ODBC.
34
2.1.3 Tivoli Storage Productivity Center server recommendations

In this section, we provide the IBM recommended sizing limits for your Tivoli Storage Productivity Center Server. As mentioned previously, in this book, we are assuming that Tivoli Storage Productivity Center Standard Edition is already installed, but it is a good practice to check the physical configuration prior to beginning detailed performance data gathering. Your Tivoli Storage Productivity Center server might not be big enough to accommodate the extra traffic, or size, which the additional workload that storage performance management requires. Your server growth depends on many factors, which we discuss next.
Tivoli Storage Productivity Center Server hardware sizing

Depending on your Tivoli Storage Productivity Center server platform, here are the minimum IBM recommendations for sizing according to the IBM Tivoli Storage Productivity Center Installation and Configuration Guide, SC27-2337-04: Machine: For Windows or Linux - 1 x Intel - Quad Core Xeon or greater, 8 GB RAM Memory: You must have at least 4 GB of RAM memory to install Tivoli Storage Productivity Center. However, you will get a warning message if you have less than 8 GB of RAM. Also, if you have less than 8 GB of RAM, you can only have one Tivoli Storage Productivity Center component: either Tivoli Storage Productivity Center or Tivoli Storage Productivity Center for Replication (Tivoli Storage Productivity Center-R) active because of system load. For AIX: IBM System p, IBM POWER4 or later processor, 1 GHz, 8 GB RAM Disk space: You need 80 GB of disk space, including sufficient capacity for the Tivoli Storage Productivity Center Data Repository. For the code installation on Windows, you need 6 GB of available disk space and 500 MB in the Windows temporary directory. The Windows temporary space is based on %TEMP% and even if you install on another drive (for example, on the E drive), you still need at least 500 MB on the C drive. For AIX: you need 10 GB of free disk space. For code installation on UNIX or Linux, you need 500 MB in the /tmp directory, 3 GB in the /opt directory, 250 MB in the /home directory, 10 KB in the /etc directory, and 200 MB in the /usr directory. After installation, you will need a large amount of space available as just stated for the data collected supporting performance management.
Sample configuration
Your configuration depends on the storage and SAN configuration that Tivoli Storage Productivity Center will be managing. An example might be a currently installed Tivoli Storage Productivity Center server supporting six IBM System Storage DS8000s and two IBM System Storage SAN Volume Controller (SVC) clusters installed on an AIX LPAR with four p5 processors and 16 GB of RAM. With that configuration, the processors average 80% utilization.
35
2.1.4 Tivoli Storage Productivity Center database considerations

The size, type, and placement of your Tivoli Storage Productivity Center central database repository can be affected by many factors. Following are some considerations and recommendations to help you get the optimal performance and usability in your implementation. No matter where you put the Tivoli Storage Productivity Center central repository, it is important that you perform regular backups just as you might for any production business application. IBM has found that after users of Tivoli Storage Productivity Center have implemented storage performance management disciplines using Tivoli Storage Productivity Center, the Tivoli Storage Productivity Center server is configured as a mission critical server with appropriate monitoring, administration, and alerting along with other mission critical servers in the enterprise. The Redbooks publication, Tivoli Storage Productivity Center V4.2 Release Guide, SG24-7894 details the processes of how to perform database backups.
Database size
The size of the Tivoli Storage Productivity Center database is directly proportional to the number of records that are retained. These records include asset data from devices, file data from servers, volume data from local drives and subsystems, and performance data from subsystems and fabrics. Obviously, as the number of servers, subsystem, volumes and switches grow, more and more records are created and stored within the repository. Over time, as storage capacity grows, and new volumes are constantly added, your Tivoli Storage Productivity Center database repository grows. For performance data, you are gathering data over different time intervals, and expiring that data at different times. Due to differences in storage device, fabric device and server types of data collected there is no straightforward formula available to size a Tivoli Storage Productivity Center database. See Chapter 16 of the Redbooks publication, Tivoli Storage Productivity Center V4.2 Release Guide, SG24-7894 to get a more precise understanding of your own data size and growth requirements. The size and duration of your performance monitors can add significant quantities of data to your Tivoli Storage Productivity Center repository. For example, a small interval size will store more data than a large sample size. An example of not monitoring the database size was recently seen by a Tivoli Storage Productivity Center customer when the customers Tivoli Storage Productivity Center DB2 filesystem space for the database was identified with less than 3 GB free of available space. This left Tivoli Storage Productivity Center with less than enough available space to perform any recovery actions by normal Tivoli Storage Productivity Center Resource History Retention period actions to recover disk space. The only action available was within DB2 administrative commands provided by IBM DB2 support directly. This activity required loss of access to the Tivoli Storage Productivity Center server during this time, and a loss of some historical data was required to shrink the space needed to support restoration of the Tivoli Storage Productivity Center environment. Tip: Best practice for Tivoli Storage Productivity Center regarding managing database space is to set the Resource History Retention settings, then monitor the space growth and adjust the Resource History Retention settings as appropriate.
36
Data retention
The amount of time that you store and retain your information also has a significant bearing on your repository size. We recommend that the following changes to the default 14 day retention values be inserted: Sample: Change to 30 days. Hourly: Change to 180 days (history for trending). Daily: Change to 365 days. These values give you the capability of providing significant and detailed trending reports.
2.1.5 Tivoli Storage Productivity Center database repository sizing formulas

To help you estimate the size of your database requirements, below are two formulas, one for storage subsystems and one for fabrics. The resultant capacity from these formulas must be added together to give you an estimate of your database size.
Repository sizing formulas

Here we list two formulas, one for storage subsystems and one for fabrics: For a storage device: (# bytes in DB for Storage Devices) = Rs x (Ss x Vm x ((24 x CR x Rs) + (24 x Rh) + Rd)) For a fabric device: (# bytes in DB for Switch ports) = Rs x (Sw x Pw x ((24 x CR x Rs) + (24 x Rh) +Rd)) Where: Rs = performance record size; for a subsystem, approximately 200 bytes, and for switches, approximately 300 bytes Ss = number of subsystems Vm = number of volumes CR = Collection Rate; number of performance samples or intervals Rs = retention period for sample data, in days Rh = retention period for hourly summarization data, in days Rd = retention period for daily summarization data, in days Sw = number of switches Pw = average number of ports
Repository sizing example

Here is an example of the foregoing formulas for our environment. We have 4 subsystems, monitoring 2,500 volumes, with an interval size of 5 minutes (20 samples), and we are retaining the data as per our recommendations of 30, 180, and 365 days. We also doing performance data collections on 2 switches, each having 32 ports. Thus, an estimate of the capacity of our TPC database repository, which is occupied with data from performance data collections, can be computed as follows: For subsystem data: 200 x (4 x 2500 x ((24 x 20 x 30) + (24 x 180) + 365)) = 3,817,000,000 bytes For switch data: 300 x (2 x 96 x ((24 x 20 x 30) + (24 x 180) + 365)) = 1,099,296,000 bytes
37
The sum of the subsystem data and switch data gives us a total of 4,916,296,000 bytes, or 4.9 GB. This is the number of bytes that are used after a year of data collection. You must remember that, in addition to this capacity, there is also an amount used for normal Tivoli Storage Productivity Center for data information. This number is insignificant compared to the amount of records used for performance data collections.
2.1.6 Database placement

The Tivoli Storage Productivity Center database repository can be installed either on a local disk (internal or external), or on another server. We recommend installing it on a SAN drive attached to the Tivoli Storage Productivity Center server, but whether it is an internal disk or a SAN attached disk is up to you. A SAN attached disk is still classified as a local disk to the Tivoli Storage Productivity Center server. If you expect your database to grow to a size that might not fit on your local drive, or you want to take advantage of the speed of FC disk, then we recommend that you initially put it on a SAN attached disk volume. This database grows dynamically as more records are inserted, and SAN attached disk can scale more easily. In our environment, we allocated a volume from our SVC as our local E:\ drive, and installed our repository on that. As you know, not all FC disks have the same performance characteristics, so it is important to choose a volume from a high performing device, such as a DS8000.
2.1.7 Selecting an SMS or DMS table space

There are a number of trade-offs to consider when determining which type of table space you need to use to store your data. Table space can be managed using either system managed space (SMS), or database managed space (DMS). For an SMS table space, each container is a directory in the file space of the operating system, and the operating systems file manager controls the storage space. For a DMS table space, each container is either a fixed size pre-allocated file, or a physical device such as a disk, and the database manager controls the storage space. Tables containing user data exist in regular table spaces. The system catalog tables exist in a regular table space. Tables containing long field data or large object data, such as multimedia objects, exist in large table spaces or in regular table spaces. The base column data for these columns is stored in a regular table space, while the long field or large object data can be stored in the same regular table space or in a specified large table space. Indexes can be stored in regular table spaces or large table spaces. System temporary table spaces are used to store internal temporary data required during SQL operations such as sorting, reorganizing tables, creating indexes, and joining tables. Although you can create any number of system temporary table spaces, we recommend that you create only one, using the page size that the majority of your tables use. User temporary table spaces are used to store declared global temporary tables that store application temporary data. User temporary table spaces are not created by default at database creation time.
Advantages of an SMS table space

These are advantages of an SMS table space: Space is not allocated by the system until it is required. Creating a database requires less initial work, because you do not have to predefine containers.
38
A container is a physical storage device and is assigned to a table space. A single table space can span many containers, but each container can belong to only one table space.
Advantages of a DMS table space

These are advantages of a DMS table space: The size of a table space can be increased by adding containers. Existing data is automatically rebalanced across the new set of containers to retain optimal I/O efficiency. A table can be split across multiple table spaces, based on the type of data being stored: Long field data Indexes Regular table data You might want to separate your table data for performance reasons, or to increase the amount of data stored for a table. For example, you can have a table with 64 GB of regular table data, 64 GB of index data, and 2 TB of long data. If you are using 8 KB pages, the table data and the index data can be as much as 128 GB. If you are using 16 KB pages, it can be as much as 256 GB. If you are using 32 KB pages, the table data and the index data can be as much as 512 GB. The location of the data on the disk can be controlled, if this is allowed by the operating system. If all table data is in a single table space, a table space can be dropped and redefined with less overhead than dropping and redefining a table. In general, a well-tuned set of DMS table spaces can outperform SMS table spaces. In general, small personal databases are easiest to manage with SMS table spaces. On the other hand, for large, growing databases, you probably only want to use SMS table spaces for the temporary table spaces, and separate DMS table spaces, with multiple containers, for each table. In addition, you probably want to store long field data and indexes on their own table spaces. For very large configurations, DB2 performs best when set up in DMS mode. SMS mode is the default, and is simpler to use, but has limitations when it grows.
2.1.8 Best practice recommendations for the TPCDB design

To support the Tivoli Storage Productivity Center database growth that can occur over time as you maintain your history data for performance samples, the Tivoli Storage Productivity Center database can grow to 100GB or more. Tip: IBM recommends as a best practice that your IBM DB2 database be hosted on a SAN attached filesystem, or mount point. This will allow you to grow the database filesystem as needed to match the data growth requirement to your performance data time periods. As the Tivoli Storage Productivity Center database is expected to be a critical server during performance debug efforts, or even to support consistent Service Level Agreement reporting, IBM will recommend dual paths to the storage device providing the SAN attached volumes. The procedure needed to migrate a Tivoli Storage Productivity Center database installed on internal server disk(s) to a SAN attached disk is described in detail in Chapter 4 of the Redbooks publication, IBM System Storage Productivity Center Deployment Guide, SG24-7560.
39
2.1.9 GUI versus CLI

Whether you choose to use the Tivoli Storage Productivity Center GUI, or the Tivoli Storage Productivity Center CLI as the interface of choice is up to you. The GUI is used for most of your displays or views, but in some circumstances you might choose to use the CLI. The Tivoli Storage Productivity Center GUI is a Java Virtual Machine (JVM), and as such, it has been limited so as not to use excessive resources on the server. When you want to do extensive analysis of a very large environment, you can choose the interface based on the following considerations: The Tivoli Storage Productivity Center GUI has been limited to display only a limit of 2,500 records. The CLI has no limit in the number of records that can be exported, but you might have restrictions for the tool to be importing those records. For example, prior to MS Excel 2007, there is a limit of 64,000 rows, while IBM Lotus Symphony still incurs this 64,000 row restriction. So your selection of a tool to import and review the Tivoli Storage Productivity Center data will determine the amount of data able to be reviewed at a time.
2.1.10 Tivoli Storage Productivity Center instance guidelines

A single instance of Tivoli Storage Productivity Center has some pre-defined practical limits. Some of these limits are based on the reporting size, some are based on the throughput of TCP/IP packets, and some are based on Native Application Interface (NAPI), Common Information Management Object Manager (CIMOM), and agent architectures. It is important to consider how big your environment can be expected to grow over time, and decide if you want to have multiple Tivoli Storage Productivity Center instances, or a single instance to provide you with the information you require. The following points provide some practical limits for a single Tivoli Storage Productivity Center instance. Currently, these are the IBM guidelines: A maximum of 100 storage subsystems to be managed A maximum of 30,000 disk volumes to be managed A maximum of 1,500 Data or Storage Resource Agents per Tivoli Storage Productivity Center server A maximum of 2500 switch ports A maximum of 100 Fibre Channel switches The following Web link can provide you with information to assist you with right serve sizing for Tivoli Storage Productivity Center. We recommend that you refer regularly to this link for updates. https://www-01.ibm.com/support/docview.wss?uid=swg21424912
2.2 SSPC considerations

System Storage Productivity Center (SSPC) is an appliance that is used to perform volume management for a new DS8000, DS6000, or SVC cluster. Tivoli Storage Productivity Center Basic Edition is pre-installed on the SSPC appliance. SSPC also provides volume management for XIV, and Storwize by launching an element manager for each device, which in turn launches the storage subsystem's GUI interface. Be aware that only with the latest SSPC appliance, IBM machine type 2805 model MC5 has a Host Bus Adapter (HBA) available. If you upgrade from Tivoli Storage Productivity Center
40
Basic Edition to Tivoli Storage Productivity Center Standard Edition so that you can do performance data collections and reporting, then it might be worth installing your own HBA(s) into the box, or pre-ordering the 2805-MC5 version with the internal HBA already included. You can upgrade from Tivoli Storage Productivity Center Basic Edition to Tivoli Storage Productivity Center Standard Edition for use as your main Tivoli Storage Productivity Center performance management server. For more information about SSPC, see the Redbooks publication, IBM System Storage Productivity Center, SG24-7560.
2.3 Configuration data collection methods

To gather configuration information from your server environments into your Tivoli Storage Productivity Center instance will require you to deploy or configure the data collection interfaces. Tivoli Storage Productivity Center version 4.1 through the current 4.2.1, supports the new Storage Resource Agents (SRA) or the older Tivoli Storage Productivity Center for Data and Fabric agents. In addition Tivoli Storage Productivity Center can directly attach to VMWare ESX servers without a Tivoli Storage Productivity Center agent. All of the above server configuration collection methods will be reviewed below. In addition the older Tivoli Storage Productivity Center for Data agents are still needed for some environments and this shall also be reviewed. For storage configuration data collection both the Storage Network Industry Association (SNIA) SMI-S standard CIMOM configuration collection method and the new NAPI configuration collection methods will be reviewed. This review will include advantages and disadvantages of each.
2.3.1 Storage Resource Agents

The Tivoli Storage Productivity Center Storage Resource Agent (SRA) was introduced in Tivoli Storage Productivity Center V4.1 as a lightweight agent to replace the Common Agent Strategy (CAS) and subagent Data and Fabric approach. In Tivoli Storage Productivity Center 4.2, the SRA completely replaced the Data Agent and the Fabric Agent used in the older versions of Tivoli Storage Productivity Center. and the Agent manager is not needed anymore. In Tivoli Storage Productivity Center V4.2, the following full functionality was added to support the replacement: File System Scan Database Monitoring N-Series support including automatic discovery and manual entry Fabric Management: Collect topology information Collect zone, zone set information Perform zone control Perform agent assignment TSM Integration Batch Reporting Changes Path Planner support Data Sources Panel improvements IBM Tivoli Monitoring Integration
41
The SRA agent drastically simplifies the agents needed by Tivoli Storage Productivity Center to gather server information. Whether or not you expect or plan to gather filesystem or any of the enhanced agent features, deploying the SRA provides tremendous value for the storage administrator when using the Tivoli Storage Productivity Center Topology Viewer, Data Path Explorer, or storage performance management. The key is in the server hardware platform detail, the operating system detail, and the Fibre Channel Host Bus Adapter (HBA) detail provided. This makes the data visualized meaningful. As a comparison that might be more meaningful, consider seeing an HBA address or seeing a server object identified as a Windows 2008 server, running SP2 with an Emulex HBA. Important: The SRA can be used without a Tivoli Storage Productivity Center Data license, but with limited functionality. The Scan function is not available but the Data Path Explorer can be used. This is important for end users to know that have only have Tivoli Storage Productivity Center basic edition installed as delivered with the SSPC. See SSPC considerations on page 40 for more details on SSPC. Figure 2-3 has been included to visualize a Tivoli Storage Productivity Center topology table view without the SRA or any Tivoli Storage Productivity Center agent installed on the attached server. While Tivoli Storage Productivity Center can visualize this server, few details are available for any identification other than an HBA WWPN.
Figure 2-3 Tivoli Storage Productivity Center Computer View without an SRA Agent deployed
Figure 2-4 has been included to visualize a Tivoli Storage Productivity Center topology table view with the SRA agent installed. In this view, many server details are exposed, such as the Operating System, Service Pack installed. In addition, Figure 2-5 reveals the HBA details that are available by clicking the HBA tab.
Figure 2-4 Tivoli Storage Productivity Center Computer View with SRA Agent deployed
42
Figure 2-5 Tivoli Storage Productivity Center Computer View with SRA Agent deployed and HBA details shown
Full details about the new Tivoli Storage Productivity Center SRA agent can be found in the Redbooks publication, Tivoli Storage Productivity Center 4.2 Release Update, SG24-7894, Chapter 8.
2.3.2 Storage Server Native API

In Tivoli Storage Productivity Center V4.2, a new storage device access method was introduced. This access method, known as Native Application Interface (NAPI), supports a direct network access between the Tivoli Storage Productivity Center instance and several IBM storage devices. The specific devices in Tivoli Storage Productivity Center version 4.2.1 are IBM System Storage DS8000, IBM System Storage SAN Volume Controller (SVC), IBM System Storage Storwize v7000, and the IBM XIVStorage System. IBM continues to support the SNIA SMI-S standard development effort, yet has identified several impacts that have limited our ability to support our IBM storage technologies. So a solution to this was to develop the NAPI interface for IBM storage devices, and support the SMI-S standard for other storage devices and Fibre Channel switches that you might include in a full enterprise storage management environment. Tivoli Storage Productivity Center supports this complete storage management perspective through both interface techniques. The NAPI interface uses either a standard Secure Shell (SSH) interface for SVC and the Storwize v7000, the Enterprise Storage Server Network Interface (ESSNI) for the DS8000, or the XML CLI (XCLI) for the IBM XIV Storage System. These native access methods are stable and provide a feature rich capability for IBM to support full storage volume provisioning including thin-provisioning, advanced copy services for IBM FlashCopy, Global Mirror, and Metro Mirror Replication support, and enhanced performance monitoring data collection support. Full details on the new NAPI interface are documented in the Redbooks publication, Tivoli Productivity Center 4.2 Release Guide, SG24-7894, Chapter 7.
2.3.3 CIMOMs
CIMOMs are pieces of code that act as a proxy agent to communicate and transfer data, to and from different devices. These devices can be either storage devices or fabric devices. It is important for you to understand where to get CIMOMs, which ones to use, when to use them, how to use them, where to deploy them, and how many to use, so that you get the optimal data information to match your configuration. The following sections provide some recommendations and assistance.
Providers
CIMOMs are provided by the manufacturing vendor of the device. For example, a CIMOM for an IBM DS6000 storage subsystem is provided by IBM, but a CIMOM for an IBM DS5300 is provided by Engenio, as Engenio is the manufacturer of the control units in the DS5300. 43
As a result, it is imperative that the correct CIMOM is obtained from the device vendor. For your level of Tivoli Storage Productivity Center, refer to the compatibility matrix to check the level of CIMOM that you expect to need. Each vendor has multiple versions of CIMOMs. There can be different releases or versions of the CIMOM, providing different functions, or there can be different CIMOMs designed for different software products, such as Tivoli Storage Productivity Center. For Tivoli Storage Productivity Center, it is important to get the correct version of the CIMOM for the specific version of Tivoli Storage Productivity Center, for the specific device type you are monitoring. CIMOMs are very easy to use, but read any release notes or documentation available, to ensure there are not any conditions, or restrictions with that version. Most vendors provide installation documentation for each version of their code. See this website for the CIMOM compatibility matrix for Tivoli Storage Productivity Center: http://www-01.ibm.com/support/docview.wss?rs=1134&context=SS8JFM&context=SSESLZ&dc =DB500&uid=swg21265379&loc=en_US&cs=utf-8&lang=en
CIMOM deployment
The deployment of CIMOMs can be important, because port conflicts can occur if similar CIMOMs are installed on the same box. We recommend that each type of CIMOM be deployed in a unique box to prevent not only port conflicts, but traffic collisions as data is transmitted back to the Tivoli Storage Productivity Center server. Some vendors require the CIMOM to be placed on a server that has an HBA installed, and Fibre Channel (FC) disk volumes need to be allocated to that server, as they communicate between the CIMOM and the managed device through the FC data path. Read each vendors documentation for instructions and requirements. Some CIMOMs are provided as part of the firmware or microcode of the device. These are called Embedded CIMOMs. Examples are the CISCO switches. If you are monitoring these devices, there is no need to install any external CIMOMs, you only have to enable the CIMOM by the vendor provided method.
CIMOM recommended capabilities

Each CIMOM has finite capabilities in its ability to monitor devices, and can be easily flooded with too much data, if multiple devices are being monitored by the same CIMOM. As such, here are some recommendations as to the maximum number of devices that can be managed by different vendor CIMOMs. Due to potential throughput constraints, and for ease of granularity, we recommend the use of multiple CIMOMs even though you might well be under the following supported limits. See the IBM Tivoli Storage Productivity Center Support Portal website for individual vendor CIMOM restrictions: https://www-01.ibm.com/support/docview.wss?uid=swg27019305 The type of CIMOM you are deploying might have limited operating system support. Always check with the vendor for the operating systems on which the CIMOMs are supported. In addition, not all CIMOMs are supported by virtual environments, and you must consider the implications of virtual servers, which are restricted in throughput and FC connectivity. To understand what information is provided by the CIMOM you want to use, you need to understand the SMI-S profile that the CIMOM supports. Again, read the documentation provided by the vendor and visit the SNIA Web site to understand the standards provided by different SMS-s levels. 44
2.3.4 Version control for fabric, agents, subsystems, and CIMOMs

In this section we discuss considerations when deciding on a configuration for your environment.
Choosing the components

With the vast array of storage devices, storage subsystems, switch devices, HBAs, operating systems, and CIMOMs that are available to you, there is an almost infinite number of permutations or combinations, when you think about which components will fit together and work in your environment.
Compatibility matrix
The Tivoli Storage Productivity Center Compatibility matrix is the starting point to see which devices are supported, which operating systems are supported, and which CIMOM must be used. There is a matrix for subsystem support, and another one for fabric support, including HBAs. As is the case for any software or hardware implementation, we always recommend that you get the latest version. At the time of writing, we are using Tivoli Storage Productivity Center version 4.2.1. Following are the websites for the two compatibility matrices: For fabric management; supports Tivoli Storage Productivity Center v4.2.1: https://www-01.ibm.com/support/docview.wss?uid=swg27019378 For storage device management; supports Tivoli Storage Productivity Center v4.2.1: https://www-01.ibm.com/support/docview.wss?uid=swg27019305
2.4 Performance data collection

The main purpose of this book is to help you get the most out of monitoring your storage devices from a performance point of view with Tivoli Storage Productivity Center. In the following sections we review key items required to gather performance data and have it available as needed.
2.4.1 Performance Data collection tasks: Overview

This section gives you an overview of the processes involved, implications, considerations, and our recommendations. Performance management has been considered a magic art in the past, because tools have not been available to provide relevant and correct information, especially in the Open Systems environment. Mainframe analysts have had the tools and data for many years, and only recently has this capability been available to non-mainframe environments. Tivoli Storage Productivity Center has the flexibility and functionality to be set up to report on nearly anything you want to do. Having the correct configuration, and understanding what agents you need to deploy in order to provide these reports, is the objective of this book. When you know what you want, it is just a matter of setting up Tivoli Storage Productivity Center to provide the correct information.
2.4.2 Performance Data collection tasks: Considerations

Next we discuss considerations before you set up the Performance Data Collection jobs. For examples of what performance collections are available, and how to set them up, see Chapter 3, General performance management methodology on page 53.
45
Duration
Within Tivoli Storage Productivity Center, you can set the duration of the performance data collection job. That means, when the job starts collecting data from the device, and when you want to stop. You can also set the collection to indefinitely, in which the job continues running unless manually stopped. When a performance data collection job is commenced, Tivoli Storage Productivity Center queries the device and creates a table of valid resources, such as volumes, and stores that with the Tivoli Storage Productivity Center memory. If you start a collection job with the indefinitely value, and subsequently a volume is added, or removed, from the resource list, Tivoli Storage Productivity Center does not know this unless one of two things occurs. If a Probe job is run, or if the CIMOM is advanced enough to understand that a new volume is created, or an existing volume is deleted, it tells Tivoli Storage Productivity Center to do a mini internal probe to update its list. See Performance data collection on page 70 for specific information about this. To overcome this issue, you can set the data collection to run daily, but the duration in past releases was 23 hours. With version 4.2.1, the new value supported is 24 hours. When the data collection automatically starts again, a new table is built with the changed devices. The reason that we now can host a 24 hour performance monitor is that the Tivoli Storage Productivity Center for Disk code was enhanced in 4.2.1 to support the shutdown and restart without having the long delays seen in prior versions. This now allows for true 24 hour a day performance data collection, and the ability to recover from configuration changes, which are induced in to a storage environment, outside of the Tivoli Storage Productivity Center provisioning management interface.
Performance data collection tasks

You will be collecting performance data because you have a business requirement for doing so. It might be for Service Level Agreement (SLA) monitoring, or for understanding your control unit cache utilization, or for doing performance problem determination. Whatever the reason, you need the right data for that business need. To give you the granularity in data collection, we recommend that you create a performance data collection tasks for each subsystem to be monitored. For example, you can set up different interval samples for different subsystems for different reasons. When running many performance data collections concurrently, you need to be mindful that data is sent back from the storage device through a CIMOM or the NAPI interface and can have an impact on the throughput across the IP network. Because the fabric performance monitors are different than storage subsystem monitors, you do not need to have a performance data collection for each switch. We recommend one performance data collection for each fabric.
Collection intervals
Storage subsystems from different vendors, or even systems within the same vendor, can have different capabilities or limits as to the sample interval they can support. For your subsystems, you need to see what the collection intervals are. You will set your interval time differently, according to the purpose of the data report: If you are producing reports over a long period of time, you might want to select an interval time that is quite large, for example, 30 minutes, or 1 hour. This gives you fewer data points on your reports and thus does not make the report look too busy. If you are recording the data because you need to do some problem determination, then you need as small an interval as possible, to give you a very a detailed report to help you analyze the problem.
46
If you are monitoring the environment to help you set your SLAs, you will probably set it at 15 minutes. When you are creating your original measurement for your baseline consideration, you can start at 15 minutes, and then change to 5 minutes to refine your true baseline value. See Creating a baseline with Tivoli Storage Productivity Center on page 68 for specifics on baseline creation.
Fabric considerations
When you are creating performance data collection jobs for your fabric, there is not the same level of complexity or flexibility compared to the subsystem performance data collection. In a fabric collection, there are no dynamic changes that need to be refreshed, so the collection job period can be set to indefinite. One point has been raised with modern fabric directors: It is quite common to hot-insert fabric blades into running directors. When this happens Tivoli Storage Productivity Center needs to become aware of this situation. If you are running your fabric monitor indefinitely, you have to start and stop the monitor after you run a configuration probe on the fabric involved. Else you can change the fabric monitors to behave like the storage performance monitors and have them run on a 24 hour basis also. The selection is yours. Attention: When SNMP has been set up on a fabric switch, and the alerts are sent to Tivoli Storage Productivity Center, a Discovery process is initiated to update the switch status and record switch changes. This Discovery job refreshes the topology view with the changed information as well as updating any out-of-band agents.
Change in environment
As Tivoli Storage Productivity Center is reporting by device resources at a volume or port level, if a volume is removed, or a port is unplugged, this means that records showing zero activity are put in the repository. This is obviously an accurate indication of the data, but for removed volumes, it can change the look of your reports in a negative perception manner. When a probe, or a CIM indication (change in status) is actioned, the data reporting then removes the old volume from reports. If you see this as a problem, you can set up scripts to troll through the Tivoli Storage Productivity Center server logs and look for zero performance devices. You can then initiate a manual Probe to remove the volume from Tivoli Storage Productivity Center.
47
2.5 Case Study: Defining the environment

In this section we discuss the environment that we have set up, to give you some examples of tasks that you need to do, or tasks that you must have already done. In our lab, we have the majority of our equipment in Tucson. Figure 2-6 shows the ITSO environment that we built.
tpcblade3-7
tpcblade3-11
- SRA Agent
brees
- CIMOM Agent
texas
- DCFM
Servers
- TPC Server - SRA Agent
Storage Virtualizing
Storwize V7000-2076Ford1_tbird-IBM
S VC-2145-svc1-IBM
SAN Switches (Fabric)

mini
volumes
volumes
backend volumes
l3bumper sentra Cjswitch4 bitty_P0
DS8000-21071301901-IBM NAPI
DS8000-21071302541-IBM NAPI
Storage Subsystems
volumes
XIV-281060000646-IBM NAPI
volumes
DS5300tpc5k-LSI CIMOM
jumbo rocky
Figure 2-6 ITSO configuration diagram
2.5.1 CASE STUDY 1: Basics

For this project, we installed Tivoli Storage Productivity Center Standard Edition. Tivoli Storage Productivity Center Standard Edition is the version that installs all components of Tivoli Storage Productivity Center, that is, Tivoli Storage Productivity Center for Data, and Tivoli Storage Productivity Center for Disk. For performance management of Subsystems, you need a minimum of Tivoli Storage Productivity Center for Disk. For Fabric performance management, you need to upgrade to Tivoli Storage Productivity Center Standard Edition. We recommend the full Tivoli Storage Productivity Center Standard Edition because this provides the full set of features of Tivoli Storage Productivity Center for Disk, Fabric, the Data Asset and Capacity reporting, and access to the deep analytics features of Configuration History, Storage Optimizer, SAN Planner, and Configuration Analysis.
48
The sequence of steps is important for a successful Tivoli Storage Productivity Center implementation. We recommend the following process: 1. Understand your environment; it is important to understand what hardware components you have, and set them up correctly: a. Plan the Tivoli Storage Productivity Center installation. b. Plan your implementation. c. Configure the server. d. Install Tivoli Storage Productivity Center Standard Edition components. 2. It is very important to have the correct firmware, microcode, or CIMOM levels. Many functions are only supported by specific levels. The Tivoli Storage Productivity Center compatibility matrix website is shown in Compatibility matrix on page 45. 3. Install NAPI attached storage devices by using Disk Manager. After installation, run a configuration probe to gather configuration data. Tip: Create a separate configuration probe per storage or fabric because this allows Tivoli Storage Productivity Center to utilize multiple job threads for this task, as a large storage or switch device can take considerable time to complete. 4. Install your CIMOMs for your devices as needed; these are in no specific order. As part of the CIMOM install, you need to register the devices for which you will collect performance data, to the CIMOM. See CIMOM recommended capabilities on page 44 for the recommended number of CIMOMs per Tivoli Storage Productivity Center instance: Engenio Brocade/DCFM McData CISCO NetApp Other vendor
5. Register CIMOMs to Tivoli Storage Productivity Center. This can be performed manually, or in some cases, Tivoli Storage Productivity Center can discover them using the Autodiscovery feature. 6. Install SRAs or utilize the older Data and Fabric agents if you are upgrading from an old Tivoli Storage Productivity Center version. These can be done using in-band or out-of-band, depending on your Fibre Channel connections. Communications: In-band communication means that device communications to the network management facility are most commonly directly across the Fibre Channel transport by using the Small Computer System Interface (SCSI) Enclosure Service (SES), and they require no LAN connections. 7. Discover Fabric. 8. Probe your devices. 9. Set up your Storage and Fabric performance monitors. 10.Set up your alerts and thresholds based on your initial performance collection, according to your SLAs. 11.Look at the results of your data collection jobs, and compare it to your expectations and your SLAs.
49
Tivoli Storage Productivity Center instance guidelines on page 40 shows the recommended limits that you must use when defining the number of CIMOMs per Tivoli Storage Productivity Center instance. If you have an existing Tivoli Storage Productivity Center implementation, we recommend that you review your current Tivoli Storage Productivity Center repository size, and then calculate your expected growth to make sure you have enough space. Use the formulas given in Database size on page 36 to help you understand this.
2.5.2 Tivoli Storage Productivity Center basics

Briefly, these are the three ongoing steps for running Tivoli Storage Productivity Center: 1. Data collection: Set up your performance for daily collection jobs. 2. Report generation: Set up your regular reporting jobs (daily, monthly, yearly, or on-demand). 3. Daily tasks: a. Monitor your Tivoli Storage Productivity Center server to ensure connectivity to all CIMOMs, NAPI attached storage devices, and host SRA agents. b. Check the size of your Tivoli Storage Productivity Center repository, and ensure that it gets backed up daily with your other business data. c. Monitor storage and fabric performance monitors. d. Monitor batch jobs.
50
Part 3
Part
Performance management with Tivoli Storage Productivity Center

This book covers three categories of performance analysis. In this part of the book we take you through scenarios to address performance management in each of these categories.
51
52
Chapter 3.
General performance management methodology

In this chapter we take you through the performance methodology that can be applied to any environment. We also lead you through all the steps needed to implement performance management using IBM Tivoli Storage Productivity Center.
53
3.1 Overview and summary of performance evaluation

Tivoli Storage Productivity Center provides the ability to create performance reports that are based on the data stored in its database. The intervals of performance data collection and inserting data into the database are specified in the performance data collection job. The three main Tivoli Storage Productivity Center performance management functions (performance monitoring, performance threshold/alerts, and performance reports) together give you a comprehensive Performance Management environment on the entire SAN infrastructure, from the HBA, through the Fibre Channel switches to the storage virtualization and to the storage subsystems. Scope: Tivoli Storage Productivity Center does not provide any information regarding application server system (hosts) processor, memory, or performance resources. IBM Tivoli products such as IBM Tivoli Monitoring can provide solutions for these types of monitoring activities.
3.1.1 Main objectives of performance management

Storage performance management has the following three objectives: Performance capacity planning: To support the planning of additional workload, as well as observe the total utilization. Tivoli Storage Productivity Center with its reporting capabilities can easily show you the utilization of individual subsystems on different levels, which together with information about the performance characteristics of the subsystem or general rules, helps enable you to use the resources wisely in order to avoid performance problems. Service level agreement (SLA): To help the storage administrator to maintain service level objectives, typically for response time and throughput. Using Tivoli Storage Productivity Center, you can get a detailed report of the current performance status and performance development trends of the storage subsystem at different levels. It provides a convenient way to help you gain insight into the performance factors that are specifically defined in the Service Level Agreement (SLA) document. Problem determination: To help with active problem identification, as well as passive problem identification (such as when a user or an application administrator complains).
3.1.2 Metrics
In this section we illustrate in detail how Tivoli Storage Productivity Center actually gathers and elaborates performance data collected from the Native Application Interface (NAPI) or Common Interface Model Object Manager (CIMOM), and how this information is available for your reports.
Overview of metrics
A metric in this context is a unit of measurement. The device counts the statistics so that Tivoli Storage Productivity Center can gather them using the NAPI or a CIMOM Agent. These counters are then used to calculate new values, which are called metrics. Technically, the counters are in the microcode of the storage subsystem or SAN switches. The counters are usually monotonically increasing, so it is necessary to take the delta between two sets of counters (combined with the time) to convert the counters into values, such as I/O rates. These become the metrics. It is also possible to use two or more metrics to derive other metrics. 54
Even so, many people call the metrics counters. In most cases, this term is acceptable. Figure 3-1 shows the value calculations for metrics in general. The value is always an average over a period of time.
Figure 3-1 Metrics: values calculation
Example 3-1 shows a simple equation for this metric: The value of the counter that counts the number of I/Os is decided by the interval length to calculate the I/Os per second (IOPS).
Example 3-1 I/O rate equation
I/O Rate = (Number of I/Os at T2 - Number of I/Os at T1)/Interval length This equation already shows one potential problem, because a counter cannot increase indefinitely, so eventually, every counter wraps and starts with 0 again. This creates a spike in the data, because the deltas are too large to be managed by Tivoli Storage Productivity Center, because the counter might be reset from a relatively big value to zero. There are other reasons that might lead to the spikes, for example, restart of the Subsystem or the CIMOM/NAPI, bugs in the subsystem or the CIMOM/NAPI, or failover of the controllers of a subsystem. Tivoli Storage Productivity Center tries to detect these situations and discard all the data from this sample.This is necessary, because the false values otherwise aggregate to the hourly and daily values, which is far worse than simply disregarding suspicious data.
Tivoli Storage Productivity Center performance metrics

Tivoli Storage Productivity Center converts blocks into meaningful units, such as megabytes. This is why the values displayed by Tivoli Storage Productivity Center might not be the same as the values that you see with another application that might not perform any conversion. This is especially true if the other application does not use binary KB, MB, or GB units. Within the Tivoli Storage Productivity Center GUI, the metrics are usually available as the selectable columns within the reports as opposed to counters, which are only available when you use TPCTOOL, see CLI: TPCTOOL as a reporting tool on page 370. While there is a huge difference between counters and metrics, most people do not make this distinction. It is important to understand that when you use these commands, you have to specify a component type and a subsystem, so that the output differs depending on the type of subsystem that you have specified. In the GUI, this differentiation is currently unavailable, which is why you might see several columns in your report that are not applicable.
Chapter 3. General performance management methodology
55
Tivoli Storage Productivity Center can report on many different performance metrics, which indicate the particular performance characteristics of the monitored devices. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for a complete list of performance metrics that Tivoli Storage Productivity Center supports.
3.2 Performance management approach

In this section we describe the performance metrics that we have defined as the most important, among all the possible metrics that Tivoli Storage Productivity Center can collect. Moreover, we discuss the meaning of the metrics and their potential usage in a Performance Management environment, giving some guidelines and expectations in analyzing the performance data metrics. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for the entire list of the performance metrics available with Tivoli Storage Productivity Center, their meaning and in which report they are available.
3.2.1 Performance data classification

In principle, the important metrics for subsystems can be broken into three main categories: Throughput: The essential performance metrics differ according to workload type. See 1.6.2, Workloads on page 17 for information about workload classification. IO/sec for OnLine Transaction Processes (OLTP) MB/sec for Batch, Data Warehouse Response time: This is in milliseconds. Cache hit rate: Typically shown as a percentage. A lot of the metrics are divided into front-end and values. Figure 3-2 illustrates this principle. Front-end metrics containing the total round-trip of the I/Os within the storage layer. In case of a cache hit, the round-trip is very short. Occasional cache misses go all the way to the RAID arrays on the back-end. It is measured on the storage interface. Back-End metrics contain the traffic between the storage interface and the storage container including the RAID arrays in the back-end of the subsystem. Most storage boxes give metrics for both kinds of I/O operations, front-end and back-end. We always need to be clear whether we are looking at the front-end, or at the back-end.
56
Figure 3-2 Front-end and back-end metrics
Throughput
Throughput is measured and reported in several different ways. There is throughput of an entire box (subsystem), or controller (ESS/DS6000/DS8000/XIV/SMI-S Block Server Performance), or module (XIV), or of each I/O group or nodes (SVC/IBM Storwize V7000). There are throughputs measured for each volume (LUN), throughputs measured at the Fibre Channel interfaces (ports) and on Fibre Channel switches, and throughputs measured at the RAID array after cache hits have been filtered out. These are the main front-end throughput metrics: Total I/O Rate (overall) Read I/O Rate (overall) Write I/O Rate (overall) Total Data Rate (overall) Read Data Rate (overall) Write Data Rate (overall) These are the main back-end throughput metrics: Total Back-End I/O Rate (overall) Back-End Read I/O Rate (overall) Back-End Write I/O Rate (overall) Back-End Total Data Rate (overall) Back-End Read Data Rate (overall) Back-End Write Data Rate (overall)
Response time
Response time is closely related to throughput and cache hits. It is desirable to track any growth or change in the rates and response times. Frequently, the I/O rate grows over time, and that response time increases as the I/O rates increase. This relationship is what capacity planning is all about. As I/O rates increase, and as response times increase, you can use
57
these trends to project when additional storage performance (as well as capacity) are required. In Chapter 6, Using Tivoli Storage Productivity Center for capacity planning management on page 305 we discuss the approach to use Tivoli Storage Productivity Center for capacity planning. These are the corresponding front-end response time metrics: Overall Response Time Read Response Time Write Response Time These are the corresponding back-end response time metrics: Overall Back-End Response Time Back-End Read Response Time Back-End Write Response Time Depending on the particular storage environment, the throughput or response times might change drastically from hour to hour, or day to day. There can be periods when the values fall outside the expected range of values. In that case, the metrics related to cache hit rate can be used to understand what is happening. Cache hit rate is the number of times that an I/O request, either read or write, was satisfied from the device cache or memory (typically shown as a percentage). In addition you might find the storage transfer size for the application IO might have changed. Due to an application or database tuning activity. This can alter the application and if tuning on the storage subsystem was not accounted for, can be the cause for reduced application response time performance. Tip: Large transfer sizes usually indicate more of a batch workload, in which case the overall data rates are more important than the I/O rates and the response times.
Cache hit rate metrics

Cache hits are very important because they reduce the amount of I/Os going to the physical disk arrays. Therefore the more cache hits that exist, the less physical spindles that are needed to satisfy the performance requirement. Within a total I/O round-trip there might exist several caches, starting from the client cache, going through interface storage cache and storage container cache, up to the single disk drive cache. Bellow the metrics related to cache hit rate, which can be used to make sense of throughput and response times: Total Cache Hit percentage Read Cache Hit Percentage Write Cache Hit Percentage Write-cache Delay Percentage (formerly NVS) Low cache hit percentages can drive up response times, because a cache miss requires access to the back-end storage. Low hit percentages also tend to increase the utilization percentage of the back-end storage, which might adversely affect the back-end throughput and response times. High NVS Full Percentage (also known as Write-cache Delay Percentage) can drive up the write response times. All the foregoing metrics can be monitored through reports or graphs in Tivoli Storage Productivity Center (see 3.5, Tivoli Storage Productivity Center performance reporting capabilities on page 92 for details on Tivoli Storage Productivity Center reports).
58
There are a few throughput metrics that must be used to monitor thresholds: Total I/O Rate Threshold Total Back-End I/O Rate Threshold Overall Back-End Response Time Threshold Write-cache Delay Percentage Threshold
SAN switch
For switches, the important metrics are Total Port Packet Rate, Total Port Data Rate, Port Send Data Rate and Port Receive Data Rate, which provide the traffic pattern over a particular switch port. When there are lost frames from the host to the switch port, or from the switch port to a storage device, the dumped frame rate on the port can be monitored.
Utilization metrics
To monitor the environment some additional useful utilization metrics exist. Those metrics use percentage: CPU Utilization (only for SVC and Storewize V7000) Volume Utilization Disk Utilization Percentage (only for ESS, DS6000 and DS8000) Port Send Utilization Percentage Port Receive Utilization Percentage Port Send Bandwidth Percentage (also for SAN switches available) Port Receive Bandwidth Percentage (also for SAN switches available) Important: Monitor the relevant patterns over time for your environment, develop an understanding of expected behaviors, and investigate the deviations from normal patterns to get warning signs of abnormal behavior, or to generate the trend of workload changes. This is possible only if a solid baseline is available. In 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68, we describe what creating a baseline means, and how Tivoli Storage Productivity Center helps you to define it.
3.2.2 Rules of Thumb

Everyone wants to know typical values for their performance metricsRules of Thumb (ROT), or best practices. It is truly difficult to supply a simple answer for this question.
Storage volume throughput

The throughput for storage volumes can range from fairly small numbers (1 to 10 IO/second) to large values (more than 1000 IO/second). This depends on the nature of the application. Because state of the art storage systems spread and balance load internally over all available resources, a volume can reach a very high number of IO/second. In case of traditional volumes on a simple array the volume performance is mostly limited by the disk array performance. See 3.2.4, Performance metric guidelines on page 62 for more detail.
Response time ranges

Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume (and our performance models assume) that 10 milliseconds is pretty high. But for a particular application, 10 milliseconds might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 milliseconds, while batch applications with large sequential transfers might be fine with 20 millisecond response time.
59
The appropriate value might also change between shifts or on the weekend. A response time of 5 milliseconds might be required from 8 a.m. until 5 p.m., while 50 milliseconds is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications. By using solid state disks (SSD) the response time is expected below 2 msec. Because SSDs are still very expensive, the usage is currently preferred for demanding environments. Alternatively current storage subsystem offer a mixture between traditional disks and SSDs and use elaborated software to optimize the utilization of the components which is a complex process. Through that fast response times, large capacity by medium costs can be achieved.
Cached storage subsystems

For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache can drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time. Response times: To simplify, we can assume that (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-End write response times can be even higher, generally up to 80 msec.
Application response
There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in megabytes per second that is most important, and maximizing this metric can drive response times much higher than 30 msec. Appendix A, Rules of Thumb and suggested thresholds on page 327 summarizes some Rules of Thumb that can be used as a basis for performance problem determination.
Low I/O rates and high response time considerations

Sometimes there can be situations in an environment where, even if there are no performance issues at the application layer, performance response times are measured to be a second or greater. At the same time, the I/O rates are measured to be 10 I/Os per second or less, for either read or write I/Os.
60
This situation can happen for all the storage subsystems or virtualization engines (IBM or non-IBM). The reason for this is often related to the internal cache management of the device. When I/O rates are low for a particular volume, it is usually the case that the volume is completely idle for an extended period of time, perhaps a minute or more. This can cause the device to flush the cache of that volume to disk, to free up valuable cache space for other volumes which are not idle. However the first I/O that arrives for such a volume after an idle period requires the cache to be re-initialized and requires proper cache synchronization to be achieved across redundant controllers or nodes of the storage subsystem. This process can be expensive in terms of performance, and can cause significant delays (sometimes multiple seconds) for that first I/O. In addition, for write I/O, the volume can operate in write-through mode until the cache has been fully synchronized, which will cause further slowdown because each write will be reported as complete only after the update has been written to the back-end disk(s). Normally, each write will be reported as complete after the update has been written to cache, which is of course several magnitudes faster. As a result, depending on the caching scheme of the storage subsystem it might occur that idle or almost idle volumes have extremely high response times. This is generally nothing to worry about unless the application performance is affected. Important: If in your environment, a high Response Time with low I/O Rate is detected frequently by your threshold alert configuration, you might consider modifying the alert definition to define a Response Time threshold with an additional filtering option on I/O Rate. See 3.4.4, Defining the alerts on page 80 for details. This will prevent your Response Time alert from triggering when the corresponding I/O Rate is very low, avoiding too many false-positive alert notifications.
3.2.3 Quickstart performance metrics

The most important metrics are throughput and response time, as mentioned in 3.2.1, Performance data classification on page 56. Tivoli Storage Productivity Center reports for these metrics are available for different storage components. For example, you can produce reports (see 3.5, Tivoli Storage Productivity Center performance reporting capabilities on page 92 for details on Tivoli Storage Productivity Center reports): By Subsystem (box level aggregate and averages) By Controller (cluster within storage subsystem) By I/O Group (two nodes for SVC and Storewize V7000) By Node (one node for SVC and Storewize V7000) By Array (disk rank) By Managed Disk Group (for SVC and Storwize V7000) By Volume (volume which is assigned to a host). By Managed Disk (for SVC and Storwize V7000) By Port (FC Ports on the storage device) Naming conventions: In this chapter we refer to SVC and Storwize V7000 components using the same naming convention used by Tivoli Storage Productivity Center, not the new naming convention, as listed here: Managed Disk Group instead of Storage Pool Virtual Disk instead of Volume Not all subsystems provide the same level of detail, but that can gradually change over time, as SMI-S standards and NAPI evolve. Also, the architecture and behavior of the storage devices often differs much and therefore not all metrics make sense for each device type.
61
In addition to throughput graphs, you can also produce graphs (or tabular data) for any of the metrics that you might have selected. For example, if the Write Response Time becomes high, you might want to look at the NVS Full metric for various components, such as the Volume or Disk Array. The Read, Write, and Overall Transfer Sizes are useful for understanding throughput and response times, and provide useful information for modeling tools like Disk Magic. In fact, the main inputs for Disk Magic are readily available in various performance reports. These include the Total I/O Rate, Read Percentage, Read Sequential, and Read Hit Percentages and Average Transfer Size. The data rate information in the performance reports (as well as Response Times, if available) can be used to calibrate Disk Magic Model results. For most components, whether subsystem, controller, array, or port, there can be expected limits to many of the performance metrics. But there are few Rules of Thumb, because it depends so much on the nature of the workload. Online Transaction Processing (OLTP) is so different from Backup (such as IBM Tivoli Storage Manager Backup) that the expectations cannot be similar. OLTP is characterized by small transfers, consequently data rates might be lower than the capability of the array or box hosting the data. TSM Backup uses large transfer sizes, so the I/O Rates might seem low, yet the data rates test the limits of individual arrays (RAID ranks). And each storage subsystem has different performance characteristics, from XIV, Storwize V7000, SVC, N series, DS4000, DS5000, DS6000, to DS8000 models, each box can have different expectations for each component. The best Rules of Thumb are derived from looking at current (and historical) data for the configuration and workloads that are not getting complaints from their users. From this performance base, you can do trending, and in the event of performance complaints, look for the changes in workload that can cause them.
3.2.4 Performance metric guidelines

In this section we give you some guidelines in performance analysis, discussing some metrics and limits that usually make sense. At least, these can provide a starting comparison, to see how your particular environment compares to these numbers, and then to understand why. See 1.6.2, Workloads on page 17 and 1.6.3, Server types on page 20 for a description of the different types of workloads and the applications categorization.
Small block reads (4-8KB/Op) must have average response times in the 2 msec to 15 msec
range. The low end of the range comes from a very good Read Hit Ratio, while the high end of the range can represent either a lower hit ratio or higher I/O rates. Average response times can also vary from time interval to time interval. It is not uncommon to see some intervals with higher response times.
Small block writes must have response times near 1 msec. These must all be writes to cache
and NVS and be very fast, unless the write rate exceeds the NVS and rank capabilities. Later we discuss performance metrics for these considerations.
Large reads (32 KB or greater) and large writes often signify batch workloads or highly
sequential access patterns. These environments often prefer high throughput to low response times, so there is no guideline for these I/O characteristics. Batch and overnight workloads can tolerate very high response times without indicating problems.
Read Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. The typical cache usage for enterprise database servers is with sequential IO workloads involving pre-fetch cache loads. For very low hit ratios, you need many ranks providing good back-end response time.
62
It is difficult to predict whether more cache might improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. For random read I/O, the back-end rank (disk) read response times must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Times can be higher because of RAID 5, RAID 6, or RAID 10 algorithms, but must seldom exceed 80 msec. There can be some time intervals when response times exceed these guidelines.
RAID array
RAID arrays also have IO/sec limitations that depend on the type of RAID (for example, RAID 5 versus RAID 10), disk type, and the number of disks in the array. Because of the different RAID algorithms, it is not easy to know how many I/Os are actually going to the back-end RAID arrays. For many RAID 5 subsystems, a worst case scenario can be approximated by using the back-end read rate plus 4 times the back-end write rate of (R + 4 * W) where R and W are the back-end read rate and back-end write rate. Sequential writes can behave considerably better than worst case. Use care when trying to estimate the number of back-end operations to a RAID array. The performance metrics seldom report this number precisely. You have to use the number of back-end read and write operations to deduce an approximate back-end Ops/sec number. The RAID array I/O limit depends on many factors, chief among them are the number of disks in the array and the speed (RPM) of the disks. But when the number of IO/sec to a array (array size 7 or 8 disks) is near or above 1000, the array can be considered very busy! For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end array are not consistent with good performance. They imply that the back-end arrays are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy arrays. For a little more precision (but dubious accuracy), in Table 3-1, consider the upper limit of performance for 10K and 15K RPM using RAID 5 with 7 or 8 disks per raid array, enterprise class devices. Be aware that different people have different opinions about these limits, but rest assured that all these numbers represent very busy DDMs.
Table 3-1 Disk performance limits DDM speed 10 K RPM 15 K RPM Max Ops/sec 150 - 175 200 - 225 6+P Ops/sec 1050 - 1225 1400 - 1575 7+P Ops/sec 1200 - 1400 1600 - 1800
While disks can achieve these throughputs, they imply a lot of queueing delay and high response times. These ranges probably represent acceptable performance only for batch oriented applications, where throughput is the paramount performance metric. For Online Transaction Processing (OLTP) applications, these throughputs might already have unacceptably high response times. Because 15K RPM DDMs are most commonly used in OLTP environments where response time is at a premium, here is a simple Rule of Thumb. Rule of Thumb: If the array using RAID 5 with 7+P is doing more than 1000 Ops/sec, it is too busy, no matter what the RPM.
63
For batch applications, you can notice low I/O rates and high response times. In these cases, the response time is not the appropriate performance metric. Rather, the throughput in megabytes per second is most important, and maximizing this metric can drive response times much higher than 30 msec. If available, it is the average front-end response time that really matters. Tip: A safe and sane limit for physical Ops to the RAID arrays is closer to 100 Ops/sec per disk, which for typical 6+P and 7+P RAID 5 arrays translates to 700 or 800 Ops/sec, per RAID 5 array. In addition to these enterprise class drives, near-line drives, formerly known as SATA, of high capacity (currently 1 or 2 TB disks) and somewhat lower performance capabilities are now becoming options in mixtures with higher performing, enterprise class drives. These are definitely considered lower performance, capacity oriented drives, and have their own limits, as shown in Table 3-2.
Table 3-2 High capacity disk performance limit DDM speed 7.2 K RPM Max Ops/sec 85 - 110 6+P Ops/sec 595 - 770 7+P Ops/sec 680 - 880
These drive types must have limited exposure to enterprise class workloads, unless included in storage subsystems such as the IBM XIV Storage System, and the guidelines might be subject to substantial revision based on field experience. Another new disk type, especially for high IO/s workload, are the Solid State Drives (SSDs). In addition to better IO/s performance, Solid State Drives offer a number of potential benefits over electromechanical Hard Disk Drives, including better reliability, lower power consumption, less heat generation, and lower acoustical noise. From the costs perspective, SSDs are much more expensive per GB but cheaper per IO than electromechanical Hard Disk Drives. An important observation from test results is that the performance improvement with SSDs for large block writes is not as remarkable as seen with just reads or with small block I/O in general. For example, while SSDs provide about 20 times the throughput of 15K RPM HDDs for 4KB reads, the difference is only about 2 times for large block writes. This is a property of Enterprise SSDs and not specific to the DS8000. Thus the best use cases for SSDs tend to be small block I/Os that have a higher percentage of reads. Be aware by using SSD disks, the performance bottleneck might move to other components within the storage subsystem such as device adapters, controllers, or SAN ports. Because different types of SSDs exist with different performance characteristics, as well as different implementations and usage of SSDs, it is difficult to advise an IO/sec limit in general. Todays enterprise class storage subsystems normally balance performance requirements over several disk arrays. For example, a SVC cluster uses several Managed Disks within a Managed Disk Group to stripe the volumes (volumes) over the whole back-end storage. Through that, all disk arrays are utilized equally, which avoids hot arrays.
64
To calculate the total Total IO rate capability of such a Managed Disk Group, based on the physical disks, you can use following formula (see the white paper EMC Symmetrix or DMX storage Controller Best practices when attached to IBM System Storage SAN Volume Controller (SVC) v4.2.x or later clusters for more details): Formula: P = n((D * Q)/(R+(W*4))) Where P = Total IO capability, n = number of Managed Disks, D = IOPs capability of a Disk, Q = Quantity of physical disks per Managed Disk, R = Read workload percentage, W = Write workload percentage, 4 = RAID 5 write penalty. For example, we have a DS8000 with 48 disk arrays as back-end storage. 24 arrays have 7 disks, 24 have 8 disks. For disk type, we use 15k RPM DDMs. We assume a total workload of 80% read and 20% write. By using this formula, we get the physical disk performance capability (back-end) of the Managed Disk Group of 45000 I/O (see Example 3-2).
Example 3-2 Calculate total back-end I/O rate capability
45000 = 24*(200*8/(0.8+(0.2*4)) + 24*(200*7/(0.8+(0.2*4)))) This calculated Total Back-End I/O Rate does not reflect any cache hits (on the SVC nor on the DS8000). Therefore on the front-end (SVC) can be more I/O. Because cache hits are not determined, we recommend to use the Total Back-End IO Rate for capacity planning.
Capacity planning guidelines

These guidelines, in conjunction with knowledge of workload growth and change, can be used to plan for new hardware. The discipline of capacity planning goes together with monitoring workloads and their performance. Workload characterization is just as important as performance monitoring. It is through knowledge of particular workloads and application requirements that you can focus these general guidelines into customer specific configurations. It is through disk modeling tools such as IBM Disk Magic, and the Tivoli Storage Productivity Center Storage Optimizer, that these workload characterizations can be used for capacity planning exercises.
Throughput and response time metrics

Now we take a look at how Tivoli Storage Productivity Center correlates the throughput metrics of interest and the response time metrics that go with them. Tivoli Storage Productivity Center offers several read and write throughput metrics available for selection and inclusion in a report. The main metrics are as follows: Total I/O Rate (overall): Includes random and sequential, read, and write Read I/O Rate (overall): Includes random and sequential Write I/O Rate (overall): Includes random and sequential These are the corresponding response times: Overall Response Time: Average of reads and writes, including cache hits and misses Read Response Time: Includes cache hits and misses Write Response Time: Includes cache hits and misses
Storage Subsystem Performance reports

For Enterprise class storage, the place to start is the By Controller report from Storage Subsystem Performance reports (see Figure 3-3). It can pay to keep historical records (and graphs) of these values over the long term. See 3.5, Tivoli Storage Productivity Center performance reporting capabilities on page 92.
65
Figure 3-3 Performance Report Option in Tivoli Storage Productivity Center for Disk
It can then be useful to track any growth or change in the rates and response times. Frequently, it happens that I/O rate grows over time, and that response time increases as the I/O rates increase. This relationship is what capacity planning is all about. After a baseline is defined, and you know the expectations from your environment (for example, the maximum total I/O rate expected on an MDisk), as I/O rates increase, and as response times increase, you can use these trends to project when additional storage performance (as well as capacity) is required, or alternative application designs or data layouts. Typically, throughput and response time change drastically from hour to hour, day to day, and week to week. This is usually a result of different workloads between first or third shift production, or business cycles like month-end processing versus normal production. There can be periods when the values lie outside the expected range of values, and the reasons are not clear. Then you can use the other performance metrics to try to understand what is happening. The following additional metrics can be used to make sense of throughput and response time: Total Cache Hit percentage: Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit Percentage: Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. If Non-Volatile Storage (NVS) is full, a write can be delayed while some changed data is destaged to the disk arrays to make room for the new write data in NVS. 66
Write-cache Delay Percentage: Write-cache Delay Percentage refers to Non-Volatile Storage for writes. Read Transfer Size (KB/Op): The Read Transfer Sizes are the average number of bytes transferred per I/O operation. Write Transfer Size (KB/Op): The Write Transfer Sizes are the average number of bytes transferred per I/O operation.
Utilization metrics
In addition, Tivoli Storage Productivity Center offers different utilization metrics that help to identify the current and historical utilization of a device. These are the most important metrics: CPU Utilization (%): Average utilization percentage of the processors (SVC, Storwize V7000). Disk utilization (%): The approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). Port Send Utilization Percentage (%) Average amount of time that the port was busy sending data over a specified time interval. Port Receive Utilization Percentage (%) Average amount of time that the port was busy receiving data over a specified time interval. Volume utilization (%) The approximate utilization percentage of a volume over a specified time interval (the average percent of time that the volume was busy). This value is calculated by Tivoli Storage Productivity Center: 1) Calculates the Population = Average I/O Rate * Average Response Time / 1000 2) Utilization = 100 * Population/(1+Population). Tip: The new Volume Utilization metric, which is available for all storage subsystems, can provide a quick view into hot volumes as seen by servers and can be used as a starting point for performance analysis. This metric allows you to display a combination of two important metrics in a single report. There are many more metrics available through Tivoli Storage Productivity Center, but these are important ones for understanding throughput and response time.
Constraint violations
Another way to use Tivoli Storage Productivity Center to monitor performance is through the use of constraint (or threshold) violations. In 3.5.5, Constraint Violations reports on page 113, we illustrate in detail how to manage this special kind of report. There are a limited number of performance metrics for which you can set constraints. Several very useful throughput metrics can be monitored. Back-End Response Time (available for most IBM storage boxes) is the time to do staging or destaging between cache and disk arrays. Particularly useful thresholds include these: Total I/O Rate Threshold Total Back-End I/O Rate Threshold Write-cache Delay Percentage Threshold
67
Overall Back-End Response Time Threshold (a very important threshold) CPU Utilization Threshold (for SVC and Storwize V7000) Disk Utilization Percentage Threshold Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold Remember that the back-end I/O rate is the rate of I/O between cache and the disk RAID ranks in the back-end of the storage. For storage boxes that support this metric, they typically include Disk Reads from the array to cache caused by a Read Miss in the cache. The Disk Write activity from cache to disk array is normally an asynchronous operation to move change data from cache to disk, freeing up space in the NVS. The Back-End Response Time gets averaged together with response time for cache hits to give the Overall Response Time mentioned earlier. You need to always be clear whether you are looking at throughput and response time at the front-end (very close to system level response time as measured from a server), or the throughput/response time at the back-end (just between cache and disk). There are other useful constrain thresholds, such as the Port Response Time thresholds, but they do not usually impact the throughput and response time from disk storage. When these thresholds are triggered, it is usually from a problem in the path between the servers and storage. When there are throughput or response time anomalies, you can be led rather naturally to look at performance reports for other metrics and other resources, such as Write-cache Delay Percentage, or the performance of individual RAID ranks, or particular volumes in critical applications. Important: There is more to performance management than defining absolute threshold values for performance metrics. The key is to monitor normal operations, develop an understanding of expected behavior, and then track the behavior for either performance anomalies or simple growth in the workload. This historical performance information is the main source for an effective baseline, and is the best source of data for any Performance Management environment.
3.3 Creating a baseline with Tivoli Storage Productivity Center

As we discussed in 3.2, Performance management approach on page 56, the important thing in performance data management is to monitor the relevant patterns for your environment over time, develop an understanding of expected behaviors, and investigate the deviations from normal pattern to get warning signs of abnormal behavior, or generate the trend of workload changes. To resolve performance issues, which usually occur at critical times and require immediate action, you have to understand the dynamics of the various performance characteristics of the storage subsystems, the workloads to which they are exposed, and the storage infrastructure, in order to determine the cause and resolve the problem in a timely manner. You might be also required to produce management reports and planning reports. These reports serve as guidelines to establish a Service Level Agreement (SLA) for storage performance, as well as the basis for proactive performance planning. These reports also help you determine where a problem might occur in the storage subsystem.
68
Daily analysis is one of the major tasks for performance management. The baseline is the foundation for daily analysis. Through careful daily analysis, you can get an in-depth view of how your storage subsystem performs. Changes in performance status are revealed, and anomalies are discovered. Before carrying out a daily analysis, have a clear idea about what type of performance status you expect to see. Only with this premise can you know how to compare the current data with the baseline, how to judge whether the current status is acceptable, and how to choose the direction for further analysis if there is a significant difference between the current status report and baseline, or if constraint threshold violations occur on a regular basis. Every storage administrator has an expectation for the performance status of their storage subsystem before putting it into production. In order to meet this expectation, different configurations are made and different solutions are designed according to the workloads that the devices support. After configurations are implemented, repeatedly tuned, and after the performance status becomes stable, that means the baseline of the performance status is set. Then, the further daily analysis can be carried out to check whether the performance status is as expected, whether it continues to follow the patterns of the baseline, and whether the original expectation for the performance configuration is still valid. In all, you must have a basic understanding of how to configure storage subsystem to meet the performance expectation. The general method is to store regular performance reports in normal work conditions when there are no complaints raised by users. Then, when problems occur, you can use the report that was generated in the timeframe when the user complained to compare with the reports you generate now to analyze what happened. With Tivoli Storage Productivity Center, you need to set up a performance data collection job for the device and think about the polling intervals, intervals to be skipped, and the data retention period. Note that retention cannot be set per storage subsystem. For certain situations, you might be able to get around this limitation by using the Tivoli Storage Productivity Center batch reporting function and storing the data outside of Tivoli Storage Productivity Center as a comma-separated value (CSV) text file or HTML report. When you need to bring up the baseline, you can either use the Tivoli Storage Productivity Center GUI, or if that does not provide you the required graphical reports, you can use TPCTOOL together with Excel to extract and display the baseline. See Appendix C., Reporting with Tivoli Storage Productivity Center , CLI: TPCTOOL as a reporting tool on page 370. The baseline implementation with Tivoli Storage Productivity Center passes through the following main steps: 1. Set up the performance collection tasks. In 3.4, Performance data collection on page 70, we explain how to plan, define, and run a performance data collection. 2. Analyze the data to get familiar with the workload of the subsystems. In 3.5, Tivoli Storage Productivity Center performance reporting capabilities on page 92, we show all the Tivoli Storage Productivity Center reporting capabilities and how use them. 3. Finally, define alerts if required. In 3.4.4, Defining the alerts on page 80, we explain how to define alerts in Tivoli Storage Productivity Center performance data collection jobs. In order to get familiar with the workload of the subsystem, you have to gather performance data for an extended period, so that you can detect certain workload patterns. We cannot give recommendations for how long you need to collect data. As a starting point, let the collection run for at least a week in order to account for more accurate daily data as well as changes in the workload between weekdays and weekends.
69
Needless to say, if you run into a problem before you establish your baseline, you still can use Tivoli Storage Productivity Center to diagnose the problem, but you cannot be sure that the problem that you think you see is really something new.
3.4 Performance data collection

Performance data collection is the first step in the realization of a Performance Management environment. In this section we illustrate the steps needed to plan and implement the Performance Data Collection jobs with Tivoli Storage Productivity Center.
3.4.1 Planning
In this planning section, we discuss what you need to consider in order to decide how you will use Tivoli Storage Productivity Center. In the following section, we show you how to set up Tivoli Storage Productivity Center. Tivoli Storage Productivity Center uses the new NAPI for some storage devices (DS8000, XIV, SVC, and Storwize V7000) and also still uses the SMI-S standard for getting performance data from other storage devices. The SMI-S standard defines a polling mechanism to gather the data from the CIMOMs (the standard does not define how the CIMOMs get the data from the devices). With this polling approach, it is important that you consider each of the topics in this section: Devices to include Polling interval Scheduling Data retention Alerting
3.4.2 Prerequisite tasks

In order for Tivoli Storage Productivity Center to collect performance data from a disk array or a SAN switch, first complete the following tasks: 1. Install and configure the correct versions of your Tivoli Storage Productivity Center server and agents (including CIM agents if required). 2. Make sure that the microcode or firmware of your storage subsystems and SAN switches are compatible with Tivoli Storage Productivity Center. You can obtain the list of supported storage devices at the Tivoli Storage Productivity Center support Web site (Links for Tivoli Storage Productivity Center 4.2): https://www-01.ibm.com/support/docview.wss?uid=swg27019378 https://www-01.ibm.com/support/docview.wss?uid=swg27019305 3. Set up a Discovery job to discover the devices and add the devices to Tivoli Storage Productivity Center. 4. Create probe jobs for the discovered and added subsystems for which you want to collect performance data. This is necessary in order to get the asset information and inventory of the storage devices.
70
If you need to know more about Tivoli Storage Productivity Center installation and configuration, see the following resources: Tivoli Storage Productivity Center Version 4.2.1: Installation and Configuration Guide, SC27-2337-04: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.d oc/fqz0_installguide_v421.pdf Tivoli Storage Productivity Center Hints and Tips: http://www-01.ibm.com/support/docview.wss?uid=swg27008254&aid=1
How Tivoli Storage Productivity Center collects performance data

Tivoli Storage Productivity Center uses CIMOMs and NAPIs to collect performance counters from the corresponding storage subsystems and also from SAN switches. For the SAN switches and various Storage Subsystems (DS8000, SVC, Storwize V7000. XIV), Tivoli Storage Productivity Center can also collect SAN port error counters. To gather performance data, you need to first set up a job that is called a Subsystem Performance Monitor (see 3.4.3, Defining the performance data collection jobs on page 77). We use the more generic term performance data collection throughout this book. When this job starts, Tivoli Storage Productivity Center tells the CIMOM and NAPI to start collecting data. Later, Tivoli Storage Productivity Center regularly queries the performance counters on that CIMOM or NAPI for a specific device and stores the information in its database as a sample. After two samples are in the database, Tivoli Storage Productivity Center can calculate the differences between them to get the real performance number. After calculating the gathered data, Tivoli Storage Productivity Center can use several metrics to display the data in a meaningful way. In 3.4.6, Running performance data collection on page 86, we explain how to start and stop the jobs as well as the different status icons.
CIMOM sizing
In case of usage of an external CIMOM, it can be used to collect performance data from multiple storage subsystems at the same time when you set up different performance data collection jobs even with different interval settings. Obviously, if you use one CIMOM to monitor multiple storage subsystems, the computer where this CIMOM resides has to afford more workload. It is also easy to understand that the real workload is caused by the number of volumes on which each CIMOM reports. This is an important issue that you must address by proper sizing. Regarding the number of devices and volumes per CIMOM, follow the general rules described in CIMOM recommended capabilities on page 44.
What data to collect

The answer depends on the requirement. For a large enterprise, most likely there is more than one storage subsystem in use. But, not every one of these storage subsystems is of the same importance. Only collect performance data for those storage subsystems whose performance status is critical to your business.
71
Amount of data collected

During data collection, a CIMOM or NAPI always reports all available data to Tivoli Storage Productivity Center. And, it includes data collected from different levels for a storage subsystem, for example, from volumes, from ranks, and from the whole storage subsystem. So, the total size of the performance data collected increases multiple times when you include more storage subsystems into performance collection. The granularity of the data collection is all information from a subsystem or nothing at all. One way to control the amount of data collected, therefore, is to limit the number of subsystems.
Performance data collection considerations

Tip: Even if Tivoli Storage Productivity Center allows you to specify multiple subsystems, we recommend that you create an individual job for each subsystem, because you have much better control over the scheduling start time and the interval duration. You can also just stop the collection for a single subsystem by removing the Enabled check mark, for example, if you need to upgrade the microcode. While Tivoli Storage Productivity Center allows you to group multiple subsystems together in one job, you cannot have more than one job per a particular subsystem.
What are the proper intervals

The data collection interval is also called the sample interval. The shorter the sample interval, the more accurate the performance reports; however with a shorter sample interval, more data is stored. This setup of the intervals is consistent with what we described previously in When to collect the data on page 73. There is also no general rule for the exact value of the sample interval, which also depends on the objective of the data collection. For capacity planning where the data is used for long-term pattern comparison during normal working hours, then a longer interval, such as 20 minutes or more, is acceptable. It makes no difference if you collect samples every five minutes and create an average out of 12 samples, or if you collect only three samples where each sample includes the data of a longer interval, because the averages are the arithmetic mean. Obviously, you lose granularity, because you do not see the peaks so clearly, but for this purpose, that is acceptable. If the data is used for Service Level Agreement (SLA) analysis or for the peak time performance problem analysis, the shorter interval is necessary. If you collect the performance data to create a baseline, you set the interval depending on the purpose that you have in mind: capacity planning or problem determination. If you are interested in capacity planning, you can use a larger interval and a larger period, but if you are focused on problem determination, use the shortest interval available to make your baseline as complete and as accurate as possible. Using Tivoli Storage Productivity Center you can define the sample interval for performance data collection tasks from five minutes to 60 minutes. The ultimate limitation though is not what Tivoli Storage Productivity Center can offer, but the interval that is supported by the CIMOM or NAPI. Tivoli Storage Productivity Center queries the minimum interval supported by the CIMOM or NAPI during discoveries or probes. Tivoli Storage Productivity Center does not let you specify any shorter interval when you create the job than what the CIMOM specified in the CIM_StatisticsCollection property.
72
We have listed the supported intervals for some subsystems and SAN switches in Table 3-3.
Table 3-3 Supported CIMOM/NAPI intervals (examples) Subsystem/Fabric CIM/NAPI NAPI (SVC, Storwize V7000, DS8000, XIV)a IBM CIM Agent for DS Open API (DS6000, ESS)a LSI SMI-S Provider 1.3 10.06.GG.33 (DS4000, DS5000)a Engenio Version 10.50.G0.04 (DS3000, DS4000)a Brocade DCFM 10.4.1b Minimum interval 5 minutes 5 minutes 5 minutes 5 minutes 5 minutes
a. See https://www-01.ibm.com/support/docview.wss?uid=swg27019305 for details b. See https://www-01.ibm.com/support/docview.wss?uid=swg27019378 for details
As mentioned previously, the shorter sampling interval generates more data than can be stored in the database. With default settings in Tivoli Storage Productivity Center, all the sampling data is stored in the database. To avoid the database size problem, while watching for potential performance issues, Tivoli Storage Productivity Center has the ability to skip inserting samples into the database. This skipping function is useful when you need to do SLA reporting and longer term capacity planning at the same time. It is important to understand that every time a defined alerting threshold is reached, the sample is stored in the database anyway.
When to collect the data

This depends on the purpose of your performance data collection and on the size of your environment. In many cases, you can set up the performance collection and let it run continually, so you do not need to worry about when to collect data.
Capacity planning
When you want to use the data for capacity planning, there is no need to run the collection all the time. Instead, you can run it at certain intervals and just capture data for a specific period of time that can also represent those periods that you do not monitor. Important here is that you still set the database retention periods accordingly, because the database retention is only time-based, not job-based. For more details on how to use Tivoli Storage Productivity Center for Capacity planning management, see Capacity Planning and Performance Management on page 306.
Performance problem determination

For performance problem determination, at first it is unnecessary to collect the data all the time. You can start the data collection, recreate the problem, look at the data, change settings, and then start the test again to see if the changes solved your problem. When you finish, you can turn the data collection off again. With Tivoli Storage Productivity Center, you can schedule the start and duration of the data collection. This can be helpful, for example, if the batch job that had the problem cannot be started during peak hours. If you had the data collection running already before the problem occurred, you can use the older data as a baseline. Compare it to the information gathered when the problem occurred to see if there really was a problem in the storage environment.
73
For more details on how to use Tivoli Storage Productivity Center for problem determination, see Chapter 5, Using Tivoli Storage Productivity Center for performance management reports on page 185.
Performance data collection considerations

As you can see, there is no exact answer to the question of when to collect performance data. But, because the primary purpose of data collection is to describe the daily operational status of the storage subsystem, the time span needs to cover at least the typical normal working time, peak working time, and valley or low activity working time. And of course, the longer the time span that is used to collect performance data, the more complete performance status you get. Nevertheless we recommend to permanently run the performance monitor to have all history data and therefore a complete view over the environment at any time. Tip: Instead of using a continue indefinitely performance monitor, we recommend to run the performance monitor 24 hours a day and restart it again (see the example in Figure 3-4). Through this we can ensure that all changes in the device configuration and also job alerts are noted and included in the performance monitoring job. The reason we now can host a 24 hour performance monitor instead of 23 hours, is that the Tivoli Storage Productivity Center for Disk code was enhanced in 4.2.1 to support the shutdown and restart without having the long delays seen in prior versions.
Figure 3-4 Performance monitor running for 24 hours and restarted every day
74
How long do you retain the data

The data retention setting is important, because here you have a second chance to control the amount of stored performance data. But first, you need to understand the structure of the data that Tivoli Storage Productivity Center collects.
Structure of collected data

For a specific job, Tivoli Storage Productivity Center collects data from multiple counters of the subsystem at every interval. These counters are available for various components of the subsystem, such as the controllers, arrays, and volumes. Each of those objects has multiple counters assigned (for example, a counter for read and a counter for write). These sets of counters are referred to as a record, and all the records that are collected in one interval are called samples. In addition to the samples, Tivoli Storage Productivity Center also saves averages of the samples on an hourly and daily basis. This occurs dynamically so that you do not need to set up a job for this roll up activity. Every report can, therefore, be based on the following data: Samples Hourly averages Daily averages Although this occurs after the interval has passed, Tivoli Storage Productivity Center uses the starting hour of the interval as a timestamp for the hourly average samples. For the daily average samples, Tivoli Storage Productivity Center uses 00:00:00 for the timestamp. Tip: When you use the skipping function that we explained in What are the proper intervals on page 72, the calculated average might be incorrect, because Tivoli Storage Productivity Center only includes samples that are stored in the database to calculate the averages.
Retention
Tivoli Storage Productivity Center allows you to specify a retention for the each of the three types of samples, which gives you a second point to control the amount of performance data that is stored within Tivoli Storage Productivity Center. The smallest interval that you can specify is always a day. These are the recommended values for each sample, in order to define a consistent baseline and long-term historical data: Samples: 30 days Hourly averages: 180 days Daily averages: 365 days In 3.4.5, Defining the data retention on page 84, we explain how to set up the data retention values. For an analysis of the impact of the collected data on the database sizing, see 2.1.5, Tivoli Storage Productivity Center database repository sizing formulas on page 37.
What alerts are available

In 3.5.1, Reporting compared to monitoring on page 93, you see Tivoli Storage Productivity Center is a performance reporting tool more than a monitoring tool. Alerts are an exception to this rule. You can define performance thresholds and let Tivoli Storage Productivity Center generate alerts whenever the thresholds are met.
75
The alert function is extremely useful for the purpose of maintaining SLAs. Every time that a defined threshold is exceeded, you not only get an e-mail, SNMP trap, or whatever you use as the standard way that Tivoli Storage Productivity Center alerts you, but a constraint violation is recorded in addition. In addition, you can look at a special report (the Constraint Violation report) to get a quick overview about when something happened. You can even drill down and look at the sample data that triggered the alert. For long-term performance capacity planning, this function is also useful. You can set up an alert level to inform you that the workload has reached this level. The more often you get this alert, the closer you are to defining the SLA level for the workload. See following link for information about threshold/performance related alerts: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ fqz0_r_perf_thresh.html
Control of the Database repository capacity

In Data retention on page 37 we discussed how the data retention impacts the database sizing. Tivoli Storage Productivity Center is not intended to be data warehousing software. While at first, you might not be concerned about the amount of data you collect, you do need to think about this topic early. Failure to plan impacts the time and capacity for the Tivoli Storage Productivity Center database backup, can slow down the Tivoli Storage Productivity Center operation, and increases the amount of space required for logfiles. These are the most important factors that you can use to control the capacity: Number of subsystems: Tivoli Storage Productivity Center can only collect all or nothing from a subsystem; therefore, you must only include those subsystems that you really need to observe. It is better to split up the data collection jobs than to include all of the subsystems into one job. Data collection period: If you are not interested in the data during certain periods, use the scheduling function of Tivoli Storage Productivity Center to start and stop the collection tasks. For example, just collecting data during 16 instead of 24 hours reduces the amount for sample and hourly data by 33%. Data collection interval: If you have no reason to look at five minute intervals, either set the interval higher, or if you need to use that granularity, do not store every sample. For example, if you use ten instead of five minute intervals, you can save 50% of the sample data. Data retention interval: Even if everything that we have discussed does not apply to your situation, you can still consider reducing the number of days to keep the data to the minimum necessary. Problem determination most often occurs just when you have a problem, not a month later. The data retention settings apply to every device for which you collect performance data, so this must not be the only point where you think about the amount of data that you collect.
76
3.4.3 Defining the performance data collection jobs

To define a performance collection task, proceed as follows: 1. Open the Disk Manager Monitoring Subsystem Performance Monitors Navigation Tree, as shown in Figure 3-5.
Figure 3-5 navigation tree: create subsystem performance monitor
2. On the next panel (see Figure 3-6), select the subsystem from which you want to collect the data. Remember that you can only create one job for each subsystem. When you include a subsystem in a job, it is no longer available in the left column.
Figure 3-6 Create Performance Data Collection: Subsystem Selection
77
3. On the next panel (Figure 3-7), which is the most important one, specify when the job starts and finishes, how often data is collected, and if the job is repeated. We have explained these options in the previous sections.
Figure 3-7 Create Performance Data Collection: Sampling and Scheduling
Previously, we explained how you can use the skipping function to define how much data to actually store in the database. 4. Click the Advanced button to the right of the interval length field to open the dialog shown in Figure 3-8. This dialog is where you can configure this function.
Figure 3-8 Create Performance Data Collection: Sample skipping
78
5. On the last panel, define the alerts for a monitoring failure. In this example, an email will be sent to the Tivoli Storage Productivity Center administrator in case of a condition Monitor Failed.
Figure 3-9 Create Performance Data Collection: Monitor Job alerting
79
3.4.4 Defining the alerts

Before we start to discuss alerts in detail, we need to clarify the potentially confusing terminology in Tivoli Storage Productivity Center: Alerts: Generally, alerts are the notifications that you can define for different jobs. Tivoli Storage Productivity Center then alerts you on certain conditions, for example, when a probe or scan fails. There are various ways to be notified: SNMP traps, IBM Tivoli Enterprise Console (TEC) / OMNIbus events, and e-mail are the most common. All the alerts are always stored in the Alert Log, even if you have not set up notification. This log can be found in the Navigation Tree at IBM Tivoli Storage Productivity Center Alerting Alert Log All. In addition to the alerts that you set up when you define a certain job, you can also define alerts that are not directly related to a job, but instead to specific conditions, such as a new subsystem has been discovered. This type of alert is defined in Disk Manager Alerting Storage Subsystem Alerts. These types of alerts are either condition-based or threshold-based. When we discuss setting up a threshold, we really mean setting up an alert that defines a threshold. The same is true if someone says they set up a constraint. They really set up an alert to define a constraint. Threshold and conditions: These are values or conditions that need to be exceeded or met in order for an alert to be generated. In the remainder of this section, we discuss the threshold-based alerts, but not the alerting conditions, such as Storage Subsystem Not Found. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for a list of the Performance Thresholds. Get familiar with your workload first before you set up the thresholds. You can only use the thresholds effectively if you first understand the workload and its usual patterns. To have an indication about the workload related to server application, see 1.6.2, Workloads on page 17 and 1.6.3, Server types on page 20.
Stress alerts
Figure 3-10 shows a diagram to illustrate the four thresholds that create five regions. Stress alerts define levels that, when exceeded, trigger an alert. An idle threshold level triggers an alert when the data value drops below the defined idle boundary. There are two types of alerts for both the stress category and the idle categories, and one type of alert for normal conditions: 1. Critical Stress: No warning stress alert is created, because both (warning and critical) levels are exceeded with the interval. 2. Warning Stress: It does not matter that the metric shows a lower value than in the last interval. An alert is triggered because the value is still above the warning stress level. 3. Normal workload and performance: No alerts are generated. 4. Warning Idle: The workload drops significantly, and this drop might indicate a problem (does not have to be performance-related). 5. Critical Idle: The same applies as for critical stress.
80
Figure 3-10 Alert levels
If you do not want to be notified of threshold violations for any boundaries, you can leave the boundary field blank and the performance data will not be checked against any value. For example, if the Critical Idle and Warning Idle fields are left blank, no alerts will be sent for any idle conditions.
Alert suppression
If you selected a threshold as a triggering condition for alerts, you can specify the conditions under which alerts are triggered and choose whether you want to suppress repeating alerts. Alerts can be suppressed to avoid generating too many alert log entries or too many actions when the triggering condition occurs often. You can view suppressed alerts in the constraint violation reports. You can define the following options that enable you to specify conditions that trigger and suppress alerts. If a threshold is not selected as a triggering condition, these options are not available. Trigger alerts for both critical and warning conditions: Generates an alert upon the violation of either critical or warning threshold boundaries. This is the default. Trigger alerts for critical conditions only: Generates alerts only upon violation of one of the critical threshold boundaries. Violation of a warning boundary creates an entry in the constraint violation report, but does not result in an entry in the alert log or an action being triggered. Trigger no alerts: Does not generate an alert upon violation of any threshold boundaries. Creates entries only in the constraint violation report.
81
Do not suppress repeating alerts: Does not suppress any repeating alerts. This is the default. Suppress alerts unless the Triggering Condition has been violated continuously for a specified length of time: Generates alerts only if the triggering condition has occurred continuously within the length of time specified in the Length of time field. Alerts for the first and any subsequent occurrences of the triggering condition within the specified time in minutes will be suppressed. At the point that there have been consecutive occurrences with the specified number of minutes, an alert is generated. When the specified suppression period has expired, the cycle starts again. Note that the timing for this feature is based on the IBM Tivoli Storage Productivity Center server clock rather than the various system clocks. This option is useful for cases when a single occurrence of the triggering condition might be insignificant, but repeated occurrences can signal a potential problem. Suppress alerts if a repeat violation has occurred within a specified length of time after the initial violation of the Triggering Condition: Generates alerts only for the first occurrence of the triggering condition. Alerts for repeated occurrences of the triggering condition within the length of time specified in the Length of time field are suppressed. When the specified suppression period has expired, the cycle starts again. Note that the timing for this feature is based on the IBM Tivoli Storage Productivity Center server clock rather than the various system clocks. This option is useful for avoiding e-mail messages or similar disruptive alerts if the same triggering condition occurs repeatedly in successive sample passes. This option generally useful for all threshold types. Figure 3-11 illustrates suppressing of alerts. The CPU utilization of a SVC triggers only an alert in case of the node CPU continuously stay busy (Warning Stress) for at least 20 minutes.
Figure 3-11 Alert Suppression
82
Storage Subsystem Alerts Filtering

Since Tivoli Storage Productivity Center 4.1.1, a filter option has been added in alert definition for several Storage Subsystems Alerts thresholds, as discussed in the following sections.
Disk Utilization Percentage Threshold Filtering

By using the filtering option for this threshold, it is possible to ignore Disk Utilization Percentage samples where the sequential I/O percentage was more than x%, where x is entered by the user (default value is 80%). It is often the case that highly sequential workloads (for example batch or backup workloads) will be able to saturate RAID Arrays, triggering this threshold. However, having a high disk utilization in this case simply means that the work is getting done as efficiently as possible, so is generally not a cause for concern. Therefore, use of this filter can allow you to ignore any violations of the Disk Utilization Percentage threshold for highly sequential workloads.
Write Cache Delay, Back-End Queue Time, and Back-End Response Time Threshold Filtering
A filtering option is also available for the following metrics: Write Cache Delay Percentage Threshold Back-End Write Queue Time Threshold Back-End Read Queue Time Threshold Back-End Write Response Time Threshold Back-End Write Response Time Threshold Overall Back-End Response Time Threshold By using this option, it is possible to ignore Response Time or Queue Time samples where the I/O rate was less than x, where x is entered by the user. This option allows you to address those circumstances where, even if a performance issue is not detected, Low I/O Rates together with High Response Time are measured by Tivoli Storage Productivity Center, as described in Low I/O rates and high response time considerations on page 60. Figure 3-12 shows an example of alert filtering, where the Back-End Response Time Threshold triggering condition will be ignored when the Back-End Read I/O rate is less than 5 ms (default value when this option is checked):
Figure 3-12 Back-End Response Time Threshold filtering
83
Known limitations
After thresholds are defined, alerts are always generated when the thresholds are reached. There is no way to specify a period of time when alerts do or do not have to be generated. As stated in 3.2.1, Performance data classification on page 56, depending on the particular storage environment, it might be that throughput or response times change drastically from hour to hour or day to day (for example, backup sessions or batch operations). This means that there might be periods when the values of several metrics fall outside the expected range of values, and an alert is triggered even if it does not have to be under normal conditions. This is a known Tivoli Storage Productivity Center limitation in threshold definition and alert generation. The only way to avoid false alarms is to be aware as possible of the workloads distribution in your environment and understand the expected behavior in different hours or days. When an alert is triggered, you have to consider when it happened in order to verify if it actually is an indication that a real problem is occurring.
3.4.5 Defining the data retention

The panel where you define the Tivoli Storage Productivity Center data retention is located in the Navigation Tree path Administrative Services Configuration Resource History Retention as shown in Figure 3-13. The process of collecting the individual samples into hourly and daily samples is called
History Aggregation. Tivoli Storage Productivity Center has a configuration panel that
controls how much history is kept over time. Important: The history aggregation process is a global setting, which means that the values set for history retention are applied to all performance data from all devices. It is impossible to set history retention on an individual device basis.
84
Figure 3-13 shows the Tivoli Storage Productivity Center panel for setting the history retention for performance monitors as well as other types of collected statistics.
Figure 3-13 Setting the history retention for performance monitors
Figure 3-14 shows retention periods for the performance monitors, where you can see the recommended values as discussed in 2.1.4, Tivoli Storage Productivity Center database considerations on page 36>.
Figure 3-14 Retention period for collected performance data
85
The descriptions of the performance monitor values as seen in Figure 3-14 are as follows: Per performance monitoring task: The value set here defines the number of days that Tivoli Storage Productivity Center keeps individual data samples for all devices sending performance data. The example shows 30 days, as recommended in How long do you retain the data on page 75. When per sample data reaches this age, Tivoli Storage Productivity Center permanently deletes it from the database. Increasing this value allows you to look back at device performance at the most granular level at the expense of consuming more storage space in the Tivoli Storage Productivity Center repository database. Data held at this level is good for plotting performance over small time periods but not for plotting data over many days or weeks because of the number of data points. Consider keeping more data in the hourly and daily sections for longer time period reports. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for per sample data. Hourly: This value defines the number of days that Tivoli Storage Productivity Center holds performance data that has been grouped into hourly averages. Hourly average data has the potential to consume less space in the database. For example, if you collect performance data from an IBM SAN Volume Controller at 15 minute intervals, the hourly averages require four times less space in the database to retain. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for hourly data. Daily: This value defines the number of days that Tivoli Storage Productivity Center holds performance data that has been grouped into daily averages. After the defined number of days, Tivoli Storage Productivity Center permanently deletes records of the daily history from the repository. Daily averaged data requires 24 times less space in the database to store compared to hourly data. This savings is at the expense of granularity; however, plotting performance over a longer period (perhaps weeks or months) becomes more meaningful. The check box determines whether history retention is on or off. If the check is removed, Tivoli Storage Productivity Center does not keep any history for daily data.
3.4.6 Running performance data collection

We have explained how to use the scheduler in Tivoli Storage Productivity Center for the performance collection. Before we explain the four states of a job, we explain how to start and stop a job manually.
86
Starting a job manually

There are two possibilities when starting a Performance Job. Traditionally you can right-click the monitor definition and select Run Now as shown in Figure 3-15. Or as shown in Figure 3-16 you just can start the monitor from the Job Management Panel.
Figure 3-15 Starting the performance data collection manually using Job definition
Figure 3-16 Starting the performance data collection manually using central job management
87
After starting the monitor the job is exclusively visible in the Job Management Panel IBM Tivoli Storage Productivity Center Job Management (see Figure 3-17).
Figure 3-17 Job Log
Stopping a job manually

When you want to stop a job, you need to open the list of jobs as shown in Figure 3-18. Note the icons next to the monitors. These icons indicate the status of the jobs. The blue circle next to the last entry indicates that it is running. The green squares next to the previous jobs indicates that they completed successfully. Open the context menu of the last entry (the running job is always the last entry), and select Stop Now.
Figure 3-18 Stop a performance monitor
The message in the logfile indicates if it was a manual stop as shown in Example 3-3.
Example 3-3 Logfile for manual job stop
2011-06-01 10:13:50.159 HWNPM2127I The performance monitor for device DS8000-2107-1301901-IBM (2107.1301901) is stopping due to a user request.
88
Multiple data collection job error

The Tivoli Storage Productivity Center GUI allows you to start a job more than one time, but the jobs fail if there is already one job running, as shown in Example 3-4.
Example 3-4 Logfile for multiple performance data collection job starts
HWNPM2022E A performance monitor for device DS8000-2107-1301901-IBM (2107.1301901) is already active. A new monitor for the same device cannot be started until the previous monitor completes or is cancelled. This is the only situation where the job entry of the currently running job is not in the last position in the job list. Note the red circle next to the failed job in Figure 3-19, which indicates that it is not running and was not successful.
Figure 3-19 Current job entry
Performance data collection job status

The performance data collection jobs can have any of the following four states, as shown in Table 3-4.
Table 3-4 States of a performance data collection job State Description The performance job is running. It might have had warnings but during the latest interval, data samples were successfully gathered.
During the last interval of Job 15, the performance data collection job had problems gathering data. For details, you need to look into the jobs logfile.
If you end the job in this state, it does not end with a failed condition if the job collected at least one sample.
89
State
Description This performance job finished successfully (as indicated by the green square), either because the duration time has elapsed or the user has stopped the job manually. If you want to know the details of the job after it completes, select the job log and click View Log File(s). This performance job finished with an error as indicated by the red circle. An alert is being created, if this has been set up. The reason is that there was already a job running that was started earlier.
Normal logfile messages

Example 3-5 is an example of normal logfile messages. It seems though that the messages starting with PM HWNPM2115I until PM HWNPM2120I are only shown on the first start of a performance data collection job, after a restart of the Tivoli Storage Productivity Center server. After the initialization finishes, you see the HWNPM2123I messages with the interval frequency that you have specified in the job definition (Example 3-5).
Example 3-5 Normal logfile messages
2011-05-31 21:29:09.069 HWNPM2113I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is starting in an active state. 2011-05-31 21:29:09.069 HWNPM2115I Monitor Policy: name="XIV-0646", creator="administrator", description="XIV-0646" 2011-05-31 21:29:09.069 HWNPM2116I Monitor Policy: retention period: sample data=14 days, hourly data=30 days, daily data=90 days. 2011-05-31 21:29:09.069 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. 2011-05-31 21:29:09.084 HWNPM2118I Threshold Policy: name="Default Threshold Policy for XIV", creator="System", description="Current default performance threshold policy for XIV devices. This default policy can be overridden for individual devices." 2011-05-31 21:29:09.084 HWNPM2119I Threshold Policy: retention period: exception data=14 days. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Port IO Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 90
2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Total Port Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Port Send Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-05-31 21:29:09.084 HWNPM2120I Threshold Policy: threshold name=Port Receive Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-05-31 21:29:09.209 HWNPM2112I Agent 9.11.123.69/9.11.123.70 has been selected for performance data collection from device XIV-2810-6000646-IBM (2810.6000646). 2011-05-31 21:29:10.163 HWNPM2203I Successfully retrieved the configuration data for the storage subsystem. Found 15 modules, 24 ports, and 921 volumes. 2011-05-31 21:30:11.819 HWNPM2123I Performance data for timestamp 05/31/11 08:54:39 PM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 21:35:11.475 HWNPM2123I Performance data for timestamp 05/31/11 08:59:39 PM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 21:40:13.069 HWNPM2123I Performance data for timestamp 05/31/11 09:04:39 PM was collected and processed successfully. 460 performance data records were inserted into the database. 2011-05-31 21:45:11.444 HWNPM2123I Performance data for timestamp 05/31/11 09:09:40 PM was collected and processed successfully. 306 performance data records were inserted into the database. Device configuration changes are displayed in the performance job log file. This is a normal behavior. Depending of the storage device type those changes are handled different. For SVC the change will be recognized at the next restart of the performance monitor and by running a probe. We restart the performance monitor every day. Therefore we will at a maximum miss the first 24 hours of performance data of a new object such as a volume (Example 3-6).
Example 3-6 Device configuration changes during performance monitoring job
2011-06-01 11:20:52.084 HWNPM2123I Performance data for timestamp 06/01/11 11:10:02 AM was collected and processed successfully. 4981 performance data records were inserted into the database. 2011-06-01 11:25:38.975 HWNPM4189W 2 of the MDisk statistics from the device agent were unrecognized and were not included in this sample interval. 2011-06-01 11:25:38.975 HWNPM4182W 6 of the volume statistics from the device agent were unrecognized and were not included in this sample interval. Every hour the number of performance data records is much more than the previous collections. This indicates the collection of the hourly data which is done every hour at the same time (Example 3-7).
Example 3-7 Samples and hourly data records
2011-05-31 03:20:11.460 HWNPM2123I Performance data for timestamp 05/31/11 02:44:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:25:12.132 HWNPM2123I Performance data for timestamp 05/31/11 02:49:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:30:12.132 HWNPM2123I Performance data for timestamp 05/31/11 02:54:41 AM was collected and processed successfully. 306 performance data records were inserted into the database.
91
2011-05-31 03:35:11.116 HWNPM2123I Performance data for timestamp 05/31/11 02:59:41 AM was collected and processed successfully. 306 performance data records were inserted into the database. 2011-05-31 03:40:12.382 HWNPM2123I Performance data for timestamp 05/31/11 03:04:41 AM was collected and processed successfully. 460 performance data records were inserted into the database. 2011-05-31 03:45:12.023 HWNPM2123I Performance data for timestamp 05/31/11 03:09:41 AM was collected and processed successfully. 306 performance data records were inserted into the database There are a lot of messages and we cannot give examples for all of them. See following link for more information about messages: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.doc/ tpcmsg42122.html
Tivoli Storage Productivity Center server restarts while performance data collection jobs run
If performance data collection jobs were running when the server is stopped, they are restarted when the server is up again as long as the specified duration has not been reached. This situation is indicated by the log message PM HWNPM2113I in Example 3-8.
Example 3-8 Job start after Tivoli Storage Productivity Center restart: Active state
... 2011-06-01 12:30:11.076 HWNPM2123I Performance data for timestamp 06/01/11 11:54:38 AM was collected and processed successfully. 308 performance data records were inserted into the database. 2011-06-01 12:30:42.951 HWNPM2129I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is stopping because of a shutdown request. 2011-06-01 12:35:37.390 HWNPM2113I The performance monitor for device XIV-2810-6000646-IBM (2810.6000646) is starting in an active state. 2011-06-01 12:35:37.406 HWNPM2115I Monitor Policy: name="XIV-2210-6000646", creator="administrator", description="XIV-2210-6000646" 2011-06-01 12:35:37.406 HWNPM2116I Monitor Policy: retention period: sample data=30 days, hourly data=180 days, daily data=365 days. 2011-06-01 12:35:37.406 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. ... You see that a new logfile has not been created.
3.5 Tivoli Storage Productivity Center performance reporting capabilities

Now that your performance data collection is up and running, it is time to look at the data gathered with the performance data collection jobs. Tivoli Storage Productivity Center provides the ability to create performance reports that are based on the data stored in its database. The intervals of performance data collection and inserting data into the database are specified in the performance data collection job.
92
3.5.1 Reporting compared to monitoring

When you use Tivoli Storage Productivity Center and you want to look at the performance of any of your devices, you soon discover that Tivoli Storage Productivity Center is not an online performance monitoring tool, but a performance reporting tool. Nevertheless, Tivoli Storage Productivity Center uses the term performance monitor for the job that you have to set up to gather data from a subsystem. The performance monitor is, in fact, a performance data collection task. The difference is that Tivoli Storage Productivity Center collects information at certain intervals and stores the data in its database. After the data is inserted, you can look at it by using the predefined reports that are provided with the product or create your own reports. Because the intervals are usually 5 - 15 minutes, this cannot be referred to as an online monitor. Metrics: Tivoli Storage Productivity Center does not store the metrics in the database, only the counters, to reduce the amount of space that is needed. At first, database space is normally not an issue, but if you have multiple subsystems with lots of volumes, database space requirements can become critical. Especially, if you have a short interval and a long retention time for the samples, storing only the raw data from the subsystems helps to reduce the database size. The downside of storing the counters only in the database is that the metrics need to be calculated each time that you display a report. Not displaying all available metrics for a component can reduce the time that is necessary to display a report. Tivoli Storage Productivity Center can be used to define performance-related alerts that can trigger an event when the defined thresholds are reached. Even though it all works similar to a monitor without user intervention, the actions are still performed at the intervals that you specified during the definition of the performance monitor job. Interval: Usually, both the interval at which data is gathered and stored in the Tivoli Storage Productivity Center database and the interval that Tivoli Storage Productivity Center looks for any monitor thresholds to be reached are the same. You can change this so that Tivoli Storage Productivity Center still gathers data in order to be checked against the defined alerts, but tell Tivoli Storage Productivity Center to not store the data in its database every time by clicking the Advanced button on the Sampling and Scheduling tab when you create the performance monitor job. Tivoli Storage Productivity Center performance reporting was designed to report on a great number of performance-related metrics. The DS8000 offers more information than the DS4000. This difference is due in part to the following considerations: The logical blocks of the storage subsystem offering more monitoring points and the processing power of the controllers The limitation in the Common Information Model object manager (CIMOM) providers The somewhat limited extent of the Storage Management Initiative-Specification (SMI-S) standard in terms of performance reporting information
93
Outlined in Table 3-5, we list the logical points where metrics are collected for several IBM systems. Other storage subsystems likely provide similar information to the DS4000; however, this depends on their conformance to the SMI-S standard in which certain metrics are required and other metrics are optional. Not all vendors provide identical data.
Table 3-5 Logical reporting levels by device type Device type ESS, DS6000, and DS8000 Performance report Includes: Subsystem Controller Array Volume Port Includes: Subsystem Controller Volume Port Includes: Subsystem I/O group Node / Node Canister Managed Didk Group (Storage Pool) Volumes Managed Disk Port Includes: Subsystem Module Volume Port Include port
DS4000 and DS5000
SAN Volume Controller (SVC) and Storwize V7000
XIV
SMI-S compliant switches
It is useful to learn what data is available for a certain subsystem, which is typically the data listed in Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331. One advantage of using Tivoli Storage Productivity Center instead of other tools is that some vendor-provided low level tools do not report values that are based on standard units, such as GB and MB, but report on more hardware-related units. If you try to compare the results of those tools with the number displayed in Tivoli Storage Productivity Center, you must convert the numbers of the other tools into the units that are used by Tivoli Storage Productivity Center. This conversion is especially important when you compare capacities. Basically, with Tivoli Storage Productivity Center we have three types of reports:
Predefined Performance Reports: These are the standard reports that ship as part of the
Tivoli Storage Productivity Center package.
Customized Reports: In addiction to standard reports, Tivoli Storage Productivity Center gives you the option to create and save your own reports. These reports can be configured to run on a regular basis and to be saved to a file; in this case we talk of Batch Reports. Constraint Report: This is a special report available in the Reporting navigation subtree that lists all the threshold-based alerts.
94
In the following sections we illustrate in detail each of these reporting capabilities, describing how to display and work with the data. Moreover, in Appendix C., Reporting with Tivoli Storage Productivity Center on page 365, we discuss additional reporting capabilities and tools that can be used with Tivoli Storage Productivity Center to generate and export reports based on the Tivoli Storage Productivity Center data you have collected.
3.5.2 Predefined performance reports

Tivoli Storage Productivity Center has several predefined reports to aid you in your day-to-day administration. To navigate to these reports, as shown in Figure 3-20, select: IBM Tivoli Storage Productivity Center Reporting System Reports Data IBM Tivoli Storage Productivity Center Reporting System Reports Disk IBM Tivoli Storage Productivity Center Reporting System Reports Fabric
Figure 3-20 Tivoli Storage Productivity Center standard predefined reports
The predefined Tivoli Storage Productivity Center performance reports are customized reports that include only specific metrics. In contrast, the reports you can generate in Disk Manager contain by default all the metrics that apply to that component of the subsystem (for example, controller, array, and volume). You must use the Selection and Filter buttons to reduce the size of the report to suit your requirements. We recommend that you create and save your own reports in order not to have to do this every time that you want to open up a report. In the following sections we describe the Tivoli Storage Productivity Center standard reports.
Predefined Disk Manager reports

In this section we look at the Disk Manager reports provided with Tivoli Storage Productivity Center.
95
Array performance
The Array performance report (Figure 3-21) is useful when you want to see if the workload is evenly distributed across the arrays that you have in your environment. Do this check on a regular basis. Currently, the array level information is only available for DS6000, DS8000, and ESS subsystems.
Figure 3-21 Standard report, Array performance
The Array performance report is based on the By Array report.
Controller cache performance report

Those metrics that are related to the performance of the cache on a controller level are selected for this report (Figure 3-22). Again, the information is only available for DS6000, DS8000, and ESS subsystems, because currently, only these devices provide this level of information.
Figure 3-22 Standard report, Controller cache performance
This report is based on the By Controller report.
Controller performance report

This report limits the amount of information to the most important metrics that you need to see when you analyze performance (Figure 3-23). Unfortunately, you cannot assume that all SMI-S compliant devices appear on the report, and even if they appear, they might not include response time information. Response time is not included in the SMI-S standard as a mandatory metric.
Figure 3-23 Standard report, controller performance
This report is based on the By Controller report. 96

I/O Group performance

The I/O Group performance report is similar to the Controller performance report for storage subsystems (Figure 3-23). Port metrics have been added to the report, because the SVC relies heavily on the performance of the back-end devices if an I/O cannot be satisfied by the cache.
Figure 3-24 Standard report, I/O group performance
This report is based on By I/O Group report.
Managed disks group performance

This report drills down from the Subsystem performance report, because it provides details about the back-end statistics at a lower level (Figure 3-25).
Figure 3-25 Standard report, managed disk group
This report is based on the By Managed Disk Group report.
Module/Node cache performance

The Module/Node cache performance report actually drills down from the I/O group level for SVC and Storwize V7000 and uses modules from the XIV, but instead of showing all of the statistics that can be collected on a node/module level, this report only includes the statistics that are related to cache performance (Figure 3-26). This report is not linked to the I/O group performance report. If you drill up from the node to the I/O group, the preselection of columns is lost and an I/O group report with all the details is displayed.
Figure 3-26 Standard report, Node cache performance
This report is based on the By Module/Node report.

97
Port performance
This report has many metrics that are specific to the SVC/Storwize V7000: port to host or port to disk as shown in Figure 3-27. It is also a useful tool to confirm that SAN traffic is balanced across all the front-end ports of the Storage Subsystem.
Figure 3-27 Standard report, Port performance
This report is based on the By Port report.
Subsystem performance
In the Subsystem performance report, the metrics are aggregated into the overall performance data for the reported metrics as shown in Figure 3-28. This report gives a high-level administrative view to gauge your subsystems overall performance.
Figure 3-28 Standard report, Subsystem performance
This report is based on the By Storage Subsystem report.
Customized predefined reports

The following predefined Tivoli Storage Productivity Center reports are not only customized in that they do not include all available metrics, but also they limit the results to a maximum of 25 rows. All six reports are based on the By Volume report.
98
Top Active Volume Cache Hit Performance

This report shows the cache statistics for the top 25 volumes.
Top Volumes Data Rate, I/O Rate, Response Performance

The top volume data rate performance, top volume I/O rate performance, and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes.
Top Volume Disk Performance

This report includes many metrics about cache and volume-related informations. Actually, the report needs to use the term volume instead of disk, because disk usually means the disk drive modules within a storage subsystem.
Predefined Fabric manager reports

Here, we show you examples of the performance-related fabric reports that are predefined in Tivoli Storage Productivity Center.
Switch Performance Report

Use the Switch Performance report to generate and view reports that provide information about switch throughput rate and error counters, and other details. This report can only be generated for switches that have had performance monitors run on them (see Figure 3-29).
Figure 3-29 Standard report, Switch performance
This report is based on the By Switch report
99
Switch Port Errors Report

Use the Switch Port Errors report to generate and view reports that provide information about the errors being generated by ports on the switches in your SAN. These reports provide information about the error frames, dumped frames, link failures, port and switch IDs, and other details. This report can only be generated for switches that have had performance monitors run on them (see Figure 3-30).
Figure 3-30 Standard report, SAN port errors
This report is based on the By Port report.
Top Switch Ports Data Rate Performance report

Use the Top Switch Ports Data Rate Performance report to generate and view reports that provide information about the switches in your SAN that have the top data rates. This report can only be generated for switches that have had performance monitors run on them. This report only displays rows for the 25 highest data rates from the switch ports monitored. It displays those rows in descending order by Total Port Data Rate. You can adjust the number of rows shown on the Selection tab. This report is based on the By Port report.
Top Switch Ports Packet Rate performance report

Use the Top Switch Ports Packet Rate performance report to generate and view reports that provide information about the switches in your SAN that have the top packet rates. This report can only be generated for switches that have had performance monitors run on them. This report only displays rows for the 25 highest packet rates from the switch ports monitored. It displays those rows in descending order by Total Port Packet Rate. You can adjust the number of rows shown on the Selection tab. This report is based on the By Port report.
100
3.5.3 Customized reports

In addition to the standard reports that Tivoli Storage Productivity Center provides, Tivoli Storage Productivity Center provides more granular reporting capabilities. In the current version of Tivoli Storage Productivity Center, there are functions that are slightly hidden in the product, as well as certain limitations in performance reporting using the GUI. The performance reports for the Disk Manager and Fabric Manager are located in two places: under the IBM Tivoli Storage Productivity Center subtree and in the Reporting sutures of the Disk Manager and Fabric Manager. Figure 3-31 shows two Tivoli Storage Productivity Center Navigation Trees with Storage Subsystem Performance selected on the left side and Switch Performance selected on the right side.
Figure 3-31 Navigation trees
For Disk Manager, there are also additional capacity-related reports, which are indicated by the selection Storage Subsystems just above the Storage Subsystem Performance selection. While in the IBM Tivoli Storage Productivity Center subtree, you see preconfigured reports and reports that you have saved (which we show you later). The Reporting Navigation Tree selections of the Disk Manager or Fabric Manager provide many more details. Because the granularity of these additional reports seems to provide an overwhelming amount of detailed information, regard these reports really as the basis for creating your own reports. Otherwise, you often see not applicable (N/A) values, because the level of information provided by the storage subsystems varies greatly. The varying levels of information are because the Block Server Performance (BSP) subprofile is still in the early stage and also because of different storage subsystem architectures. Tivoli Storage Productivity Center groups the various metrics differently according to the levels at which the data was collected or aggregated, not by device type. The whole concept of SMI-S is to standardize the information, and otherwise, there are too many individual reports if Tivoli Storage Productivity Center provided one report per storage subsystem and per level.
101
Therefore, we recommend that you use the standard reports in the Reporting navigation subtrees as a set of building blocks to create and save your own reports so that you do not need to customize a standard Tivoli Storage Productivity Center report every time you open Tivoli Storage Productivity Center. The following items are most suited for customization: Included columns Filters Unfortunately, the following options currently are not available: Subsystem device type Host names in the volume report All the attributes that Tivoli Storage Productivity Center uses to drill down for more details (for example, array names and device adapter IDs), which we need, especially on the volume level. To overcome these limitations, implement a common naming schema for those items that are included in the reports and can be customized: Storage Subsystem names: Include the type so that you can filter on type. Volume names: Include the host name.
Locating customized reports

In every panel, you can select the Save option in the File menu in order to save the customized report in the IBM Tivoli Storage Productivity Center Navigation Tree. A unique subtree is created for each user that has a saved report. In Figure 3-32, you can see that the Navigation Tree selection is only created for the currently logged-in user.
Figure 3-32 Navigation Tree: My Reports
Charts cannot be saved, only exported, by using File Print (Printer, PDF, HTML). However, they can be easily recreated because the charts are created from the data of the underlying tabular reports. After you save a report, you can still modify it. For example, if you saved a report containing only SVC volumes (based on a filter for the subsystem name), you can add more filter options to just include a certain volume.
102
Displaying a customized performance report

When you want to display or create and save a customized report, you open the Reporting subtree of the Disk Manager or Fabric Manager as shown in Figure 3-33 on the left side.
Figure 3-33 Creating a performance report
Within the individual subtrees, you see the components on which Tivoli Storage Productivity Center can report, as well as the constraint statistics report. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for more information about the available metrics. If you select a report type, you see the same panel layout that is used for all of the reports as shown in Figure 3-33 on the right side. You can select the columns (metrics) that you want to include in the report. You can also click Selection to limit the information included in the report, and you can create additional filters. If you do not limit the records that are included in the report by using the Selection or Filter function, you might get an extremely long report. Tip: If you want to reduce the amount of records displayed, for some reports it is better to use the filter function instead of the selection function. Displaying the list with the available components that you can select or deselect by itself might take a long time, because a query has to be submitted to the database. However, defining a filter does not include any database activity. This is especially true for any volume reports where the selection list might already include hundreds of volume entries.
103
You have also the capability to specify the time range on the first panel, so that less data is included in the report, which speeds up the following steps.
Examining a customized performance report

After you click Generate Report, you get a tabular report containing the information that you have customized as shown in Figure 3-34. When using default settings, this report only includes the performance data of the last sample that was inserted into the Tivoli Storage Productivity Center database. For displaying performance reports, operator rights are sufficient, but you need to have the privilege corresponding to the Disk Manager or Fabric Manager product.
Saving a customized report

You can save this report in the same way that you save most of the other reports in Tivoli Storage Productivity Center. The saved report is accessible in the Navigation Tree in IBM Tivoli Storage Productivity Center Reporting My Reports under your user name.
Drill up and drill down

When you see the report as displayed in Figure 3-34, you see two small icons displayed on the left side of the table (we have copied and resized the icons on the top of the picture).
Figure 3-34 Drill down and drill up
104
If you click the drill down icon in Figure 3-34, you get a report containing all the volumes that are stored on that specific array. If you click the drill up icon, you get a performance report at the controller level. In Figure 3-35, we show you the various components and levels to which you can drill up and down.
Figure 3-35 Various drill down possibilities: SVC/Storwize V7000 and DS8000/DS6000
See Figure 3-36 for DS4000, DS5000, and XIV drill up and drill down capabilities.
Figure 3-36 Various drill down possibilities: DS4000/DS5000 and XIV
105
Reports: There is one drawback when you drill up or down to the next level with saved customized reports. After you drill up or down, the next level reports include all columns again, so you cannot skip from one customized report to another one. This function currently does not work for a DS4000, because there are no drill down level reports available from the controller. Tivoli Storage Productivity Center displays the error in Figure 3-37.
Figure 3-37 DS4000 drill-down error
Creating a chart
From the tabular report, you can also create a chart to visualize the data. No matter if you are looking at a predefined report or at one that you have customized and saved. There are two types of charts that Tivoli Storage Productivity Center can display: a bar chart and a line chart for displaying history information. In order to create a chart, just select the records to display and click the pie chart icon in the upper left corner (see Figure 3-34 on page 104). Tivoli Storage Productivity Center displays the dialog box shown in Figure 3-38 where you can select the options for the chart. First, you need to select the chart type and then the metrics to display. You can select multiple components and multiple metrics at the same time, as long as those metrics use the same units. For example, you can display the Read I/O Rate (overall) and the Write I/O rate (overall) at the same time (as shown in the example in Figure 3-38), but you cannot display the Total Read I/O and the Read Response Time on one chart. Tip: Whenever you make use of this function, we recommend that you set History Chart Ordering to By Components. By doing this, you get all metrics for a component at the same time, instead of all the components with the first metric and then all the components with the second metric on another page.
106
Figure 3-38 Select charting option
After you click Ok, it might take a while for Tivoli Storage Productivity Center to query the data from the database and display it (see Figure 3-39).
Figure 3-39 Chart example with multiple metrics
If you right-click the chart, you see a context menu with the single entry, customize this chart, to set additional options.
107
There are several considerations to keep in mind: If for any reason, Tivoli Storage Productivity Center did not receive any samples for a certain time frame, the diagram does not indicate this gap, but just draws a straight line between the last and the first samples that enclose this gap. See Figure 3-40.
Figure 3-40 Gaps in the collected samples
In Figure 3-40, it is obvious that given the short interval (see left part of the diagram), there was an interruption in the performance collection, because the line in the illustrated gap is straight from one point to another point that is multiple intervals later. If all of the records cannot display on a single chart, you can use the Next and Prev buttons (on top of bar charts or in the top right corner of line charts) to scroll through the pages.
3.5.4 Batch reports

A batch report represents any Tivoli Storage Productivity Center report that you want to run on a regular basis and save its data to a file. You can view the information in a generated batch report file directly or import it into an application of your choice.
Using batch reports

If you want to create batch reports, you do the same tasks that you do for creating any other report. The only difference is one additional Options panel. However, there is currently no way in Tivoli Storage Productivity Center for specifying flexible time frames; therefore, the batch reporting does not play a major role in the performance reports. Alternatively, creating batch reports saves you from the task of initiating your reports from the Tivoli Storage Productivity Center GUI. By using batch reports, you can collect data and save it into any of the following file formats: comma separated values (CSV), HTML, formatted text, PDF chart, or HTML chart.
108
Steps to create your batch report

To create a batch report, you first determine which Tivoli Storage Productivity Center report you want to generate, the schedule on which to generate the report, and the file to which to save its data. Follow these steps to create your batch report: 1. Expand IBM Tivoli Storage Productivity Center Reporting My Reports, right-click Batch Reports as shown in Figure 3-41, and select Create Batch Report.
Figure 3-41 Selection to create batch report
2. Continue with the batch report creation by selecting the performance report type from the Report tab as shown in Figure 3-42.
Figure 3-42 Selection of Storage Subsystem performance report
109
3. After selecting the performance report type, complete your customization by clicking the tabs: a. b. c. d. Selection Options When to Run Alerts
The following series of panels take you through the steps to create a batch report. 4. From the Selection tab, specify the criteria for selecting and displaying report data (see Figure 3-43). 5. Click Selection in the upper right side to specify the resources (for example, all DS8000, SVC, and Storwize V7000) to display in the report. 6. Click Filter in the upper side to apply filters to the data that you want to display.
Figure 3-43 Select the metrics on which to report
7. On the Options tab (see Figure 3-44), specify: The machine and file system location on which to save the report file The report format: CSV File: In this format you have the possibility to include headers and totals Formatted file HTML file History CSV File: in this format you have the possibility to include headers PDF Chart HTML Chart
Use of Classic Column Names Whether to run a script when the report process completes The format for the name of the batch report
110
Figure 3-44 Select option for report type
8. On the When to Run tab (see Figure 3-45), specify when to run the batch report, the frequency of running the batch report, and the time zone. The following options are available for the time zone: Local time in each time zone: Select this option to use the time zone of the location where the agent running the batch report is located. Same Global time across all time zones: Use the options in this section to apply the same global time across all the time zones where the probe is run.
111
Figure 3-45 Select run-time options
9. The Alert tab (see Figure 3-46) allows you to define an alert that is triggered if the report does not run on schedule.
Figure 3-46 Setting Alert options
112
10.After you have specified the options for the batch report, click File Save As and type a name for the batch report. The batch report is saved using the user ID with which you are logged on as a prefix. In our case, we are logged on to Tivoli Storage Productivity Center Tivoli Storage Productivity Center as tpcadmin; therefore, the name of the batch report is administrator.Storage Subsystem Performance R1 (see Figure 3-47).
Figure 3-47 Save the report
After you run this example reporting job, the output looks like Figure 3-48. Each sample (5 minute interval) represents a row. Each metric is displayed in a column.
Figure 3-48 Output of the report job in html
3.5.5 Constraint Violations reports

Constraint Violations reports are special reports in the Reporting navigation subtree that list all the threshold-based alerts. For details on how to define alerts, see 3.4.4, Defining the alerts on page 80. Figure 3-49 shows two Tivoli Storage Productivity Center Navigation Trees with Disk Manager Reporting Storage Subsystem Performance Constraint Violations selected on the left side and Fabric Manager Reporting Switch Performance Constraint Violations selected on the right side.
113
Figure 3-49 Navigation Tree: Constrain Violation
These are the most useful thresholds for Storage Subsystems for performance monitoring: Write-cache Delay Percentage Threshold Total I/O Rate Threshold Back-End Write Response Time Threshold Back-End Read Response Time Threshold Total Back-End I/O Rate Threshold Disk Utilization Percentage Threshold CPU Utilization Threshold Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold These are the most useful thresholds for Switches for performance monitoring: Port Receive Bandwidth Percentage Threshold Port Send Bandwidth Percentage Threshold Tip: Tivoli Storage Productivity Center also offers several thresholds for error counters on SAN ports that are highly recommended to use in case no other tool already is doing that. Errors on SAN ports, especially on Inter Switch Links (ISLs), can heavily impact the performance in the SAN fabric. For information about the exact meaning of these thresholds and the other thresholds, see Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331, where we explain the metrics and show you which subsystem really supports each metric. You also see at which level each metric is available. For example, the Total I/O Rate is a controller and I/O Group metric. Therefore, you need to specify the values for a controller or I/O Group and not the whole subsystem, even if you need to select the name of the subsystem in the second panel. Also, not all metrics are supported for all subsystems. You might select a threshold, but later, you cannot select your subsystem, because that subsystem does not support the selected metric (this is often the case with DS4000 systems, which have a limited number of metrics compared to the DS8000).
114
In Figure 3-50 you can see an example of the constraint violation report using the Disk Utilization Percentage Threshold in the time range limit of 01. June till 02. June 2011. This report shows that the DS8000.2107-1301901-IBM subsystem has exceeded the Disk Utilization Percentage threshold 28 times:
Figure 3-50 Constraint Violation Report
115
By clicking the lens icon, it is possible to see the details, as shown in Figure 3-51.
Figure 3-51 Constraint violation details
As you can see, the Critical Stress threshold has been exceeded 28 times, and always by the array A14. The thresholds values are those set in the performance monitor job, as you can see from the lines of the job log reported in Example 3-9 (see bolted lines; -1 indicates blank):
Example 3-9 Performance collection job log
2011-06-01 19:29:05.500 HWNPM2113I The performance monitor for device DS8000-2107-1302541-IBM (2107.1302541) is starting in an active state. 2011-06-01 19:29:05.500 HWNPM2115I Monitor Policy: name="ds8k-1302541", creator="administrator", description="ds8k-1302541" 2011-06-01 19:29:05.500 HWNPM2116I Monitor Policy: retention period: sample data=30 days, hourly data=180 days, daily data=365 days. 2011-06-01 19:29:05.500 HWNPM2117I Monitor Policy: interval length=300 secs, frequency=300 secs, duration=24 hours. 2011-06-01 19:29:05.500 HWNPM2118I Threshold Policy: name="Default Threshold Policy for DS8000", creator="System", description="Current default performance threshold policy for DS8000 devices. This default policy can be overridden for individual devices." 2011-06-01 19:29:05.500 HWNPM2119I Threshold Policy: retention period: exception data=14 days. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Port IO Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Port Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Overall Port Response Time Threshold, enabled=no , boundaries=-1,-1,-1,-1 ms/op.
116
2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Error Frame Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Link Failure Rate Threshold, enabled=no , boundaries=0.0030,-1,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=CRC Error Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Send Utilization Percentage Threshold, enabled=no , boundaries=-1,-1,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Receive Utilization Percentage Threshold, enabled=no , boundaries=-1,-1,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Send Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Port Receive Bandwidth Percentage Threshold, enabled=yes, boundaries=85,75,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Invalid Transmission Word Rate Threshold, enabled=no , boundaries=0.033,0.01,-1,-1 cnt/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Write-cache Delay Percentage Threshold, enabled=yes, boundaries=10,3,-1,-1 %. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Cache Holding Time Threshold, enabled=yes, boundaries=30,60,-1,-1 s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Back-end I/O Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 ops/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Total Back-end Data Rate Threshold, enabled=no , boundaries=-1,-1,-1,-1 MB/s. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Back-end Read Response Time Threshold, enabled=no , boundaries=35,25,-1,-1 ms/op. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Back-end Write Response Time Threshold, enabled=no , boundaries=120,80,-1,-1 ms/op. 2011-06-01 19:29:05.500 HWNPM2120I Threshold Policy: threshold name=Disk Utilization Percentage Threshold, enabled=yes, boundaries=80,50,-1,-1 %.
117
By selecting the Storage Subsystem Performance - By Array chart and selecting the DS8000 arrays, we can see the related graphs. To do so, follow these steps: 1. Access the Array Storage Subsystems Performance Panel (Disk Manager Reporting Storage Subsystems Reporting By Array; see Figure 3-52).
Figure 3-52 Subsystem Performance report - By Array
2. Click the Generate Report button, and in the next panel, select the specific DS8000 lines, as shown in Figure 3-53.
Figure 3-53 DS8000 arrays selection
118
3. Click the chart icon and select the Disk Utilization Percentage metric (Figure 3-54).
Figure 3-54 Metric selection
4. As shown in Figure 3-55, click the chart icon. In the next panel, select the time interval of interest (01. June 00:00 till 02. June 24:00, that is, the time range when the thresholds were exceeded).
Figure 3-55 Disk Utilization chart
As you can see from the chart, the array A14 exceeded 28 times the Critical Stress of 80% value between 01. and 02. June 2011.
119
3.6 Tivoli Storage Productivity Center configuration history

Tivoli Storage Productivity Center offers the possibility to capture and analyze historical data that identifies possible problems with a storage area network (SAN) configuration. Note that, in order to use the configuration history function, you must have Tivoli Storage Productivity Center Standard Edition installed. In a Performance Management environment implementation, it is a good practice to periodically take a snapshot of the SAN configuration, which is saved in a page similar to the Topology Viewer, and which you can manipulate to show changes that occurred between two or more points in the time period. This feature can be very useful for several purposes: Correlate performance statistics with configuration changes: For example, during collection of performance statistics (including volume performance statistics) on an ESS system, you might delete a volume. While no new statistics are reported on that volume, the Tivoli Storage Productivity Center Performance Manager will have already collected partial statistical information prior to the deletion. At the end of the data collection task, reporting of the partially collected statistics on the (now) deleted volume will require access to its properties which will not be available. The configuration history feature, with its ability to take and store snapshots of a systems configuration, can provide access to the volumes properties. Analyze end-to-end performance: You want to know why performance changed on volume A during the last 24 hours. To learn why, it is useful to know what changes were made to the storage subsystems configuration that might affect the volumes performance, even if performance statistics were not recorded on some of those elements. For example, even if performance statistics on a per-rank basis are not collected, but the number of volumes allocated on a rank is increased from 1 to 100 over time, access to that configuration history information helps with analyzing the volumes degraded performance over time. Aid in planning and provisioning: The availability of configuration history can enhance the quality of both provisioning and planning. For example, historical data is useful when using the IBM Tivoli Storage Productivity Center SAN Planner to provision a volume. To use the configuration history feature, complete the following steps: 1. In the Navigation tree pane, expand Administrative Services Configuration Configuration History Settings. The Configuration History Settings page displays for you to indicate how often to capture SAN configuration data and how long to retain it. 2. Perform the following tasks to collect historical data: a. In the Create snapshot every field, click the check box to enable this option and type how often (in hours) you want the system to take snapshot views of the configuration. b. In the Delete snapshots older than field, click the check box to enable this option and type how long (in days) you want the snapshots to be stored. c. The page displays the total number of snapshots in the database and the date and time of when the latest snapshot was taken. To refresh this information, click Update. d. To optionally create and title a snapshot on demand, in the Title this snapshot field, specify a name for the on demand snapshot and click Create Snapshot Now. If you do not want to title the on demand snapshot, simply click Create Snapshot Now.
120
snapshot frequency and 30 days for the snapshot retention.
In Figure 3-56 we show the panel with the recommended values, 12 hours for the create
Figure 3-56 Configuration History settings
In the Navigation Tree pane, expand IBM Tivoli Storage Productivity Center Analytics Configuration History, and click Configuration History. The software loads the snapshot data for the length of time that you specified. The Configuration History page (a variation of the Topology Viewer) displays the configurations entities and a floating snapshot selection panel. The panel allows you to define the time periods against which the configuration is compared to determine whether changes have occurred (see Figure 3-57). Use the thumb sliders to establish the time interval that you want to examine.
Figure 3-57 Snapshot selection panel
1. To define the time periods that you want to compare, perform the following tasks: a. Using the mouse, drag the two thumbs in the left Time Range slider to establish the desired time interval. The Time Range slider covers the range of time from the oldest snapshot in the system to the current time. It indicates the date as mm/dd/yy, where mm equals the month, dd equals the day, and yy equals the year.
121
b. Drag the two thumbs in the right Snapshots in Range slider to indicate the two snapshots to compare. The Snapshots in Range slider allows you to select any two snapshots from the time interval specified by the Time Range slider. The value in parentheses beside the Snapshots in Range slider indicates the total snapshots in the currently selected time range. The Snapshots in Range slider has one check mark for each snapshot from the time interval that you specified in the Time Range slider. Each snapshot in the Snapshots in Range slider is represented as time stamp mm/dd/yy hh:mm, where the first mm equals the month, dd equals the day, yy equals the year, hh equals the hour, and the second mm equals the minute. The value in parentheses beside each snapshot indicates the number of changes that have occurred between this and the previous snapshot. Snapshots with zero changes are referred to as empty snapshots. If you provided a title while creating an on demand snapshot, the title displays after the time stamp. If you want to remove empty snapshots, click the check box to display a check mark in Hide Empty Snapshots. The Displaying Now box indicates the two snapshots that are currently active. c. Click Apply to continue. d. Determine the changes that have occurred to the entities by examining the icons and colors associated with them in the graphical and table views: for information about viewing the changes, see 3.6.1, Viewing configuration changes in the graphical view on page 122 and 3.6.2, Viewing configuration changes in the table view on page 124. One single snapshot selection panel applies for all Configuration History views that are open at the same time. Any change that you make in this panel is applied to all of the Configuration History views.
3.6.1 Viewing configuration changes in the graphical view

To determine the changes that have occurred to a configuration over time, examine the icons and colors of the change overlay in the graphical view of the Configuration History page. In the Configuration History page, a change overlay presents icons and colors that indicate changes in the configuration between the time that a snapshot was taken and the time that a later snapshot was taken: The icons display beside the name of the entity in the graphical view and in the change status column of the table view. The colors display as background colors for entities in the graphical view and for affected rows in the table view. Table 3-6 describes the icons and colors of the change overlay.
Table 3-6 icons and colors of the change overlay Change overlay icon Change overlay color indicator Blue background Description Entity changed between the time that the snapshot was taken and the time that a later snapshot was taken.
Yellow pencil Dark gray background Entity did not change between the time that the snapshot was taken and the time that a later snapshot was taken.
Light gray circle
122
Change overlay icon
Change overlay color indicator Green background
Description Entity was created or added between the time that the snapshot was taken and the time that a later snapshot was taken. Entity was deleted or removed between the time that the snapshot was taken and the time that a later snapshot was taken. Entity did not exist at the time that the snapshot was taken or at the time that a later snapshot was taken.
Green cross Red background Red minus sign Not applicable Light gray background
Figure 3-58 shows the icons and colors of the change overlay. In the graphical view, the pencil icon beside the switches and storage entities and the blue background color indicate that change occurred to these entities. The pencil icon and blue background also appears for these entities in the table view. In the snapshot selection panel, use the Time Range and Snapshots in Range sliders to determine when the change occurred.
Figure 3-58 Icons and colors of the change overlay
To distinguish them from tabs in the Topology Viewer page, tabs in the Configuration History page (Overview, Computers, Fabrics, Storage, and Other) have a light gray background and are outlined in orange. The minimap in the Configuration History page uses the following colors to indicate the aggregated change status of groups:
Blue: One or more entities in the group have changed. Note that the addition or removal of an entity is considered a change. Gray: All of the entities in the group are unchanged.
Entities in the graphical view can be active (they existed at one or both snapshots) or inactive (not yet created or deleted):
Active entities act like they normally do in the Topology Viewer; when you select them all relevant information also appears in the table view. You can adjust a grouping of active entities, but you cannot perform actions that change the database, such as pinning.
123
Inactive entities do not exist in the selected snapshots, but exist in the other snapshots.
They are shown in a light gray background and do not have a change icon associated with them. Inactive entities display to keep the topology layout stable and to make it easier to follow what has changed (instead of having the entities flicker in and out of existence when you change the snapshot selection). Inactive entities are not listed in the table view. An entity that is moved from one group to another group appears only once in the new group in the graphical view. For example, if the health status of a computer has changed from Normal to Warning, the Configuration History page displays the computer as changed in the Warning health group (and no longer displays the computer in the Normal health group). Attention: In the Configuration History view, the performance and alert overlays are disabled and the minimaps shortcut to the Data Path Explorer is not available.
3.6.2 Viewing configuration changes in the table view

To determine the changes that have occurred to a configuration over time, you can also examine the icons and colors of the change overlay in the table view of the Configuration History page. In the table view, a changed cell or tab is color coded. The change overlay uses the same colors and icons for the table view as for the graphical view (see Table 3-6, icons and colors of the change overlay on page 122). The changed cell or tab displays both the old and new values as Old Value New Value. To display a summary of change for the entire row, hover over the changed row. The summary displays the title of the column. It also displays the old and new value for each cell or tab that changed. For example, if a group contains one new entity but everything else in the group is unchanged, the group as a whole displays as changed. If you click the group, the one new entity displays as created and all other entities display as unchanged. The Change column in the table view also indicates changed entities. An object that is moved from one group to another group appears only once in the new group in the graphical view. For example, if the health status of a computer has changed from Normal to Warning, the Configuration History page displays the computer as changed in the Warning health group (and no longer displays the computer in the Normal health group). Figure 3-59 shows a table view for history changes for storage subsystems. By hovering the mouse pointer over a cell, a change summary for that system is displayed.
Figure 3-59 Configuration history table view
124
3.7 Tivoli Storage Productivity Center administrator tasks

In this section we illustrate the daily administration operations that a Tivoli Storage Productivity Center administrator must perform to check the Tivoli Storage Productivity Center environment and to define the Tivoli Storage Productivity Center baseline. Here is a list of the main tasks for the Tivoli Storage Productivity Center administrator: 1. Use the Configuration Utility to verify everything is running as expected within Tivoli Storage Productivity Center. 2. Verify discovery, probes, scans, performance monitors and quotas are set to run regularly for all managed devices. 3. Set key system-wide thresholds: Disk Fabric 4. Define additional reports to support two or three important subsystems. 5. Gather performance baselines. 6. Regularly review the incoming alerts. For alerts occurring frequently is there a problem or does the threshold need revising? 7. Use the Constraint Violation reports. 8. Use Topology Viewer. 9. Test-drive Data Path Explorer, identify potential bottlenecks, look for puzzling paths. 10.Configure automatic snapshots (then explore Change History). In the following sections we describe each step in detail.
3.7.1 Using Configuration Utility to verify everything is running as expected

The Tivoli Storage Productivity Center configuration utility (IBM Tivoli Storage Productivity Center Configuration Utility) provides you a control point on the health of the entire Tivoli Storage Productivity Center environment (Tivoli Storage Productivity Center services as well as Data Sources), as shown in Figure 3-60.
125
Figure 3-60 Tivoli Storage Productivity Center Configuration Utility
Access this panel daily to verify that all the services are up and running and the data sources (Fabric and Data/Storage Resource Agents, CIM Agents and NAPI, Out Of Band agents) are up and reachable.
3.7.2 Verifying that Discovery, probes, and performance monitors are running
The Discovery, probes, and monitor jobs (and scan as well, even if these jobs are out of context in this book) must be configured to run daily. Figure 3-61 and Figure 3-62 show two examples; a CIMOM discovery job and an SVC probe job.
Figure 3-61 Discovery job settings
126
Figure 3-62 Probe job settings
Also check if all planned performance monitors are running. For best practices, restart the job daily. See example Figure 3-63.
Figure 3-63 Performance monitor
Also check within the IBM Tivoli Storage Productivity Center Job Management for jobs in status Warnings or Failed (see Figure 3-64).
127
Figure 3-64 Jobs running with Warnings or are Failed
To check if all storage subsystems are monitored, you can choose Entity Type Storage Subsystem. A button appears with the label Show Recommendation. Click the button to see the recommendations from Tivoli Storage Productivity Center. By selecting the entry and clicking the button Take Actions, you can solve the issues directly from this panel. See Figure 3-65.
Figure 3-65 Tivoli Storage Productivity Center recommendations
3.7.3 Setting system-wide thresholds

As stated in 3.2, Performance management approach on page 56, metrics values are related to the environment and the applications type and workload. After you get experience with your environment, look at your defined thresholds levels and adjust them as necessary. Next we illustrate two threshold setting examples, one for Disk and one for Fabric.
128
Tivoli Storage Productivity Center for Disk threshold: CPU utilization threshold
To create a storage subsystem alert, you can choose Disk Manager Alerting Storage Subsystems Alerts. In Figure 3-66 we show an alert setting for the SVC and Storwize V7000 subsystem.
Figure 3-66 CPU Utilization Threshold
The CPU Utilization Thresholds takes each SVC node into account. In case the CPU utilization of a node is higher than 50% (Warning) or 70% (critical), an alert will be generated which triggers an email information to the defined recipients. We recommend that you use reporting for performance analyzing and capacity planning. You can use performance alerts for identifying a performance problem within your environment. In this case, we recommend that you use high alert thresholds; otherwise, you might be overwhelmed with alerts. A good strategy is to start with a high performance alerting threshold, solve performance bottlenecks, and lower the thresholds over time to an accurate value. Keep in mind that the storage subsystem-attached hosts and the associated applications determine a valid subsystem performance.
129
Tivoli Storage Productivity Center for Fabric threshold: Port send bandwidth percentage threshold
To create a switch, you can choose Fabric Manager Alerting Switch Alerts. In Figure 3-67 we show the alert setting for the SAN ports in our Fabric.
Figure 3-67 Port Send Bandwidth Utilization Percentage Threshold
In case the send bandwidth utilization of a SAN port passes 85% (independent of the port speed of 2, 4, or 8 Gbit/s), an alert is triggered. Further alerts are suppressed during one hour.
130
3.7.4 Defining additional reports and thresholds

You will need to identify and define additional reports and thresholds that are significant for your environment.
Defining a customized report

In the following example we define a Subsystem Total I/O Rate report, which can be used to monitor the overall behavior of a production environment day by day. It shows if the usage of each storage subsystem is growing, stable, or decreasing, and helps in predicting the future subsystem utilization requirement. First, we define a By Storage Subsystem Performance Report, where we select only the metrics of interest (see Figure 3-68).
Figure 3-68 By Storage Subsystem Total I/O Rate Report
131
By clicking Generate Report, we can see the metrics in a tabular view. From here we can select some of the subsystems and generate a chart using all or some of the metrics allowed. In this case we choose all storage subsystems, and look at the Total I/O Rate (Figure 3-69).
Figure 3-69 Subsystem reporting view
The result is the chart shown in Figure 3-70. With this chart we can see what happened to each storage subsystem in the last 3 days.
Figure 3-70 Report chart
132
Frequently compare this short-term data with long-term data. Take data hourly or even daily and adjust the timeline to several days to month. The following examples of storage subsystem performance consumption show daily averages over some month (Figure 3-71).
Figure 3-71 Slowly increasing of Storage Performance consumption (Total I/O Rate)
Figure 3-72 shows a slow performance consumption reduction over the time period.
Figure 3-72 Slowly decreasing of Storage Performance consumption (Total I/O Rate)
133
From the Selection view, we now save the short-term report (see Figure 3-73).
Figure 3-73 Saving the report
Now the saved report is accessible from the Tivoli Storage Productivity Center tree (IBM Tivoli Storage Productivity Center Reporting My Reports administrators Reports1), as shown in Figure 3-74.
Figure 3-74 My reports view
The subtree shows the reports defined by the logged user, in this case administrator (the same user that created the report).
134
Defining a batch report

We now define a Batch report including the overall I/O rate of each storage subsystem. We generate periodically a weekly report including the previous 7 days in a PDF file or an HTML file. Tip: For advanced performance reporting tasks, you also can periodically export performance data using a CSV file, which afterwards can be used as the source for Excel reports or dynamic websites. Or alternatively, you can directly use the predefined Table Views within in the Tivoli Storage Productivity Center database as a data source. To access the Batch report, we navigate through the Tivoli Storage Productivity Center tree IBM Tivoli Storage Productivity Center Reporting My Reports Batch Reports, right-click and select Create Batch report. In the Reports section, we select the Storage Subsystem Performance - By Storage Subsystem report, as shown in Figure 3-75.
Figure 3-75 Create batch report. Storage Subsystem selection
135
In the Selection section, we choose the metrics of our interest (this is Total I/O Rate), choose a time range of the last 7 days, and if required, select a filtering option to limit the data for the subsystems as shown in Figure 3-76 and Figure 3-77.
Figure 3-76 Create batch report: metrics selection
Figure 3-77 Create batch report: filtering
136
In the Option section, we define where the batch report is to be run, its format (.pdf in our case), and the metrics contained, as shown in Figure 3-78.
Figure 3-78 Create batch report: define options
Finally, in the When to run section, we choose to run repeatedly the batch report each week on Monday morning 6 AM, as shown in Figure 3-79.
Figure 3-79 Create batch report: define scheduling
137
The result is now a PDF file containing a report with all DS8000 storage subsystems. Each Storage Subsystem has its own history chart (see Figure 3-80).
Figure 3-80 Create batch report: PDF history chart
138
This report can be used now as an entry point to see any obvious abnormality in the storage environment. For detailed investigations, you still need to go to the Tivoli Storage Productivity Center. In Figure 3-81 we show the Subsystem Performance reports implemented in our environment. As you can see, there is one performance collection job for each subsystem, to avoid CIM overload, for a clear view, and for best alerting.
Figure 3-81 Subsystem Performance Report job
TIP: Note that all the jobs are defined with a duration of 24 hours (see Figure 3-81). This is to be sure to detect the error on any communication error with the CIM Agent or NAPI on the start-up of the subsequent collection. The log files of the performance data collection job give you a lot of information about the job itself. To access at the log file details, select a job instance, right-click and choose Job History. This will now jump to the job management. Figure 3-82 shows an example of job log file, with the relevant information highlighted: 1) Serial Number of the device 2) Defined Thresholds for this performance job: -1 indicates this value is disabled 3) Assets of the device 4) Performance data collections
139
Figure 3-82 Performance job details
3.7.5 Regularly reviewing the incoming alerts

Reviewing the alerts must be the main daily task for a Tivoli Storage Productivity Center Performance Administrator. This task is intended to be used not only for problem determination, but even and especially during the early phase of the performance management environment setup, during the baseline definition. You need to take a closer look at alerts occurring frequently (isolated errors with random dates probably are not a problem, just a spike situation), and understand if there is actually a problem or if the threshold might need to be revised. Important: There is no way with Tivoli Storage Productivity Center to select time windows where the alert generation has to be inhibited. It might happen that in your environment, throughput or response times change drastically from hour to hour or day to day (backup sessions, batch operations, and so on). This means that there can be periods when the values of several metrics fall outside the expected range of values, and an alert is triggered even if an alert is not required. As much as possible, be aware of the workload distribution in your environment and understand the expected behavior in different hours or days. Thus, when an alert is triggered, you have to consider when it happened in order to verify if it actually is an indication that a real problem is occurring. As we stated in 3.4.4, Defining the alerts on page 80, you have to get familiar with your workload first before you set up the thresholds. You can only use the thresholds effectively if you first understand the physical environment and the workloads in place.
140
In Figure 3-83 we show an example of several alerts that frequently occur in our environment. These alerts are all Total I/O Rate threshold.
Figure 3-83 Storage Subsystems alerts -Total I/O Rate Threshold
In Figure 3-84 you can see the detailed message for the alert on the Storwize V7000.
Figure 3-84 Alert detail
141
As you can see, the threshold value of 20% CPU Utilization has been exceeded by the node. This value is the defined Critical Stress CPU Utilization Threshold, as the highlighted line in Figure 3-85.
Figure 3-85 subsystem performance monitor - detail
The value is definitely too small for the Storwize V7000 (see Figure 3-86). We set this low value to demonstrate the alerting. Values defined: In Tivoli Storage Productivity Center for Disk threshold: CPU utilization threshold on page 129 we defined the Critical Stress and Warning Stress values for SVC and Storwize V7000.
Figure 3-86 Storwize V7000 CPU Utilization chart
142
This is an example where the thresholds values need to be adjusted.
3.7.6 Using constraint violation reports

Constraint violations reports are a fast and effective way to look at many thresholds (constraint) violations. We now take a look at the Constraint Violation Reports in our environment and see if any threshold values have been exceeded. In particular, we focus on Disk Constraint Violation Panel (Disk Manager Reporting Storage Subsystem Performance Constraint Violations as shown in Figure 3-87).
Figure 3-87 Disk Constraint violations Panel
After clicking the Generate Report (see Figure 3-88), we can see that the Disk Utilization Percentage Time Threshold has been exceeded for the DS8000 subsystem several times.
Figure 3-88 Constraint Violation report
143
The Overall Back-End Response Time is the approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). In our example we have set Critical and Warning Stress thresholds respectively to 80 and 50 percent (for details on alert definition see 3.4.4, Defining the alerts on page 80), and have forced these thresholds to be exceeded on the DS8000 by simulating heavy traffic using IO Meter. By clicking the lens icon, we get access to the constraint violation details for the metrics of our interest, as shown in Figure 3-89.
Figure 3-89 Constraint violations detail
The components affected by the constraint violations are the DS8000, mostly array A10 and sometimes A9.
144
To go deeper in the analysis, you can click the lens of the violation entry; for example, the entry number 3 from Jun 6, 2011 12:01:42 AM. Tivoli Storage Productivity Center automatically provides you a new panel including the volumes that are located on that array. Click Generate Report to create a report (Figure 3-90).
Figure 3-90 Constrain Violation: Analyze Disk Utilization
The new report shows all volumes on that array, the last samples, and which hosts are affected (see Figure 3-91).
Figure 3-91 Constraint Violation: Affected Volumes
145
To find out which volume utilizes the array most, we create a history graph with the I/O rate metrics over all volumes (Figure 3-92).
Figure 3-92 Constraint Violation: Total I/O Rate of the affected volumes
Figure 3-92 shows three volumes producing traffic to this array. A solution can be to move one of these volumes to another, less utilized array.
146
3.7.7 Using the Topology Viewer

The Topology Viewer is designed to provide an extended graphical topology view. It provides a graphical end to end view if Computers, SANs and Storage Subsytems are added to Tivoli Storage Productivity Center. The Topology Viewer offers several layers and also provides graphical and tabular information about relationships, health and latest performance status. It also shows volume sizes, firmware versions of HBAs, Storage Subsystems, and much more. In this section we show the effectiveness of the Topology Viewer, by continuing in the analysis of the example reported in 3.7.6, Using constraint violation reports on page 143 and finding out the servers or SVC/Storwize V7000 producing this load. To start the Topology Viewer, use IBM Tivoli Storage Productivity Center Topology Storage. Figure 3-92 and Figure 3-91 show the volumes that generate the highest I/O rate to the rank. Those are the volumes: 009f, 00a0, and 00a3. The Topology Viewer shows that the volume 009f is mdisk1 on the Storwize V7000 (see Figure 3-93). 00a0 is mdisk0 and 00a3 is mdisk13, also on the Storwize V7000 (not shown here).
Figure 3-93 Topology Viewer Relationship volume 009f
147
At this stage you can analyze further the relationships to the Storwize V7000. There you see that mdisk0 and mdisk1 are in the Managed Disk Group (mdiskgroup) IBM Cognos. Furthermore you see which volumes are placed in this Managed Disk Group (see Figure 3-94). Now you can start analyzing which of those volumes produce the most load to this Managed Disk Group, called Cognos on the Storwize V7000.
Figure 3-94 Topology Viewer: Volume relationship
3.7.8 Using the Data Path Explorer

The Data Path Explorer allows you to follow connectivity between a host disk or computer and a storage subsystem. Using Data Path Explorer, for example, you can follow the connectivity from a computer to SVC, and from an SVC to a storage subsystem. In Figure 3-95 we show the Data Path Explorer for the Server tpcblade3-7. To do so we open the Computer Topology View following the path IBM Tivoli Storage Productivity Center Topology Computers, left-click the tpcblade3-7 icon, and then click Data Path Explorer on the minimap.
148
Figure 3-95 Data Path Explorer view
149
Important: To see the Data Path from the Server to the end device, a Data Agent or Storage Resource Agent has to be installed on the Server itself. Viewing the data paths in a single view allows you to monitor performance status and pinpoint weaknesses without navigating the many entities typically seen in Topology Viewer. You can do the following tasks, for example: Identify critical path and/or potential performance bottlenecks Identify unexpectedly convoluted paths through the SAN Verify logical and efficient paths Reminder: All topology views for performance values only present the performance data for the last performance monitor collection.
3.7.9 Configuring automatic snapshots, then exploring Change History

In 3.6, Tivoli Storage Productivity Center configuration history we explained the purposes of the Tivoli Storage Productivity Center configuration history tool and how to set the Automatic Snapshot for the environment managed by Tivoli Storage Productivity Center. Automatic snapshots, together with Configuration History, are effective tools that can be used also as an entry point in problem determination, because often problems are related to human errors and easily detected as changes in the SAN environment (for example, deletion of a volume, SAN zone modification), even on those components that are not monitored. In Figure 3-96 we show the Storage view of the Configuration History panel following the path IBM Tivoli Storage Productivity Center Analytics Configuration History Storage, where we compare the environment between two snapshots.
Figure 3-96 Storage Configuration History
150
During the time interval, a configuration change has occurred on the Storwize V7000 subsystem, as you can see both in the surrounded icon in the graphical view and the highlighted line in the tabular view in Figure 3-96. For detailed descriptions of the icons and colors overlay, see Table 3-6, icons and colors of the change overlay on page 122. In Figure 3-97 we go in further detail, also accessing the level 2 view for the Storwize V7000 subsystem (double-click the subsystem icon).
Figure 3-97 Configuration History detailed view
As you can see both from the graphical and the tabular views, one change has occurred in the Managed Disk Group Cognos: The volume test has been created.
151
152
Chapter 4.
Using Tivoli Storage Productivity Center for problem determination

There are many different methods to run a problem determination. In this chapter we detail our perspectives on performance problem resolution. This methodology provides thought and impact to management of why storage, like any other technology, requires the appropriate tools and continuing education.
153
4.1 Problem determination lifecycle

In this section we describe the approach for problem solving. Note that the approach to resolving problems is not only related to actions after problems happen. We also take a proactive approach and follow-up actions to prevent problems from occurring in the future. Figure 4-1 shows the typical life cycle for problem determination, where you will either start when the problem happens or, when taking a proactive approach, you will design and manage your environment with problem prevention in mind.
Figure 4-1 Life Cycle for Problem Management
4.2 Problem determination steps

Before you are able to correct performance problems, you also need to be able to recognize what is acceptable performance for different types of data. Not all applications are equally important to your business, and not all applications require all data to be delivered at the same speed. The only way to identify data with critical response needs is to ask those people who own the data to identify data sets with critical performance requirements.
4.2.1 Identifying acceptable base performance levels

One way to accomplish this is to write service levels with performance criteria that identify an acceptable base performance level and that highlight and document resources that require exceptional performance. For this approach to work, you must also report regularly on all the components of performance. You need to be aware that I/O performance cannot be measured or tuned in isolation. Correcting an I/O performance problem might result in worse performance if other system resources are constrained. Speeding up I/O can cause increased demand for CPU cycles and storage. Improving I/O response for one application might degrade the performance of the entire system. 154
Performance analysis usually can be triggered by two events: Response time problems: Users are complaining, or you want to tune the performance. System indicators: Ongoing monitoring shows signs of a problem. This said, in the following sections we describe some steps and considerations that we suggest as an approach to performance problem resolution.
4.2.2 Understanding your configuration

To approach a performance problem, one of the first recommended steps is to understand your current configuration. Identify what components are between the actual disk drive and the user experience where the problem is recognized. At this stage, you might not know which subsystems or volumes are used by a server that has been identified as having a performance problem. In this section we show you how to use the Tivoli Storage Productivity Center reports to identify all of the subsystems and volumes. Figure 4-2 shows a general topology sample that we use in our discussion.
SVC
SAN
Host Adapter Host Adapter Host Adapter Host Adapter Host Adapter Host Adapter
High Bandwidth, Fault Tolerant Interconnect
N-way SMP
Cache
NVS
Cache
NVS
N-way SMP
RAID RAID RAID RAID RAID RAID Adapters Adapter Adapters Adapter Adapters Adapter
RAID RAID RAID RAID RAID RAID Adapters Adapter Adapters Adapter Adapters Adapter
Figure 4-2 General Topology Sample
Many performance problems are related to the performance of an array, because that is often the most restricting factor in a subsystem. Based on this fact, the arrays are a good point to start your investigation if you cannot determine specific volumes causing a problem or if the problem that you face is not related to more than a few volumes.
Chapter 4. Using Tivoli Storage Productivity Center for problem determination
155
Generally, understanding a performance problem and the techniques to remediate the issue apply to all types of enterprise storage box types at any level (see Figure 4-2). Hardware resources include these: Host Adapter (HA) Ports Interconnect (PCI-e Busses, RIO loops) Cache and NVS DA (RAID) adapters RAID ranks (the disk count) SVC or storage virtualization layers that introduce additional layers in the data path: Front-end I/O is timed and counted at the fiber interfaces to the SVC representing I/O to/from the servers. Back-end I/O is timed and counted again at the fiber adapters representing I/O to/from SVC and the backing storage. Tivoli Storage Productivity Center can provide monitoring and configuration support for a variety of storage subsystems. In order to analyze storage subsystem performance, it is important to understand the storage environment including the storage subsystem configuration and storage assignments to the servers. This can be achieved using the Tivoli Storage Productivity Center Topology Viewer. SAN Planner assists the user in end-to-end planning involving fabrics, hosts, storage controllers, storage pools, volumes, paths, ports, zones, zone sets, storage resource groups (SRGs), and replication. Moreover, Tivoli Storage Productivity Center is a very good tool for using the various reports provided to summarize and understand the current storage subsystem configuration and allocation. Tivoli Storage Productivity Center can provide you the information to understand your SAN and Storage environment, both in asset and relationships, through the following features: Topology Viewer: This is designed to provide an extended graphical topology view and the relationships among resources: Synchronized graphical and tabular views that allow users to manipulate views by enlarging, reducing, or closing one of the views A locate function to search and find entities, synchronized with a tabular view Overlays that allow you to turn on or off aggregated status (for example, health and performance) and membership (for example, zone and zone set) information See 3.7.7, Using the Topology Viewer on page 147 for an example of use of the Topology Viewer. Predefined and user-defined configuration data: This is useful to view information about your system in addition to the graphical depiction. You can use Tivoli Storage Productivity Center for Disk and Tivoli Storage Productivity Center for Data to get information such as the array where a specific volume is located, other volumes on that array, the number of disk drives on that array, and the RAID level. You need at least the server name to get this information: Disk Asset and configuration data, accessible expanding Disk Manager Storage Subsystems, as shown in Figure 4-3
156
Figure 4-3 Viewing storage subsystem data - storage subsystem list
Fabric configuration (and zone configuration), accessible by expanding Fabric Manager Fabrics, as shown in Figure 4-4
Figure 4-4 Viewing Fabric configuration
4.3 Volume information

There are several ways to list the volumes that a server is allocated. After the storage device is configured to Tivoli Storage Productivity Center through either the Native API (NAPI) or the Common Information Model (CIM) object manager (CIMOM) interface, Tivoli Storage Productivity Center is able to collect configuration details from the storage device, including LUN or Volume details. In Disk Manager, you can select Disk Manager Reporting Storage Subsystems Volume to HBA Assignment, where there are multiple reports that visualize which SAN-attached disks a certain server has been allocated. The first of the four reports shows what is available without a Tivoli Storage Productivity Center agent installed on the server, while three of the reports show what is available with an agent. Next we show these four reports. The first report, Volume HBA Assignment: Not on Monitored Computers (Figure 4-5) depicts the level of volume detail with no agent installed. The report is useful as it provides the volumes defined on a storage subsystem and it provides the volume name and storage capacity allocated to the volume, but lacks in detail on how it is used or any detail about the server the volume was assigned to beyond HBA WWN.
157
Figure 4-5 Volume HBA Assignment: Not on Monitored Computers Report
The second report, Volume HBA Assignment: Not Visible to Monitored Computer (Figure 4-6) provides the details for volumes allocated to a Tivoli Storage Productivity Center monitored server, yet the server has not had the volume allocated to a filesystem. This is a great report to identify orphaned storage independent of the version of Tivoli Storage Productivity Center you currently are licensed with from Tivoli Storage Productivity Center for Basic, through Tivoli Storage Productivity Center Standard Edition.
Figure 4-6 Volume HBA Assignment: Not Visible to Monitored Computer Report
The third report, Volume HBA Assignment: By Storage Subsystem (Figure 4-7) reviews the volumes allocated to Tivoli Storage Productivity Center monitored servers with a Tivoli Storage Productivity Center agent installed.
Figure 4-7 Volume HBA Assignment: By Storage Subsystem Report
158
The report in Figure 4-7 includes the following columns: a. Storage Subsystem b. Volume Name c. Volume World Wide Name (WWN) d. HBA Port WWN e. SMIS Host Alias f. Volume Space g. Computer h. Network Address i. OS Type j. Disk Path k. Disk Space l. Available Disk Space m. WWN Match: The WWN Match column indicates a Yes if a Data agent is installed on the host machine and was able to match the HBA Port WWN that was returned by a storage subsystem probe job. This column will display a value for: Windows, Solaris, and HP-UX machines if the HBA API client is installed on the host agent. The volume format column indicates the format of the storage volume. The valid values are: Unknown, Fixed Block, Block 512, Block 520 Protected, Block 520 Unprotected, 3380, 3390, 3390 Extended, Count Key Data. The mainframe volumes are identified as 3380, 3390, 3390 Extended, or Count Key Data.1
n. Volume Format:
o. Manufacturer p. Model q. Probe Last Run date r. Volume Real Space The volume real space column reflects the physical allocated space of the volume. For normal volumes, this is equal to the volume space. For space efficient/thin provisioned volumes, this is equal to the real space allocated when data is written to the volume.1
This column explanation detail was captured from the Tivoli Storage Productivity Center Volume to HBA report F1 help screen.
159
The fourth report Volume HBA Assignment: By Volume Space (Figure 4-8) is nearly identical to the third report. The only variation in this report is the column on which the data is sorted. In this case it is on the volume space instead of the Storage Subsystem column.
Figure 4-8 Volume HBA Assignment: By Volume Space Report
Important: Depending on the implementation of the LUN masking of your storage subsystem and the way that the information is passed through CIM or NAPI interface to Tivoli Storage Productivity Center, the SMI-S Host Alias might have a suffix that identifies the individual HBAs. Because the report can be very long, you need to use the filter function and specify the server name (SMI-S Host Alias) so that it is easy to find the server. Tip: If you do not know the full name for which to search, type the beginning of the name and then enter an asterisk at the end together with the LIKE operator. This is especially useful if the host name includes a suffix for HBA identification.
160
Figure 4-9 shows an example of the report that you get using the filter criteria available as an option when building a custom report. We choose the volume for the tpcblade3-11.storage.tuscon.ibm.com server.
Figure 4-9 Volume HBA Assignment: By Storage Subsystem Custom Filter Report
With this report, you can easily identify all the subsystems and volumes that are assigned to the server you are investigating, regardless of whether the server has storage assigned from multiple subsystems. If the server has either the Tivoli Storage Productivity Center Storage Resource Agent (SRA) or Data agent installed, the disk path is now available if you scroll to the right (Figure 4-10). This might help you to discover the disk that is causing the problem, as you now have the information to correlate the information provided by the system administrator to the block storage details seen at the storage subsystem.
Figure 4-10 Volume HBA Assignment: By Storage Subsystem with SRA Agent Disk Path Shown Report
Now that you have identified your subsystem and the volumes, you can use the asset reports of the Data Manager to learn more about the configuration of your subsystem.
161
4.3.1 Determining the subsystem configuration

To get information about the subsystem configuration, you can use the asset reporting capability of the Data Manager function of Tivoli Storage Productivity Center (Figure 4-11). License: The Tivoli Storage Productivity Center Data Manager full functionality is not enabled until a license has been obtained and applied to the Tivoli Storage Productivity Center Server. After being installed, the full functions are displayed; otherwise, the complete Navigation Tree of Data Manager is not available.
Figure 4-11 Tivoli Storage Productivity Center Data Manager Navigation Tree
Depending on the type of the subsystem, you see the components that are available. Most of the time, the tree view is helpful for understanding the organization of the storage device quickly, but sometimes, there is also additional information available that you cannot find anywhere else in the product, such as the RAID level. The asset reports are useful, because you can look at the data from different angles and drill up and drill down where needed.
4.3.2 DS8000 information

The screen captures in this section show what we consider to be the most important parts of the reports. For example, there are extra panels for the probe history, and if you have Data agents installed, you can get additional information.
162
Array sites: Array sites are usually groups of eight single disk drive modules (DDMs). If you create an array in a DS8000, you assign a RAID level to that group of disks. Within Tivoli Storage Productivity Center, the term array and array site are used interchangeably even though there is a difference (see Figure 4-12).
Figure 4-12 Tivoli Storage Productivity Center Asset Report: DS8000 Array Site Detail
Important information here includes the number of disks, the RAID level, and the Disk Adapter (DA) to which the array or array site is connected. Ranks: Tivoli Storage Productivity Center actually gathers statistics on a rank level from the DS8000, DS6000 and ESS subsystems (see Figure 4-13). These values are directly available in TPCTOOL and are converted for the array reports in the Tivoli Storage Productivity Center GUI. For details on TPCTOOL, see Appendix C., Reporting with Tivoli Storage Productivity Center on page 365.
Figure 4-13 Tivoli Storage Productivity Center Asset Report: DS8000 Rank Detail
You can also see in this panel whether a rank is formatted using count key data (CKD) or fixed block (FB) data:
163
Storage pools: On the DS8000, these are called extent pools (see Figure 4-14).
Figure 4-14 Tivoli Storage Productivity Center Asset Report: DS8000 Storage Pool (Extent Pool) Details
As shown in Figure 4-14, Tivoli Storage Productivity Center 4.2.1 now exposes additional details in the report shown for the DS8000 storage pools. Here we can see whether EZ-Tier is in use, Solid State Disks (SSD) disk drives are in use, just as a simple example. As Thin Provisioning, or Space Efficient volumes are now an available feature in the DS8000 and several storage servers, Tivoli Storage Productivity Center was enhanced to expose this feature through a simple check of the Is Space Efficient field. If yes, then reviewing the Configured Real Space against the Available Space can show you what is really available in this Space Efficient storage pool. Tip: With the introduction of the DS8800 and with Tivoli Storage Productivity Center 4.2.1 having the ability to expose multiple ranks in a single extent pool in the DS8000, the best practice for DS8000 volumes is now to use volume striping across multirank extent pools for direct attached servers to the DS8000 and for SVC. While multirank extent pool volume striping is now a best practice, this does add complexity to the storage solution from a performance troubleshooting perspective. When you have multiple volume striping methods in the storage data path for a server, you must be able to review the per rank or array details. Tivoli Storage Productivity Center is able to provide that detail and so the use of multirank extent pools is less of a challenge. WARNING: Use caution when implementing a storage solution with multiple disk striping methods without tools such as Tivoli Storage Productivity Center in the solution. Else the solution will have Black Boxes, or areas of the solution without any details on performance. This can extend or even remove the possibility for performance problem determination. For more information about DS8000 performance, see 4.9 Plan Extent Pools in IBM TotalStorage DS8000 Series: Performance Monitoring and Tuning, SG24-7146, at this website: http://www.redbooks.ibm.com/abstracts/sg247146.html?Open 164
Disks: There is little information that you can get from the disk panel (see Figure 4-15). With additional information from the disk vendor, you can determine if the disk is a Fibre Channel (FC) or Serial Advanced Technology Attachment (SATA) disk, and perhaps the revolutions per minute (RPMs).
Figure 4-15 Tivoli Storage Productivity Center Asset Report: DS8000 Disk Drive Detail
Tip: As seen in Figure 4-15, with Tivoli Storage Productivity Center 4.2.1, additional disk drive data fields were added. This is intended to support identifying if a Disk Drive Module (DDM) in the DS8000 is a Solid State Disk (SSD), is an encryptable candidate drive, or is encrypted.
165
Volumes: If you select a volume, you can actually see the number of disks onto which the volume is spread as well as the RAID level of the volume. One drawback is that you cannot see the array in which the volume is created. In the array tree, you can see the volumes; therefore, you have to go through all the arrays to look for the volume (for a CKD volume example, see Figure 4-17; for a FB volume example, see Figure 4-16). However, during performance problem determination, you do not need to go through all the arrays, because you can drill down in the performance reports in the GUI.
Figure 4-16 Tivoli Storage Productivity Center Asset Report: DS8000 Fixed Block Volume
Figure 4-17 Tivoli Storage Productivity Center Asset Report: DS8000 Mainframe CKD Volume
166
4.3.3 IBM SAN Volume Controller (SVC) or Storwize V7000

Tivoli Storage Productivity Center provides several reports specific to the SVC and or Storwize V7000: Managed disk Group (SVC/Storwize V7000 Storage Pool) No additional information is provided here that you need for performance problem determination (see Figure 4-18). Yet the report was enhanced to reflect whether EZ-Tier was introduced in to the storage pool.
Figure 4-18 Tivoli Storage Productivity Center Asset Report: Manage Disk Group (SVC Storage Pool) Detail
Managed Disks: Figure 4-19 shows the Managed Disk for the selected SVC. No additional information is provided here that you need for performance problem determination. The report was enhanced in 4.2.1 to also reflect if the MDisk was a Solid State Disk (SSD). This is key as you must manually change an MDisk in SVC to mark it as a SSD candidate for EZ-Tier.
Figure 4-19 Tivoli Storage Productivity Center Asset Report: Managed Disk Detail
167
Virtual Disks (Also called volumes): Figure 4-20 shows virtual disks for the selected SVC, or in this case, a virtual disk or volume from a Storwize V7000. Tip: Virtual Disks for either the Storwize V7000 or SVC are identical within Tivoli Storage Productivity Center for this report. So only Storwize V7000 screens were selected, because they review an SVC version 6.1 version impact with Tivoli Storage Productivity Center 4.2.1.
Figure 4-20 Tivoli Storage Productivity Center Asset Report: Virtual Disk Detail
The virtual disks are referred to as volumes in other performance reports. For the volumes, you see the managed disk (MDisk) on which the virtual disks are allocated, but you do not see the correct RAID level. From an SVC perspective, you often stripe the data across the MDisks so that Tivoli Storage Productivity Center displays RAID 0 as the RAID level. As with many other reports, this report was also enhanced to report on EZ-Tier and Space Efficient usage. The key value add features were added to SVC since the first release of this book. In this example screen capture, you see that EZ-Tier is enabled for this volume, yet it is inactive. In addition, this report was also enhanced to show the quantity of storage for this volume in EZ-Tier compared to HDD Tier. Tip: IBM EZ-Tier is a function that automatically removes hot spots for volumes to hosts through migration of sub-volume extents from volumes built on HDD to SSD. This migration removes hot-spots and can drastically increase application performance. While this automatic function is enabled, only sub-volume extents actually migrated from HDD to SSD disks show activity. There is another report that can help you see the actual configuration of the volume. This report includes the MDG or Storage Pool, Back-End Controller, MDisks, and lots more detail; however, this information is not available in the asset reports on the MDisks.
168
Volume to Back-End Volume Assignment: Figure 4-21 shows the location of the Volume to Back-End Volume Assignment report within the Navigation Tree.
Figure 4-21 Volume to Back-End Volume Assignment Navigation Location
Figure 4-22 shows the report, in which the virtual disks are referred to as volumes.
Figure 4-22 Tivoli Storage Productivity Center Asset Report: Volume to Back-End Volume Assignment
This report provides many details about the volume. While specifics of the RAID configuration of the actual MDisks are not presented, the report is quite useful in that all aspects from the host perspective to the back-end are in one report. The following details are available and are quite useful: Storage Subsystem containing the Disk in View; for this report, this is the SVC. Storage Subsystem type; for this report, this is the SVC.
169
User-Defined Volume Name. Volume Name. Volume Space, total usable capacity of the volume. Tip: For space-efficient volumes, this value is the amount of storage space requested for these volumes, not the actual allocated amount. This can result in discrepancies in the overall storage space reported for a storage subsystem using space-efficient volumes. This also applies to other space calculations, such as the calculations for the Storage Subsystem's Consumable Volume Space and FlashCopy Target Volume Space. Storage Pool associated with this volume. Disk, what MDisk is the volume placed upon. Tip: For SVC or Storwize V7000 volumes spanning multiple MDisks, this report will have multiple entries for that volume to reflect the actual MDisks the volume is using. Disk Space, what is the total disk space available on the MDisk. Available Disk Space, what is the remaining space available on the MDisk. Back-End Storage Subsystem, what is the name of Storage Subsystem this MDisk is from. Back-End Storage Subsystem type, what type of storage subsystem is this. Back-End Volume Name, what is the volume name for this MDisk as known by the back-end storage subsystem. (Big Time Saver). Back-End Volume Space. Copy ID. Copy Type, this will present the type of copy this volume is being used for, such as primary or copy for SVC versions 4.3 and newer. Primary is the source volume, and Copy is the target volume. Back-End Volume Real Space, for full back-end volumes this is the actual space. For Space Efficient back-end volumes this is the real capacity being allocated. Easy Tier, indicated whether EZ-Tier is enabled or not on the volume. Easy Tier status, active or inactive. Tiers. Tier Capacity.
4.3.4 DS5000 information

Tivoli Storage Productivity Center provides several reports specific to the DS5000, DS4000 or DS3400. Because the reports are all very similar, we are showing reports for the DS5300 that we had in our Tucson lab. Storage pools: DS5000 storage pools are the pools that have been combined and given a RAID level, even though the storage pool has nothing to do with the physical cabling or the location of disks within the enclosures.
170
The most important information in Figure 4-23 is the RAID level as provided under the Type field.
Figure 4-23 DS5000 Storage Pool details
There are no reports about the storage pools in the performance reports. A DS5000 storage pool can be compared with the extent pools in a DS8000 or a DS6000. For the Storage Pools reports, you can drill down to the disks and LUNs in the storage pools similar to the reports on the other storage devices. Disks: The Disks panel gives you information about the DDMs. Figure 4-24 shows information that you can get for each DDM. However, you do not see the position of the disks in the enclosures and loops of the DS5000.
Figure 4-24 DS5000 Disk details
The best way to get information about the disks is to look at the LUNs report and then drill down (see lower right part of Figure 4-25).
171
LUNs: On the LUN report, you can also find the RAID level of a LUN, which is always the same as the RAID level of an array. The nice thing about this report is that you can actually see in which enclosure (tray) and slot the disks are located. You do not see the DDMs worldwide name (WWN) so it is hard to correlate this information with the information in the Disks report in Figure 4-24, but for performance problem determination, this is not that important.
Figure 4-25 DS5000 LUNs
You do not have to remember all the information that is in the reports. We just wanted to step you through the necessary information about your environment, and to show you the volume and server that currently face a performance problem.
4.3.5 XIV information

Tivoli Storage Productivity Center provides several reports specific to the XIV: XIV Storage System Specifics: As shown in Figure 4-26, Tivoli Storage Productivity Center reports on many aspects of the XIV storage system in the following report. This report reviews what group the XIV frame is currently being reported on within Tivoli Storage Productivity Center, the IBM model number, serial number, and firmware version. Much other high level capacity planning information is readily available from this one report.
172
Figure 4-26 Tivoli Storage Productivity Center Asset Report: XIV Detail for Storage Subsystem
Storage Pools: The XIV storage pools shown in Figure 4-27 are the logical pools that have been combined even though the storage pool has nothing to do with the physical cabling or the location of disks within the enclosures.
Figure 4-27 Tivoli Storage Productivity Center Asset Report: XIV Storage Pools
Because EZ-Tier is not available on XIV, the fields reflect this through the N/A state.
173
Disks: Each of the disks in the XIVstorage system are shown as displayed in Figure 4-28.
Figure 4-28 Tivoli Storage Productivity Center Asset Report: XIV Disk Details
Volumes: On the volume report (see Figure 4-29, you can also find the RAID level of a volume. While Tivoli Storage Productivity Center reflects this as RAID 10, in reality XIV volumes do not use traditional RAID techniques at all. Yet RAID 10 reflects the stripping across numerous disks and that the volume extents are mirrored.
174
Figure 4-29 Tivoli Storage Productivity Center Asset Report: XIV Volume Details
4.3.6 Determining what your baselines are

Another important step to proceed in storage performance problem determination is to determine what is the difference between current results and your baselines: If there are existing baselines from the array to the application If you do have baselines, how recent these baselines are What the discrepancy is between your latest baselines and current performances results In 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68, we explain what a baseline is and how it can be created using Tivoli Storage Productivity Center.
4.3.7 Determining what your SLAs are

Storage administrators often face the task of resolving performance issues, which usually occur at critical times and require immediate action. Understanding the dynamics of the various performance characteristics of the storage subsystems, the workloads to which they are exposed, and the storage infrastructure, helps you determine the cause and resolve the problem in a timely manner. A proactive approach in problem determination is to produce management reports and planning reports during the normal behavior of the environment. These reports serve as guidelines to establish a Service Level Agreement (SLA) for storage performance, as well as the basis for proactive performance planning.
175
If a problem can be seen by an end-user, then ask yourself if the following conditions exist: If the problem is due to a perception that things are slower than yesterday If the performance result is still within business defined boundaries for the solution If so, then the solution is working as desired, and to resolve the complaint might require a solution redesign.
4.3.8 General considerations about the environment

When determining a performance problem, there are various things to consider, as discussed in the following sections.
Considering what sharing is occurring in your environment

This is typically where a problem can be inserted as incorrect sharing of a resource that might introduce a bottleneck. In 4.5, Deciding what can be done to prevent or solve issues on page 178 we discuss data isolation and load spreading across resources.
Determining what has changed

Change in the environment is typically where problems can be inserted over time, or an update that occurred last night. Understanding what has changed and the ability to obtain documented changes is key to resolving problems that are not related to hardware failures. When a performance problem occurs that is related to storage, you need to identify what has been changed or what might have caused it. First answer these questions: What resources we are concerned with If the storage is congested If Application A runs slower because of Application B If a change in Application A has caused congestion to itself If normal growth in workload has overloaded existing resources It sometimes happens that problems are caused by human errors resulting in unwanted changes in the configuration (zone deletion or modification, volume deletion or reallocation). So a good starting point can be to look for trivial causes. The Tivoli Storage Productivity Center Topology Viewer, together with the Change History tool, are very useful to have a look at the storage path end-to-end and quickly determine if there is something obvious, possibly something missing. In 3.6, Tivoli Storage Productivity Center configuration history on page 120 we explained in detail how to configure and use the Tivoli Storage Productivity Center Configuration History tool.
4.3.9 Problem perception considerations

Often there are differences of opinion as to what the problem is, and if it is real or not.
Determining at what point the problem is seen

This point is key, because problems might not be readly identifiable. Having the perspective of how the problem is presenting itself can provide guidance on how to proceed. Get details whenever possible.
Determining if the problem is real or perceived

This is the hardest problem to resolve. Many problems are really not problems in the infrastructure. As stated in Determining what your SLAs are on page 175, normal growth
176
can be perceived as a problem that you cannot resolve unless a solution redesign is performed.
Determining if the problem is infrastructure or application

This is where communication between all parties is critical. Having the application, server administrator, storage administrators, and network administrators all being aware of the application solution design is key to resolving performance problems in a timely fashion.
Determining if the problem is storage, and what kind it is

Storage problems are identified based upon baseline reviews, as discussed previously in Determining what your baselines are on page 175.
Determining if the problem is consistent or intermittent

This is an element where a review of historical data is key. Without this you have no real tool to assist in debugging. Having a consistent baseline is key in understanding what has changed in performance data: If the problem in configuration is a hard or soft failure This is the nuts and bolts of solving the problem, the analysis of the data. If the performance data is showing a problem, or not
4.3.10 Keeping track of the changes

After a problem resolution. it is important to redo the baseline and validate problem resolution, or go back to the appropriate prior step. Keep the changes under control, meaning that to resolve a performance problem requires a thoughtful process where a change is followed by a re-baseline, and review of the change (see Figure 4-1 on page 154). Then implement the next change, rebaseline, and review.
4.4 Common performance problems

Performance problems can be caused by poor application design. Almost always, poor I/O response time is a symptom. Here we list the most common performance problems: Application I/O storms - I/O issues: Interaction between sequential and other (for example, backup during production) Flashcopy during production Failure to protect critical applications Failure to exploit striping techniques
There are often scheduling issues that result in higher I/O and more problems at night time than during the day, due to batch, backup, and other applications running all at the same time. Overloaded RAID ranks: There is too much I/O on too few disks. Rank skew occurs when there is too much I/O to too few arrays. It just gets to be a bigger problem when we see bigger and bigger physical disks, because users buy fewer spindles. That is usually the first place where we start seeing performance problems.
177
Overloaded ports: This occurs rarely, but it can happen. The ports are driven too heavily during a heavy workload. Poor read hit ratio: Poor read hits are common in an online transaction environment. There is not much you can do about it. Poor read hit ratios are a characteristic of the applications, not storage.
4.5 Deciding what can be done to prevent or solve issues

The key to preventing issues is good design and continuous monitoring of the performance of your environment as well as comparison with your baseline. This is the proactive approach, as discussed in 4.1, Problem determination lifecycle on page 154. Here we summarize some hints and tips.
4.5.1 Dedicating plenty of resources, with storage isolation

Sometimes there are performance reasons to do storage isolation. The most common reason involves batch/sequential interference with other workloads. In fact, batch/sequential is designed for throughput with high utilization. But high utilization is really bad for response time.
4.5.2 Spreading work across many resources

Even with storage isolation, it is wise to try to balance activity across all the resources: Spread across clusters Spread across device adapters Spread across ranksthis is actually the only real tool: Use ranks uniformly Use ranks across clusters, even across boxesit minimizes resource utilization
Workload spreading strategy

The sequential throughput capability of a disk subsystem generally scales with the number of DA pairs. Spread the data across as many resources as possible to maximize the utilization and performance of the storage server as a whole. Workload spreading is the most important principle of performance optimization. It means using all available resources in a balanced manner, and it applies to both isolated workloads and resource-sharing workloads. Workload disk capacity spreading is done by allocating logical volumes evenly across ranks, servers, and (for the DS8000) DA pairs. Workload host connection spreading means allocating host connections evenly across I/O ports, host adapters, and (for the DS8000) I/O enclosures. Host level striping and multipathing software are used along with workload spreading to optimize performance.
Exceptions to workload spreading

The exception to workload spreading involves files or data sets that will never be accessed at the same time. These can be safely placed on the same rank. For example, database log files for an application are typically used in sequential orderone is used until it fills up and then the next one is used. 178
A hot array means a single array that experiences a disproportionate amount of I/Os resulting in high disk utilizations and long response times. It is bad because it results in poor performance that is unnecessary, because other arrays can share the load.
4.5.3 Choosing the proper disk type and sizing

As general rule, we recommend the following practices: Use more disks instead or fewer, larger disks Use data striping techniques However, there are some considerations to take in account regarding disk type, sizing, and RAID type.
10K RPM or 15K RPM

For sequential workloads, there is not much of a performance difference between 15K RPM drives and 10K RPM drives. For random workloads with high demand rates, the 15K drives perform better. For response time sensitive workloads, the 15K RPM drives provide up to 50% more throughput and up to 33% lower response times (although workloads with higher hit ratios realize less of the benefit). Spreading the load across 50% more 10K RPM disks has a similar throughput impact as switching to 15K RPM disks.
Big disk drives or small disk drives

For the same rotational speed, a larger capacity disk generally performs about the same as a smaller disk drive on a per-drive basis. If you want the highest throughput (Ops/sec) for a given capacity, choose the set of smallest and fastest spinning drives that give you that capacity. But if you know the throughput required for a given capacity, use Disk Magic to model the best choice of disk drives. For pure sequential workloads on the DS8000 and DS6000, the rotation speed of the drive is not a factor in the throughput, only the number of drives. Larger capacity disk drives will perform fine if your access density allows it; otherwise smaller capacity drives give better performance (because they have more spindles). A customers demand rate and where they fall on the access density curve can determine what size drives they need to choose. The smaller the hit ratio of a customers workload, the more performance differences can be seen between larger and smaller drives.
RAID 5, RAID 6, and RAID 10 considerations

On modern disk systems, such as the DS8000 or DS6000, writes to disk are asynchronous. This means that any write penalties associated with a RAID configuration are generally shielded from the users in terms of disk response time. For random write workloads that are rank-limited, RAID 10 can provide almost twice the throughput as RAID 5 and RAID 6 (because of the RAID 5 and RAID 6 write penalty). But the trade-off for better performance with RAID 10 is about 40% less disk capacity. Larger drives can be used with RAID 10 to get the random write performance benefit while maintaining about the same usable capacity as a RAID 5 or RAID 6 array with the same number of disks. Note that RAID 6 random write throughput must be about 2/3 of RAID 5. For sequential writes, RAID 5 has a slight performance advantage over RAID 6 and RAID 10. In most cases, RAID 5 write performance is adequate because disk systems tend to operate at I/O rates below their maximum throughputs. Differences can be primarily observed at maximum throughput levels. For sequential and random reads, there is no performance difference.
179
XIV storage
A unique storage server available today is the IBM XIV Storage System. This device, utilizing either 1 TB or 2 TB SATA disks in a configuration from 72 to 180 disk drives, is able to use a GRID technology and storage software to provide very high availability and performance rich storage to application servers. This class of storage server is self tuning and is able to provide high availability without typical RAID hardware or software techniques. Using a patented redundancy technique at a 1 MB level, the XIV storage system is able to withstand single or dual drive failures in a single module, or even full drive modules without suffering a volume outage or performance penalty normally obtained during RAID rebuild times for drive replacements. The XIV storage system can provide remarkable performance and availability for applications and enterprises where deep performance skills are not available, or reduced staff expectations required vendor technologies to provide the performance and availabilities without the normal staffing overhead for tuning or performance management.
Solid State Disks

Since writing the last edition of this book, a new tier of disk has been introduced, Tier0, or Solid State Disk (SSD). SSDs have no moving parts (no spinning platters and no actuator arm). The performance advantages are the fast seek time and average access time. They are targeted at applications with heavy IOPS bad cache hit rates and random access workload, which necessitates fast response times. Database applications with their random and intensive I/O workloads are prime candidates for deployment on SSDs. For detailed recommendations about SSD usage and performance, see DS8000: Introducing Solid State Drives, REDP-4522.
Automatic tiering
Automatic tiering is the ability to move volumes, or subvolumes dynamically between tiers of storage without causing an outage and an application impact. In IBM technologies this technique or feature is known as EZ-Tier. This technology currently has been introduced in the IBM System Storage DS8800 and SVC. Although both storage devices support EZ-Tier, the specifics are slightly different. In the SVC implementation, SSD disks can be utilized from within the SVC nodes, or from managed disks (MDisks) or LUNs from back-end storage servers that support SSD disks, such as the DS8800. With the SVC, after the SSD disks are identified and LUNs or MDisks are created, these MDisks can be added to storage pools managed by SVC. Currently SVC supports a two tier automatic tiering implementation. This means that through SVCs EZ-Tier feature, virtual volumes, after being enabled, will allow the SVC to manage at an extent level the ability to migrate hot sub volume extents from the HDD MDisks to the SSD MDisks within the same storage pool. This feature can dramatically enhance the performance of the virtual volumes supported by applications using the SVC. For further details on either the SVC and Storwize V7000 or DS8800 implementations of EZ-Tier, see one of the following topics. If your storage is being managed by Tivoli Storage Productivity Center from another vendor, see that vendors documentation for specifics. While several vendors have announced facilities similar to EZ-Tier, they each have their own unique characteristics.
180
4.5.4 Monitoring performance

The key to maintaining the situation under control, prevent surprises, and set realistic expectations isafter you have familiarity with your environment and your workload (I/O rates, response times, access densities) to monitor and track the performance over time: Watch for changes in workload by shift and schedule. Monitor changes in hardware and software. Keep records, and do capacity planning. You need tools to tell whether you have succeeded (it does not always work the way you think it might). With Tivoli Storage Productivity Center, you have an on-going process of monitoring, characterization, history. For most components, whether box, cluster, array, or port, there are expected limits to many of the performance metrics. But there are few Rules of Thumb, because it depends so much on the nature of the workload. Online Transaction Processing (OLTP) is so different from backup (such as TSM Backup) that the expectations cannot be similar. OLTP is characterized by small transfers, consequently data rates might be lower than the capability of the array or box hosting the data. TSM Backup uses large transfer sizes, so the I/O Rates might seem low, yet the data rates test the limits of individual arrays (RAID ranks). Also, each box has different performance characteristics, from SVC, DS5000, XIV to DS8000 models, and each box has different expectations for each component. The best Rules of Thumb are derived from looking at current (and historical) data for the configuration and workloads that are not getting complaints from their users. From this performance base, you can do trending, and in the event of performance complaints, look for the changes in workload that can cause them.
4.6 SVC considerations

When starting to analyze the performance of the SVC environment to identify a performance problem, we recommend that you identify all of the components between the two systems and verify the performance of the smaller components.
4.6.1 SVC traffic

Traffic between a host, the SVC nodes, and a storage controller goes through these paths: 1. The host generates the I/O and transmits it on the fabric. 2. The I/O is received on the SVC node ports. 3. If the I/O is a write I/O: a. The SVC node writes the I/O to the SVC node cache. b. The SVC node sends a copy to its partner node to write to the partner nodes cache. c. If the I/O is part of a Metro Mirror and Global Mirror, a copy needs to go to the target volume of the relationship. d. If the I/O is part of a FlashCopy and the FlashCopy block has not been copied to the target volume, this action needs to be scheduled.
181
4. 4. If the I/O is a read I/O: a. The SVC needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the SVC needs to read the data from the physical LUNs. 5. At some point, write I/Os are sent to the storage controller. 6. The SVC might also do some read ahead I/Os to load the cache in case the next read I/O SVC striping for performances
4.6.2 SVC best practice recommendations for performance

We recommend that you have at least two MDisk groups, one for key applications, another for everything else. You might want more MDisk groups if you have different device types to be separated, for example, RAID 5 versus RAID 10, or SAS versus Near Line (NL)-SAS. The development recommendations for SVC are summarized here for DS8000s: One MDisk per Extent Pool One MDISK per Storage Cluster One MDisk group per storage subsystem One MDisk group per RAID array type (RAID 5 versus RAID 10) One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB) There are situations where multiple MDisk groups are desirable: Workload isolation Short-stroking a production MDisk group, or Managing different workloads in different groups
4.7 Storwize V7000 considerations

When starting to analyze the performance of the Storwize V7000 environment to identify a performance problem, we recommend that you identify all of the components between the Storwize V7000, the server and the back-end storage subsystem if configured in that manner, or between the Storwize V7000 and the server. Then verify the performance of all of components.
4.7.1 Storwize V7000 traffic

Traffic between a host, the Storwize V7000 nodes and direct attached storage, and or a back-end storage controller all traverse the same storage paths: 1. The host generates the I/O and transmits it on the fabric. 2. The I/O is received on the Storwize V7000 container ports. 3. If the I/O is a write I/O: a. The Storwize V7000 container writes the I/O to the Storwize V7000 container cache. b. The preferred container sends a copy to its partner container to write to the partner container cache.
182
c. If the I/O is part of a Metro or Global Mirror, a copy needs to go to the target volume of the relationship. d. If the I/O is part of a FlashCopy and the FlashCopy block has not been copied to the target volume, this action needs to be scheduled. 4. If the I/O is a read I/O: a. The Storwize V7000 needs to check the cache to see if the Read I/O is already there. b. If the I/O is not in the cache, the Storwize V7000 needs to read the data from the physical mDisks. 5. At some point, write I/Os are destaged to Storwize V7000 managed mDisks or sent to the back-end SAN attached storage controller(s). 6. The Storwize V7000 might also do some data optimized sequential detect pre-fetch cache I/Os to pre-load the cache in case the next read I/O has been determined by the Storwize V7000 cache algorithms to benefit from this cache approach over the more common Least Recently Used (LRU) method used for non sequential IO.
4.7.2 Storwize V7000 best practice recommendations for performance

We recommend that you have at least two storage pools for internal MDisks, and two for external MDisks from external Storage Subsystems. Each of these storage pools, whether built from internal or external MDisks, will provide the basis for either a general purpose class of storage or for a higher performance or highly availability class of storage. You might want more storage pools if you have different device types to be separated, for example, RAID 5 versus RAID 10, or SAS versus Near Line (NL)-SAS. Here is a summary or the development recommendations for Storwize V7000: One MDisk group per storage subsystem One MDisk group per RAID array type (RAID 5 versus RAID 10) One MDisk and MDisk group per disk type (10K versus 15K RPM, or 146 GB versus 300 GB) There are situations where multiple MDisk groups are desirable: Workload isolation Short-stroking a production MDisk group Managing different workloads in different groups
183
184
Chapter 5.
Using Tivoli Storage Productivity Center for performance management reports

In this chapter we take you through a methodology using Tivoli Storage Productivity Center to identify and isolate performance problems in your storage environment. We often cross-reference the Rules of Thumb that we have discussed previously (for a summary, see Appendix A, Rules of Thumb and suggested thresholds on page 327, as well as how to interpret these values. In some charts you can see a red traffic light icon, like this one: This indicates that some value or values in that chart are out of limits and need further investigation.
185
5.1 Data analysis: Top 10 reports

As previously discussed, we expect that you already have your storage environment up and running with Tivoli Storage Productivity Center Standard Edition Server installed. We also assume that you have all required Data Sources enabled to support collecting storage performance data. You are now ready to proceed to the reporting section, creating reports that will enable you to analyze performance results throughout your storage environment that Tivoli Storage Productivity Center is monitoring. So probably your first question is What are the Top 10 storage performance reports from Tivoli Storage Productivity Center Standard Edition? This is a very common request. Next we summarize the standard reports and explain which reports you have to create to begin your performance analysis. The Top 10 reports can also be used to support management in Service Level Agreements (SLA), because they can be used as a foundation for effective SLA reports. We recommend this approach for all Disk Subsystems (DS4000, DS5000, DS6000, DS8000, IBM XIV Storage System) and a slightly different set for SVC and StorWize V7000, as summarized in the flowchart in Figure 5-1. With Block Server Performance (BSP) subprofile, Tivoli Storage Productivity Center is additionally able to identify other SMI-S certified Disk Storage Subsystems that are not IBM but from other vendors. Naming conventions: In this chapter we refer to SVC and Storwize V7000 components using the same naming convention used by Tivoli Storage Productivity Center, not the new naming convention, as listed here: Managed Disk Group instead of Storage Pool Virtual Disk instead of Volume How to create these reports is detailed in 5.2, Top 10 reports for disk subsystems on page 187 and in 5.3, Top 10 reports for SVC and Storwize V7000 on page 230. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 to verify which metrics are available for each Storage Subsystems and at which level (for example controller, array, volumes).
186
Figure 5-1 illustrates the Top 10 reports for SVC, Storwize V7000, and Disk subsystems.
Top 10 reports for SVC/Storwize and DISK Subsystems

e r wiz / St o SV C
I/O Gro up repo rt Node CPU utilization I/O rates Response time Data r ates Module/Node Cache (***) performance report Subsystem performance report I/O rates Data rates Response time
Dis k
Con tro ller performance report (*) Data Rates I/O rates balanced
XIV reports availability: (*) Not available (**) Available only for XIV 10.2.4 or later (***) also available on XIV (at Module level)
Controller Cache (*) performance report
Managed Disk Group and Managed Disk report Response Time Backend Data Rates Volumes/M report Top Volumes Cache performance Top Volumes Data rate performances Top Volumes Disk performances Top Volumes I/O rate performances Top Volumes Response performance Ports report (**)
Array performance report (*) Disk Utilization Total I/O rate Backend I/O rate Backend Data rate Backend Response Time Write cache delay
Figure 5-1 Top 10 reports for SVC, Storwize V7000, and Disk
To interpret and evaluate these reports, we recommend that you first refer to your created baseline, as documented in Creating a baseline with Tivoli Storage Productivity Center on page 68. During each report review, we review some Rule of Thumb (ROT) impacts. Considerations about ROT values are detailed in Chapter 3, General performance management methodology on page 53, but a summary of them is included in Appendix A, Rules of Thumb and suggested thresholds on page 327.
5.2 Top 10 reports for disk subsystems

The Top 10 reports from Tivoli Storage Productivity Center are provided as a starting point for new users of Tivoli Storage Productivity Center. These reports provide a foundation to better understand your storage infrastructure configuration. Previously, we discussed the impact of your configuration on performance in 4.2.2, Understanding your configuration on page 155. Next we review each of the reports provided by Tivoli Storage Productivity Center and show how they link back to the problem determination lifecycle introduced in 4.1, Problem determination lifecycle on page 154.
Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
187
Figure 5-2 provides a view of the Top 10 reports that we are reviewing and a numbered prioritization approach showing how to walk through the reports.
Figure 5-2 Top 10 reports - sequence to proceed
Important: For the IBM XIV Storage System (not for DSx000 storage subsystems), the additional Module Cache Performance Report is available. See 5.2.7, IBM XIV Module Cache Performance Report on page 228 for details.
5.2.1 Top 10 for Disk #1: Subsystem Performance report

The Subsystem Performance report provides an overall view of the throughput and performance for all of your storage devices in your environment. These subsystems include IBM DS4000, 5000, 6000, and 8000, IBM XIV, SAN Volume Controller (SVC), Storwize V7000, and non-IBM disk subsystems. The metrics available in this report type cover the broadest set of performance criteria such as I/O Rates, Data Rates, Response times, and Cache percentages. In Tivoli Storage Productivity Center since version 4.1, the storage subsystem performance reports have been expanded to included Global and Metro Mirror replication metrics. These can be of huge value in monitoring your replication solutions.
188
Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, as shown in Figure 5-3.
Figure 5-3 DS8000 Subsystem selection
When the report is presented, you see a tabular report on the screen. This includes many columns that are commonly of interest, including these: Time Interval Read, Write, and Total I/O Rate Read, Write, and Total Data Rate Read, Write, and Overall Response Times. The reports included in this section have a larger purpose than just a group of reports for disk. These reports are a set of foundation reports that you can use to create custom saved reports. Many additional columns are available through the selection tab on each of these reports, and after you have selected the columns and have placed them into the order you want. You can then save this new tabular report for future reuse. A potential reuse for this saved tabular report can be as a SLA report.
I/O rates (overall)

To obtain a picture that can be used to assist in pinpointing a hotspot, click the icon within the tabular report and select the Read I/O rate, Write I/O Rate, and Total I/O rate metrics. By using the Shift key and selecting the first item, then the last item, you can select sequential performance metrics. Or, by using the Ctrl key and then clicking each metric to include, you are able to select the metrics for Tivoli Storage Productivity Center to graph. The only limitation to this method is that the performance metrics must be of the same type.
189
Figure 5-4 shows the pop-up window and the three metrics selected for all I/O rates to be seen on a graph.
Figure 5-4 Top 10 Subsystem performance report - I/O rate selection
After you have selected your metrics, click Ok to generate the chart. We recommend that you create three separate reports to provide a total I/O view of your storage environment. One report will be based upon Read I/O, the next on Write I/O, and finally on Total I/O. This provides a quick review of your subsystems by type of I/O, and a quick reference regarding how your I/O workloads are distributed within your storage environment. Finally, these reports can be used to identify your most I/O bound subsystems. Figure 5-5 shows a Total I/O rate report as an example. This is the report from which we expect you to start your analysis (as you can see, there is a straight line for Storwize that joins the samples from May 23, 00:15 to May 23 08:40. It means that a connection problem with the device occurred during that time frame, and Tivoli Storage Productivity Center did not get any performance value).
190
In Figure 5-5, which shows the Total I/O Rate report in our lab test case, you can see that DS8000-1302541 is receiving the highest workload.
Figure 5-5 Total Subsystem Total I/O Overall Rate
Figure 5-6 shows an example of the Read I/O Rate report.
Figure 5-6 Total Subsystem Read I/O Rate
191
Figure 5-7 shows an example of the Write I/O Rate report.
Figure 5-7 Total Subsystem Write I/O Rate
The foregoing three reports provide a foundational view from an I/O perspective, that resolves one of the aspects of problem determination discussed in Chapter 4., Using Tivoli Storage Productivity Center for problem determination on page 153. We introduced this idea in 4.2.2, Understanding your configuration on page 155. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
192
Data rates
In the same way that you reviewed I/O rates previously, you can look at Data Rates for your Subsystems, starting with Total Data Rates and from there going deeper to analyze Read and Write workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, click the icon, and select the Total Data Rates. Then click Ok, as shown in Figure 5-8.
Figure 5-8 TOP10 Subsystem performance report - Data Rate selection
Click Ok to generate the chart. This report provides an overview of a Subsystems Total Data Rate throughput, as shown in Figure 5-9.
193
In Figure 5-9, confirming what was seen in the previous section (I/O rates (overall) on page 189), we see that the highest throughput belongs to DS8000-1302541, again. As in our example, the data rate reports are able to demonstrate our hottest utilized storage devices. Through this report you can understand the total data bandwidth usage. Although Data Bandwidth is not a typical metric used with SLA reporting, being aware of the bandwidth usage by a subsystem can assist in reviewing whether you have a bottleneck or lack of bandwidth in your SAN environment, both of which are elements of capacity planning. We review capacity planning in Chapter 6., Using Tivoli Storage Productivity Center for capacity planning management on page 305.
Figure 5-9 TOP10 Subsystem performance report - Total Data rate
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
194
Response Time
To further understand the performance of your subsystem, you can produce a report about Response Times as provided for the entire subsystem. This metric is very high level, but is typically included in SLA reports. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, click the icon and select the Overall Response Time. Then click Ok, as shown in Figure 5-10.
Figure 5-10 TOP10 Subsystem performance report - Response Time selection
Click Ok to generate the chart, which is illustrated in Figure 5-11.
Figure 5-11 Top 10 Subsystem performance report - Response Time
195
Figure 5-11 on page 195 shows the following response times: Good response time at Subsystem level report for XIV and DS8000-1901 devices (less or equal to 10 msec) during all the time interval analyzed; Bad response time for DS8000-2451 device (more than 20 msec) in a time frame between May, 23 at 10:30 and May, 23 at 12:40; Very bad response time for SVC-svc1 and V7000-2076 (peaks of 50 msec) in a time frame between May,23 at 10:30 and May 23 at 17:30. This chart reflects an average of all activities on each storage subsystem. You have to perform a deep dive into your subsystem to identify bottlenecks. These bottlenecks can be inserted through the improper planning, configuration, or overusage of an array. Reviewing your response times from the top down allows you to identify these bottlenecks. This deeper dive impact on reports is further reviewed later in this chapter (see Top 10 for Disk #2: Controller Performance reports on page 197).
Recommendations
In storage performance management we often assume that 10 msec is pretty high for Tier 1 class of storage. Most disk modeling tools assume this. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time or higher. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current magnetic disk products. In crude terms, the service time of a magnetic disk is composed of a seek, latency, and a data transfer packet size. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5 to 7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications. For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time.
196
To simplify, we can assume that (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-end write response times can be even higher generally up to 80 msec. There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in Megabytes per second that is most important, and maximizing this metric will drive response times much higher than 30 msec. Important: All the above considerations are not valid for SSD disks, where seek time and latency are not applicable. We can expect for these disks much better performance and therefore very short response time (less than 4 ms), especially when the IO Workloads are random in nature and use small block IO transfers. These are the types of workloads that SSD and specifically EZ-Tier can provide large value opportunities. See page 64 for further details on SSD performance. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
5.2.2 Top 10 for Disk #2: Controller Performance reports

Reports: These reports are not available for XIV. See Figure 5-2 on page 188 for a quick reference on available reports. The Controller performance report layer provides many metrics similar to the subsystem layer, but are focused on the controller perspective. These metrics are, by default: Data rates I/O rates I/O response time As stated previously, these are just some of the metrics available. Review the report selection tab for a full list of performance metrics that are available at the controller level for a custom report to be built using it. Reports can be generated on controllers of DS4000, DS5000, DS6000, DS8000 and ESS subsystems. IBM provides additional layer of detail for those five storage subsystems using the NAPI interface for DS8000, or the CIMOM Interface Agent for DS4000, DS5000, DS6000 and ESS.
197
Data rates
You can use this chart to understand how the throughput is divided between the controllers of your subsystems, to determine if this is a well balanced subsystem or not. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Controller Performance. Separate rows are generated for Server 1 and Server 2 of DS8000 subsystems, as shown in Figure 5-12.
Figure 5-12 Controller performance
In our example we focus on the DS8000-1302541, examining separately the Server 1 and Server 2 performance. To do this, we can highlight the related Server line in the Controller tab, or choose the desired Server by clicking in the Selection... button in the Selection tab of the Controller Performance window, as shown in Figure 5-13.
Figure 5-13 Controller performance - Server selection
198
To review the throughput of the DS8000 controllers, click the icon and select Read Data Rate, Write Data Rate, and Total Data Rate, holding the Shift key pressed. A line graph is generated, as shown in Figure 5-14. Tip: To get a more readable picture, we recommend that you create a chart for one controller at time.
Figure 5-14 Server 1 Data Rate
199
You can read the maximum throughput reached by Server 1 and compare to the value of Server 2, as shown in Figure 5-15. To create the follow chart, repeat this step: Click the icon and select Read Data Rate, Write Data Rate, and Total Data Rate, holding the Shift key pressed.
Figure 5-15 Server 2 Data Rate
The throughput of Server 2 is for most of the time higher than Server 1, even if there are some time frames where Server 1 is much higher than Server 2. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53. In a real environment, the typical usage of this report relates to determining if your controllers are obtaining a balanced usage based upon your current configuration. Best practice design expects that LUNs or volumes can obtain maximum performance only when the storage device utilizes all resources available in a balanced approach. This report provides a direct method to measure this critical aspect in your storage environment.
200
I/O rates
As was seen with Data Rate, I/O Rates also require a review for balance. As shown before, we are providing a set of example reports typically for this type of request. Remember that the key here is to identify when a less than balanced distribution of I/O exists within the different storage subsystems. From the same starting point as with the Data Rates, we have the performance metrics pop-up displayed in Figure 5-16.
Figure 5-16 Controller Total I/O Rate
201
Click the Ok button to generate the performance chart report by controller with the selected metric, as shown in Figure 5-17. As for Data Rates reports, we focus on DS8000-1302541.
Figure 5-17 DS8000 Controller Total I/O Rate Chart
The chart shown in Figure 5-17 confirms a poorly balanced I/O distribution within the DS8000. There are also some spikes representative of great difference in I/O rates. Further digging into the specific I/O patterns of read/write will provide some additional input on where that imbalance load was coming from. In a typical user environment, this will be indicative of an error to further investigate.
5.2.3 Top 10 for Disk #3: Controller Cache Performance reports

Reports: These reports are not available for XIV. See Figure 5-2 on page 188 for a quick reference on available reports. The Controller Cache Performance report shows performance data for storage subsystem controller caches. These reports are currently only available for DS6000 and DS8000 subsystems, as listed in Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331.
202
Cache Hit percentage

To review controller cache usage, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Controller Cache Performance. A separate row is generated for each of the controllers. See Figure 5-18.
Figure 5-18 Controller Cache performance
We now focus on DS8000-1301901. Click the DS8000-1301901 Server 1 entry to highlight it. Click the icon and select Write Cache Hit Percentage (normal) and Write Cache Hit Percentage (sequential). A line graph is generated, as shown in Figure 5-19.
Figure 5-19 Write Cache Hit percentage
203
Confirming Cache hits

Write Cache hits are always at 100%. This can be confirmed also creating a graph with metric Write Cache Delay percentage, as shown in Figure 5-20. This shows the percentage of all I/O operations that were delayed due to write-cache space constraints or other conditions during the sample interval. Only writes can be delayed, but the percentage is of all I/O. The persistent memory (write cache) is used to hold modified write data before being asynchronously destaged, without delaying host requests. Persistent memory is not usually the bottleneck. Delays tend to be caused by the disks being overutilized, and thus being unable to keep up. With high bandwidth workloads, internal bus structures can limit how fast we write data to both cache and persistent memory. This impact affects the maximum number of sequential writes that can be achieved. The only advantage of a larger persistent memory in a balanced environment is its ability to accommodate momentary bursts of write activity.
Recommendations
Write Cache Delay percentage is the percentage of write requests coming from a server (or another subsystem when doing remote mirroring) that had to be delayed while existing cache pages are destaged to free up cache pages to store the new writes. The optimum value is zero, or anything close to that. Tivoli Storage Productivity Center has built-in default alerts that have been defined at 3% for warning stress and 10% critical stress. To overcome a problem that is related to high Write Cache Delay percentage values, you have two options: Add more cache. Although this solution sounds logical, often the relief is only minimal, especially if the duration of this problem is not short but is seen in consecutive sample intervals. Distribute the load. This is the better solution, because in situations where you see high Write Cache Delay percentage values, the reason is that more write requests are incoming than can be written to the back-end in the same time. The real problem is that the destage process is too slow. To solve this problem, you need to determine how you can distribute the load to allow data to be destaged faster. How to achieve this depends on the subsystem, but generally you must try to spread the load onto more resources, which can be multiple disks, arrays, disk adapters, or loops.
204
Figure 5-20 Write Cache Delay percentage
To analyze Controller cache Read usage, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Controller Cache Performance. Select DS8000 Server entries to highlight it. Click the icon and select Read Cache Hit Percentage (overall), as shown in Figure 5-21.
Figure 5-21 DS8000 Controller Read Cache Hit selection Chapter 5. Using Tivoli Storage Productivity Center for performance management reports
205
Click Ok to generate the graph shown in Figure 5-22.
Recommendations
Read Cache Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. The typical cache usage for enterprise database servers is with sequential IO workloads involving pre-fetch cache loads. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
Figure 5-22 DS8000 Controller Cache read-hit report
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
206
5.2.4 Top 10 for Disk #4: Array Performance reports

Reports: These reports are not available for XIV. See Figure 5-2 on page 188 for a quick reference on available reports. The Array Performance report shows performance data for arrays such as these: Disk Utilization percentage Sequential I/O percentage Back-End I/O Rate Back-End Data Rate Back-End Response Time Write Cache Delay percentage Write Cache Delay I/O This is valid for DS8000, ESS, and DS6000. The array I/O limit depends on many factors, chief among them being the number of disks in the rank and the speed (RPM) of the magnetic disks. Attention: It is common to use array or RAID terms to indicate the same component. The term array describes the physical group of disk drive modules that are formatted with a certain RAID level.
Recommendations
When the number of IOPS to a rank is near or above 1000, the rank can be considered very busy. For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end ranks are not consistent with good performance; they imply that the back-end ranks are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy ranks. A hot array describes a single array that experiences a disproportionate amount of I/Os resulting in high disk utilizations and long response times. This is bad, resulting in poor performance that is unnecessary, because other arrays can share the load.
207
Disk utilization percentage

This metric shows the average disk utilization during the sample interval. This is also the utilization of the RAID array, because the activity is uniform across the array. It indicates the load on an array. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Array Performance. A separate row is generated for every array within all of the DS8000s in the environment. The report in our environment shows that there are 32 arrays. See Figure 5-23.
Figure 5-23 Array performance
The Disk Utilization percentage metric available in the Array Performance report is normally used to understand the degree of utilization of back-end arrays services by a subsystem. This is critical to understand, because when an array reaches 50% utilization, there is an impact to write and read response times. Typically a 70% or higher sustained array is one that is a target for capacity management. While it is very easy to reach nearly 100% utilization on an array in an instance such as during a sequential batch import or export, or during a backup job. Having an array above 70% will inject storage queuing into your storage volumes that are based upon this array. With multiple arrays selected, the chart generated can be used to check if one array is busier than the others and whether the workload is balanced. Select all the arrays you want to investigate, click the icon. Select the Disk Utilization percentage metric and click Ok. Tip: When creating a chart, you can always modify the Limit days From: and To: to zoom into the time period that you want to focus on. Then click Generate Chart to regenerate the graph.
208
Figure 5-24 shows a report for first 10 arrays. Array 14 on DS8000-1301901 had two peaks of disk workload that can be investigated further. This will be a capacity planning alarm that must be raised. As any event such as a loss of a cache module, rebuild of a RAID array from a hot spare, and so on, can cause large application impacts.
Recommendations
If there are a lot of cache hits, the subsystem response time might be well below 5 msec even with a high Disk Utilization percentage, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time.
Figure 5-24 Disk Utilization percentage
This will be a capacity planning alarm that must be raised. As any event such as a loss of a cache module, rebuild of a RAID array from a hot spare, and so on, can cause large application impacts.
Drilling up to controller level

With many Tivoli Storage Productivity Center performance reports visible from the System Reports Disk, you have the option to drill up or down. By clicking the Arrays tab, you can go back to the list of arrays. Click the icon next to Array 15 to drill up on the array. As shown in Figure 5-25 a new tab is created. The report shows array 15 is served by Server 1 controller.
209
You can run historical charts on the controller by clicking the icon and select any available metric. See 5.2.2, Top 10 for Disk #2: Controller Performance reports on page 197 for further details.
Figure 5-25 Drill up on an Array
Drilling down to volume level

Click the Arrays tab to go back to the list of arrays. Click the icon next to the array you need to investigate, to drill down to the volume level. A new tab Drill down from 2107-1301901-A14 is created, containing the report for all volumes belonging to this array, as shown in Figure 5-26. The list contains 315 volumes.
Figure 5-26 Drill down to the volume level
See Total I/O Rate (overall) on page 211 to investigate further on how to understand which logical volumes is the primary I/O generating volume, and is causing the excessive Disk Utilization percentage, shown in the foregoing report.
210
Total I/O Rate (overall)

Then, click the icon and click Total I/O Rate (overall) to generate a chart and look for which volumes are causing this high Disk Utilization percentage, shown in Figure 5-24 on page 209. We suggest to display no more than 10 volumes per series on the screen. By scrolling down and clicking Next 10, you can find the volumes with highest I/O rate. In this lab we found volume xf004, as shown next in Figure 5-27; the red traffic light is here because more than 1000 IOPS against a single volume is not recommended.
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 IOPS) to very large values (more than 1000 IOPS). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume. I/O rates for disks and RAID ranks are discussed in the next section. In case of traditional volumes on a simple array the volume performance is mostly limited by the disk array performance.
Figure 5-27 Total I/O rate at volume level
211
Back-End I/O Rate

This metric indicates the average rate per second caused by front-end activity. These are read and write I/Os to the back-end storage, for the sample interval. These are logical writes and the actual number of physical I/O operations, depend on whether the storage is RAID 5, RAID 10, RAID 6 or some other architecture. Read Back-End I/O Rate is the average read rate in reads per second caused by a cache read miss. This is the read rate to the back-end storage for the sample interval. Write Back-End I/O Rate is the average write rate in writes per second caused by front-end write cache destage activity. This is the write rate to the back-end storage for the sample interval. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Array Performance, then select all the arrays belonging to DS8000-1301901. Click the icon and select Total Backend I/O Rate, as shown in Figure 5-28.
Figure 5-28 Array report - back-end I/O rate selection
212
Click Ok to generate the chart shown in Figure 5-29. The red traffic light is displayed here because the report reflects a rate of more than 1,000 IOPS for one of the arrays included in this report. Because this is not a recommended workload for a DS8000 array, we draw your attention to this situation.
Recommendations
The rank I/O limit depends on many factors, chief among them are the number of disks in the rank and the speed (RPM) of the disks. But when the number of IO/sec to a rank is near or above 1000, the rank must be considered very busy! For 15K RPM disks, the limit is a bit higher. But these high I/O rates to the back-end ranks are not consistent with good performance. They imply that the back-end ranks are operating at very high utilizations, indicative of considerable queueing delays. Good capacity planning demands a solution that reduces the load on such busy ranks.
Figure 5-29 Back-End I/O rate report
213
Back-End Data Rate

This report gives you the average number of megabytes per second, written or read to/from back-end storage. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Array Performance. Then select all the arrays belonging to DS8000-1301901. Click the icon and select Total Backend Data Rate, as shown in Figure 5-30. You can select Backend Write Data rate and Backend Read Data rate if you want to compare write and read activity.
Figure 5-30 Total Back-End Data Rate selection
214
Click Ok to generate the chart shown in Figure 5-31. Only the array A14 reached a throughput of 143 MB/sec as the highest value, that is an high but still acceptable value.
Recommendations
The maximum bandwidth at the rank level for sequential read activity (64 KB I/O size) is 240 MB/sec, 150 MB/sec for write. The Redbooks publication, IBM TotalStorage DS8000 Series: Performance Monitoring and Tuning, SG24-7146, contains information about the maximum bandwidth. Limiting write workload to one rank can increase the persistent memory destaging execution time and so, impact all write activities on the same DS8000 subsystem. To avoid this situation, you have to spread write I/O on multiple ranks. For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53.
Figure 5-31 Back-End Data rate report.
215
Back-End Response Time

This report gives you the average response time in milliseconds for write and read operations to the back-end storage. This time might include several physical I/O operations, depending on the type of RAID architecture. Expand IBM Tivoli Storage Productivity Center My Reports System Reports Disk. Click Array Performance. Then select all the arrays belonging to DS8000. Click the icon and select Overall Backend Response Time, as shown in Figure 5-32.
Figure 5-32 Overall Back-End Response time selection
216
Click Ok to produce the report shown in Figure 5-33. Notice that the Response time as shown at back-end level reached values that are too high. This will need to be investigated at the volume level to determine the current impact to volumes using these arrays.
Recommendations
For random read I/O, the back-end rank (disk) read response time must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Time will be higher because of RAID 5, RAID 6, or RAID 10 striping algorithms, but must seldom exceed 80 msec.
Figure 5-33 Back-End Response time report
217
To investigate which volumes can be affected by poor Back-End Response Time, click the next to array 2, as shown in Figure 5-34.
Figure 5-34 Volume drill-down
You get the list of all volumes belonging to that array. Select them all, click the icon and select Overall Response Time. Click Ok to generate the next chart. You get multiple screens, each one containing 10 volumes or any other customized value. Look for volumes with highest response time. Figure 5-35 shows that highest value is 36 msec, which sounds reasonable. A good Cache Hit percentage can justify this configuration and ought not to be a factor for applications. In a best practice environment, this report will justify a review of rebalancing the storage configuration to obtain a better storage balance environment. The expectation will be a large reduction in these higher than normal I/O operations.
218
Figure 5-35 Volume drill-down response time
Write Cache Delay percentage

This metric indicates the percentage of all I/O operations that were delayed due to write-cache space constraints or other conditions during the sample interval. Important: Only writes can be delayed, but the percentage refers to all I/O. The chart in Figure 5-36 confirms that in our lab configuration, cache utilization is acceptable, because the percentage of Write Cache Delay is almost null.
Recommendations
The DS8000 stores data in the persistent memory before sending acknowledgement to the host. If the persistent memory is full of data (no space available), the host will receive a retry for its write request. In parallel, the server has to destage data stored in its persistent memory to the back-end disk before accepting new write operations from any host. If one of your volumes is facing write operation delayed due to persistent memory constraint, you need to move your volume to a new rank which is less used or spread this volume on multiple ranks (increase the number of DDMs used) to avoid this situation. If this solution does not fix the persistent memory constraint problem, you can consider adding cache capacity to your DS8000.
219
Figure 5-36 Write Cache Delay percentage
5.2.5 Top 10 for Disk #5-9: Top Volume Performance reports

Tivoli Storage Productivity Center provides five reports on top volume performance: 1. Top Volume Cache performance: Prioritized by the Total Cache Hits percentage (overall) metric 2. Top Volume Data Rate performance: Prioritized by the Total Data Rate metric 3. Top Volume Disk performance: Prioritized by the Disk to Cache Transfer rate metric 4. Top Volume I/O Rate performance: Prioritized by the Total I/O Rate (overall) metric 5. Top Volume Response performance: Prioritized by the Total Data Rate metric Volumes referred to in these reports correspond to DS8000, DS6000, ESS, DS4000, XIV, SVC or Storwize V7000. By default, the scope of the report is not limited to a single storage subsystem. Important: By default, the last performance data collected on volumes is used for these reports. The report creates a ranked list of volumes based on the metric used to prioritize the performance data. You can customize these reports according to the needs of your environment, for example, by changing the time period frame.
220
Metrics: In Tivoli Storage Productivity Center 4.2, a new Volume Utilization metric is available for all storage devices. In such cases it can provide a quick view into hot volumes as seen by servers and can be used as a starting point for performance analysis. This metric allows you to display a combination of two important metrics in a single report. For details on this metric, see Table B-2 on page 337. Even if the Top Volume reports do not include the Volume Utilization metric, this metric can be quite helpful when review this report for highly utilized volumes to also determine if the volume needs attention. This metric again reviews the potential for the volume to be out of gas.
Top Volume Cache performance

This report shows the cache statistics for the top 25 volumes based upon the latest performance sample collection period, or as many as you specify in the selection criteria. The default quantity is 25. The report will only show volumes that have total I/O greater than 0. The records are prioritized by the Total Cache Hits percentage (overall) metric, as shown in Figure 5-37.
Figure 5-37 Top volume Cache Hit performance report
This report provides a valuable tool for reviewing the impact of caching within your storage environment. As caching both for read and write I/Os has a direct relationship to the I/O response times seen by applications.
221
Click the icon and select Read cache Hits percentage (overall). Click Ok to generate the chart shown in Figure 5-38.
Recommendations
Read cache hit ratio shows how efficiently your cache works on the Disk subsytems. For example, the value of 100% indicates that all read requests are satisfied within the cache. If the Disk subsystem cannot complete I/O requests within the cache, it transfers data from the DDMs. The subsystem suspends the I/O request until it has read the data. This situation is called cache-miss. If an I/O request is cache-miss, the response time will include not only the data transfer time between host and cache, but also the time that it takes to read the data from DDMs to cache before sending it to the host. A database can be cache-unfriendly to applications by nature. An example might be if a large amount of sequential data is written to a highly fragmented file system in an open systems environment. If an application reads this file, the cache hit ratio will be very low, because the application never reads the same data, due to the nature of sequential access. In this case, de-fragmentation of the file system will improve the performance. You cannot determine if increasing the size of cache improves the I/O performance, without knowing the characteristics of data. We recommend that you monitor the read hit ratio over an extended period of time: If the cache hit ratio has been low historically, it is most likely due to the nature of your data, and you do not have much control over this. If you have a high cache hit ratio initially and it is decreasing as you load more data with the same characteristics, then adding cache or moving some data to another cluster that uses the other clusters cache can improve the situation.
Figure 5-38 Volume Cache Read Hit percentage
For further details about Rules of Thumb and how to interpret these values, see Chapter 3, General performance management methodology on page 53. 222
Top Volume Data Rate performance

To find out the top 5 volumes with the highest total data rate during the last data collection time interval, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Top Volumes Data Rate Performance. By default, the scope of the report is not limited to a single storage subsystem. Instead, Tivoli Storage Productivity Center interrogates the data collected for all the storage subsytems that it has statistics for, and creates the report with a list of 25 volumes that have the highest total data rate. Click the Selection tab to set as the maximum number of rows to be displayed on the report to 5, as shown in Figure 5-39.
Figure 5-39 Top Volume Data rate selection
Click Generate Report on the Selection panel to regenerate the report, shown in Figure 5-40. The top five volumes with the highest total data rate at the last collection time are listed on the report.
Figure 5-40 Top Volume Data rate report
223
Top Volume Disk performance

This report includes many metrics about cache and volume-related information. Actually, the report needs to use the term volume instead of disk, because disk usually means the disk drive modules within a storage subsystem. Figure 5-41 shows the list of Top 25 volumes, prioritized by the Disk to cache Transfer rate metric. This metric indicates the average number of track transfers per second from disk to cache during the sample interval.
Figure 5-41 Top Volume Disk performance
Top Volume I/O Rate Performance

The top volume data rate performance, top volume I/O rate performance and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes. Figure 5-42 shows the top 25 volumes, prioritized by the Total I/O Rate (overall) metrics. The red traffic light indicates a volume that has more than 2000 IOPS. Yet if this being serviced by a high cache hit, then everything might be fine. In these cases it can be useful to review the cache hit rate, plus the volume utilization metric to confirm.
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 IOPS) to very large values (more than 1000 IOPS). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume.
224
Figure 5-42 Top Volume I/O Rate performances
Top Volume Response performance

The top volume data rate performance, top volume I/O rate performance, and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes. Figure 5-43 shows the top 25 volumes, prioritized by the Overall Response Time metrics. The red traffic light indicates volumes with high response time (almost 20 msec).
Recommendations
Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume that 10 milliseconds is pretty high. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications.
225
For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time. To make a long story short, (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-End write response times can be even higher, generally up to 80 msec. Important: All the above considerations are not valid for SSD disks, where seek time and latency are not applicable. Expect for these disks much better performance and therefore very short response time (less than 4 ms) for workloads that can be benefited by SSD disks. Today that will be small pack IO workloads with random read type of patterns. See 3.2.4, Performance metric guidelines on page 62 for further details on SSD performance. There are applications (typically batch applications) for which response time is not the appropriate performance metric. In these cases, it is often the throughput in megabytes per second that is most important, and maximizing this metric will drive response times much higher than 30 msec. For further details, refer Chapter 3, General performance management methodology on page 53.
Figure 5-43 Top Volume Response time performance
See 5.7, Case study: Top volumes response time and I/O rate performance report on page 280 to create a tailored report for your environment.
226
For more details regarding these Rules of Thumb and how to interpret these values, see Appendix A, Rules of Thumb and suggested thresholds on page 327.
5.2.6 Top 10 for Disk #10: Port Performance reports

Reports: These reports were enhanced to include XIV in addition to the other storage devices, as long as XIV is at the 10.2.4 version or newer. Prior versions of XIV do not provide support for these reports. See Figure 5-2 on page 188 for a quick reference on available reports. In our environment lab, XIV was at 10.2.2 level, so no port performance reports were available. The Port Performance report summarizes the various send, receive, and total port I/O rates and data rates. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and click Port Performance. A separate row is generated for each subsystems ports. The information displayed on each row represents the data last collected for the port. Notice the Time column displayed the last collection time, which might be different for different subsystem ports. Metrics: Not all the metrics in the Port Performance report are applicable for all ports. For example, the Port Send Utilization Percentage, Port Receive Utilization Percentage, Overall Port Utilization Percentage, Port Send Response Time, Port Receive Response Time, Total Port Response Time metrics, Port Send Transfer size and Port Receive Transfer Size data are not available on SVC or Storwize V7000 ports. N/A is displayed in the place when data is not available, as shown in Figure 5-44. By clicking Total Port I/O Rate you get a prioritized list by I/O rate. See Appendix B, Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for metric details.
Figure 5-44 Port Performance selection
227
This report is valuable because you can quickly determine if you have a potential imbalance in your data rates for a storage subsystem, SVC I/O Group, or for specific volumes for your mission critical applications. The first report presented is a general review of all ports, for all subsystems known by Tivoli Storage Productivity Center, and sorted by the subsystem name. This chart is valuable because you can see everything available, but in regard to SLA, Problem Determination, or Change Management, it is a starting or root report from which you can build custom reports. For SLA reporting, you have a set of questions that require reports to provide answers for. One such question might be: Is your storage environment supporting the application data rate required by your Mission Critical application? With this report you can use the selection button to specify the storage subsystem and FC ports that are being used by this specific application server. Then, by reviewing the Total I/O and Total Data Rate set of reports, you can present either a tabular view, graphic view, or a combination of the two to respond to this question. In our environment, the Storwize V7000 showed the highest Total Port I/O Rate during the time frame used in this report. See 5.3, Top 10 reports for SVC and Storwize V7000 for some example reports.
5.2.7 IBM XIV Module Cache Performance Report

An efficient use of cache enhances the volumes I/O response time. The Module Cache Performance report displays a list of cache related metrics. The available reports for IBM XIV Module are a subset of those available for SVC and Storwize V7000 Nodes, and are listed below: Read Cache Hit Percentage (overall) Write Cache Hit Percentage (overall) Total Cache Hit Percentage (overall) Read I/O Rate (overall) Write I/O Rate (overall) Total I/O Rate (overall)
Cache memory reports

The cache memory resource reports provide an understanding of the utilization of the IBM XIV cache. These reports provide you with an indication of whether the cache is able to service and buffer the current workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Module/Node Cache performance report. Notice that this report is generated at both SVC/Storwize V7000 node level and IBM XIV Module level, as shown in Figure 5-45.
Figure 5-45 IBM XIV Module performance report
228

Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. If the cache is full, a write might be delayed while some changed data is destaged to the disk arrays to make room for the new write data. The Read and Write Transfer Sizes are the average number of bytes transferred per I/O operation. To look at the Read cache hits percentage, click the icon and select the Read Cache Hits percentage (overall). Then click Ok to generate the chart, shown in Figure 5-46. In the highlighted time frame Read Cache Hit percentage falls below 50%, this will be eligible for further investigation.
Figure 5-46 Read Cache Hit percentage
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
229
5.3 Top 10 reports for SVC and Storwize V7000

Top 10 reports from Tivoli Storage Productivity Center is a very common request. In this section we summarize which reports you need to create to begin your performance analysis regarding an SVC or Storwize V7000 virtualized storage environment. Figure 5-47 is numbered with our recommended sequence to proceed. In other cases, such as performance analysis for a particular Server, we might recommend that you follow another sequence, starting with Managed Disk Group performance. This allows you to quickly identify Managed Disk and Virtual Disk belonging to the Server you are analyzing.
Figure 5-47 Top 10 reports - sequence to proceed
230
Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk to view system reports that are relevant to SVC and Storwize V7000. I/O Group Performance and Managed Disk Group Performance are specific reports for SVC and Storwize V7000, while Module/Node Cache Performance is also available for IBM XIV. In Figure 5-48 those reports are highlighted.
Figure 5-48 System reports
Figure 5-49 shows a sample structure to review basic SVC / Storwize V7000 concepts about product structure and then proceed with performance analysis at the different component levels.
SVC Storwize V7000

VDisk (1 TB) VDisk (1 TB) VDisk (1 TB)
3 TB of virtua lize d stora ge
I/O Group SVC Node SVC Node
MDisk (2 TB)
MDisk (2 TB)
MDisk (2 TB)
MDisk (2 TB)
8 TB of mana ged storage (used to determine SVC St orage software Usage)
DS4000, 5000, 6000, 8000, XIV . ..
Internal Storage (Storwize V7000 only)

RAW stora ge
Figure 5-49 SVC and Storwize V7000 sample structure
231
5.3.1 Top 10 for SVC and Storwize V7000#1: I/O Group Performance reports
Tip: For SVCs with multiple I/O groups, a separate row is generated for every I/O group within each SVC. In our lab environment, data was collected for SVC that had a single I/O group. The scroll bar at the bottom of the table indicates that additional metrics can be viewed, as shown in Figure 5-50.
Figure 5-50 I/O group performance
Important: The data displayed in a performance report is the last collected value at the time the report is generated. It is not an average of the last hours or days, but it simply shows the last data collected. Click the next to SVC io_grp0 entry to drill down and view the statistics by nodes within the selected I/O group. Notice that a new tab, Drill down from io_grp0, is created containing the report for nodes within the SVC. See Figure 5-51.
Figure 5-51 Drill down from io_grp0
To view a historical chart of one or more specific metrics for the resources, click the icon. A list of metrics is displayed, as shown in Figure 5-52. You can select one or more metrics that use the same measurement unit. If you select metrics that use different measurement units, you will receive an error message.
232
Restriction: Multiple metrics with different measurement units, that is, MB/s, IO/s, Percentages, to be visualized in a single graphic will need to be generated using an external tool such as Microsoft Excel. To get the data to Excel, you will use the export facility of Tivoli Storage Productivity Center available within the GUI or by the CLI using TPCTOOL. See Appendix C, Reporting with Tivoli Storage Productivity Center on page 365, CLI: TPCTOOL as a reporting tool.
CPU Utilization percentage

The CPU Utilization reports give you an indication of how busy the cluster nodes are. To generate a graph of CPU utilization by node, select the CPU Utilization Percentage metric and click Ok. See Figure 5-52:
Figure 5-52 SVC CPU utilization selection
You can change the reporting time range and click the Generate Chart button to re-generate the graph, as shown in Figure 5-53. A continual high Node CPU Utilization rate, indicates a busy I/O group; in our environment CPU utilization does not rise above 24%, that is a more than acceptable value.
Recommendations
If the CPU utilization for the SVC or Storwize V7000 version 6.2 Node remains constantly high above 70%, it might be time to increase the number of I/O Groups in the cluster. You can also redistribute workload to other I/O groups in the cluster if available, or also move volumes from one IO Group to another if another IO Group is available. Remember through SVC or Storwize V7000 version 6.2, it will still require a volume outage to move a volume from one SVC or Storwize V7000 IO Group to another. You can add cluster I/O groups (up to the maximum of four I/O Groups per SVC cluster, or maximum of two I/O Group per Storwize V7000 version 6.2). If there are already four I/O Groups in a cluster (with the latest firmware installed), and you are still having high SVC or Storwize V7000 Node CPU utilization as indicated in the reports, it is time to build a new cluster and consider either migrating some storage to the new cluster, or if existing SVC nodes are not of the 2145-CG8 version, upgrading them to the CG8 nodes.
233
Figure 5-53 SVC CPU utilization graph
I/O Rate (overall)

To view the overall total I/O rate, click the Drill down from io_grp0 tab to return to the performance statistics for the Nodes within the SVC. Click the icon and select the Total I/O Rate (overall) metric.Then click Ok. See Figure 5-54.
Figure 5-54 I/O rate
234
Notice that the I/Os are only present on Node 2. So, in Figure 5-56, you can see a configuration problem, where workload is not well balanced, at least during this time frame (this is the reason for the red traffic light shown in that figure).
Recommendations
To interpret your performance results, the first recommendation is to go always back to your baseline (see 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68. Moreover, some industry benchmarks for the SVC and Storwize V7000 are available. SVC 4.2, and the 8G4 node brought a dramatic increase in performance as demonstrated by the results in the Storage Performance Council (SPC) Benchmarks, SPC-1 and SPC-2. The benchmark number, 272,505.19 SPC-1 IOPS, is the industry-leading OLTP result and the PDF is available at the following website: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf An SPC Benchmark2 was also performed for Storwize V7000; the Executive Summary PDF is available at the following website: http://www.storageperformance.org/benchmark_results_files/SPC-2/IBM_SPC-2/B00052_I BM_Storwize-V7000/b00052_IBM_Storwize-V7000_SPC2_executive-summary.pdf Figure 5-55 shows numbers on max I/Os and MB/s per I/O group. SVC performance or your realized SVC obtained performance will be based upon multiple factors such as these: The specific SVC nodes in your configuration The type of Managed Disks (volumes) in the Managed Disk Group (MDG) The application I/O workloads using the MDG The paths to the back-end storage These are all factors that ultimately lead to the final performance realized. In reviewing the SPC benchmark (see Figure 5-55), depending upon the transfer block size used, the results for the I/O and Data Rate obtained are quite different. Looking at the two-node I/O group used, you might see 122,000 I/Os if all of the transfer blocks were 4K. In typical environments, they rarely are. So if you jump down to 64K, or bigger. with anything over about 32K, you might realize a result more typical of the 29,000 as seen by the SPC benchmark.
235
Max I/Os and MB/s Per I/O Group 70/30 R/W Miss
2145-8G4 4K Transfer Size 122K 500MB/s 64K Transfer Size 29K 1.8GB/s 2145-8F4 4K Transfer Size 72K 300MB/s 64K Transfer Size 23K 1.4GB/s 2145-4F2 4K Transfer Size 38K 156MB/s 64K Transfer Size 11K 700MB/s 2145-8F2 4K Transfer Size 72K 300MB/s 64K Transfer Size 15K 1GB/s
Figure 5-55 SPC SVC benchmark Max I/Os and MB/s per I/O group
For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53. As mentioned before, in the I/O rate graph shown in Figure 5-56, you can see a configuration problem indicated by the red traffic light in the lower right corner.
Figure 5-56 I/O rate graph
236
Response time
To view the read and write response time at Node level, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Backend Read Response Time and Backend Write Response Time metrics, as shown in Figure 5-57.
Figure 5-57 SVC Node Response time selection
Click Ok to generate the report, as shown in Figure 5-58. We see values that can be accepted in back-end response time for both read and write operations, and these are consistent for both our I/O Groups.
Recommendations
For random read I/O, the back-end rank (disk) read response times must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Times will be higher because of RAID 5 (or RAID 10) algorithms, but must seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines. In case of poor response time, you have to investigate using all available information from the SVC and the back-end storage controller. Possible causes for a large change in response times from the back-end storage might be visible using the storage controller management tool include these: Physical array drive failure leading to an array rebuild. This drives additional back-end storage subsystem internal read/write workload while the rebuild is in progress. If this is causing poor latency, it might be desirable to adjust the array rebuild priority to lessen the load. However, this must be balanced with the increased risk of a second drive failure during the rebuild, which will cause data loss in a RAID 5 array. Cache battery failure leading to cache being disabled by the controller. This can usually be resolved simply by replacing the failed battery.
237
Figure 5-58 SVC Node Response time report
238
Data Rate
To look at the Read Data rate, click the Drill down from io_grp0 tab to return to the performance statistics for the nodes within the SVC. Click the icon and select the Read Data Rate metric. Press down Shift key and select Write Data Rate and Total Data Rate. Then click Ok to generate the chart, shown in Figure 5-59.
Figure 5-59 SVC Data Rate graph
To interpret your performance results, the first recommendation is to go always back to your baseline (see 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68.) Moreover, some benchmark is available. The throughput benchmark, 7,084.44 SPC-2 MBPS, is the industry-leading throughput benchmark, and the PDF is available here: http://www.storageperformance.org/results/b00024_IBM-SVC4.2_SPC2_executive-summary .pdf
5.3.2 Top 10 for SVC and Storwize V7000#2: Node Cache Performance reports
Efficient use of cache can help enhance virtual disk I/O response time. The Node Cache Performance report displays a list of cache related metrics such as Read and Write Cache Hits percentage and Read Ahead percentage of cache hits. The cache memory resource reports provide an understanding of the utilization of the SVC or Storwize V7000 cache. These reports provide you with an indication of whether the cache is able to service and buffer the current workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Module/Node Cache performance report. Notice that this report is generated at SVC and Storwize V7000 node level (moreover there is an entry that refers to the IBM XIV Storage System, see 5.2.7, IBM XIV Module Cache Performance Report on page 228), as shown in Figure 5-60.
239
Figure 5-60 SVC and Storwize V7000 Node cache performance report

Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. If the cache is full, a write might be delayed while some changed data is destaged to the disk arrays to make room for the new write data. The Read and Write Transfer Sizes are the average number of bytes transferred per I/O operation. To look at the Read cache hits percentage for Storwize V7000 nodes, select both nodes and click the icon and select the Read Cache Hits percentage (overall). Then click Ok to generate the chart, shown in Figure 5-61.
Figure 5-61 Storwize V7000 Cache Hits percentage - no traffic on node1
Important: The flat line for node1 does not mean that the read request for that node cannot be handled by the cache, it means that there is no traffic at all on that node, as is illustrated in Figure 5-62 and Figure 5-63, where Read Cache Hit Percentage and Read I/O Rates are compared in the same time interval.
240
Figure 5-62 Storwize V7000 Read Cache Hit Percentage
Figure 5-63 Storwize V7000 Read I/O Rate
241
Cache Hit example

This might be not a good configuration, because the two nodes are not balanced. In our lab environment, volumes defined on Storwize V7000 were all defined with node2 as a preferred node. After moving the preferred node for volume tpcblade3-7-ko from node2 to node1, we obtained the graphs shown in Figure 5-64 and Figure 5-65 for Read Cache Hit percentage and Read I/O Rates.
Figure 5-64 Storwize V7000 Cache Hit Percentage after reassignment
242
Figure 5-65 Storwize V7000 Read I/O rate after reassignment
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. If you need to analyze further cache performances and try to understand if it is enough for your workload, you can run multiple metrics charts. Select the metrics named percentage, because you can have multiple metrics with the same unit type, in one chart. In the Selection panel, move from Available Column to Included Column the percentage metrics you want include, then in the Selection.. button, check only the Storwize V7000 entries. Figure 5-66 on page 245 shows an example where several percentage metrics are chosen for Storwize V7000. The complete list of metrics is as follows: CPU utilization percentage: The average utilization of the node controllers in this I/O group during the sample interval. Dirty Write percentage of Cache Hits: The percentage of write cache hits which modified only data that was already marked dirty in the cache; re-written data. This is an obscure measurement of how effectively writes are coalesced before destaging.
243
Read/Write/Total Cache Hits percentage (overall): The percentage of reads/writes/total during the sample interval that are found in cache. This is an important metric. The write cache hot percentage must be very nearly 100%. Readahead percentage of Cache Hits: An obscure measurement of cache hits involving data that has been prestaged for one reason or another. Write Cache Flush-through percentage: For SVC and Storwize V7000, the percentage of write operations that were processed in Flush-through write mode during the sample interval. Write Cache Overflow percentage: For SVC and Storwize V7000, the percentage of write operations that were delayed due to lack of write-cache space during the sample interval. Write Cache Write-through percentage: For SVC and Storwize V7000, the percentage of write operations that were processed in Write-through write mode during the sample interval. Write Cache Delay percentage: The percentage of all I/O operations that were delayed due to write-cache space constraints or other conditions during the sample interval. Only writes can be delayed, but the percentage is of all I/O. Small Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are <= 8 KB. Small Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are <= 8 KB. Medium Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Medium Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 8 KB and <= 64 KB. Large Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Large Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 64 KB and <= 512 KB. Very Large Transfers I/O percentage: Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 512 KB. Very Large Transfers Data percentage: Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 512 KB.
244
Overall Host Attributed Response Time Percentage: The percentage of the average response time, both read response time and write response time, that can be attributed to delays from host systems. This metric is provided to help diagnose slow hosts and poorly performing fabrics. The value is based on the time taken for hosts to respond to transfer-ready notifications from the SVC nodes (for read) and the time taken for hosts to send the write data after the node has responded to a transfer-ready notification (for write). The following metric is only applicable in a Global Mirror Session: Global Mirror Overlapping Write Percentage: Average percentage of write operations issued by the Global Mirror primary site which were serialized overlapping writes for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes were serialized.
Figure 5-66 Storwize V7000 multiple metrics Cache selection
After selecting Storwize V7000 node1 and node2, select all the metrics in the Select charting option pop-up window and click Ok to generate the chart. In our test, as shown in Figure 5-67, we notice that there is a drop in the Cache Hits percentage. Even if the drop is not so dramatic, this can be considered as an example for further investigation of problems arising. Changes in these performance metrics together with an increase in back-end response time (see Figure 5-68) show that the storage controller is heavily burdened with I/O, and the Storwize V7000 cache can become full of outstanding write I/Os. Host I/O activity will be impacted with the backlog of data in the Storwize V7000 cache and with any other Storwize V7000 workload that is going on to the same MDisks.
245
I/O Groups: If cache utilization is a problem, in SVC and Storwize V7000 version 6.2 you can add additional cache to the cluster by adding an I/O Group and moving volumes to the new I/O Group. Also, adding an I/O Group and moving a volume from one I/O group to another are still disruptive actions. So proper planning to manage this disruption is required.
Figure 5-67 Storwize V7000 Multiple nodes resource performance metrics
Figure 5-68 Storwize V7000 increased Overall Back-End Response Time
246
5.3.3 Top 10 for SVC #3: Managed Disk Group performance reports
The Managed Disk Group performance report provides disk performance information at the managed disk group level. It summarizes read and write transfer size, back-end read, write, and total I/O rate. From this report you can easily drill up to see the statistics of virtual disks supported by a managed disk group or drill down to view the data for the individual mdisks that make up the managed disk group. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and select Managed Disk Group performance. A table is displayed listing all the known Managed Disk Groups and their last collected statistics, based on the latest performance data collection. See Figure 5-69.
Figure 5-69 Managed Disk Group performance
One of the Managed Disk Groups is CET_DS8K1901mdg. Click the drill down icon on the entry CET_DS8K1901mdg to drill down. A new tab is created, containing the Managed Disks in the Managed Disk Group. See Figure 5-70.
Figure 5-70 Drill down from Managed Disk Group Performance report
247
Click the drill down icon on the entry mdisk61 to drill down. A new tab is created, containing the Volumes in the Managed Disk. See Figure 5-71.
Figure 5-71 Drill down from Managed Disk performance report
I/O rate
We recommend that you analyze how the I/O workload is split between Managed Disk Groups, to determine if it is well balanced or not. Click Managed Disk Groups tab, select all Managed Disk Groups, click the icon, and select Total Backend I/O Rate, as shown in Figure 5-72.
Figure 5-72 Top 10 SVC - Managed Disk Group I/O rate selection
Click Ok to generate the next chart, as shown in Figure 5-73. When reviewing this general chart, you must understand that it reflects all I/O to the back-end storage from the Managed Disks included within this Managed Disk Group. The key for this report is a general understanding of back-end I/O rate usage, not whether there is balance outright. 248
Although the SVC and Storwize V7000 by default stripes writes and read I/Os across all Managed Disks, the striping is not through a RAID 0 type of stripe. Rather, as the Virtual Disk is a concatenated volume, the striping injected by the SVC and Storwize V7000 is only in how we identify extents to be used when we create a Virtual Disk. Until host I/O write actions fill up the first extent, the remaining extents in the block Virtual Disk provided by SVC will not be used. It is very likely when you are looking at the Managed Disk Group Back-End I/O report, that you will not see a balance of write activity even for a single Managed Disk Group. In the report shown in Figure 5-73, for the time frame specified, we see that at one point we have a maximum of nearly 8200 IOPS.
Figure 5-73 Top 10 SVC - Managed Disk Group I/O rate report
249
Response time
Now you can get back to the list of Managed Disks, by moving to the Drill down from CET_DS8K1901mdg tab (see Figure 5-70 on page 247). Select all the Managed Disks entries, click the icon and select the Backend Read Response time metric, as shown in Figure 5-74.
Figure 5-74 Managed Disk Back-End Read Response Time
250
Then click Ok to generate the chart, as shown in Figure 5-75.
Recommendations
For random read I/O, the back-end rank (disk) read response time must seldom exceed 25 msec, unless the read hit ratio is near 99%. Back-End Write Response Time will be higher because of RAID 5, RAID 6, or RAID 10 algorithms, but must seldom exceed 80 msec. There will be some time intervals when response times exceed these guidelines.
Figure 5-75 Back-End Response Time
For further details about the SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Back-End Data Rates

Back-End throughput and response time depend on the actual DDMs in use by the storage subsystem that the LUN or Volume was created from and the specific RAID type in use. With this report you can also check how Managed Disk workload is distributed.
251
Select all the Managed Disks from the Drill down from CET_DS8K1901mdg tab, click the icon, and select the Backend Data Rates, as shown in Figure 5-76.
Figure 5-76 Managed Disk Back-End Data Rates selection
Click Ok to generate the report shown in Figure 5-77. Here the workload is not balanced on Managed Disks.
Figure 5-77 Managed Disk Back-End Data Rates report
252
5.3.4 Top 10 for SVC and Storwize V7000 #5-9: Top Volume Performance reports
Tivoli Storage Productivity Center provides five reports on Top Volume performance: Top Volume Cache performance: Prioritized by the Total Cache Hits percentage (overall) metric. Top Volume Data Rate performance: Prioritized by the Total Data Rate metric. Top Volume Disk performance: Prioritized by the Disk to cache Transfer rate metric. Top Volume I/O Rate performance: Prioritized by the Total I/O Rate (overall) metric. Top Volume Response performance: Prioritized by the Total Data Rate metric. Volumes referred in these reports correspond to the Virtual Disks in SVC. Important: The last collected performance data on volumes are used for the reports. The report creates a ranked list of volumes based on the metric used to prioritize the performance data. You can customize these reports according to the needs of your environment. To limit these system reports to just SVC subsystems, you have to specify a filter, as shown in Figure 5-78. Click the Selection tab, then click Filter. Click Add to specify another condition to be met. This has to be done for all the five reports.
Figure 5-78 SVC Top Volumes Filter selection
253
Top Volume Cache performance

This report shows the cache statistics for the top 25 volumes, prioritized by the Total Cache Hits percentage (overall) metric, as shown in Figure 5-79. This is the weighted average of read cache hits and write cache hits. The percentage of writes that are handled in cache must be 100% for most enterprise storage. An important metric is the percentage of reads during the sample interval that are found in cache.
Recommendations
Read Hit percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations.
Figure 5-79 SVC Top volume Cache Hit performance report
For further details about SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Top Volume Data Rate performance

To find out the top five volumes with the highest total data rate during the last data collection time interval, expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Top Volumes Data Rate Performance. By default, the scope of the report is not limited to a single storage subsystem. Tivoli Storage Productivity Center interrogates the data collected for all the storage subsystems that it has statistics for and creates the report with a list of 25 volumes that have the highest total Data Rate. To limit the output, click the Selection tab to enter 5 as the maximum number of rows to be displayed on the report, as shown in Figure 5-80.
254
Figure 5-80 Top Volume Data rate selection
Click Generate Report on the Selection panel to regenerate the report, shown next in Figure 5-81. If this report is generated during the run time periods, the volumes will have the highest total data rate and be listed on the report.
Figure 5-81 SVC Top Volume Data rate report
255
Top Volume Disk Performance

This report includes many metrics about cache and volume-related informations. Figure 5-82 shows the list of Top 25 volumes, prioritized by the Disk to cache Transfer rate metric. This metric indicates the average number of track transfers per second from disk to cache during the sample interval.
Figure 5-82 SVC Top Volume Disk performance
Top Volume I/O Rate performance

The top volume data rate performance, top volume I/O rate performance, and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes. Figure 5-83 shows the top 25 volumes, prioritized by the Total I/O Rate (overall) metrics.
Recommendations
The throughput for storage volumes can range from fairly small numbers (1 to 10 I/O per second) to very large values (more than 1000 I/O/second). This depends a lot on the nature of the application. When the I/O rates (throughput) approach 1000 IOPS per volume, it is because the volume is getting very good performance, usually from very good cache behavior. Otherwise, it is not possible to do so many IOPS to a volume.
256
Figure 5-83 SVC Top Volume I/O Rate performances
For further details about the SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
Top Volume Response performance

The top volume data rate performance, top volume I/O rate performance and top volume response performance reports include the same type of information, but due to different sorting, other volumes might be included as the top volumes in this report. Figure 5-84 shows the top 25 volumes, prioritized by the Overall Response Time metrics.
Recommendations
Typical response time ranges are only slightly more predictable. In the absence of additional information, we often assume (and our performance models assume) that 10 milliseconds is pretty high. But for a particular application, 10 msec might be too low or too high. Many OLTP (On-Line Transaction Processing) environments require response times closer to 5 msec, while batch applications with large sequential transfers might be fine with 20 msec response time. The appropriate value might also change between shifts or on the weekend. A response time of 5 msec might be required from 8 until 5, while 50 msec is perfectly acceptable near midnight. It is all customer and application dependent. The value of 10 msec is somewhat arbitrary, but related to the nominal service time of current generation disk products. In crude terms, the service time of a disk is composed of a seek, a latency, and a data transfer. Nominal seek times these days can range from 4 to 8 msec, though in practice, many workloads do better than nominal. It is not uncommon for applications to experience from 1/3 to 1/2 the nominal seek time. Latency is assumed to be 1/2 the rotation time for the disk, and transfer time for typical applications is less than a msec. So it is not unreasonable to expect 5-7 msec service time for a simple disk access. Under ordinary queueing assumptions, a disk operating at 50% utilization will have a wait time roughly equal to the service time. So 10-14 msec response time for a disk is not unusual, and represents a reasonable goal for many applications.
257
For cached storage subsystems, we certainly expect to do as well or better than uncached disks, though that might be harder than you think. If there are a lot of cache hits, the subsystem response time might be well below 5 msec, but poor read hit ratios and busy disk arrays behind the cache will drive the average response time number up. A high cache hit ratio allows us to run the back-end storage ranks at higher utilizations than we might otherwise be satisfied with. Rather than 50% utilization of disks, we might push the disks in the ranks to 70% utilization, which will produce high rank response times, which are averaged with the cache hits to produce acceptable average response times. Conversely, poor cache hit ratios require pretty good response times from the back-end disk ranks in order to produce an acceptable overall average response time. To simplify, we can assume that (front-end) response times probably need to be in the 5-15 msec range. The rank (back-end) response times can usually operate in the 20-25 msec range unless the hit ratio is really poor. Back-End write response times can be even higher, generally up to 80 msec. Important: All of these considerations are not valid for SSD disks, where seek time and latency are not applicable. Expect for these disks much better performance and therefore very short response time (less than 4 ms) for workloads that can be benefited by SSD disks. Today that will be small pack IO workloads with random read type of patterns. See 3.2.4, Performance metric guidelines on page 62 for further details on SSD performance.
Figure 5-84 SVC TOP volume Response performance report
See 5.7, Case study: Top volumes response time and I/O rate performance report on page 280 to create a tailored report for your environment. For further details about Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
258
5.3.5 Top 10 for SVC and Storwize V7000 #10: Port Performance reports
The SVC and Storwize V7000 port performance reports help you understand the SVC and Storwize V7000 impact on the fabric and give you an indication of the traffic between the following systems: SVC (or Storwize V7000) and hosts that receive storage SVC (or Storwize V7000) and back-end storage Nodes in the SVC (or Storwize V7000) cluster These reports can help you understand if the fabric might be a performance bottleneck and if upgrading the fabric can lead to performance improvement. The Port Performance report summarizes the various send, receive and total port I/O rates and data rates. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk and click Port Performance. In order to display only SVC and Storwize V7000 ports, click Filter to produce a report for all the volumes belonging to SVC or Storwize V7000 subsystems, as shown in Figure 5-85.
Figure 5-85 Port Performance Report - Subsystems filters
A separate row is generated for each subsystems ports. The information displayed in each of the rows reflect data last collected for the port. Notice the Time column displayed the last collection time, which might be different for different subsystem ports. Not all the metrics in the Port Performance report are applicable for all ports. For example, the Port Send Utilization percentage, Port Receive Utilization Percentage, Overall Port Utilization percentage data are not available on SVC or Storwize V7000 ports. N/A is displayed in the place when data is not available, as shown in Figure 5-86. By clicking Total Port I/O Rate you get a prioritized list by I/O rate.
Figure 5-86 Port Performance report
259
Verifying data rates at back-end ports

At this point you can verify if the Data Rates seen to the back-end ports are beyond the normal expected for the speed of your fibre links, as shown in Figure 5-87. This report is typically generated to support Problem Determination, Capacity Management, or SLA reviews. Based upon the 8 Gb per second fabric in place for this book, these rates are well below the throughput capability of this fabric, and thus the fabric is not a bottleneck, here.
Figure 5-87 SVC and Storwize V7000 Port I/O rate report
Recommendations
Based on the nominal speed of each of the FC ports, which can be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of round 800 MB/sec. So, you need to generate an alert when it is more than 400 MB/sec. See 3.4.4, Defining the alerts on page 80 for information about how to set up alerts.
260
To investigate further using the Port performance report, go back to the I/O group performances report. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click I/O group Performance and drill-down to Node level. In the example in Figure 5-88 we choose Node 1 of the SVC subsystem:
Figure 5-88 SVC node port selection
Then click the icon and select Port to Local Node Send Queue Time, Port to Local Node Receive Queue Time, Port to Local Node Receive Response Time and Port to Local Node Send Response Time, as shown in Figure 5-89.
Figure 5-89 SVC Node port selection queue time
261
Look at port rates between SVC nodes, hosts, and disk storage controllers. Figure 5-90 shows low queue and response times, indicating that the nodes do not have a problem communicating with each other.
Figure 5-90 SVC Node ports report
If this report shows high queue and response times, the write activity (because each node communicates to each other node over the fabric) is affected. Unusually high numbers in this report indicate: SVC (or Storwize V7000) node or port problem (unlikely) Fabric switch congestion (more likely) Faulty fabric ports or cables (most likely) For further details about this SVC and Storwize V7000 Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
262
Identifying over-utilized ports

You can verify if you have any Host Adapter or SVC (or Storwize V7000) ports that are heavily loaded when the workload is balanced between the specific ports of a Subsystem that your application server is using. If you identify an imbalance, then you need to review whether the imbalance is a problem or not. If there is an imbalance, and the response times and data rate are acceptable, then taking a note of the impact might be the only action required. If there is a problem at the application level, then a review of the volumes using these ports, and a review of their I/O and data rates will need to determine if redistribution is required. To support this review, you can generate a port chart and using the date range (specify the specific time frame when you know I/O and data was in place). Then select the Total Port I/O Rate metric on all of SVC (or Storwize V7000) ports, or the specific Host Adapter ports in question. The graphical report shown in Figure 5-91 refers to all the Storwize ports:
Figure 5-91 SVC Port I/O Send/Receive Rate
263
After you have the I/O rate review chart, you also need to generate a data rate chart for the same time frame. This will support a review of your HA ports for this application. Generate another historical chart with the Total Port Data Rate metric, as shown in Figure 5-92, that confirms the unbalanced workload for one port shown in the foregoing report.
Recommendations
According to the nominal speed of each FC ports, which can be 4 Gbit, 8 Gbit or more, we recommend not to exceed 50-60% of that value as Data Rate. For example, a 8 Gbit port can reach a maximum theoretical Data Rate of around 800 MB/sec. So, you need to generate an alert when it is more than 400 MB/sec. See 3.4.4, Defining the alerts on page 80 for information about how to set up alerts.
Figure 5-92 Port Data Rate report
For further details about this SVC Rule of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
5.4 Reports for Fabric and Switches

Fabric and Switches provide metrics that you cannot create in a Top 10 reports list. Tivoli Storage Productivity Center provides the most important metrics in order to create reports against them. Figure 5-93 shows a list of System Reports available for your Fabric.
264
5.4.1 Switches reports: Overview

The first four reports shown in Figure 5-93 provide you Asset information in a tabular view. You can see the same information in a graphic view, using Topology Viewer. Tip: We recommend that you use Topology Viewer to get Asset information. Expand IBM Tivoli Storage Productivity Center Topology Switches.
Figure 5-93 Fabric list of reports
Tip: Rather than using a specific report to monitor Switch Port Errors, we recommend that you use the Constraint Violation report. By setting an Alert for the number of errors at the switch port level, the Constraint Violation report becomes a direct tool to monitor the errors in your fabric. For details, see 3.5.5, Constraint Violations reports on page 113.
5.4.2 Top Switch Port Data Rate performance

Total Port Data Rate shows the average number of megabytes (2^20 bytes) per second that were transferred for send and receive operations, for a particular port during the sample interval. Tip: For the TOP report, we recommend that you analyze Top Switch Ports Data Rate Performance.
265
Expand: IBM Tivoli Storage Productivity Center Reporting System Reports Fabric and select Top Switch Ports Data Rates performance. Click the icon and select Total Port Data Rate, as shown in Figure 5-94.
Figure 5-94 Fabric report - Port Data Rate selection
Click Ok to generate the chart shown next in Figure 5-95. Port Data Rates do not reach a warning level, in this case, knowing that FC Port speed is 8 Gbits/sec.
Recommendations
Use this report to monitor if some Switch Ports are overloaded or not. According to FC Port nominal speed (2 Gbit, 4 Gbit or more) as shown in Table 5-1, you have to establish the maximum workload a switch port can reach. We recommend to not exceed 50-70%.
Table 5-1 Switch Port data rates FC Port speed Gbits/sec 1 Gbits/sec 2 Gbits/sec 4 Gbits/sec 8 Gbits/sec 10 Gbit/sec FC Port speed MBytes/sec 100 MB/sec 200 MB/sec 400 MB/sec 800 MB/sec 1000 MB/sec Recommended Port Data Rate threshold 50 MB/sec 100 MB/sec 200 MB/sec 400 MB/sec 500 MB/sec
266
Figure 5-95 Fabric report - Port Data Rate report.
5.5 Case study: Server - performance problem with one server

Often a problem is reported as a server suffering poor performance, and usually the storage disk subsystem is the first suspect. In this case study we show how Tivoli Storage Productivity Center can help you to debug this problem verifying if this is a storage problem or an out of storage issue, providing volume mapping for this server and identifying which storage components are involved in the path. Tivoli Storage Productivity Center provides reports that show the storage assigned to the computers within your environment. To display one of the reports, expand Disk Manager Reporting Storage Subsystem Computer Views By Computer. Click the Selection button to select particular available resources to be on the report (in our case server tpcblade3-7), as shown in Figure 5-96.
267
Figure 5-96 Computer case study - selection
Click Generate Report to get the output shown in Figure 5-97. Scrolling to the right of the table more information is available, such as the volume names, volume capacity, allocated and unallocated volume spaces are listed.
Figure 5-97 Computer case study - volume list
268
Data on the report can be exported by selecting File Export Data to a comma delimited file, comma delimited file with headers or formatted report file and HTML file. You can start from this volumes list to analyze performance data and workload I/O rate. Tivoli Storage Productivity Center provides a report that shows volume to back-end volume assignments. To display the report, expand Disk Manager Reporting Storage Subsystem Volume to Backend Volume Assignment By Volume. Click Filter to limit the list of the volumes to the ones belonging to server tpcblade3-7, as shown in Figure 5-98.
Figure 5-98 Computer case study - volume to back-end filter
269
Click Generate Report to get the list shown in Figure 5-99.
Figure 5-99 Computer case-study - volume to back-end list
Scroll to the right to see the SVC Managed Disks and Back-End Volumes on DS8000, as shown in Figure 5-100. Back-end storage: The highlighted lines with N/A values are related to a back-end storage subsystem not defined in our Tivoli Storage Productivity Center environment. To obtain the information about the back-end storage subsystem, it has to be added in the Tivoli Storage Productivity Center environment, together with the corresponding probe job (see the first line in the report in Figure 5-100, where the back-end storage subsystem is part of our Tivoli Storage Productivity Center environment and therefore the volume is correctly showed in all its details).
Figure 5-100 Back-End Storage Subsystems
With this information and the list of volumes mapped to this computer, you can start to run a Performance Report to understand where the problem for this server can be.
270
5.6 Case study: Storwize V7000- disk performance problem

In this case study we look at a problem reported by a customer: One disk volume is having different and lower performance results during the last period. At times it is getting a good response time, and sometimes it is getting unacceptable response time. Throughput is also changing. The customer specified the name of the affected volume, that is, tpcblade3-7-ko2, a volume in a Storwize V7000 subsystem.
Recommendations
Looking at disk performance problems, you need to check the overall response time as well as its overall I/O rate. If they are both high, there might be a problem. If the overall response time is high and the I/O rate is trivial, the impact of the high overall response time might be inconsequential. Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Then click Filter to produce a report for all the volumes belonging to Storwize V7000 subsystems, as shown in Figure 5-101.
Figure 5-101 SVC performance report by Volume
271
Click the volume you need to investigate, click the icon and select Total I/O Rate (overall). Then click Ok to produce the graph, as shown in Figure 5-102.
Figure 5-102 Storwize V7000 performance report - volume selection
The chart in Figure 5-103 shows that I/O rate had been around 900 operations per second and suddenly declined to around 400 operations per second. Then, it goes back to 900 operations per second. In this case study we limited the days to the time frame reported by the customer when the problem was noticed.
Figure 5-103 Storwize V7000 volume - Total I/O rate chart
272
Select again the Volumes tab, click the volume you need to investigate, click the icon and scroll down to select Overall Response Time. Then click Ok to produce the chart, as shown in Figure 5-104.
Figure 5-104 Storwize V7000 performance report - volume selection
The chart in Figure 5-105 indicates the increase in response time from a few milliseconds to around 30 milliseconds. This information, combined with the high I/O rate, indicates there is a significant problem and further investigation is appropriate.
Figure 5-105 Storwize V7000 Volume - response time
273
The next step is to look at the performance of MDisks in the MDisk group. To identify to which Managed Disk the Virtual Disk tpcblade3-7-ko2 belongs, go back to Volumes tab and click the drill up icon, as shown in Figure 5-106.
Figure 5-106 SVC Volume and Managed Disk selection
Figure 5-107 shows the Managed Disks where tpcblade3-7-ko2 extents reside:
Figure 5-107 Storwize V7000 Volume and Managed Disk selection - 2
274
Select all the MDisks. Click the icon and select Overall Backend Response Time. Click Ok as shown in Figure 5-108.
Figure 5-108 Storwize V7000 metric selection
Keep the charts generated relevant to this scenario, using the charting time range. You can see from the chart in Figure 5-109 that something happened around May, 26 at 6:00 pm that probably caused the back-end response time for all MDisks to dramatically increase.
Figure 5-109 Overall Back-End Response Time
275
If you take a look at the chart for the Total Back-End I/O Rate for these two MDisks during the same time period, you will see that their I/O rates all remained in a similar overlapping pattern, even after the introduction of the problem. This is as expected and will be because tpcblade3-7-ko2 is evenly striped across the two MDisks. The I/O rate for these MDisks is only as high as the slowest MDisk, as shown in Figure 5-110.
Figure 5-110 Back-End I/O Rate
At this point, we have identified that the response time for all Managed Disks dramatically increased. The next step is to generate a report to show the volumes that have an overall I/O rate equal to or greater than 1000 Ops/ms and then generate a chart to show which of the I/O rates for those volumes changed around 5:30 pm on August 20.
276
Expand Disk Manager Reporting Storage Subsystem Performance by Volume. Click Display historic performance data using absolute time and limit the time period to 1 hour before and1 hour after the event reported in Figure 5-109. Click Filter to limit to Storwize V7000 Subsystem and Add a second filter to select the Total I/O Rate (overall) greater than 1000 (it means high I/O rate). Click Ok, as shown in Figure 5-111.
Figure 5-111 Display historic performance data
The report in Figure 5-112, shows all the performance records of the volumes filtered above. In the Volume column there are only three volumes that meet these criteria: tpcblade3-7-ko2, tpcblade3-7-ko3 and tpcblade3ko4. There are multiple rows for each as there is a row for each performance data record. Look for what volumes I/O rate changed around 6:00 pm on May 26. You can click the Time column to sort.
Figure 5-112 Volumes I/O rate changed
277
Now we have to compare the Total I/O rate (overall) metric for the above volumes and the volume subject of the case study, tpcblade3-7-k02. To do so remove the filtering condition on the Total I/O Rate defined in Figure 5-111 and generate the report again. Then select one row for each of these volumes and select Total I/O Rate (overall). Then click Ok to generate the chart, as shown in Figure 5-113.
Figure 5-113 Total I/O rate selection for three volumes
For Limit days From, insert the time frame we are investigating. Results: Figure 5-114 shows the root cause. Volume tpcblade3-7-ko2 (the blue line in the screen capture) started around 5:00pm and has a Total I/O rate around 1000 IOPS. When the new workloads (generated by tpcblade3-7-ko3 and tpcblade3-ko4)started together, the Total I/O rate for volume tpcblade3-7-ko2 fell from around1000 IOPS to less than 500 I/Os, and then grew up again to about 1000 I/Os when one of the two loads decreased. The hardware has physical limitations on the number of IOPS that it can handle and this was reached at 6:00 pm.
278
Figure 5-114 Total I/O rate chart for three volumes
To confirm this behavior, you can generate a chart by selecting Response time. The chart shown in Figure 5-115 confirms that as soon as the new workload started, response time for tpcblade3-7-ko2 gets worse.
Figure 5-115 Response time chart for three volumes
The easy solution is to split this workload, moving one Virtual Disk to another Managed Disk Group.
279
5.7 Case study: Top volumes response time and I/O rate performance report
The default Top Volumes Response Performance Report can be useful identifying problem performance areas. A long response time is not necessarily indicative of a problem. It is possible to have volumes with long response time with very low (trivial) I/O rates. These situations might pose a performance problem to be further investigated. In this section we tailor Top Volumes Response Performance Report to identify volumes with both long response times and high I/O rates. The report can be tailored for your environment; it is also possible to update your Filters to exclude volumes or subsystems you no longer want in this report. Expand Disk Manager Reporting Storage Subsystem Performance by Volume as shown in Figure 5-116 and keep only desired metrics as Included Columns, moving all the others to Available Columns. You can save this report to be referenced in the future from IBM Tivoli Storage Productivity Center My Reports your user Reports.
Figure 5-116 TOP Volumes tailored reports - metrics
280
You have to specify the filters to limit the report, as shown in Figure 5-117. Click Filter and then Add the conditions. In our example we are limiting the report to Subsystems SVC* and DS8* and to the volumes that have an I/O Rate greater than 100 Ops/sec and a Response Time greater than 5 msec.
Figure 5-117 TOP volumes tailored reports - filters
Prior to generating the report, you need to specify the date and time of the period for which you want to make the inquiry. Important: Specifying large intervals might require intensive processing and a long time to complete. As shown in Figure 5-118, click Generate Report.
Figure 5-118 TOP Volume tailored report - limiting days
281
Figure 5-119 shows the resulting Volume list. Sorting by response time or by I/O Rate columns (by clicking the column header), you can easily identify which entries have both interesting total I/O Rates and Overall Response Times.
Recommendations
We suggest that in a production environment, you might want to initially specify a Total I/O Rate overall somewhere between 1 and 100 Ops/sec and Overall Response Time (msec) that is greater than or equal to 15 msec, and adjust those numbers to suit your needs as you gain more experience.
Figure 5-119 TOP Volume tailored report - volumes list
282
5.8 Case study: SVC and Storwize V7000 performance constraint alerts
Along with reporting on SVC and Storwize V7000 performance, Tivoli Storage Productivity Center can generate alerts when performance has not met, or has exceeded a defined threshold. Like most Tivoli Storage Productivity Center tasks, the alerting can report to these choices: SNMP: Enables you to send an SNMP trap to an upstream systems management application. The SNMP trap can then be used with other events occurring within the environment to help determine the root cause of an SNMP trap. In this case was generated by the SVC. For example, if the SVC or Storwize V7000 reported to Tivoli Storage Productivity Center that a fibre port went offline, it might in fact be because a switch has failed. This port failed trap, together with the switch offline trapped can be analyzed by a systems management tool to be a switch problem, not an SVC (or Storwize V7000) problem, so that the switch technicians called. Tivoli Omnibus Event: Select to send a Tivoli Omnibus event. Login Notification: Select to send the alert to a Tivoli Storage Productivity Center user. The user receives the alert upon logging in to Tivoli Storage Productivity Center. In the Login ID field, type the user ID. UNIX or Windows NT system event logger Script: The script option enables you to run a predefined set of commands that can help address this event, for example, simply open a trouble ticket in your helpdesk ticket system. Email: Tivoli Storage Productivity Center will send an e-mail to each person listed. Tip: Remember that for Tivoli Storage Productivity Center to be able to email addresses, an email relay must be identified in the Administrative Services Configuration Alert Disposition and then Email settings. These are some useful alert events that you need to set: CPU utilization threshold: The CPU utilization report will alert you when your SVC or Storwize V7000 nodes become too busy. If this alert is being generated too often, it might be time to upgrade your cluster with additional resources. Development recommends this setting to be at 75% as warning or 90% as critical. These are the defaults that come with Tivoli Storage Productivity Center 4.2.1. So to enable this function, just create an alert selecting the CPU Utilization. Then define the alert actions to be performed. Next, using the Storage Subsystem tab, select the SVC or Storwize V7000 cluster to have this alert set for.
283
Overall port response time threshold: The port response times alert can let you know when the SAN fabric is becoming a bottleneck. If the response times are consistently bad, you must perform additional analysis of your SAN fabric. Overall back-end response time threshold: An increase in back-end response time might indicate that you are overloading your back-end storage: Because back-end response times can very depending on what I/O workloads are in place. Prior to setting this value, capture 1 to 4 weeks of data to baseline your environment. Then set the response time values. Because you can select the storage subsystem for this alert. You are able to set different alerts based upon the baselines you have captured. Our recommendation is to start with your mission critical Tier 1 storage subsystems. To create an alert, as shown in Figure 5-120, expand Disk Manager Alerting Storage Subsystem Alerts and right-click to Create a Storage Subsystems Alert. On the right you get a pull-down menu where you can choose which alert you want to set.
Figure 5-120 SVC constraints alert definition
Tip: The best place for you to verify which thresholds are currently enabled, and at what values, is at the beginning of a Performance Collection job. Expand Tivoli Storage Productivity Center Job Management and select in the Schedule table the latest performance collection job running or that has run for your subsystem. In the Job for Selected Schedule part of the panel (lower part), expand the corresponding job and select the instance, as shown in Figure 5-121.
284
Figure 5-121 Job management panel - SVC performance job log selection
By clicking to the View Log File(s) button, you can access to the corresponding log file, where you can see the threshold defined, as shown in Figure 5-122. Tip: To go to the beginning of the log file, click the Top button.
Figure 5-122 SVC constraint threshold enabled
285
Expand IBM Tivoli Storage Productivity Center Alerting Alert Log Storage Subsystem to list all the alerts occurred. Look for your SVC subsystem, as shown in Figure 5-123.
Figure 5-123 SVC constraints alerts history
By clicking the icon next to the alert you want to enquire about, you get detailed information as shown in Figure 5-124.
Figure 5-124 SVC constraints - alert details
See 3.4.4, Defining the alerts on page 80 for more details.
286
5.9 Case study: IBM XIV Storage System workload analysis

For the IBM XIV Storage System, there are various metrics or parameters that can be monitored by Tivoli Storage Productivity Center to build reports or define constraint violations, as shown in Table 5-2.
Table 5-2 IBM IXIV metrics Metrics Read I/O Rate (overall) Write I/O Rate (overall) Total I/O Rate (overall) Read Cache Hits % (overall) Write Cache Hits % (overall) Total Cache Hits % (overall) Read Data Rate Write Data Rate Total Data Rate Read Transfer Size Write Transfer Size Overall Transfer Size Read Response Time Write Response Time Overall Response Time Port Send I/O Rate Port Receive I/O Rate Total Port I/O Rate Available only for XIV 10.2.4 or later: Port Send Data Rate Port Receive Data Rate Total Port Data Rate component volume, module, subsystem volume, module, subsystem volume, module, subsystem volume, module, subsystem volume, module, subsystem port Description Average number of I/O operations per second for both sequential and non-sequential read/write/total operations for a particular component over a time interval. Percentage of cache hits for non-sequential read/write/total operations for a particular component over a time interval. Average number of megabytes (10^6 bytes) per second that were transferred for read/write/total operations for a particular component over a time interval Average number of KB per I/O for read/write/total operations for a particular component over a time interval. Average number of milliseconds that it took to service each read/write/read and write operation for a component over a specified time interval. Average number of I/O operations per second for send/receive/send and receive operations for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second that were transferred for send (read)/receive (write)/send and receive operations for a port over a specified time interval.
port
In this case study, we compare the overall I/O rate of some IBM XIV volumes. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Subsystem Performance, select the IBM XIV subsytems, click the icon and select Read I/O Rate (overall) and Write I/O Rate (overall), as shown in Figure 5-125.
287
Figure 5-125 IBM XIV I/O rate selection
Click Ok to generate the graph shown in Figure 5-126. From the chart result, we see that this subsystem has a higher read I/O rate but very low write I/O rate, which means that it has a more read extensive workload.
Figure 5-126 IBM XIV I/O rate report
288
This type of information can be used, for example, to do performance tuning from the application, operating system, or the storage subsystem side. This can be a starting point for further analysis. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Click Top Active Volume Cache hit Performance, click the Selection tab to specify additional Filter option. Click Filter on the upper right corner and add a Filter at Subsystem level, as shown in Figure 5-127.
Figure 5-127 IBM XIV Cache hit selection
289
Then click Generate Report to get volumes list. Click the icon and select Read Cache Hits percentage (overall) and Write Cache Hits percentage (overall). Click Ok to generate the chart shown in Figure 5-128. In this case study, we notice that the IBM XIV volume tpcblade3-7_cet_1 make good use of Cache during Read activity, while the others have low Read Cache Hits percentage. This can depend on the type of workload or application.
Figure 5-128 IBM XIV Cache Hit report
290
5.10 Case study: Fabric - monitor and diagnose performance

In this case study we try to find a fabric port bottleneck that exceeds 50% port utilization. We are using 50% for lab purposes only. Tip: It is more realistic to monitor 80% of port utilization, in a production environment. Ports on the switches in this SAN are 8 Gigabit. Therefore, 50% utilization will be approximately 400 megabytes per second. You can create a performance collection job specifying filters, as shown in Figure 5-129. Expand Fabric Manager Reporting Switch Performance By Port. Click Filter, on the upper right corner and specify the conditions shown in Figure 5-129. Important: The At least one condition option has to be turned on. This results in this report identifying switch ports that satisfy either filter parameter.
Figure 5-129 Filter for fabric performance reports
After generating this report on the next page, you will use the Topology Viewer to identify what device is being impacted and identify a possible solution. Figure 5-130 shows the result we are getting in our lab.
Figure 5-130 Ports exceeding filters set for switch performance report
291
Click the icon and select Port Send Data Rate, Port Receive Data Rate and Total Port Data Rate, holding Ctrl key. Click Ok to generate the chart shown in Figure 5-131. Tip: This chart gives you an indication as to how persistent this high utilization for this port is. This is an important consideration in order to establish the importance and the impact of this bottleneck. important: To get all the values in the selected interval, you have to remove the filters defined in Figure 5-129. The chart shows a consistent throughput higher than 300 MB/sec in the selected time period. You can change the dates, extending the Limit days.
Figure 5-131 Switch ports Data rate
292
To identify what device is connected to port 7 on this switch, expand IBM Tivoli Storage Productivity Center Topology Switches. Right-click, select Expand all Groups and look for your switch, as shown in Figure 5-132.
Figure 5-132 Topology Viewer for switches
Tip: To navigate in the Topology Viewer, press and hold the Alt key and press and hold the left mouse button to anchor your cursor. With these keys all held down, you can use the mouse to drag the screen to show what you need.
293
Find and click port 7. The line shows that it is connected to computer tpcblade3-7, as shown in Figure 5-133. Note that in the tabular view on the bottom, you can see Port details. If you scroll right, you can check Port speed, too.
Figure 5-133 Switch port and computer
Double-click this computer to highlight it. Click Datapath Explorer (see DataPath Explorer shortcut highlighted in the minimap on Figure 5-133) to get a view of the paths between servers and storage subsystems or between storage subsystems (for example you can get SVC to back-end storage, or server to storage subsystem). The view consists of three panels (host information, fabric information and subsystem information) that show the path through a fabric or set of fabrics for the endpoint devices, as shown in Figure 5-134. Tip: A possible scenario utilizing Data Path Explorer is an application on a host that is running slow. The system administrator wants to determine the health status for all associated I/O path components for this application. Are all components along that path healthy? Are there any component level performance problems that might be causing the slow application response? Looking at the data paths for computer tpcblade3-7, this indicates that it has a single port HBA connection to the SAN. A possible solution to improve the SAN performance for computer tpcblade3-7 is to upgrade it to a dual port HBA.
294
Figure 5-134 Data Path Explorer
295
5.11 Case study: Using Topology Viewer to verify SVC and Fabric configuration
After Tivoli Storage Productivity Center has probed the SAN environment, it takes the information from all the SAN components (switches, storage controllers, and hosts) and automatically builds a graphical display of the SAN environment. This graphical display is available by the Topology Viewer option on the Tivoli Storage Productivity Center navigation tree. The information about the Topology Viewer panel is current as of the successful resolution of the last problem. By default, Tivoli Storage Productivity Center will probe the environment daily; however, you can execute an unplanned or immediate probe at any time. Tip: If you are analyzing the environment for problem determination, we recommend that you execute an ad hoc probe to ensure that you have the latest up-to-date information about the SAN environment. Make sure that the probe completes successfully.
296
5.11.1 Ensuring that all SVC ports are online

Information in the Topology Viewer can also confirm the health and status of the SVC and the switch ports. When you look at the Topology Viewer, Tivoli Storage Productivity Center will show a Fibre port with a box next to the WWPN. If this box has a black line in it, the port is connected to another device. Table 5-3 shows an example of the ports with their connected status.
Table 5-3 Tivoli Storage Productivity Center port connection status Tivoli Storage Productivity Center port view Status This is a port that is connected.
This is a port that is not connected.
Figure 5-135 shows the SVC ports connected and the switch ports.
Figure 5-135 SVC connection
297
Important: Figure 5-135 shows an incorrect configuration for the SVC connections, as it was implemented for lab purposes only. In real environments it is important that each SVC (or Storwize V7000) node port is connected to two separate fabrics. If any SVC (or Storwize V7000) node port is not connected, each node in the cluster displays an error on LCD display. Tivoli Storage Productivity Center also shows the health of the cluster as a warning in Topology Viewer, as shown in Figure 5-135. It is also important that: You have at least one port from each node in each Fabric; You have an equal number of ports in each Fabric from each node; that is, do not have three ports in Fabric 1 and only one port in Fabric 2 for an SVC (or Storwize V7000) node.
Ports: In our example, the connected SVC ports are both online. When an SVC port is not healthy, a black line drawn is shown between the switch and the SVC node. Because Tivoli Storage Productivity Center knew where the unhealthy ports were connected to on a previous probe (and, thus, they were previously shown with a green line), the probe discovered that these ports were no longer connected, which resulted in the green line becoming a black line. If these ports had never been connected to the switch, no lines will show for them.
298
5.11.2 Verifying SVC port zones

When Tivoli Storage Productivity Center probes the SAN environment to obtain information about SAN connectivity, it also collects information about the SAN zoning that is currently active. The SAN zoning information is also available on the Topology Viewer by the Zone tab. By opening the Zone tab and clicking both the switch and the zone configuration for the SVC, we can confirm that all of SVC node ports are correct in the Zone configuration. Attention: By default, the Zone tab is not enabled. To enable the Zone tab, you must configure and turn this on using the Global Settings. To get to the Global Settings list, open the Topology Viewer screen. Then right-click in any white space. From the pop-up window, select the Global Settings from the list. Within the Global Setting box, place a check mark on the Show Zone Tab box. This will enable you to see SAN Zoning details for your switch fabrics. Figure 5-136 shows that we have defined an SVC node zone called SVC_CL1_NODE in our FABRIC-2GBS, and we have correctly included all of the SVC node ports.
Figure 5-136 Topology Viewer - SVC zoning
299
5.11.3 Verifying paths to storage

The Data Path Explorer functionality in Topology View can be used to see the path between two objects and it shows the objects and the switch fabric in one view. Using Data Path Explorer, we can see for example that mdisk1 in Storwize V7000-2076-ford1_tbird-IBM is available through two Storwize V7000 ports and trace that connectivity to its logical unit number (LUN) raid (ID:009f). This is shown in Figure 5-137 on page 301. What is not shown in Figure 5-137 is that you can hover the MDisk, LUN, and switch ports and get both health and performance information about these components. This enables you to verify the status of each component to see how well it is performing.
300
Figure 5-137 Topology Viewer - Data Path Explorer
301
5.11.4 Verifying host paths to the Storwize V7000

By using the computer display in Tivoli Storage Productivity Center, you can see all the fabric and storage information for the computer that you select. Figure 5-138 shows the host tpcblade3-11, which has two host bus adapters (HBAs) but only one is active and connected to the SAN. This host has been configured to access some Storwize V7000 storage, as you can see in the top-right part of the panel. Our Topology Viewer shows that tpcblade3-11 is physically connected to a single fabric. By using the Zone tab, we can see the single zone configuration applied to tpcblade3-11 for the 100000051E90199D zone. This will mean that tpcblade3-11 does not have redundant paths, and thus if switch mini went offline, tpcblade3-11 will lose access to its SAN storage. By clicking the zone configuration, we can see which port is included in a zone configuration and thus which switch does have the zone configuration. The port that has no zone configuration will not be surrounded by a gray box.
Figure 5-138 tpcblade3-11 has only one active HBA
The Data Path Viewer in Tivoli Storage Productivity Center can also be used to check and confirm path connectivity between a disk that an operating system sees and the volume that the Storwize V7000 provides.
302
Figure 5-139 shows the path information relating to host tpcblade3-11 and its volumes. However, Figure 5-139 does not show that you can hover over each component to also get health and performance information, which might be useful when you perform problem determination and analysis.
Figure 5-139 Viewing host paths to the Storwize V7000
303
304
Chapter 6.
Using Tivoli Storage Productivity Center for capacity planning management

In this chapter we take you through the methodology to use Tivoli Storage Productivity Center for Capacity Planning.
305
6.1 Capacity Planning and Performance Management

When you hear or think of the phrase storage capacity management you immediately think about questions such as how big is my subsystem?, how much space do I have left? or is my Oracle Database about to fall over because its filesystem is filling up? In this chapter, we will be looking at how Capacity Management and Performance Management are not only interwoven with each other, but are complimentary and dependent on each other as they have the same objective: provide the correct solution to meet the business objectives. So far in this book we have been talking about Performance Management, and how to set up your performance data collections. We have given you reasons why you have done this, but we have not mentioned Capacity Management up until now. A lot of you use Tivoli Storage Productivity Center for Data to look at you storage utilization, so that you can create Trending reports to be able to predict future requirements, or to find out if filesystems are filling up, and thus affecting your application performance. We will actually show you how the performance monitoring of your environment will help you with your capacity planning. Before we show you some samples, let us look at some definitions and approaches for both Capacity Planning and Performance Management. Although both Capacity Planning and Performance Management involve performance goals and resources, they have different starting points. Capacity Planning starts with the performance goals that you have established in your Service Level Agreement (SLA). The SLA is a contract between your Information Technology (IT) and the business specifying the measurable service and performance goals that IT will provide. Capacity Planning then determines the hardware resources necessary to achieve the performance goals. Performance Management, on the other hand, starts with your existing hardware resources. It is a process to make sure performance goals in the SLA continue to be met and to identify and correct performance issues if goals are not met. Of course, we also need to consider not just application performance or capacity management, but also the components that make up a storage subsystem, such as the cache utilization, the CPU utilization, Disk Utilization or the Inter-Switch Link (ISL) utilization.
306
6.1.1 Capacity Planning overview

Capacity Planning deals with two separate measures of storage: storage capacity and storage performance.
Capacity Planning considerations

You usually have a good idea of how much disk space you need, but often do not focus on the hardware resources necessary to achieve the storage performance you require for the business. Capacity Planning is generally a pre-sales effort dealing with both areas that relates storage resources to business metrics. It involves: Gaining a long-term understanding of your data workloads and business storage requirements based on your Service Level Agreement. Using appropriate modelling tools to design a storage solution to meet those long-term requirements. If necessary, IBM can help you with these tools.
Capacity Planning approach

Capacity Planning relies upon a proper understanding of the workload, having appropriate performance goals, and knowing practical ways to size the resources required to reach the goal. There is a trade-off between the amount of effort put into capacity planning and the accuracy of the resulting forecast. The following are examples of progressive levels of effort that yield greater levels of accuracy: Learn from Best of Breed. Find another business about your size and install a similar environment. The other business can also be a customer-like performance benchmark, Buy enough disk. Take the capacity requirements of your new application from the developers as gospel and buy enough disk to install it on. Use Rules of Thumb (ROT). This technique employs cause-effect relations between load and particular equipment. It requires you have a valid estimate of your peak performance requirement (usually in terms of I/O rate or data transfer rate). We have already made reference to some ROTs in previous chapters, and have also provided a list of these ROTs in Appendix A., Rules of Thumb and suggested thresholds on page 327. Use a Model. Use powerful capacity planning software to make an accurate, credible estimate. There are several tools to help you with this, and IBM can help you if necessary. Use a True benchmark. Run your applications and data off-site on hardware proposed by a vendor to judge actual performance. This is usually expensive, especially in terms of personnel costs.
6.1.2 Performance Management overview

Performance Management is generally a post-sales effort that involves determining how to manage existing resources to meet your SLA or defined goals. Performance is optimized by allocating the resources to current IT applications based on these objectives.
Performance Management approach

Performance Management requires analysis of large amounts of data to break down where all the response time is occurring and fix the biggest culprits. The effort succeeds when performance goals or Service Level Objectives are met.
Chapter 6. Using Tivoli Storage Productivity Center for capacity planning
management
307
Performance Management procedure

Performance can be managed using the following steps: Check the status of your key applications. Their performance must meet or exceed the goals set by your SLAs. If an application is not doing well, find out what resources account for the largest portion of the problem. Get more of the resource causing the application performance trouble: Buy more resource. Re-allocate existing resources from a less important application. This is tough because most people think the entire well-being of the company is embodied in their application. There is high probability there will not be volunteers. Tune the resource. But the benefit from tuning depends on how badly things are already out of tune. Any method of getting more resource can be used as long as enough resource is applied to correct the applications performance problem. Evaluate the cost of different ways of obtaining more resource. Estimate the effect extra resources will have on the troubled application. Form an action plan and execute. Verify that the application is running as it ought to be, after changes have been made, and that they have had the planned effect. The application running properly is not enough verification. The application might be running better because of some change in user demand or some change other than the one that was planned. You must be sure that your changes solved the problem, and you must quantify the improvement. After all applications are running properly, performance management action is usually triggered by applications violating your performance thresholds. The performance threshold values are based on requirements from your service level agreement. If necessary, review your threshold settings and adjust, either up or down, accordingly. These new settings can form part of your new configuration baseline. See 3.3, Creating a baseline with Tivoli Storage Productivity Center on page 68.
6.1.3 Capacity Planning reporting

In this section, we will be giving you examples of how to create Capacity reports for both capacity space utilization, as well as showing you some of the reports, from the performance data gathering processes, that can be generated to help identify some potential bottlenecks that can be corrected by proper Capacity Planning.
Data capacity approach

There are many ways to see how much disk capacity is defined or how much is free. Depending on the granularity you require, it can be as simple as looking at the Dashboard view that is displayed when you first start up the Tivoli Storage Productivity Center GUI, or as detailed as how much free space is available for a particular Managed Disk Group. The Dashboard shows you multiple screens with overall system summaries of filesystem capacity, used space, free space, number of servers, pending alerts, and more. This level of detail might be sufficient for your average needs, but for more detailed capacity, and usage reports, we show you some of these below.
308
Data capacity reports

From the Tivoli Storage Productivity Center tree view, select Data Manager Reporting TPC-wide Storage Space you will see the sub-menu options as can be seen in Figure 6-1.
Figure 6-1 TPC-wide Storage Space options
From the foregoing list, you can see that you can view reports by Disk Space, Filesystem Space, Consumed Filesystem Space and Available Filesystem Space. Each of these categories can then be broken down further. If you click Available Filesystem Space By Computer, then click the Generate Report button on the right hand window, you will be presented with a window, as shown in Figure 6-2.
Figure 6-2 By Computer view
You can then highlight the computer you want to monitor, or select all. You select all by selecting the top computer, then, while holding shift key down, left-click the bottom computer. After that is done, when you click the graph symbol, you will be presented with another window which allows you to select which graphical report you want to view. This can be seen in Figure 6-3.
management
309
Figure 6-3 History Graph selection
After selecting History Chart Used Space for selected, the graph as shown in Figure 6-4 is provided.
Figure 6-4 Used Space graph
As you can see from the graph, ours is not a very dynamic environment. This is not typical of a full production configuration. The dashed (---) lines on the right show the expected trend for the near future. At first glance, the above lines appear to be constant, but by clicking each of the data collection points, it shows the values in that collection and our data is changing slightly. You can use this procedure daily to see your trends for usage growth or freespace. 310
6.2 Performance of a storage subsystem

In this section, we will look at subsystem performance at a high level, with the view of how it can be used to assist in your capacity planning. Chapter 3, General performance management methodology on page 53 shows how to create detailed performance reports for problem determination.
6.2.1 SVC and Storwize V7000

Naming convention: In this chapter we refer to SVC and Storwize V7000 components using the same naming convention used by Tivoli Storage Productivity Center, not the new naming convention, as listed here: Managed Disk Group instead of Storage Pool Virtual Disk instead of Volume For SVC and Storwize V7000, there are some very specific reports that are not applicable to other subsystems. Tivoli Storage Productivity Center provides more metrics for storage virtualization devices and the terminology of SVC and Storwize V7000 is different from traditional storage subsystems.
Managed Disk Groups

These are made up of physical Volumes provided by the back-end subsystems. It is important that the performance of those Volumes are monitored, as a faulty, or poor performing Volume can dramatically affect the whole Managed Disk group. Managed Disk Group performance report provides disk performance information at the managed disk group level. It summarizes read and write transfer size, back-end read, write and total I/O rate. From this report you can easily drill up to see the statistics of virtual disks supported by a managed disk group or drill down to view the data for the individual mdisks that make up the managed disk group. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Managed Disk Group Performance. A table is displayed listing all the known Managed Disk Groups and their last interval collection statistics. See Figure 6-5.
Figure 6-5 Managed Disk Group performance
management
311
One of the managed disk groups in Storwize V7000 is mdiskgrp1. Click the drill down icon on the row mdiskgrp1. A new tab is created, containing the Managed Disks within this Managed Disk Group. See Figure 6-6.
Figure 6-6 Drill down from managed disk group performance report
From here, click the drill down icon again to get to the Virtual Disks that reside on the Managed Disk. See Figure 6-7.
Figure 6-7 Drill down from Managed Disk
I/O Rate
We recommend that you start analyzing how your workload is split between Managed Disk Groups to understand if is well balanced or not. Click Managed Disk Groups tab, select all Managed Disk Groups for the Storwize V7000, click the icon and select Total Backend I/O Rate, as shown in Figure 6-8.
312
Figure 6-8 Storwize V7000 Managed Disk Group I/O rate selection
Click Ok to generate the chart as shown in Figure 6-9. Back-end workload is not equally distributed, because mdiskgrp2 is much less used than the other Managed Disk Groups, confirming an unbalanced workload distribution. This does not necessary mean that a problem occurred, because there can be different back-end storage subsystems with different technologies and sizes, and therefore different workloads for the Managed Disk Groups. If this is a problem, you can look at moving some of the Virtual Disks into other Managed Disk Group to balance the workload.
Figure 6-9 Storwize V7000 Managed Disk Group I/O rate report
For further details about Rules of Thumb and how to interpret these values, see 3.2.2, Rules of Thumb on page 59.
management
313
Back-End Data Rates

Back-End throughput and response time depends on the DDM and its relationship to RAID. With this report you can also check how Managed Disk workload is distributed. In this example, we focus on Managed Disk Group cognos on Storwize V7000. Drill down from Managed Disk Group cognos to see Managed Disks, select all Managed Disks, click the icon and select the Total Backend Data Rates, as shown in Figure 6-10.
Figure 6-10 Storwize V7000 MDisk Back-End Data Rates selection
Click Ok to generate the report shown in Figure 6-11.
Figure 6-11 Storwize MDisk Back-End Data Rates report
314
As you can see here the workload is balanced across the Managed Disks. This generally happens when Managed Disks in a Managed Disk Group are of the same size, so they sustain the same Data Rate. Figure 6-12 confirms this: The Storwize Element Manger shows that both volumes mdisk0 and mdisk1 in Managed Disk Group cognos are of the same size.
Figure 6-12 Storwize - Managed Disks in Managed Disk Group cognos
Figure 6-13 represents an example of poor balanced Data Rate, in this case between Managed Disks mdisk61 and msidk91 in Managed Disk Group CET_DS8K1901 on SVC subsystem:
Figure 6-13 SVC Managed Disks back-end data rate report
management
315
Looking at the SVC Element Manager, we can see that the two volumes are not of the same size, and this most probably is the reason for poor balanced configuration. See Figure 6-14.
Figure 6-14 SVC - Mdisks in MDgroup CET_DS8K1901
For further details about SVC and Storwize V7000 Rules of Thumb and how to interpret these values, see Chapter 3., General performance management methodology on page 53.
I/O Groups
For SVCs with multiple I/O Groups, a separate row is generated for every I/O Group within each SVC Cluster. For capacity planning at the I/O group level, you will monitor each node, the CPU utilization of those nodes, and the Cache Hit Rates pertaining to those nodes, to determine if the current configuration is sufficiently sized for the workload you currently have, or are growing into. In our Lab environment, data was collected for one SVC which only have a single I/O group (and from a Storwize V7000 that cannot have more than one I/O group). The scroll bar at the bottom of the table indicates additional metrics can be viewed, as shown in Figure 6-15.
Figure 6-15 I/O group performance
Important: The data displayed in this performance report is the last collected value at the time the report is generated - it is not an average of last hours or days.
316
Click the Drill Down button next to SVC io_grp0 entry to drill down and view the statistics by nodes within the selected I/O Group. Notice that a new tab, Drill down from io_grp0, is created containing the report for nodes within the SVC I/O Group (Figure 6-16).
Figure 6-16 Drill down from io_grp0
To view a historical chart of one or more specific metrics for the resources, you can click the icon and select the metrics of interest. You can select one or more metrics that use the same measurement unit. If you select metrics that use different measurement units, you will receive an error message. Note: If you want to create graphs including metrics with different measurement units, you have to use TPCTOOL. See Appendix C., Reporting with Tivoli Storage Productivity Center on page 365.
CPU Utilization percentage

The CPU Utilization reports give you an indication of how busy the Cluster Node CPUs are. To generate a graph of CPU utilization by node, from the Select Charting Option menu select the CPU Utilization Percentage metric and click Ok to create the chart. See Figure 6-17, where the CPU utilization is more than acceptable, even for the peaks for node2.
Figure 6-17 SVC CPU utilization report
management
317
A consistently high CPU Utilization rate indicates a busy Node in the Cluster. If the CPU utilization remains high, it might be time to increase the cluster by adding more resources, or migrate Virtual Disks to another I/O Group or SVC Cluster. You can add cluster resources by adding another I/O Group to the cluster (two nodes) up to the maximum of four I/O Groups per cluster (SVC only); alternatively you might replace old Nodes with new ones. In case the Cluster is already composed by four I/O Groups and still there is high CPU utilization, it is time to build a new cluster and consider either migrating some storage to the new cluster or servicing new storage requests from it. Tip: We recommend that you plan additional resources for the cluster if your CPU utilization indicates workload continually above 70%.
Cache Hit Rates on nodes

Efficient use of cache can help enhance Virtual Disk I/O response time. The Node Cache Performance report display a list of cache related metrics such as read and write cache hits percentage and read ahead percentage of cache hits. The cache memory resource reports provide an understanding of the utilization of the SVC cache. These reports provide you with an indication of whether the cache is able to service and buffer the current workload. Expand IBM Tivoli Storage Productivity Center Reporting System Reports Disk. Select Module/Node Cache performance report. See Figure 6-18. Note: This report is generated for SVC and Storwize V7000 at Node level, and for XIV at Module level.
Figure 6-18 Node cache performance report
Total Cache Hit percentage is the percentage of reads and writes that are handled by the cache without needing immediate access to the back-end disk arrays. Read Cache Hit percentage focuses on Reads, because Writes are almost always recorded as cache hits. The Read and Write Transfer Sizes are the average number of bytes transferred per I/O operation.
318
To look at the Read cache hits percentage by node for Storwize V7000 nodes: select the Storwize V7000 nodes, click the icon and select the Read Cache Hits Percentage (overall). Then click Ok to generate the chart, as shown in Figure 6-19.
Figure 6-19 Storwize V7000 Read Cache hits percentage - per node
Read Hit Percentages can vary from near 0% to near 100%. Anything below 50% is considered low, but many database applications show hit ratios below 30%. For very low hit ratios, you need many ranks providing good back-end response time. It is difficult to predict whether more cache will improve the hit ratio for a particular application. Hit ratios are more dependent on the application design and amount of data, than on the size of cache (especially for Open System workloads). But larger caches are always better than smaller ones. For high hit ratios, the back-end ranks can be driven a little harder, to higher utilizations. It is not possible to increase the size of the cache in a particular SVC (or Storwize V7000) node. Therefore if you have a cache problem, it is important that you understand how the cache works and the implications of the structure at the back-end.
management
319
If you need to analyze further Cache performance metrics and try to understand if it is enough for your workload, you can run multiple metrics charts. Select all the metrics named percentage, because you can have multiple metrics with the same unit type, in one chart, as shown in Figure 6-20 where two percentage metrics are selected for a report on SVC1 node1.
Figure 6-20 SVC multiple metrics Cache selection
In our example we compare the reports in the same time frame for SVC1 node1 and node2, selecting one node for each report. See Figure 6-21 and Figure 6-22.
Figure 6-21 SVC1 node1 multiple metrics Cache selection
320
Figure 6-22 SVC1 node2 multiple metrics Cache selection
We notice in Figure 6-21 on page 320 that for node1 there is a high Write Cache Delay Percentage (almost 80%), write cache hits percentage almost 0 and a drop in Read Cache Hits Percentage. These values, together with an increase in back-end response time shows that the node1 is heavily burdened with I/O, and at this time interval, the SVC cache is probably full of outstanding write I/Os. Host I/O activity will now be impacted with the backlog of data in the SVC cache and with any other SVC workload that is going on to the same Managed Disk Group. Figure 6-22 shows a completely different situation for node2, because there is no traffic stressing the node. Therefore, the foregoing two figures show a very poorly balanced configuration for SVC1.
management
321
SVC and Storwize V7000 Clusters

If an I/O group has reached its performance limit or constraint, and you already have 4 I/O groups in the cluster (SVC only), then you will need to install a new cluster to assist you in preventing performance bottlenecks. Each cluster also has some architectural limits (see next paragraphs for SVC and Storwize V7000), even though you might not be experiencing any obvious performance issues.
SVC V6.2.0 restrictions

At the time of writing, these limits are: A maximum of 8192 replication (global or metro mirrors) relationship per cluster A maximum of 1024 hosts per cluster A maximum of 4096 Managed Disk groups per cluster. For a complete list of SVC V6.2.0 configuration limits and restrictions, see this website: https://www-01.ibm.com/support/docview.wss?uid=ssg1S1003799
Storwize V7000 V6.2.0 restrictions

At the time of writing, these limits are: A maximum of 4096 replication (global or metro mirrors) relationship per cluster A maximum of 256 hosts per cluster A maximum of 4096 Managed Disk groups per cluster. For a complete list of SVC V6.2.0 configuration limits and restrictions, see this website: https://www-01.ibm.com/support/docview.wss?uid=ssg1S1003800 If you need more than these in your environment, then you will need to start another cluster, and it is then recommended to balance of the workload between the old cluster, and the new cluster.
322
6.2.2 Storage subsystems

The concepts and recommendations for SVC and Storwize V7000 apply to individual subsystems as well. The names of components can be different, but from a capacity planning perspective, it is a matter of understanding bottlenecks and any performance impacts on your applications, and then either adding, migrating or moving resources to reduce the problems encountered. It is also worth noting, that not all subsystems provide the same performance reports. Thus, depending on the vendors subsystems you have installed, you will have different levels of granularity to help you understand your environment, from a performance point of view, and then a capacity planning point of view. Most subsystems are designed with similar layers of infrastructure. Tivoli Storage Productivity Center will report for all components that the NAPI interface or CIMOM for that device provides. These reports include: Subsystem Controller Cache Array Volume Disk Port See Appendix B., Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports on page 331 for the specific metrics that are available for the IBM subsystems. All of these components need to be monitored, similar to the way we showed you for the SVC or Storwize V7000, so that there are sufficient resources available to enable your application to meet or exceed the agreed SLAs, or to resolve any perceived performance problems.
6.2.3 Fabric
The monitoring of your Fabric environment is important so that you know how much data is transferring across your SAN. Tivoli Storage Productivity Center provides performance information by port across all monitored switches. When you have Inter-Switch Links (ISL) traffic between switches in the same Fabric, it is critical for those ports (these ISLs are named E_Ports) to be monitored so that you have sufficient bandwidth to satisfy your application response time. For Capacity Planning, it is especially important that sufficient bandwidth is available in a Copy Services environment when you are mirroring between subsystems. You can identify the E_Ports by looking at the FabricTopology view. In the tabular view of a switch, select the Switch Port tab, so you can see the Switch Type in one of the displayed columns. As shown in Figure 6-23 you can see the port types for jumbo switch. As you can see, there is an E_Port (port 12 in slot 7 - index 76) connected to switch l3bumper.
management
323
Figure 6-23 Topology view of Port types
Known limitation: At this stage, the monitoring of E_Port is a two-step process. You need to identify the relevant E_Port number(s) and then use the Selection option in a report by Port.
324
E_Port report creation

To create this report, expand Fabric Manager Reporting Switch Performance and Select By Port. Click the Selection button in the top right-hand corner, then Deselect All and select the appropriate E_Port numbers (identified from the previous list as port with index 76 in jumbo switch), as shown in Figure 6-24.
Figure 6-24 Selecting specific port by port number
management
325
Click Ok then Generate Report. You can then use this report to keep track of your E_Port performance. Figure 6-25 shows an example of a report for Send and Received Bandwidth Percentage metrics, where are present peaks in Send Bandwidth Percentage (the 80% peak must trigger an alert in a production environment):
Figure 6-25 Port Send/Received Bandwidth Percentage report
Port throughput
You also need to monitor individual port throughput to ensure that your application has sufficient bandwidth available. If the switch, or HBA, ports were a bottleneck, then additional ports or HBAs must be installed. You need to install a multi-path driver to be able to use the extra paths.
326
Appendix A.
Rules of Thumb and suggested thresholds

Throughout the book we have referred to Rules of Thumb that can be used as a basis for performance problem determination, as well as a list of suggested thresholds for several Tivoli Storage Productivity Center performance alerts. The goal in this appendix is to have these values in one place for quicker reference.
327
Rules of Thumb summary

The Rules of Thumb in this appendix are used throughout the book and discussed in Chapter 3, General performance management methodology on page 53 and Chapter 5, Using Tivoli Storage Productivity Center for performance management reports on page 185. Keep in mind that the values are generic and are not intended to be utilized as a firm indicators for every environment. They are intended to be used as thresholds to investigate where performance can be affected.
Response time Threshold

Back-End Response Time: This threshold can be difficult to set. Back-End Response Times for DS8000 are very different for Back-End Reads versus Back-End Writes. I have never found an average which works satisfactorily. At Tivoli Storage Productivity Center 4.2, you can set separate thresholds for back-end reads and back-end writes. For DS8000 and RAID 5, Back-End Reads must take less than 25 msec, and back-end writes must take less than 80 msec. Values for SVC are a little harder to estimate. Certainly 25 msec for a back-end read is an appropriate warning level. But back-end writes for SVC usually go to another cached subsystem like DS8K or DS5K or XIV. These writes must go into cache and take less than 25 msec. Summarizing, these are possible values for thresholds: Read response time: average of 50 msec Write response time (RAID): 25 msec (warning) - 50 msec (critical) Special case is BATCH: Read response time can be more than 30 msec. Do not use response time as a metric; instead use MB/sec Front-end Response Time (cache): 5 - 15 msec
CPU Utilization Percentage Threshold

This threshold is for SVC and Storwize V7000 only. Critical levels are, as for many resources near 70%, but 50% might well be an appropriate Warning Threshold. Must be less than 70% as an average utilization value. If above 70%, a capacity planning effort must be undertaken. See , Cache Hit Rates on nodes on page 318.
Disk Utilization Threshold

This alert is for DS8000/DS6000 only. Queueing theory warns us that resource utilizations near 75% and 90% are inflection points in response time, at least for what have historically been called OLTP workloads. Up to 75% The problem is that backup applications and some data mining applications are designed to drive disks very hard 90% and above. To prevent too many alerts you can suppress alerts unless the triggering condition has been violated continuously for a specified period of time. see , Utilization metrics on page 59. 328
FC: Total Port Data Rate Threshold

Individual IT shops might find high thresholds indicating that a subsystem is in danger of overload. But different models have different effective thresholds. For example, an eight node SVC cluster will have a much higher threshold than a four node or two node cluster. And some configurations of DS8000 are more capable than others. What is worse, the same subsystems can have ports of different capabilities. Recommended values are a warning at 50% and a critical stress level of 75% of the nominal data rate. As for low thresholds, it might be useful set one that warns that something might be going wrong. Following the 50% values of nominal data rate for different FC Port Speed: FC Port Speed 2 Gbit/sec: 100 MB/sec FC Port Speed 4 Gbit/sec: 200 MB/sec FC Port Speed 8 Gbit/sec: 400 MB/sec
Overall Port response Time Threshold

Historically, Port Response Times have been 1 or 2 msec. The newest generations of SVC and DS8000 can show fractions of 1 msec. But for large transfers (say 128KB/op or larger) and for busier subsystems, the port response times occasionally range from 5 msec to 10 msec. It might take a little experimentation, but you can probably find a high Warning Threshold just above 20 msec, and a high Critical Threshold at 30 msec or just above.
Cache Holding Time Threshold

This threshold is for DS8000/DS6000 and ESS only. It is a low threshold metric. As the cache holding time approaches 100 seconds, a warning is appropriate. At 60 seconds, things are critical. These low values indicate a workload that will profit from larger cache.
Write-Cache Delay Percentage Threshold

Warning and Critical thresholds are appropriate near the 5% and 10% ranges.
Back-End Read and Write Queue Time Threshold

Especially for SVC these can be very valuable in being aware of this. Recommended setting are a warning at 5 and a critical at 10 for both of these. In reality anything above 1 is a concern.
Port to Local Node Send/Receive Response Time Thresholds

This is a great threshold that ought to be engaged, but by default is not. We recommend as best practice, this be enabled and set to 2 for warning and 3 for critical for any SVC clusters. Any time SVC nodes take more than 1ms for response between the nodes volumes are impacted.
Appendix A. Rules of Thumb and suggested thresholds
329
Port to local node Send/receive Queue Time Threshold

Just like the prior this is critical to know if your SVC nodes are having any challenge in talking to each other.
Non-Preferred Node Usage

This threshold is disabled by default. Non-Preferred Node Usage was a really big issue before SVC 4.3.1. Since then, it is not as big of an issue anymore, If you elect to enable this, it must be set the to low values, until a threshold is obtained for your environment. Again, the impact here is with OSs that do not recognize the preferred node concept, or when a preferred node is offline.
CRC Error rate Threshold

This is key one for SVC with Global Mirror. This must be enabled and set to a low value as any CRC errors are a concern. The conditional trigger must be used, based upon how many CRCs are happening over an hour period.
Zero Buffer Credit Threshold

This is a good one, but you must use the conditional filter to watch for Zero Buffer Credit thresholds happening for long periods of time. If you elect to enable this, it must be set to low values, until a threshold is obtained for your environment.
Link Failure Rate and Error Frame Rate Threshold

This is another good one for replication. Again this must be used with the conditional trigger reviewing number over a certain period. It can be assumed that anything greater than 10 to 20 per hour will be a real concern.
330
Appendix B.
Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
This appendix contains a list of performance metrics and thresholds for IBM Tivoli Productivity Center Performance Reports, with explanations of their meanings.
331
Performance metric collection

We begin with a high level discussion of the way that Tivoli Storage Productivity Center collects performance metrics from storage devices and switches. The performance counters are usually kept in device firmware, then pulled out for processing by NAPI Interfaces or CIM agents, and forwarded to Tivoli Storage Productivity Center for final calculations and insertion into the Tivoli Storage Productivity Center database. For most devices, the counters kept in firmware are monotonically increasing values. Over time, these values go up and up and only up. Consequently, it is necessary to pull two samples of the counters, separated by a number of seconds, in order to take the difference in the counters and calculate metrics, such as I/O rates using the known time between samples. For example, each time that an I/O (a read or write) is issued to a volume, several counters (I/O count and Bytes transferred) increment. If the counters are pulled at times T1 and T2, the number of I/Os in the sample interval is obtained by subtracting the counters at time T1 from the counters at time T2 (T2-T1). When this count is divided by the number of seconds between T1 and T2, we obtain the I/O rate in IOPS for the sample interval (T1 to T2). This is the technique and is pretty simple for metrics, such as I/O rate, data rate, average transfer size, and so forth. Other metrics, such as Read hit ratios or Disk Utilization, involve other calculations involving sampled counters and times T1 and T2.
Counters
The counters in the firmware are usually unsigned 32-bit or 64-bit counters. Eventually, these counters wrap, meaning that the difference between the counters at T2 and T1 might be difficult to interpret. The Tivoli Storage Productivity Center Performance Manager adjusts for these wraps during its delta computations. The Tivoli Storage Productivity Center Performance Manager stores the deltas in the database. Certain counters are also stored in the Tivoli Storage Productivity Center database, but the performance data is mostly comprised of rates and other calculated metrics that depend on the counter deltas and the sample interval, that is, the time between T1 and T2.
Essential metrics
The primary and essential performance metrics are few and simple, for example, Read I/O Rate, Write I/O Rate, Read Response Time, and Write Response Time. Also important are data rates and transfer sizes. Then come the cache behaviors in the form of Read Hit Ratio and Write Cache delays (percentages and rates). There are a myriad of additional metrics in the Tivoli Storage Productivity Center performance reports, but they need to be used as adjuncts to the primary metrics, sometimes helping you to understand why the primary metrics have the values they have. There are a very few metrics that measure other kinds of values. For example, SVC and Storwize V7000 storage subsystems also report the maximum read and write response times that occur between times T1 and T2. Each time that a sample of the counters is pulled, this type of counter is set back to zero. But the vast majority of counters are monotonically increasing, reset to zero only by very particular circumstances, such as hardware, software, or firmware resets. The design of the Tivoli Storage Productivity Center Performance Manager allows several storage subsystems to be included in a report (or individual subsystems by selection or filtering). But not all the metrics apply to every subsystem or component. In these cases, a -1 appears, indicating that no data is expected for the metric in this particular case.
332
In the remainder of this section, we look at the metrics that can be selected for each report. We examine the reports in the order in which they appear in the Tivoli Storage Productivity Center Navigation Tree.
Reports under Disk Manager

Storage Subsystem Performance: By Storage Subsystem By Controller By I/O Group By Module/Node (Module refers to XIV only - Node refers to SVC and Storwize V7000) By Array By Managed Disk Group By Volume By Managed Disk By Port
Reports under the Fabric Manager

Switch Performance: By Port
Appendix B. Performance metrics and thresholds in Tivoli Storage Productivity Center performance reports
333
New FC port performance metrics and thresholds in Tivoli Storage Productivity Center 4.2.1 release
The Performance Manager component in IBM Tivoli Storage Productivity Center collects, reports, and alerts users on various performance metrics for a variety of SAN devices. One request by customers is to provide more information regarding FCLink problems in their SAN environment, particularly related with their DS8000 and SVC (or Storwize V7000) ports.
Metrics
Numerous metrics are already collected for DS8000, SVC and Storwize ports; however, those pertaining to error counts are currently not tracked or reported by Tivoli Storage Productivity Center. For consistency, the switch port counters that are currently not exposed as metrics that are the same as counters for either DS8000, SVC, or Storwize V7000 ports must be displayed in reports as well. The following error counters will be provided by this work item: Error frame rate for DS8000 ports This metric is defined as the number of frames per second that violated Fibre Channel protocol for a particular port over a particular time interval. Link failure rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of miscellaneous Fibre Channel link errors, such as unexpected NOS received or a link state machine failure detected, per second that were experienced by a particular port over a particular time interval. Loss-of-synchronization rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of loss of synchronization errors per second, where it is a confirmed and a persistent synchronization loss on the Fibre Channel link, for a particular port over a particular time interval. Loss-of-signal rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of times per second that a loss of signal was detected on the Fibre Channel link when a signal was previously detected for a particular port over a particular time interval. Invalid CRC rate for SVC, Storwize V7000 and DS8000 ports This metric is defined as the average number of frames received per second where the CRC in the frame did not match the CRC computed by the receiver for a particular component over a particular period of time. Primitive Sequence protocol error rate for SVC, Storwize V7000, DS8000 and Switch ports This metric is the average number of primitive sequence protocol errors per second where an unexpected primitive sequence was received on a particular port over a particular time interval. Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports This metric is the average number of times per second that "bit" errors were detected on a particular port over a particular time interval. Zero buffer-buffer credit timer for SVC and Storwize V7000 ports The zero buffer-buffer credit timer is the number of microseconds for which the port has been unable to send frames due to lack of buffer credit since the last node reset.
334
Link Recovery (LR) sent rate for DS8000 and Switch ports This metric is the average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR1) state over a particular time interval. Note: I think this is the same as Link Reset transmitted. Link Recovery (LR) received rate for DS8000 and Switch ports This metric is the average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR2) state over a particular time interval. Note: I think this is the same as Link Reset received. Out of order data rate for DS8000 ports This metric is the average number of times per second that an out of order frame was detected for a particular port over a particular time interval Out of order ACK rate for DS8000 ports This metric is the average number of times per second that an out of order ACK frame was detected for a particular port over a particular time interval. Duplicate frame rate for DS8000 ports This metric is the average number of times per second that a frame was received that has been detected as previously processed for a particular port over a particular time interval. Invalid relative offset rate for DS8000 ports This metric is the average number of times per second that a frame was received with bad relative offset in the frame header for a particular port over a particular time interval. Sequence timeout rate for DS8000 ports This metric is the average number of times per second the port has detected a timeout condition on receiving sequence initiative for a Fibre Channel exchange for a particular port over a particular time interval. Note: Bit error rate for DS8000 ports will not be supported. The metric is very similar to the invalid transmission word rate that will be supported, and being limited to 5 minute counting windows makes this counter unreliable for collection frequencies greater than 5 minutes.
Thresholds
While it is preferable to be able to define thresholds for each of the new metrics being introduced, the following thresholds are currently deemed to be the most important ones to include at this time: Error (illegal) frame rate for DS8000 ports. Link failure rate for SVC, Storwize V7000 and DS8000 ports. Invalid CRC rate for SVC, Storwize V7000, DS8000 and Switch ports. Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports. Zero buffer-buffer credit timer for SVC and Storwize V7000 ports.
335
Tivoli Storage Productivity Center Performance Metrics

A description of the metrics that are displayed in the columns of performance reports is provided.
Common columns
Table B-1 contains information about the columns that are common among performance reports.
Table B-1 Common columns Column Time Interval Description Date and time that the data was collected Size of the sample interval in seconds. You can specify a minimum interval length of five minutes and a maximum interval length of sixty minutes for the following models:
Tivoli Storage Productivity Center Enterprise Storage Server DS6000 DS8000 XIV storage system
For SAN Volume Controller models earlier than V4.1, you can specify a minimum interval length of 15 minutes and a maximum interval length of sixty minutes. For SAN Volume Controller models V4.1 and later and for Storwize V7000, you can specify a minimum interval length of 5 minutes, and a maximum interval length of sixty minutes.
Note: When you view metrics for the ESS and DS series of storage systems, you must take into account the following differences between Tivoli Storage Productivity Center reports and the native reports of those systems: Tivoli Storage Productivity Center reports display port performance metrics as send and receive metrics (for example, Send Data Rate and Receive Data Rate). Storage system native reports (for example, reports based on data collected by the DS CLI) display port performance metrics as read and write metrics (for example, Byteread and Bytewrite). When a host performs a read operation, the DS port sends data to the host. Therefore "read" metrics in DS reports correspond to "send" metrics in Tivoli Storage Productivity Center reports. When a host performs a write operation, DS ports receive data from the host. Therefore "write" metrics in DS reports correspond to "receive" metrics in Tivoli Storage Productivity Center reports. When you view port Peer-to-Peer Remote Copy (PPRC) performance metrics, you must take into account the following additional differences between Tivoli Storage Productivity Center reports and native reports for storage systems: Metrics for PPRC reads in storage system native reports are represented as PPRC receives in Tivoli Storage Productivity Center (reads = receives). Metrics for PPRC writes in storage system native reports are represented as PPRC sends in Tivoli Storage Productivity Center (writes = sends).
336
XIV system metrics

To distinguish between metrics that were introduced in different versions of XIV storage systems, the following convention was used in this topic:
1
is displayed next to metrics that are available in XIV storage system version 10.2.2 or later. is displayed next to metrics that are available in XIV storage system version 10.2.4 or later.
For example: The Read I/O Rate (overall) metric is available for XIV storage systems version 10.2.2 and later. In the Devices: components column of the list of metrics, the entry for Read I/O Rate (overall) is displayed like this: XIV1 : volume, module, subsystem The Small Transfers Response Time metric is available for XIV storage systems version 10.2.4 and later. In the Devices: components column of the list of metrics, the entry for Small Transfers Response Time is displayed like this: XIV2 : volume, module, subsystem
Volume-based metrics
Table B-2 contains information about volume-based metrics. Note: Tivoli Storage Productivity Center does not calculate volume-based metrics if there are space efficient volumes allocated in an extent pool consisting of multiple ranks. In this case, the columns for volume-based metrics display the value N/A in the Storage Subsystem Performance By Array report for the arrays associated with that extent pool. However, if there are no space efficient volumes allocated in a multi-rank extent pool, or if the space efficient volumes are allocated in an extent pool consisting of a single rank, then this limitation does not apply and all volume-based metrics are displayed in the By Array report.
Table B-2 Volume-based metrics Column I/O Rates Read I/O Rate (normal) Read I/O rate (sequential) Read I/O Rate (overall) ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: Volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem Average number of I/O operations per second for nonsequential read operations for a component over a specified time interval. Average number of I/O operations per second for sequential read operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential read operations for a component over a specified time interval. Devices: components Description
337
Write I/O Rate (normal) Write I/O Rate (sequential) Write I/O Rate (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of I/O operations per second for nonsequential write operations for a component over a specified time interval. Average number of I/O operations per second for sequential write operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential write operations for a component over a specified time interval.
Total I/O Rate (normal)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of I/O operations per second for nonsequential read and write operations for a component over a specified time interval. Average number of I/O operations per second for sequential read and write operations for a component over a specified time interval. Average number of I/O operations per second for both sequential and nonsequential read and write operations for a component over a specified time interval.
Total I/O Rate (sequential) Total I/O Rate (overall)
Global Mirror Write I/O Rate Global Mirror Overlapping Write Percentage
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem
Average number of write operations per second issued to the Global Mirror secondary site for a component over a specified time interval. Average percentage of write operations issued by the Global Mirror primary site which were serialized overlapping writes for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes were serialized. Average number of serialized overlapping write operations per second encountered by the Global Mirror primary site for a component over a specified time interval. For SVC 4.3.1 and later, some overlapping writes are processed in parallel (are not serialized) and are excluded. For earlier SVC versions, all overlapping writes are serialized. Average number of read operations per second that were issued by the High Performance IBM FICON (HPF) feature of the storage subsystem for a component over a specified time interval.
Global Mirror Overlapping Write I/O Rate
SVC, Storwize V7000: volume, node, I/O group, subsystem
HPF Read I/O Rate
DS8000: volume, array, controller, subsystem
338
HPF Write I/O Rate
Average number of write operations per second that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. Average number of read and write operations per second that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. The percentage of all I/O operations that were issued by the High Performance FICON (HPF) feature of the storage subsystem for a component over a specified time interval. Average number of track transfer operations per second for PPRC usage for a component over a specified time interval. This metric shows the activity for the source of the PPRC relationship, but shows no activity for the target. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are <= 8 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Percentage of I/O operations over a specified interval. Applies to data transfer sizes that are > 512 KB.
Total HPF I/O Rate
HPF I/O Percentage
DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem
PPRC Transfer Rate
Small Transfers I/O Percentage Medium Transfers I/O Percentage Large Transfers I/O Percentage Very Large Transfers I/O Percentage Cache hit percentages Read Cache Hits Percentage (normal) Read Cache Hits Percentage (sequential) Read Cache Hits Percentage (overall)
XIV2: volume, module, subsystem XIV2: volume, module, subsystem
XIV2: volume, module, subsystem
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for nonsequential read operations for a component over a specified time interval. Percentage of cache hits for sequential read operations for a component over a specified time interval. Percentage of cache hits for both sequential and nonsequential read operations for a component over a specified time interval.
Write Cache Hits Percentage (normal) Write Cache Hits Percentage (sequential)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem
Percentage of cache hits for nonsequential write operations for a component over a specified time interval. Percentage of cache hits for sequential write operations for a component over a specified time interval.
339
Write Cache Hits Percentage (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for both sequential and nonsequential write operations for a component over a specified time interval.
Total Cache Hits Percentage (normal) Total Cache Hits Percentage (sequential) Total Cache Hits Percentage (overall)
ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Percentage of cache hits for nonsequential read and write operations for a component over a specified time interval. Percentage of cache hits for sequential read and write operations for a component over a specified time interval. Percentage of cache hits for both sequential and nonsequential read and write operations for a component over a specified time interval.
Readahead Percentage of Cache Hits Dirty Write Percentage of Cache Hits Read Data Cache Hit Percentage Write Data Cache Hit Percentage Total Data Cache Hit Percentage Data rates Read Data Rate
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem XIV2: volume, module, subsystem XIV2: volume, module, subsystem XIV2: volume, module, subsystem
Percentage of all read cache hits which occurred on prestaged data. Percentage of all write cache hits which occurred on already dirty data in the cache. Percentage of read data that was read from the cache over a specified time interval. Percentage of write data that was written to the cache over a specified time interval. Percentage of all data that was read from or written to the cache for a component over a specified time interval.
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP: volume, controller, subsystem XIV1: volume, module, subsystem
Average number of megabytes (2^20 bytes) per second that were transferred for read operations for a component over a specified time interval.
340
Write Data Rate
Average number of megabytes (2^20 bytes) per second that were transferred for write operations for a component over a specified time interval.
Total Data Rate
Average number of megabytes (2^20 bytes) per second that were transferred for read and write operations for a component over a specified time interval.
Small Transfers Data Percentage Medium Transfers Data Percentage Large Transfers Data Percentage Very Large Transfers Data Response times Read Response Time
Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are <= 8 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 8 KB and <= 64 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 64 KB and <= 512 KB. Percentage of data that was transferred over a specified interval. Applies to I/O operations with data transfer sizes that are > 512 KB.
Average number of milliseconds that it took to service each read operation for a component over a specified time interval.
341
Write Response Time
ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, MDisk group, subsystem SMI-S BSP volume, controller, subsystem XIV1: volume, module, subsystem
Average number of milliseconds that it took to service each write operation for a component over a specified time interval.
Overall Response Time
Average number of milliseconds that it took to service each I/O operation (read and write) for a component over a specified time interval.
Peak Read Response Time Peak Write Response Time Global Mirror Write Secondary Lag
SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem
The peak (worst) response time among all read operations. The peak (worst) response time among all write operations. The average number of additional milliseconds it takes to service each secondary write operation for Global Mirror, over and above the time that is required to service primary writes. The percentage of the average response time, both read response time and write response time, that can be attributed to delays from host systems. This metric is provided to help diagnose slow hosts and poorly performing fabrics. The value is based on the time taken for hosts to respond to transfer-ready notifications from the SVC nodes (for read) and the time taken for hosts to send the write data after the node has responded to a transfer-ready notification (for write). Average number of milliseconds that it takes to service each read cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each write cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each read cache hit operation and each write cache hit operation over a specified time interval. Average number of milliseconds that it takes to service each read cache miss operation over a specified time interval.
Overall Host Attributed Response Time Percentage
SVC, Storwize V7000: volume, node, I/O group, subsystem
Read Cache Hit Response Time Write Cache Hit Response Time Overall Cache Hit Response Time Read Cache Miss Response Time
342
Write Cache Miss Response Time Overall Cache Miss Response Time Small Transfers Response Time Medium Transfers Response Time Large Transfers Response Time Very Large Transfers Response Time Transfer sizes Read Transfer Size
Average number of milliseconds that it takes to service each write cache miss operation over a specified time interval. Average number of milliseconds that it takes to service each read cache miss operation and each write cache miss operation over a specified time interval. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are <= 8 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 8 KB and <= 64 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 64 KB and <= 512 KB. Average number of milliseconds that it takes to service each I/O operation. Applies to data transfer sizes that are > 512 KB.
Average number of KB per I/O for read operations.
Write Transfer Size
Average number of KB per I/O for write operations.
Overall Transfer Size
Average number of KB per I/O for read and write operations.
343
Write-cache constraints Write-cache Delay Percentage ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Write-cache Delayed I/O Rate ESS/DS6000/DS8000: volume, array, controller, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Write-cache Overflow Percentage Write-cache Overflow I/O Rate Write-cache Flush-through Percentage Write-cache Flush-through I/O Rate Write-cache Write-through Percentage Write-cache Write-through I/O Rate Record mode reads Record Mode Read I/O Rate Record Mode Read Cache % Cache transfers Disk to Cache I/O Rate ESS/DS6000/DS8000: volume, array, controller SVC, Storwize V7000: volume, node, I/O group, subsystem Cache to Disk I/O Rate ESS/DS6000/DS8000: volume, array, controller SVC, Storwize V7000: volume, node, I/O group, subsystem Miscellaneous computed values Average number of I/O operations (track transfers) per second for cache to disk transfers for a component over a specified time interval. Average number of I/O operations (track transfers) per second for disk to cache transfers for a component over a specified time interval. ESS/DS6000/DS8000: volume, array, controller ESS/DS6000/DS8000: volume, array, controller Average number of I/O operations per second for record mode read operations for a component over a specified time interval. Percentage of cache hits for record mode read operations for a component over a specified time interval. SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem SVC, Storwize V7000: volume, node, I/O group, subsystem Percentage of write operations that were delayed due to lack of write-cache space for a component over a specified time interval. Average number of tracks per second that were delayed due to lack of write-cache space for a component over a specified time interval. Percentage of write operations that were processed in Flush-through write mode for a component over a specified time interval. Average number of tracks per second that were processed in Flush-through write mode for a component over a specified time interval. Percentage of write operations that were processed in Write-through write mode for a component over a specified time interval. Average number of tracks per second that were processed in Write-through write mode for a component over a specified time interval. Percentage of I/O operations that were delayed due to write-cache space constraints or other conditions for a component over a specified time interval. (The ratio of delayed operations to total I/Os.) Average number of I/O operations per second that were delayed due to write-cache space constraints or other conditions for a component over a specified time interval.
344
Cache Holding Time
ESS/DS6000/DS8000: controller, subsystem SVC, Storwize V7000: node, I/O group, subsystem SVC, Storwize V7000: volume, I/O group ESS/DS6000/DS8000: volume SVC, Storwize V7000: volume XIV1: volume
Average cache holding time, in seconds, for I/O data in this subsystem controller (cluster). Shorter time periods indicate adverse performance. Average utilization percentage of the processors. The overall percentage of I/O performed or data transferred by the non-preferred nodes of the volumes, for a component over a specified time interval. The approximate utilization percentage of a volume over a specified time interval (the average percent of time that the volume was busy).
CPU Utilization Non-Preferred Node Usage Percentage Volume Utilization
Back-end-based metrics
Table B-3 contains information about back-end-based metrics.
Table B-3 Back-end-based metrics Column I/O rates Back-End Read I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Total Back-End I/O Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Data rates Back-End Read Data Rate ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Average number of megabytes (2^20 bytes) that were transferred for read operations. Average number of I/O operations per second for read and write operations. Average number of I/O operations per second for write operations. Average number of I/O operations per second for read operations. Devices: components Description
345
Back-End Write Data Rate
ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Average number of megabytes (2^20 bytes) that were transferred for write operations.
Total Back-End Data Rate
ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Average number of megabytes (2^20 bytes) that were transferred for read and write operations.
Response times Back-End Read Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Overall Back-End Response Time ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Read Queue Time Back-End Write Queue Time Overall Back-End Queue Time Peak Back-End Read Response Time SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Average number of milliseconds that it took to respond to each read operation. For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that it took to respond to each write operation. For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that it took to respond to each I/O operation (read and write). For SAN Volume Controller models, this is the external response time of the managed disks (MDisks).
Average number of milliseconds that each read operation spent on the queue before being issued to the back-end device. Average number of milliseconds that each write operation spent on the queue before being issued to the back-end device. Average number of milliseconds that read and write operations spent on the queue before being issued to the back-end device. The peak (worst) response time among all read operations for a component over a specified time interval. For SAN Volume Controller, it represents the external response time of the MDisks. The peak (worst) response time among all write operations for a component over a specified time interval. For SAN Volume Controller, it represents the external response time of the MDisks.
Peak Back-End Write Response Time
346
Peak Back-End Read Queue Time
SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
The lower bound on the peak (worst) queue time for read operations for a component over a specified time interval. The queue time is the amount of time that the read operation spent on the queue before being issued to the back-end device. The lower bound on the peak (worst) queue time for write operations for a component over a specified time interval. The queue time is the amount of time that the write operation spent on the queue before being issued to the back-end device.
Peak Back-End Write Queue Time
SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem
Transfer sizes Back-End Read Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Back-End Write Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Overall Back-End Transfer Size ESS/DS6000/DS8000: rank, array, controller, subsystem SVC, Storwize V7000: node, I/O group, MDisk, MDisk group, subsystem Disk utilization Disk Utilization Percentage ESS/DS6000/DS8000: array The approximate utilization percentage of a rank over a specified time interval (the average percent of time that the disks associated with the array were busy). Note: Tivoli Storage Productivity Center does not calculate a value for this column if there are multiple ranks in the extent pool where the space-efficient volumes are allocated. This column displays value of N/A for the reports in which it appears. However, if there is only a single rank in the extent pool, Tivoli Storage Productivity Center does calculate the value for this column regardless of the space-efficient volumes. Percentage of all I/O operations performed for an array over a specified time interval that were sequential operations. Average number of KB per I/O for read and write operations for a component over a specified time interval. Average number of KB per I/O for write operations for a component over a specified time interval. Average number of KB per I/O for read operations for a component over a specified time interval.
Sequential I/O Percentage
ESS/DS6000/DS8000: array
347
Front-end and fabric based metrics

Table B-4 on page 348 contains information about front-end and fabric-based metrics:
Table B-4 Front-end and Fabric based metrics Column I/O or packet rates Port Send I/O Rate ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port XIV2: port Port Receive I/O Rate ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port XIV2: port Total Port I/O Rate ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port XIV2: port Port Send Packet Rate Port Receive Packet Rate Total port Packet Rate Port to Host Send I/O Rate Port to Host Receive I/O Rate Total Port to Host I/O Rate switch port, switch switch port, switch switch port, switch SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem Average number of packets per second for send operations for a port over a specified time interval. Average number of packets per second for receive operations for a port over a specified time interval Average number of packets per second for send and receive operations for a port over a specified time interval. Average number of exchanges (I/Os) per second sent to host computers by a component over a specified time interval. Average number of exchanges (I/Os) per second received from host computers by a component over a specified time interval. Average number of exchanges (I/Os) per second transmitted between host computers and a component over a specified time interval. Average number of I/O operations per second for send and receive operations for a port over a specified time interval. Average number of I/O operations per second for receive operations for a port over a specified time interval. Average number of I/O operations per second for send operations for a port over a specified time interval. Devices: components Description
348
Port to Disk Send I/O Rate A Port to Disk Receive I/O Rate Total Port to Disk I/O Rate Port to Local Node Send I/O Rate Port to Local Node Receive I/O Rate Total Port to Local Node I/O Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Average number of exchanges (I/Os) per second sent to storage subsystems by a component over a specified time interval. Average number of exchanges (I/Os) per second received from storage subsystems by a component over a specified time interval. Average number of exchanges (I/Os) per second transmitted between storage subsystems and a component over a specified time interval. Average number of exchanges (I/Os) per second sent to other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second received from other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second transmitted between other nodes in the local SAN Volume Controller cluster and a component over a specified time interval. Average number of exchanges (I/Os) per second sent to nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of exchanges (I/Os) per second received from nodes in the remote SAN Volume Controller cluster. Average number of exchanges (I/Os) per second transmitted between nodes in the remote SAN Volume Controller cluster and a component over a specified time interval. Average number of send operations per second using the FCP protocol, for a port over a specified time interval. Average number of receive operations per second using the FCP protocol for a port over a specified time interval. Average number of send and receive operations per second using the FCP protocol for a port over a specified time interval. Average number of send operations per second using the FICON protocol for a port over a specified time interval. Average number of receive operations per second using the FICON protocol for a port over a specified time interval. Average number of send and receive operations per second using the FICON protocol for a port over a specified time interval. Average number of send operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Port to Remote Node Send I/O Rate Port to Remote Node Receive I/O Rate Total Port to Remote Node I/O Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Port FCP Send I/O Rate* Port FCP Receive I/O Rate* Total Port FCP I/O Rate* Port FICON Send I/O Rate* Port FICON Receive I/O Rate* Total Port FICON I/O Rate* Port PPRC Send I/O Rate
ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port
ESS/DS6000/DS8000: port, subsystem
349
Port PPRC Receive I/O Rate Total Port PPRC I/O Rate Data rates Port Send Data Rate
ESS/DS6000/DS8000: port, subsystem ESS/DS6000/DS8000: port, subsystem
Average number of receive operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of send and receive operations per second for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SMI-S BSP: port switch port, switch XIV2: port
Average number of megabytes (2^20 bytes) per second that were transferred for send (read) operations for a port over a specified time interval.
Port Receive Data Rate
Average number of megabytes (2^20 bytes) per second that were transferred for receive (write) operations for a port over a specified time interval.
Total Port Data Rate
Average number of megabytes (2^20 bytes) per second that were transferred for send and receive operations for a port over a specified time interval.
Port Peak Send Data Rate Port Peak Receive Data Rate Port to Host Send Data Rate Port to Host Receive Data Rate
switch port switch port SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Peak number of megabytes (2^20 bytes) per second that were sent by a port over a specified time interval Peak number of megabytes (2^20 bytes) per second that were received by a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to host computers by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from host computers by a component over a specified time interval.
350
Total Port to Host Data Rate Port to Disk Send Data Rate Port to Disk Receive Data Rate Total Port to Disk Data Rate Port to Local Node Send Data Rate Port to Local Node Receive Data Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Average number of megabytes (2^20 bytes) per second transmitted between host computers and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to storage subsystems by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from storage subsystems by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between storage subsystems and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from other nodes in the local SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between other nodes in the local SAN Volume Controller cluster and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent to nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second received from nodes in the remote SAN Volume Controller cluster by a component over a specified time interval. Average number of megabytes (2^20 bytes) per second transmitted between nodes in the remote SAN Volume Controller cluster and a component over a specified time interval. Average number of megabytes (2^20 bytes) per second sent over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent or received over the FCP protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent over the FICON protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received over the FICON protocol, for a port over a specified time interval.
Total Port to Local Node Data Rate
SVC, Storwize V7000: port, node, I/O group, subsystem
Port to Remote Node Send Data Rate Port to Remote Node Receive Data Rate Total Port to Remote Node Data Rate
SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Port FCP Send Data Rate* Port FCP Receive Data Rate* Total Port FCP Data Rate* Port FICON Send Data Rate* Port FICON Receive Data Rate*
351
Total Port FICON Data Rate* Port PPRC Send Data Rate Port PPRC Receive Data Rate Total Port PPRC Data Rate Response times Port Send Response Time
Average number of megabytes (2^20 bytes) per second sent or received over the FICON protocol for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second sent for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second received for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of megabytes (2^20 bytes) per second transferred for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem ESS/DS6000/DS8000: port, subsystem ESS/DS6000/DS8000: port, subsystem
ESS/DS6000/DS8000: port, subsystem XIV2: port
Average number of milliseconds that it took to service each send (read) operation for a port over a specified time interval. Average number of milliseconds that it took to service each receive (write) operation for a port over a specified time interval. Average number of milliseconds that it took to service each operation (send and receive) for a port over a specified time interval. Average number of milliseconds it took to service each send operation to another node in the local SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each receive operation from another node in the local SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each send or receive operation between another node in the local SAN Volume Controller cluster and a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds that each send operation issued to another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each receive operation from another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval.
Port Receive Response Time
Overall Port Response Time
Port to Local Node Send Response Time
SVC, Storwize V7000: node, I/O group, subsystem
Port to Local Node Receive Response Time
Total Port to Local Node Response Time
Port to Local Node Send Queued Time
Port to Local Node Receive Queued Time
352
Total Port to Local Node Queued Time
Average number of milliseconds that each operation issued to another node in the local SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds it took to service each send operation to a node in the remote SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each receive operation from a node in the remote SAN Volume Controller cluster for a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds it took to service each send or receive operation between a node in the remote SAN Volume Controller cluster and a component over a specified time interval. This is the external response time of the transfers. Average number of milliseconds that each send operation issued to a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each receive operation from a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds that each operation issued to a node in the remote SAN Volume Controller cluster spent on the queue before being issued for a component over a specified time interval. Average number of milliseconds it took to service all send operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all receive operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations over the FCP protocol for a port over a specified time interval. Average number of milliseconds it took to service all send operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all receive operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations over the FICON protocol for a port over a specified time interval. Average number of milliseconds it took to service all send operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
Port to Remote Node Send Response Time
Port to Remote Node Receive Response Time Total Port to Remote Node Response Time
Port to Remote Node Send Queued Time
Port to Remote Node Receive Queued Time
Total Port to Remote Node Queued Time
Port FCP Send Response Time* Port FCP Receive Response Time* Overall Port FCP Response Time* Port FICON Send Response Time* Port FICON Receive Response Time* Overall Port FICON Response Time* Port PPRC Send Response Time
ESS/DS6000/DS8000: port, subsystem
353
Port PPRC Receive Response Time Overall Port PPRC Response Time Transfer sizes Port Send Transfer Size
ESS/DS6000/DS8000: port, subsystem ESS/DS6000/DS8000: port, subsystem
Average number of milliseconds it took to service all receive operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval. Average number of milliseconds it took to service all I/O operations for Peer-to-Peer Remote Copy usage for a port over a specified time interval.
ESS/DS6000/DS8000: port, subsystem SMI-S BSP: port
Average number of KB sent per I/O by a port over a specified time interval.
Port Receive Transfer Size
Average number of KB received per I/O by a port over a specified time interval.
Overall Port Transfer Size
Average number of KB transferred per I/O by a port over a specified time interval.
Port Send Packet Size Port Receive Packet Size Overall Port Packet Size
switch port, switch switch port, switch switch port, switch
Average number of KB sent per packet by a port over a specified time interval.
Average number of KB received per packet by a port over a specified time interval.
Special computed values Port Send Utilization Percentage Port Receive Utilization Percentage Overall Port Utilization Percentage Port Send Bandwidth Percentage ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port ESS/DS6000/DS8000: port ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port Port Receive Bandwidth Percentage ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port The approximate bandwidth utilization percentage for receive operations by this port, based on its current negotiated speed. Average amount of time that the port was busy sending data over a specified time interval. Average amount of time that the port was busy receiving data over a specified time interval. Average amount of time that the port was busy sending or receiving data over a specified time interval. The approximate bandwidth utilization percentage for send operations by a port based on its current negotiated speed.
354
Overall Port Bandwidth Percentage
ESS/DS8000: port SVC, Storwize V7000: port switch, port XIV2: port
The approximate bandwidth utilization percentage for send and receive operations by this port.
Error rates Error Frame Rate switch port, switch DS8000: port, subsystem Dumped Frame Rate switch port, switch The number of frames per second that were lost due to a lack of available host buffers for a port over a specified time interval. The number of link errors per second that were experienced by a port over a specified time interval. The number of frames per second that were received in error by a port over a specified time interval.
Link Failure Rate
switch port, switch DS8000: port, subsystem SVC, Storwize V7000: port, node, I/O group, subsystem
Loss of Sync Rate
The average number of times per second that synchronization was lost for a component over a specified time interval.
Loss of Signal Rate
The average number of times per second that the signal was lost for a component over a specified time interval.
CRC Error Rate
The average number of frames received per second in which the CRC in the frame did not match the CRC computed by the receiver for a component over a specified time interval.
Short Frame Rate
switch port, switch
The average number of frames received per second that were shorter than 28 octets (24 header + 4 CRC) not including any SOF/EOF bytes for a component over a specified time interval. The average number of frames received per second that were longer than 2140 octets (24 header + 4 CRC + 2112 data) not including any SOF/EOF bytes for a component over a specified time interval. The average number of disparity errors received per second for a component over a specified time interval. The average number of class-3 frames per second that were discarded by a component over a specified time interval.
Long Frame Rate
switch port, switch
Encoding Disparity Error Rate Discarded Class3 Frame Rate
switch port, switch switch port, switch
355
F-BSY Frame Rate
switch port, switch
The average number of F-BSY frames per second that were generated by a component over a specified time interval. The average number of F-RJT frames per second that were generated by a component over a specified time interval. The average number of primitive sequence protocol errors detected for a component over a specified time interval.
F-RJT Frame Rate
switch port, switch
Primitive Sequence Protocol Error Rate
Invalid Transmission Word Rate
The average number of transmission words per second that had an 8b10 code violation in one or more characters; had a K28.5 in its second, third, or fourth character positions; and/or was an ordered set that had an incorrect Beginning Running Disparity. The number of microseconds that the port has been unable to send frames due to lack of buffer credit since the last node reset. The average number of times per second that a port has transitioned from an active (AC) state to a Link Recovery (LR1) state over a specified time interval. The average number of times per second a port has transitioned from an active (AC) state to a Link Recovery (LR2) state over a specified time interval The average number of times per second that an out of order frame was detected for a port over a specified time interval. The average number of times per second that an out of order ACK frame was detected for a port over a specified time interval. The average number of times per second that a frame was received that has been detected as previously processed for a port over a specified time interval. The average number of times per second that a frame was received with invalid relative offset in the frame header for a port over a specific time interval. The average number of times per second the port has detected a timeout condition on receiving sequence initiative for a Fibre Channel exchange for a port over a specified time interval.
Zero Buffer-Buffer Credit Timer Link Recovery (LR) Sent Rate Link Recovery (LR) Received Rate Out of Order Data Rate Out of Order ACK Rate Duplicate Frame Rate
SVC, Storwize V7000: port, node, I/O group, subsystem switch port, switch DS8000: port, subsystem switch port, switch DS8000: port, subsystem DS8000: port, subsystem
DS8000: port, subsystem
Invalid Relative Offset Rate Sequence Timeout Rate
Note: * The value N/A is displayed for this metric if you set the Summation Level to hourly or daily before generating the report.
356
Tivoli Storage Productivity Center performance thresholds

Performance thresholds are triggering conditions which are used to monitor a component with user-defined values. You can monitor the performance of your enterprise by creating alerts on performance thresholds for switches and storage subsystems. By creating alerts that are triggered by performance thresholds, you can be informed about performance issues in your enterprise. Threshold events tell you when a component has fallen outside of the user-defined values. For example, when a threshold value has reached critical stress.
Threshold boundaries
You can establish your boundaries for the normal expected subsystem performance when defining storage subsystem alerts for performance threshold events. When the collected performance data samples fall outside out of the range you have set, you are notified of this threshold violation so you are aware of the potential problem. The upper boundaries are Critical Stress and Warning Stress. The lower boundaries are Warning Idle and Critical Idle. Usually you will want the stress boundaries to be high numbers and the idle to be low numbers. The exception to this rule is Cache Holding Time Threshold, where you want the stress numbers to be low and the idle numbers to be high. If you do not want to be notified of threshold violations for any boundaries, you can leave the boundary field blank and the performance data will not be checked against any value. For example, if the Critical Idle and Warning Idle fields are left blank, no alerts will be sent for any idle conditions. The Ignore triggering condition when the sequential I/O percentage exceeds check box is active only for the triggering condition Disk Utilization Percentage Threshold. It is a filter condition. The default is 80%. The Ignore triggering condition when the Back-End Read I/O Rate is less than check box only applies to the Back-End Read Response Time and Back-End Read Queue Time thresholds. The Ignore triggering condition when the Back-End Write I/O Rate is less than check box only applies to the Back-End Write Response Time and Back-End Write Queue Time thresholds. The Ignore triggering condition when the Total Back-End I/O Rate is less than check box only applies to the Overall Back-End Response Time threshold. The Ignore triggering condition when the Total I/O Rate is less than check box only applies to the Non-preferred Node Usage Percentage threshold. The Ignore triggering condition when the Write-cache Delay I/O Rate is less than check box only applies to the Write-cache Delay Percentage threshold.
357
Setting the thresholds

Only a few thresholds have defaults and on the other thresholds you will have to determine the best values for stress, idle, critical, and warning values so you can derive the maximum benefit without generating too many false alerts. Because suitable stress thresholds are highly dependent on the type of workload you are running, your exact hardware configuration, the number of physical disks, exact model numbers, and so forth, there are no easy or standard default rules. One of the best approaches is to monitor your performance for a number of weeks and, using this historical data, determine reasonable values for each threshold setting. After that is done, you can fine tune these settings to minimize the number of false alerts.
Array thresholds
Table B-5 lists and describes the Array thresholds:
Table B-5 Array Thresholds Threshold (Metric) Array Thresholds Disk Utilization Percentage DS6000/DS8000 array Sets thresholds on the approximate utilization percentage of the arrays in a particular subsystem; for example, the average percentage of time that the disks associated with the array were busy. The Disk Utilization metric for each array is checked against the threshold boundaries for each collection interval. This threshold is enabled by default for IBM TotalStorage Enterprise Storage Server systems and disabled by default for others. The default threshold boundaries are 80%, 50%, -1, -1. For DS6000 and DS8000 subsystems, this threshold applies only to those ranks which are the only ranks in their associated extent pool. Sets thresholds on the average number of I/O operations per second for array and MDisk read and write operations. The Total I/O Rate metric for each array or MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of MB per second that were transferred for array and MDisk read and write operations. The Total Data Rate metric for each array or MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of milliseconds that it took to service each array and MDisk read operation. The Back-End Read Response Time metric for each array or MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 35,25,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Device/Component Type Description
Total Back-End I/O Rate
SVC, Storwize V7000 MDisk group and MDisk DS6000/DS8000 array
Total Back-End Data Rate
SVC, Storwize V7000 MDisk group and MDisk DS6000/DS8000 array
Back-End Read Response Time
SVC, Storwize V7000 MDisk DS6000/DS8000 array
358
Back-End Write Response Time
SVC, Storwize V7000 MDisk, DS6000/DS8000 array
Sets thresholds on the average number of milliseconds that it took to service each array and MDisk write operation. The Back-End Write Response Time metric for each array or MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 120,80,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Write I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Sets thresholds on the average number of milliseconds that it took to service each MDisk I/O operation, measured at the MDisk level. The Total Response Time (external) metric for each MDisk is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. A filter is available for this threshold which will ignore any boundary violations if the Total Back-End I/O Rate is less than a specified filter value. The pre-populated filter value is 10. Sets thresholds on the average number of milliseconds that each read operation spent on the queue before being issued to the back-end device. The Back-End Read Queue Time metric for each MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 5,3,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Violation of these threshold boundaries means that the SVC deems the MDisk to be overloaded. There is a queue algorithm that determines the number of concurrent I/O operations that the SVC will send to a given MDisk. If there is any queuing (other than during a backup process) then this suggests performance can be improved by resolving the queuing issue. Sets thresholds on the average number of milliseconds that each write operation spent on the queue before being issued to the back-end device. The Back-End Write Queue Time metric for each MDisk is checked against the threshold boundaries for each collection interval. Though this threshold is disabled by default, suggested boundary values of 5,3,-1,-1 are pre-populated. A filter is available for this threshold which will ignore any boundary violations if the Back-End Read I/O Rate is less than a specified filter value. The pre-populated filter value is 5. Violation of these threshold boundaries means that the SVC deems the MDisk to be overloaded. There is a queue algorithm that determines the number of concurrent I/O operations that the SVC will send to a given MDisk. If there is any queuing (other than during a backup process) then this suggests performance can be improved by resolving the queuing issue.
Overall Back-End Response Time
SVC, Storwize V7000 MDisk
Back-End Read Queue Time
Back-End Write Queue Time
359
Peak Back-End Write Response Time
SVC, Storwize V7000 Node
Sets thresholds on the peak (worst) response time among all MDisk write operations by a node. The Back-End Peak Write Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 30000,10000,-1,-1. Violation of these threshold boundaries means that the SVC cache is having to partition-limit for a given MDisk group. The de-staged data from the SVC cache for this MDisk group is causing the cache to fill up (writes are being received faster than they can be de-staged to disk). If delays reach 30 seconds or more, then the SVC will switch into short-term mode where writes are no longer cached for the MDisk Group. Sets thresholds on the average number of milliseconds it took to service each send operation to another node in the local SVC cluster. The Port to Local Node Send Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 3,1.5,-1,-1. Violation of these threshold boundaries means that it is taking too long to send data between nodes (on the fabric), and suggests that there is either congestion around these FC ports, or an internal SVC microcode problem. Sets thresholds on the average number of milliseconds it took to service each receive operation from another node in the local SVC cluster. The Port to Local Node Receive Response Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 1,0.5,-1,-1. Violation of these threshold boundaries means that it is taking too long to send data between nodes (on the fabric), and suggests that there is either congestion around these FC ports, or an internal SVC microcode problem. Sets thresholds on the average number of milliseconds that each send operation issued to another node in the local SVC cluster spent on the queue before being issued. The Port to Local Node Send Queue Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 2,1,-1,-1. Violation of these threshold boundaries means that the node has to wait too long to send data to other nodes (on the fabric), and suggests congestion on the fabric. Sets thresholds on the average number of milliseconds that each receive operation issued to another node in the local SVC cluster spent on the queue before being issued. The Port to Local Node Receive Queue Time metric for each node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundary values of 1,0.5,-1,-1. Violation of these threshold boundaries means that the node has to wait too long to receive data from other nodes (on the fabric), and suggests congestion on the fabric.
Port to Local Node Send Response Time
Port to Local Node Receive Response Time
Port to Local Node Send Queue Time
Port to Local Node Receive Queue Time
360
Controller thresholds
Table B-6 lists and describes the Controller thresholds:
Table B-6 Controller Thresholds Threshold (Metric) Controller Thresholds Total I/O Rate (overall) DS6000/DS8000 controller SVC, Storwize V7000 I/O group Sets threshold on the average number of I/O operations per second for read and write operations, for the subsystem controllers (clusters) or I/O groups. The Total I/O Rate metric for each controller or I/O group is checked against the threshold boundaries for each collection interval. These thresholds are disabled by default. Sets threshold on the average number of MB per second for read and write operations for the subsystem controllers (clusters) or I/O groups. The Total Data Rate metric for each controller or I/O group is checked against the threshold boundaries for each collection interval. These thresholds are disabled by default. Sets thresholds on the percentage of time that NVS space constraints caused I/O operations to be delayed, for the subsystem controllers (clusters). The NVS Full Percentage metric for each controller is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 10, 3, -1, -1. Sets thresholds on the average cache holding time, in seconds, for I/O data in the subsystem controllers (clusters). Shorter time periods indicate adverse performance. The Cache Holding Time metric for each controller is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 30, 60, -1, -1. Sets thresholds on the percentage of I/O operations that were delayed due to write-cache space constraints. This metric for each controller or node is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries of 10, 3, -1, -1. In addition, a filter is available for this threshold which will ignore any boundary violations if the Write-cache Delay I/O Rate is less than a specified filter value. The pre-populated filter value is 10 I/Os per second. Sets thresholds on the Non-Preferred Node Usage Percentage of an I/O group. This metric of each I/O group is checked against the threshold boundaries at each collection interval. his threshold is disabled by default. In addition, a filter is available for this threshold which will ignore any boundary violations if the Total I/O Rate of the I/O group is less than a specified filter value. Device/Component Type Description
Total Data Rate
DS6000/DS8000 controller SVC, Storwize V7000 I/O group
NVS Full Percentage
DS6000/DS8000 controller
Cache Holding Time
DS6000/DS8000 controller
Write-cache Delay Percentage
DS6000/DS8000 controller SVC, Storwize V7000 node
Non-Preferred Node Usage Percentage
SVC, Storwize V7000 I/O group
361
Port thresholds
Port thresholds are used to set limits for such things as bandwidth utilization, data rates, and I/O operations. lists and describes the Port thresholds:
Table B-7 Threshold (Metric) Port Thresholds Total Port I/O Rate DS6000/DS8000 port switch port XIV port Total Port Data Rate DS6000/DS8000 port switch port XIV port Overall Port Response DS6000/DS8000 port Time Sets thresholds for ports on the average number of I/O operations or packets per second for send and receive operations. The Total I/O Rate metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds for ports on the average number of MB per second for send and receive operations. The Total Data Rate metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds for ports on the average number of milliseconds that it takes to service each send and receive I/O operation. The Total Response Time metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of frames per second received in error by ports. The Error Frame Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of link errors per second for ports. The Link Failure Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the critical and warning data rates for stress and idle in MB per second. The Total Port Data Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on critical and warning data rates for stress and idle conditions in packets per second. For example, a critical stress or warning stress condition occurs when the upper boundary for the packet rate of a switch is detected. A critical idle or warning idle condition occurs when the lower boundary for the packet rate of a switch is detected. The Total Port Packet Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average amount of time that ports are busy sending data. The metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Sets thresholds on the average amount of time that ports are busy receiving data. The metric for each port is checked against the threshold boundaries for each collection interval. This threshold is disabled by default. Device/Component Type Description
Error Frame Rate
DS8000 port
Link Failure Rate
DS8000 port SVC, Storwize V7000 port
Total Port Data Rate
Switch port
Total Port Packet Rate Switch port
Port Send Utilization Percentage
DS6000/DS8000 port
Port Receive Utilization Percentage
DS6000/DS8000 port
362
Port Send Bandwidth Percentage
DS8000 port SVC, Storwize V7000 port switch port XIV port
Sets thresholds on the average port bandwidth utilization percentage for send operations. The Port Send Utilization Percentage metric is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries 85,75,-1,-1. Sets thresholds on the average port bandwidth utilization percentage for receive operations. The Port Send Utilization Percentage metric is checked against the threshold boundaries for each collection interval. This threshold is enabled by default, with default boundaries 85,75,-1,-1.
Port Receive DS8000 port Bandwidth Percentage SVC, Storwize V7000 port switch port XIV port CRC Error Rate DS8000 port SVC, Storwize V7000 port switch port Invalid Transmission Word Rate DS8000 port SVC, Storwize V7000 port switch port Zero Buffer - Buffer Credit Timer SVC, Storwize V7000 port
Sets thresholds on the average number of frames received in which the cyclic redundancy check (CRC) in a frame does not match the CRC computed by the receiver. The CRC Error Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the average number of bit errors detected on a port. The Invalid Transmission Word Rate metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default. Sets thresholds on the number of microseconds that a port has been unable to send frames because of a lack of buffer credit since the last node reset. The Zero Buffer-Buffer Credit Timer metric for each port is checked against the threshold boundary for each collection interval. This threshold is disabled by default.
363
364
Appendix C.
Reporting with Tivoli Storage Productivity Center

In this chapter we discuss the various types of reports that Tivoli Storage Productivity Center can generate, as well as tools that you can use to generate reports based on the Tivoli Storage Productivity Center data that you have collected.
365
Using SQL
Because the Tivoli Storage Productivity Center database repository is a standard DB2 database, you can use external commands to access some of the information, without the use of the Tivoli Storage Productivity Center GUI. Tivoli Storage Productivity Center provides a predefined set of Table Views that must be used because these Table Views will not change (only by additions) with new Tivoli Storage Productivity Center releases. Furthermore, only the Table Views are documented. For details, see the IBM Tivoli Storage Productivity Center V4.1 Release Guide, SG24-7725: Section 10.1 provides a reporting overview and a collection of reporting information. Chapter 11 provides details on customized reporting through Tivoli common reporting.
SQL: Table views

Table views exist since Tivoli Storage Productivity Center version 4.1. Documentation of the table views can be found by using the following link: https://www-01.ibm.com/support/docview.wss?uid=swg27020299&wv=1 PM_Metrics.xls: A Microsoft Excel spreadsheet that lists the views containing performance data and shows the performance metrics available within those views. The information is sorted by the device for which performance metrics are available (for example, XIV Storage System, SAN Volume Controller, DS6000, DS8000, DS4000) TPC4.2.1_TPCREPORT_schema.zip: The HTML files in this compressed file provide a complete list of the views that exist in the TPCREPORTschema and descriptions of the columns within those views. To access the HTML files in this zip file, complete the following steps: 1. Download TPC4.2.1_TPCREPORT_schema.zip 2. Extract the contents of TPC4.2.1_TPCREPORT_schema.zip to a local directory using a compression utility such as WinZip or PKZIP. 3. Start a Web browser and open index.html. With those information you can run quires against the Table Views. Important: To avoid locking issues and not to interfere with the Tivoli Storage Productivity Center server, always use an isolation level of UR (Uncommitted Read) and a Fetch only connection type. Appending the following line to any SQL statement will set that: for fetch only with UR For details about working with Table Views, see the following release guide books: IBM Tivoli Storage Productivity Center V4.1 Release Guide, SG24-7725 (Chapter 11) IBM Tivoli Storage Productivity Center V4.2 Release Guide, SG24-7894 (Section 15.6)
SQL: Example Query XIV Performance Table View

In this example we create a performance report using the metric Total Response Time for our XIV. To create the SQL statement we use first the latest version of the Excel (PM_Metrixs.xls). This Excel file includes the User Defined Functions (UDFs) which are 366
required to get the right values and metrics out of the table views. Without the UDFs we will need to understand each column in the table view and calculate our own metrics, which is error-prone. Choose the tab in the Excel of the type of Storage Subsystem. We choose XIV. Afterwards you can select the metric as shown in Figure C-1.
Figure C-1 Choose tab XIV and filter to Total Response Time
Now we choose the Category. The category can be XIV System, XIV Module or XIV Volume. We select XIV System because we want the Total (overall) Response Time. The Excel shows two rows as shown in Figure C-2.
Figure C-2 Filter Category XIV System
The last part is to decide which Table View we want to use. In Figure C-2 we see four possible Views. PRF_XIV_SYSTEM, LATEST_PRF_XIV_SYSTEM PRF_HOURLYDAILY_XIV_SYSTEM, LATEST_PRF_HOURLYDAILY_XIV_SYSTEM Those Views are described in the Tivoli Storage Productivity Center 4.2.1_TPCREPORT_schema.zip. In this example we use the Table View called PRF_HOURLYDAILY_XIV_SYSTEM which includes hourly aggregated performance data from the XIV Storage Subsystem. We set the filter in the Excel and get finally the required View, Metric, Unit and the UDF & Parameters. See Figure C-3. Those Values are prerequisite to create the SQL statement.
Figure C-3 SQL Parameters Appendix C. Reporting with Tivoli Storage Productivity Center
367
Important: A set of User Defined Functions (UDFs) is provided to ease implementation of metric calculations. The UDFs automate all required transformations. You do not need to be aware of the details of individual values in order to generate performance metrics using the UDFs. Usage of the UDFs is documented in the Excel Sheet PM_Metrics.xls To create the SQL select statement we have a look on the table view description. All table views belong to the schema TPCREPORT. From the table PRF_HOURLYDAILY_XIV_SYSTEM (see Figure C-4) we choose DEV_ID, PRF_TIMESTAMP and INTERVAL_LEN.
Figure C-4 Table View: PRF_HOURLYDAILY_XIV_SYSTEM
Because the DEV_ID is just a Tivoli Storage Productivity Center internal number we use the table view STORAGESUBSYSTEM to display a meaningful name of the Storage Subsystem in our report (see Figure C-5).
Figure C-5 Tabel View: STORAGESUBSYSTEM
368
With that information we create now the SQL select statement: select s.DISPLAY_NAME, p.PRF_TIMESTAMP, p.INTERVAL_LEN, TPCREPORT.PM_HD_XIV_TOT_RESP_TIME(p.READ_IO, p.WRITE_IO,p.READ_TIME,p.WRITE_TIME) AS "TOTAL RESPONSE TIME (ms/op)" from TPCREPORT.PRF_HOURLYDAILY_XIV_SYSTEM as p, TPCREPORT.STORAGESUBSYSTEM as s where p.DEV_ID = s.SUBSYSTEM_ID ORDER BY PRF_TIMESTAMP DESC for fetch only with UR The output of the command is shown in Figure C-6 which lists the Total Response Time of the XIV Storage System based on a hourly performance averages.
Figure C-6 XIV Total Response Time performance report
Appendix C. Reporting with Tivoli Storage Productivity Center
369
CLI: TPCTOOL as a reporting tool

Instead of using the Tivoli Storage Productivity Center GUI, you can also use the TPCTOOL command line interface to extract data from the Tivoli Storage Productivity Center database. Admittedly, TPCTOOL is not very intuitive to use, but it has advantages over the Tivoli Storage Productivity Center GUI. Multiple samples: You can extract multiple samples as a list, instead of only the most recent sample that you get when you select File Export on the first panel after generating a report. Multiple components: You can extract information about multiple components, such as volumes and arrays by either specifying a list of component IDs, or if you omit the list, every component for which data has been gathered is returned. Multiple metrics: This is probably the most important feature of the TPCTOOL reporting function. If you try to export the data that is displayed in a history chart, you can also get data of multiple samples for multiple components, but this is limited to a single metric. In TPCTOOL, the metrics are specified by the columns parameter. The name of the parameter needed to be more generic, because you can also specify counters that you want to extract. Report generation: The report generation can be automated to a certain degree. You need to know the subsystem or fabric Globally Unique Identifier (GUID), the component types, the metrics, and the time frame. Other than the time frame, the information does not really change (only in that subsystems and fabrics might be added or deleted). TPCTOOL: TPCTOOL can be useful if you need to create your own metrics using supplied metrics or counters. For example, you can create a metric that shows the access density - the number of I/Os per GB. For this metric, you also need information from other Tivoli Storage Productivity Center reports that include the volume capacity. There is still manual work that you need to do, but it is possible. Nevertheless, TPCTOOL also has a few limitations: Single subsystem or fabric: Reports can only include data of a single subsystem or a single fabric, regardless of the components, types, and metrics that you specify. Identification: The identification of components, subsystems, and fabrics is not so easy, because TPCTOOL uses worldwide name (WWN) and Globally Unique Identifiers (GUIDs) instead of the user-defined names or labels. At least for certain commands, you can tell TPCTOOL to return more information by using the -l parameter. For example, lsdev also returns the user-defined label when you use the -l parameter, but then the output is more than 80 columns so you need to redirect the output to a file. TPCTOOL has one command that you use to create a report and several list commands (starting with ls) that you might need to query the information that you need to know in order to create a report.
370
Here we list the normal sequence of steps that you follow to create a report with TPCTOOL: 1. Start the TPCTOOL CLI: Run the batchfile which came with the client installation. <TPC_Installation_Directory>\cli\tpctool. This opens a command prompt with tpctool. 2. List the storage devices by using the lsdev command as shown in Figure C-7.
Figure C-7 TPCTOOL lsdev command output list
3. Determine the component type on which you want to report by using the lstype command as shown in Figure C-8.
Figure C-8 TPCTOOL lstype command output list
4. Next, you need to decide which metrics or counters that you want to include in the report. You can either use the lists provided in the book or use the lsmetrics or lscounter command. Remember that the metrics returned by the lsmetrics command are the same as the columns in the Tivoli Storage Productivity Center GUI, as compared to the counters, which represent the raw data that Tivoli Storage Productivity Center has gathered from the CIMOMs and NAPIs.
371
Figure C-9 shows the TPCTOOL lstype command output list.
Figure C-9 TPCTOOL lstype command output list
5. Before you run the reporting command, you need to decide on which time frame to report and which level of the samples to include. 6. Run the report, and redirect the output to a file. Tip: If you want to import the data into Excel later, we recommend using a semi-colon as the field separator (-fs parameter). A comma can easily be mistaken as a decimal or decimal grouping symbol. The disadvantage is that Excel does not recognize the structure of a csv file when you open it with a double-click. The book titled Monitoring Your Storage Subsystems with TotalStorage Productivity Center, SG24-7364, contains an Excel template that you can use with TPCTOOL for reporting.
372
7. Now you can use the data in any kind of reporting tool, for example, Excel. The lstime command, which is one of the ls commands, is very helpful because you use this command to verify that the performance collection is running and data is inserted into the database (see Example 6-1).
Example 6-1 lstime command sample output
tpctool> lstime -user administrator -pwd xxxxx -url localhost:9550 -ctype subsystem -level sample -subsys 2810.6000646+0 Start Duration Option =================================== 2011.06.13:18:04:08 81298 server Figure C-10 shows a performance report of an XIV storage subsystem. Hourly counters, for 10 hours starting at 2011.06.14 at 5am. The reported components are: Total I/O Rate (overall) = 809 Total Data Rate (overall) = 821 Total Response Time (overall) = 824
Figure C-10 XIV Performance Report
The IBM Redpaper publication, Reporting with TPCTOOL, REDP-4230, where TPCTOOL is discussed in more detail, is available at the following link: http://w3.itso.ibm.com/abstracts/redp4230.html?Open
373
Tivoli Storage Productivity Center: Batch Report

A batch report represents any Tivoli Storage Productivity Center report that you want to run on a regular basis and save its data to a file. You can view the information in a generated batch report file directly or import it into an application of your choice. If you want to create batch reports, you do the same tasks that you do for creating any other report. The only difference is one additional Options panel. Alternatively, creating batch reports saves you from the task of initiating your reports from the Tivoli Storage Productivity Center GUI. By using batch reports, you can collect data and save it into any of the following file formats: comma separated values (CSV), HTML, formatted text, PDF chart, or HTML chart. A batch report can be defined by right clicking: IBM Tivoli Storage Productivity Center Reporting Batch Reports
374
Tivoli Storage Productivity Center: Batch Report Example

In this example we generate a daily HTML file containing the overall history performance information of our Storage Subsystems. This HTML file can be placed by Tivoli Storage Productivity Center into a directory where a webserver has access to it (SRA needs to be installed on the target server). Therefore people in the company does not mandatory require to have access and skills to Tivoli Storage Productivity Center to get those selected information. In this example we want to have a daily HTML file for each Storage Subsystem showing overall performance metrics. In our example we choose an XIV and a Storwize V7000. See the following figures to see the setting for our batch example report for the XIV. In Figure C-11 we choose the report type By Storage Subsystem, because our report must show overall performance numbers over the whole box.
Figure C-11 Batch Report: Select By Storage Subsystem
375
In the tab Selection we define (see Figure C-12): 1. Selection: We choose the desired XIV 2. Set the Display historic performance data using relative time to 1 day (24 hours) 3. Set the Summary Level to Hourly 4. Add the Included Columns for the report
Figure C-12 Batch Report: Define report layout
376
On the next panel we define where to place the report as a file, the format of the file, and the naming convention of the file. For details see Figure C-13 .
Figure C-13 Batch Report: Define report file specifications
377
In the next panel we define when to run the batch report. We want to run it once a day. And we set the time to 8pm. Therefore every day at 8pm such a report is generated. See Figure C-14 for the details.
Figure C-14 Batch Report: Define batch report run time
In the last panel you can set the alert in case a report generation fails (see Figure C-15). In this example we send an email to the tpcadmin in case of a failure. To be able to use alerting you need to configure Tivoli Storage Productivity Center. You can find the alert configuration panel in the Tivoli Storage Productivity Center Navigation Tree: Administrative Services Configuration Alert Disposition.
378
Figure C-15 Batch Report: Define report failed alerting
At the end, save the job. We also create a batch report for the Storwize V7000. We exactly use the same configuration as used for the XIV. Except we select the Storwize V7000 instead of the XIV, take another destination path for the report files, and also add the SVC/Storwize V7000 specific metric CPU Utilization to the Selected Column. From now every night at 8pm a report (HTML file) is generated for each storage subsystem, the Storwize V7000 and the XIV, containing 24 hourly samples. The HTML file name contains an incremental number (up to 9999), the storage subsystem device name and also the timestamp. Therefore the file is never overwritten. After the first run of the batch job we see the HTML file in the defined directories. See Figure C-16. For the example output of the XIV report and see Figure C-17 for the output of the Storwize V7000 report.
379
Figure C-16 Batch Report: XIV performance report
Figure C-17 Batch Report: Storwize V7000 performance report
380
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this book.
IBM Redbooks publications

For information about ordering these publications, see How to get Redbooks publications on page 382. Note that some of the documents referenced here might be available in softcopy only. IBM Tivoli Storage Productivity Center V4.2 Release Guide; SG24-7894. IBM System Storage Productivity Center Deployment Guide, SG24-7560 DS8000 Performance Monitoring and Tuning, SG24-7146 IBM System Storage DS8000: Architecture and Implementation and Implementation, SG24-8886 SAN Volume Controller Best Practices and Performance Guidelines, SG24-7521 IBM Virtual Disk System Quickstart Guide, SG24-7794 Implementing the IBM System Storage SAN Volume Controller V6.1, SG24-7933. Implementing the IBM Storwize V7000, SG24-7938
Other publications
These publications are also relevant as further information sources: Tivoli Storage Productivity Center Storage Productivity Center for Replication Version 4.2.1 Installation and Configuration Guide, SC27-2337
Online resources
These Web sites are also relevant as further information sources: IBM Storage Software support Web site: http://www.ibm.com/servers/storage/support/software/ Tivoli Storage Productivity Center: http://publib.boulder.ibm.com/infocenter/tivihelp/v4r1/topic/com.ibm.tpc_V421.d oc/fqz0_r_product_packages.html CIMOM compatibility matrix: For fabric management - supports Tivoli Storage Productivity Center v4.2.1:
https://www-01.ibm.com/support/docview.wss?uid=swg27019378
http://www-01.ibm.com/support/docview.wss?rs=1134&context=SS8JFM&context=SSESLZ &dc=DB500&uid=swg21265379&loc=en_US&cs=utf-8&lang=en
381
How to get Redbooks publications

You can search for, view, or download Redbooks publications, Redpaper publications, Technotes, draft publications, and Additional materials, as well as order hardcopy Redbooks publications, at this Web site: ibm.com/redbooks
Help from IBM

IBM Support and downloads: ibm.com/support IBM Global Services: ibm.com/services
382
Index
Symbols
78, 24, 48, 99101, 108, 182, 187, 207 cache hostile workloads 18 cached storage subsystems 60 cache-miss 222 Capacity Planning 306 Capacity reports disk capacity 308 TPC-wide Storage Space 309 storage capacity management 306 storage subsystem performance 311 Case study Basics 48 fabric performance 291 IBM XIV workload analysis 287 Server performance 267 SVC & Storwize performance constraint alerts 283 Top Volumes Response Performance 280 Topology Viewer - SVC and Fabric 296 change history 150 change overlay 122 chart reports 102 CIMOM 15, 43 CIMOM compatibility matrix 44 deployment 44 Providers 43 recommended capabilities 44 sizing 71 comma separated values (CSV) 69, 108 Common Information Model Object Manager (CIMOM) 6 Compatibility matrix 45 fabric management 45, 381 storage device management 45 configuration history 120, 124 change overlay 122 use 120 constraint 80 constraint violation reports 113, 143 constraint violation thresholds 114 constraint violations 67 controller cache performance report cache hit percentage 203 Controller cache read usage 205 Controller performance reports Data rates 198 I/O rates 201 counters 54 CPU Utilization Thresholds 129 customized performance report displaying 103 Customized predefined reports 98 Predefined Fabric manager reports 99 switch performance report 99 Switch Port Errors Report 100 Top Active Volume Cache Hit Performance 99 Top Switch Ports Data Rate 100 Top Switch Ports Packet Rate 100
A
Agents 33 CIMOM Agent 33 Data agent 33 fabric agent 33 Native Application Interface (NAPI) 33 Storage resource agent 33 alert events CPU utilization threshold 283 Overall back-end response time threshold 284 overall port response time threshold 284 alerts 80, 140 data gathering 93 default write-cache delay 204 performance-related 93 application response 60 application workloads 5 array 7 Array Performance report 96, 216 array site 8, 163 Automatic Tiering 180 EZ-Tier 180
B
back-end data rate 214, 314 back-end I/O metrics 56 back-end I/O rate 68, 212 back-end response time 67, 216 back-end response time metrics 58 back-end throughput metrics 57 backup server 23 baseline creation 68 baseline management 4 baselines 175 batch report creation 109 when to run 111 batch report formats 108 Block Server Performance 101 Block Server Performance Subprofile 16 BSP 186 By Volume report 99
C
cache 7 battery failure 237 cache friendly workloads 18 Cache hit percentage 203 cache hit rate 58
383
Top Volume Disk Performance 99 Top Volumes Data Rate, I/O Rate, Response Performance 99 customized reports 101 charts 102 Generate Report 104 location 102 tabular reports 106 time range 104
EZ-Tier 164, 180
F
fabric agent 33, 41 Fabric manager reports Top Switch Ports Data Rate 100 Top Switch Ports Packet Rate 100 Fabric reporting E_Ports 323 FC port speed 264 front-end 57 front-end I/O metrics 57 front-end ports 7 front-end response time metrics 58
D
daily administration tasks 125 Data agent 33, 41 Data Path Explorer 148 Data Path View 300 data retention daily monitoring task 86 hourly monitoring task 86 data spike 55 Database access 34 DB2 34 SQL 34 TPCREPORT 34 database backups 36 database managed space 38 database repository capacity 76 placement 38 sizing formulas 37 database table space 38 Datapath Explorer 294 DB2 database 33, 366 TPCDB database 33 DDM 8, 63, 222, 251 Disk Manager reports performance reports 101 Disk to cache Transfer rate metric 224 Disk Utilization Percentage 208209 high 211 Disk Utilization Percentage Threshold Filtering 83 Disks 171 DMS table space 39 drive modules 8 DS4000 information 172 DS5000 information 170 DS8000 information 162 array site 163 DS8800 180 EZ-Tier 164 Ranks 163 SSD 164 Storage Pools 164 Thin Provisioning 164
G
Globally Unique Identifier 370 Graphical user interface (GUI) 32 GUI versus CLI 40
H
HBA identification 160 HBA WWPN 42 High NVS Full Percentage 58 History Aggregation 84 hot array 207 HTML chart 108 HTML report. 69
I
I/O Groups 11, 97 I/O performance 154 I/O response performance 14 IBM Storwize V7000 xv, 10 IBM System Storage DS8800 Automatic Tiering 180 idle threshold level 80 Interfaces 34 GUI 34 Java Web Start GUI 34 TPCTOOL 34 ITSO environment 48
J
Java Web Start GUI 34 Job History 139
L
large writes guidelines 62 latency 60 Logical 94 Logical reporting levels by device type 94 LUN 57, 172, 251 LUN mapping 15 LUN masking 160
E
embedded CIMOM 44 environmental norms 4 extent pool 10 extents 10
384
M
Managed Disk 13, 167 Managed Disk Group reports 311312 Managed Disk Group performance 247 MDisk definition 13 MDisk performance 274 messages HWNPM2123I 90 PM HWNPM2113I 92 PM HWNPM2115I 90 PM HWNPM2120I 90 metric 6, 54 metrics versus counters 93 multipath software 5
N
N/A values 101 NAPI supported storage devices IBM System Storage DS8000 43 IBM System Storage SAN Volume Controller (SVC) 43 IBM System Storage Storwize v7000 43 IBM System Storage XIV 43 Native Application Interface (NAPI) 4, 6 Native Storage System Interface (Native API) 14 Enterprise Storage Server Network Interface (ESSNI ) 14 Secure Shell (SSH) interface 14 XML CLI (XCLI) 14 Near Line (NL)-SATA 183 networking subsystem 22 new volume utilization metric 67 Node Cache performance report 228, 239 Node level reports 237 N-Series support 41
O
OLTP performance rates 62 OLTP response time 196 online monitor 93 Overall Back-End Response Time 216 oversubscription of links 14
P
performance 71 baselines 175 batch reports 108 performance analysis guidelines 62 performance collection scheduler 86 performance collection task 77 performance configuration 24 workload isolation 24 Workload resource sharing 25 workload spreading 25 performance considerations random workloads 179 sequential workloads 179
performance counters 332 performance data collection considerations 74 counters 75 retention 75 samples 75 performance data classification Cache hit rate 58 Response time 57 SAN switch 59 throughput 57 performance data collection 70 alerts 80 CIMOM intervals 72 CIMOM sizing 71 intervals 46 job duration 46 job starts 78 job status 89 new 24 hour value 46 sample interval. 72 server restart 92 Service Level Agreement 72 skipping function 78 start 87 stop 88 task considerations 46 performance management 45, 54, 307 applications 60 cache hit rate 58 cached storage subsystems 60 daily analysis 69 data collection 72 metrics 55 OLTP 59 performance data collection job 69 prerequisite tasks 70 problem determination 73 response time ranges 59 top 10 reports top 10 reports 186 performance management concepts 3 performance measurement RAID ranks 63 performance metrics 332 -1 value 332 Bit error rate for DS8000 ports 335 Counters 332 Duplicate frame rate for DS8000 ports 335 Error frame rate for DS8000 ports 334 essential 332 Important Thresholds 335 Invalid CRC rate for SVC, Storwize V7000 and DS8000 ports 334 Invalid relative offset rate for DS8000 ports 335 Invalid transmission word rate for SVC, Storwize V7000, DS8000 and Switch ports 334 Link failure rate for SVC, Storwize V7000 and DS8000 ports 334 Link Recovery (LR) received rate for DS8000 and Switch ports 335
Index
385
Link Recovery (LR) sent rate for DS8000 and Switch ports 335 Loss-of-signal rate for SVC, Storwize V7000 and DS8000 ports 334 Loss-of-synchronization rate for SVC, Storwize V7000 and DS8000 ports 334 Out of order ACK rate for DS8000 ports 335 Out of order data rate for DS8000 ports 335 Primitive Sequence protocol error rate for SVC, Storwize V7000, DS8000 and Switch ports 334 quickstart 61 Sequence timeout rate for DS8000 ports 335 XIV system metrics 337 Zero buffer-buffer credit timer for SVC and Storwize V7000 ports 334 performance monitor 36, 71, 80, 93, 181 24 hours 74 data retention 86 performance monitoring Fabric environment 323 performance problems determination 73 identification 185 rank skew 177 resource sharing 176 performance reports back-end data rate 214 back-end response time 216 drill up 106 Managed Disk Group 247 N/A values 101 SVC port performance 259 Top Volume Cache 221 Top Volume Disk Performance 224 persistent memory 204 persistent memory constraint 219 Port performance 98 Port Send Receive Response Time 227 ports 7 PPRC 336 predefined performance reports 98 Array performance 96 controller cache performance report 96 controller performance report 96 I/O Group performance 97 Managed disks group 97 Module/Node cache performance 97 Node cache performance 97 port performance 98 Subsystem performance 98 problem determination basics 154 proxy agent 43
R
RAID 63, 179 RAID 5, RAID 6, and RAID 10 Considerations 179 asynchronous writes 179 random write workloads 179 sequential and random reads 179 sequential writes 179
RAID algorithms 251 RAID array utilization 208 RAID level 7 RAID5 algorithms 251 random read IO 217 random workloads 179 rank 9 count key data (CKD) 9 fixed block (FB) data 9 rank busy recommendations 207 rank I/O limit 63, 213 rank level information 163 rank skew 177 Read cache Hit Percentages 206 Read cache hit ratio 222 Read Data rate 239 Read Hit percentages guidelines 62 Redbooks publications Web site 382 Redbooks Web site Contact us xiii reports 365 constraint report 94 Customized Reports 94 Predefined Performance Reports 94 Reports for Fabric and Switches Switches reports 265 Total Port Data Rate 265 response time ranges 225 response time recommendations 196 response times back-end 226 front-end 226 reviewing alerts 140 Rules of Thumb Back-End Read and Write Queue Time 329 Cache Holding Time Threshold 329 CPU Utilization Percentage 328 CPU Utilization Percentage Threshold 328 CRC Error rate 330 Disk Utilization 328 Disk Utilization Threshold 328 Link Failure Rate and Error Frame Rate 330 Non-Preferred Node Usage 330 Overall Port response Time 329 Overall Port response Time Threshold 329 Port Data rate threshold 329 Port to local node Send/receive Queue Time 330 Port to Local Node Send/Receive Response Time 329 Read Cache Hit Percentage 328 Response Time Threshold 328 Write-Cache Delay Percentage 329 Zero Buffer Credit 330
S
SAN Planner 120, 156 SAN Volume Controller Version 6.1 xv SAN zoning 299 SAS disks 183
386
SATA disks 165, 180 scheduler in TPC 86 server applications 20 server to disk information 157 Service Level Agreement 19, 68, 175 SLA reporting 228 small block reads 62 small block writes guidelines 62 SMI-S Block Server Performance Subprofile 16 SMI-S profile 44 SMI-S standard 94 ports 7 response time 96 SMS table space 38 snapshot Create 120 Delete 120 SNIA 15 SNMP 47 solid state disks (SSD) 60, 180 SQL Example Query XIV Performance Table View 366 PM_Metrics.xls 366 select statement 369 Table views 366 TPCREPORT 366, 368 User Defined Functions (UDFs 368 XIV Total Response Time performance report 369 SSPC considerations 40 Host Bus Adapter (HBA) 40 SSPC appliance 40 volume management 40 storage performance management 196 Storage Pool information 164 Storage Pools 173 Storage Resource Agent (SRA) xv, 41 Common Agent Strategy (CAS) 41 deploying the SRA 42 Tivoli Storage Productivity Center topology table view with the SRA agent 42 Tivoli Storage Productivity Center topology table view without the SRA 42 Storage Server Native API 43 storage subsystem architecture 5 storage subsystem counters 54 Storage Subsystem Performance reports 65 Storage virtualization device 10 storage volume throughput 59 storage workloads 17 Storwize V7000 Case study disk performance 271 cluster 12 Control enclosure 10 Expansion enclosures 10 I/O group 11 I/O Groups in a cluster 233 Managed Disk Group (Storage Pool) 13 MDisk 13 Node 12
SPC Benchmark2 235 Storwize V7000 metric selection 275 Storwize V7000 Nodes 228 Storwize V7000 performance constraint alerts 283 Storwize V7000 performance report - volume selection 273 two node canisters 10 V6.2.0 restrictions 322 Vdisk 12 Verifying host paths to the Storwize V7000 302 Viewing host paths to the Storwize V7000 303 virtual volume 12 virtualization device 13 volume 12 Volume and Managed Disk selection 274 Storwize V7000 Best Practice Recommendations For Performance 183 Storwize V7000 considerations 182 Storwize V7000 nodes 283 Storwize V7000 version 6.2 233 Stress alerts 80 subsystem data considerations 72 subsystem metrics 94 Subsystem Performance Monitor 71 Subsystem Performance report 98 cached storage subsystems 196 data rates 193 front-end response times 197 I/O rate 189 Read I/O rate 191 recommendations 196 Response Times 195 Write I/O rate 191 SVC 43, 61, 64, 9798, 182 and Storwize V7000 performance reports 311 Back-end Read Response time 250 Best Practice Recommendations For Performance 182 HDD MDisks 180 Managed Disk information 167 performance benchmarks 235 Storage Performance Council (SPC) Benchmarks 235 V6.2.0 restrictions 322 SVC / Storwize V7000 concepts 231 SVC and Storwize V7000 Automatic Tiering 180 CPU Utilization Percentage metric 317 Element Manager 316 EZ-Tier 167 Managed Disks 167 MDisk 167 Solid State Disk (SSD) 167 Top Volume Performance reports 253 Virtual Disks 168 Volume to Back-End Volume Assignment 169 SVC and Storwize V7000 reports 311 back-end data rate 314 back-end subsystems 311
Index
387
Back-end throughput and response time 314 Cache performance 254 cache utilization 239 Clusters 322 CPU Utilization 233 CPU utilization by node 233 CPU utilization percentage 243 Dirty Write percentage of Cache Hits 243 I/O Groups 316 I/O Rate 312 Managed Disk Group 247 Managed Disk Group Performance 311 MDisk performance 274 Node Cache performance 228, 239, 318 Node CPU Utilization rate 233 node CPU Utilization reports 317 node statistics 232 over utilized ports 263 overall IO rate 234 Read Cache Hit percentage 229, 240 Read Cache Hits percentage 244 Read Data rate 239 Read Hit Percentages 229, 243 Readahead percentage of Cache Hits 244 report metrics 232 response time 237 Top Volume Cache performance 253 Top Volume Data Rate performances 253 Top Volume Disk performances 253 Top Volume I/O Rate performances 253 Top Volume Response 257 Top Volume Response performances 253 Total Back-End I/O Rate 312 Total Cache Hit percentage 240 Total Data Rate 239 Write Cache Flush-through percentage 244 Write Cache Hits percentage 244 Write Cache Overflow percentage 244 Write Cache Write-through percentage 244 Write Data Rate 239 Write-cache Delay Percentage 244 SVC cache utilization 246 SVC considerations 181 SVC traffic 181 SVC health 297 SVC performance 181182 Top Volumes Data Rate 254 SVC port information 227 SVC ports 298 SVC Rule of Thumb SVC response 257 SVC version 6.2 233 switch metrics 59 Switch Port Errors report 100 switch ports 297 Switches 265 symmetric multiprocessor 21 System Storage Productivity Center 35 system-wide thresholds 128
T
table space system managed space 38 Terminal Services 22 threshold-based alerts 80 thresholds setting 128 Warning Stress 116 throughput metrics 65 throughput recommendations 224 Tier0 180 time zone 111 Tivoil Storage Productivity Center report batch reports 108 Tivoli 34, 95, 332, 336 Tivoli Storage Productivity Center CLI 40 data retention 37 database backups 36 GUI 40 hardware sizing 35 instances 40 packaging options 30 repository sizing 36 Tivoli Storage Productivity Center Components 31 Agents 33 CIMOM agent 33 Data agent 33 Fabric agent 33 Native Application Interface (NAPI) 33 Storage resource agent 33 Data Server 32 Graphical user interface (GUI) 32 Device Server 32 Interfaces 34 Java Web Start GUI 34 Tivoli Storage Productivity Center GUI 34 user interfaces (UI) 34 Tivoli Integrated Portal (TIP) 32 Tivoli Integrated Portal(TIP) Single sign-on 32 Tivoli Common Reporting (TCR) 32 Tivoli Storage Productivity Center for Replication 33 Tivoli Storage Productivity Center licensing options 30 License Summary 31 Tivoli Storage Productivity Center Basic Edition 30 Tivoli Storage Productivity Center for Data 30 Tivoli Storage Productivity Center for Disk 30 Tivoli Storage Productivity Center Mid-Range Edition 30 Tivoli Storage Productivity Center Standard Edition 30 Tivoli Storage Productivity Center performance management functions 54 performance monitoring 54 performance reports 54 performance threshold/alerts 54 Tivoli Storage Productivity Center Performance Metrics Metrics for PPRC reads 336 Metrics for PPRC writes 336
388
Tivoli Storage Productivity Center reports Batch reports comma separated values (CSV 108 HTML chart 108 charts 106 Constraint Violations reports 113 tabular report 106 Tivoli Storage Productivity Center SAN Planner 120 Tivoli Storage Productivity Center for Replication 33 Top 10 Disk reports 188 Array Performance reports 207 Controller Cache Performance report 202 Controller Performance reports 197 Port Performance reports 227 Subsystem Performance report 188 Top Volume Performance reports 220 Top Volume Cache performance 221 Top Volume Data Rate Performance 223 Top Volume Disk Performance 224 Top volume I/O rate performance 224 Top Volume response performance 225 Top 10 reports for SVC and Storwize V7000 I/O Group Performance reports 232 Managed Disk Group performance report 247 Node Cache Performance report 239 Top Volume Performance reports 253 SVC performance Top Volume I/O Rate 256 SVC reports Top Volume Disk 256 Top Volume Cache performance 254 Top Volume Data Rate performance 254 Top Volume Response 257 TOP 10 reports for SVC, Storwize V7000 and Disk At a Glance 187 Topology Viewer 147, 156 Data Path Explorer 294 Data Path View 300 navigation 293 SVC health 297 zone configuration 299 Total Cache Hit percentage 229, 240 Total I/O Rate 211 TPC performance metrics collection 332 TPCTOOL 34 CLI as a reporting tool 370 command line interface 370 limitations 370 ls commands 373 lsmetrics command 371 lstime command 373 Multiple components 370 Multiple metrics 370 Report generation 370 Start the TPCTOOL CLI 371 TSM backup 62
virtualization device 5 VMware ESX Server 23 Volume HBA Assignment 157 volume information 157 volume report 174 Volume to Back-End Volume Assignment 169 volumes 8
W
Warning Stress 116 Web server 22 Windows Server 2008 R2 Hyper-V 24 workload isolation 24 workload spreading 178 host connection 178 workloads backup server 23 cache 18 cache friendly 18 database server 21 file server 20 multimedia servers 22 terminal server 22 transaction based 18 web servers 22 Windows hypervisor 20 Write Cache overflow 19 Write-cache Delay Percentage 68, 204, 219 Write-cache Delay percentage 204
X
XIV 173, 366 Disks 174 information 172 Module/Node Cache Performance 231 storage device 19 Storage Pools 173 Volumes 174 XIV system metrics 337 XIV Module Cache Performance Report 228 XIV reports IBM XIV Module Cache Performance Report 228 Read Cache Hit percentage 229 Storage Pools 173 Total Cache Hit percentage 229 volume report 174 RAID level of a volume 174 XIV Disk Details 174 XIV Storage 180 Automatic Tiering 180 GRID technology 180 SATA disks 180 Solid State Disks 180 Tier0 180 XIV Storage System xv
V
Verifying host paths to the Storwize V7000 302 virtual disks 168
Z
zone configuration 299
Index
389
390

SAN Storage Performance Management Using Tivoli Storage
(0.5 spine) 0.475<->0.873 250 <-> 459 pages
Back cover
Customize Tivoli Storage Productivity Center environment for performance management Review standard performance reports at Disk and Fabric layers Identify essential metrics and learn Rules of Thumb
IBM Tivoli Storage Productivity Center is an ideal tool for performing storage management reporting, because it uses industry standards for cross vendor compliance, and it can provide reports based on views from all application servers, all Fibre Channel fabric devices, and storage subsystems from different vendors, both physical and virtual. This IBM Redbooks publication is intended for experienced storage managers who want to provide detailed performance reports to satisfy their business requirements. The focus of this book is to use the reports provided by Tivoli Storage Productivity Center for performance management. We do address basic storage architecture in order to set a level playing field for understanding of the terminology that we are using throughout this book. Although this book has been created to cover storage performance management, just as important in the larger picture of Enterprise-wide management are both Asset Management and Capacity Management. Tivoli Storage Productivity Center is an excellent tool to provide all of these reporting and management requirements.
INTERNATIONAL TECHNICAL SUPPORT ORGANIZATION
BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE

IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.
For more information: ibm.com/redbooks

SG24-7364-02 ISBN 073843597X

SAN Storage Performance

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

SAN Storage Performance

Hochgeladen von

Copyright:

Verfügbare Formate

IBM Tivoli

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Karen Orlando Daniel Frueh Paolo DAngelo Lloyd Dean

Copyright IBM Corp. 2009, 2011. All rights reserved.

SAN Storage Performance Management Using Tivoli Storage Productivity Center

287 291 296 297 299 300 302

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Copyright IBM Corp. 2009, 2011. All rights reserved.

SAN Storage Performance Management Using Tivoli Storage Productivity Center

The team who wrote this book

Copyright IBM Corp. 2009, 2011. All rights reserved.

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Now you can become a published author, too!

Stay connected to IBM Redbooks

SAN Storage Performance Management Using Tivoli Storage Productivity Center

September 2011, Third Edition

Copyright IBM Corp. 2009, 2011. All rights reserved.

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Storage performance management concepts

Copyright IBM Corp. 2009, 2011. All rights reserved.

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Performance management concepts

Copyright IBM Corp. 2009, 2011. All rights reserved.

1.1 Performance management fundamentals

1.2 Environmental norms

SAN Storage Performance Management Using Tivoli Storage Productivity Center

1.3 Storage subsystem architecture

1.3.1 High-level component diagram of storage devices

1.3.2 Disk storage subsystem

Front End Port for server connections

Read Cache Controller 2 Write Cache Controller 2

Back End Port for connection to the disks

Arrays (sometimes called enclosures, 8-Pack, Mega Pack, )

Figure 1-1 Storage device: Physical view

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Chapter 1. Performance management concepts

DS6000 and DS8000 storage device

Figure 1-2 Storage device: Logical view

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Figure 1-3 Array site, array, and rank relationship

Chapter 1. Performance management concepts

1.3.3 Storage virtualization device

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Controller Enclosure Expansion Enclos ure

Node 2 Expansion canister 2

Figure 1-4 Virtualization device overview

Chapter 1. Performance management concepts

Figure 1-5 Virtualization device: physical view

SAN Storage Performance Management Using Tivoli Storage Productivity Center

Figure 1-6 Virtualization device: logical view

Managed disk group (storage pool)

1.3.4 Comparison of a disk storage device and a virtualization device

Chapter 1. Performance management concepts

1.3.5 Data path from your application to the storage

Potential Bottle Neck

Figure 1-7 Typical redundant path host to storage configuration

1.4 Native Storage System Interface (Native API)

SAN Storage Performance Management Using Tivoli Storage Productivity Center

1.5.2 CIMOM and CIM agent

Chapter 1. Performance management concepts

1.5.3 SMI-S standards

Block Server Performance Subprofile

Common metrics by SMI-S standard