Kirk McGowan Technical Director RAC Pack Oracle Server Technologies Cluster and Parallel Storage Development Agenda Operational Best Practices (IT MGMT 101) Background Requirements Why RAC Implementations Fail Case Study Criticality of IT Service Management (ITIL) process Best Practices People, Process, AND Technology
Why do people buy RAC? Low cost scalability Cost reduction, consolidation, infrastructure that can grow with the business High Availability Growing expectations for uninterrupted service. Why do RAC Implementations fail? RAC, scale-out clustering is new technology Insufficient budget and effort is put towards filling the knowledge gap HA is difficult to do, and cannot be done with technology alone Operational processes and discipline are critical success factors, but are not addressed sufficiently Case Study Based on true stories. Any resemblance, in full or in part, to your own experiences is intentional and expected. Names have been changed to protect the innocent Case Study Background 8-12 months spent implementing 2 systems somewhat different architectures, very different workloads, identical tech stacks Oracle expertise (Development) engaged to help flatten tech learning curve Non-mission critical systems, but important elements of a larger enterprise re-architecture effort. Many technology issues encountered across the stack, and resolved over the 8-12 month implementation Hw, OS, storage, network, rdbms, cluster, and application Case Study Situation New mission critical deployment using same technology stack Distinct architecture, applications development teams, and operations teams Large staff turnover Major escalation, post production CIO: Oracle products do not meet our business requirements RAC is unstable DG doesnt handle the workload JDBC connections dont failover
Case Study Operational Issues Requirements, aka SLOs were not defined e.g. Claim of 20s failover time; application logic included 80s failover time, cluster failure detection time alone set to 120s. Inadequate test environments Problems encountered first in production including the fact that SLOs could not be met Inadequate change control Lessons learned in previous deployments were not applied to new deployment rediscovery of same problems Some changes implemented in test, but never rolled into production re-occuring problems (outages) in production No process for confirming a change actually fixes the problem prior to implementing in production
Case Study More Operational Issues Poor knowledge xfer between internal teams Configuration recommendations, patches, fixes identified in previous deployments were not communicated. Evictions are a symptom, not the problem. Inadequate system monitoring OS level statistics (CPU, IO, memory) were not being captured. Impossible to RCA on many problems without ability to correlate cluster / database symptoms with system level activity. Inadequate Support procedures Inconsistent data capture No on-site vendor support consistent with criticality of system No operations manual - Managing and responding to outages - Responding and restoring service after outages
Overview of Operational Process Requirements What are ITIL Guidelines?
ITIL (the IT Infrastructure Library) is the most widely accepted approach to IT service management in the world, ITIL provides a comprehensive and consistent set of best practices for IT service management, promoting a quality approach to achieving business effectiveness and efficiency in the use of information systems. IT Service Management IT Service Management = Service Delivery + Service Support Service Delivery: partially concerned with setting up agreements and monitoring the targets within these agreements. Service Support: processes can be viewed as delivering services as laid down in these agreements. Provisioning of IT Service Mgmt In all organizations, must be matched to current and rapidly changing business demands. The objective is to continually improve the quality of service, aligned to the business requirements, cost-effectively. To meet this objective, three areas need to be considered: People with the right skills, appropriate training and the right service culture Effective and efficient Service Management processes Good IT Infrastructure in terms of tools and technology. Unless People, Processes and Technology are considered and implemented appropriately within a steering framework, the objectives of Service Management will not be realized. Service Delivery Financial Management Service Level Management Severity/priority definitions e.g. Sev1, Sev2, Sev3, Sev4 Response time guidelines SLAs Capacity Management IT Service Continuity Management Availability Management
Service Support Incident Management Incident documentation & Reporting, incident handling, escalation procedures Problem Management RCAs, QA & Process improvement Configuration Management Standard configs, gold images, CEMLIs Change Management Risk assessment, backout, sw maintenance, decommission Release Management New deployments, upgrades, Emergency release, component release BP: Set & Manage Expectations Why is this important? Expectations with RAC are different at the outset HA is as much (if not moreso) about the processes and procedures, than it is about the technology No matter what technology stack you implement, on its own it is incapable of meeting stringent SLAs Must communicate what the technology can AND cant do Must be clear on what else needs to be in place to supplement the technology if HA business requirements are going to be met. HA isnt cheap! BP: Clearly define SLOs Sufficiently granular Cannot architect, design, OR manage a system without clearly understanding the SLOs 24x7 is NOT an SLO Define HA/recovery time objectives, throughput, response time, data loss, etc Need to be established with an understanding of the cost of downtime for the system. RTO and RPO are key availability metrics Response time and throughput are key performance metrics Must address different failure conditions Planned vs unplanned Localized vs site-wide Must be linked to the business requirements Response time and resolution time Must be realistic Manage to the SLOs Definitions of problem severity levels Documented targets for both incident response time, and resolution time, based on severity Classification of applications w.r.t. business criticality Establish SLA with business Negotiated response and resolution times Definition of metrics E.g. Application Availability shall be measured using the following formula: Total Minutes In A Calendar Month minus Unscheduled Outage Minutes minus Scheduled Outage Minutes in such month, divided by Total Minutes In A Calendar Month Negotiated SLOs Effectively documents expectations between IT and business Incident log: date, time, description, duration, resolution
Example Resolution Time Matrix Severity 1 Priority 1 and 2 SRs
< 1 hour Severity 1 Priority 3 SRs
< 13 Hours Severity 2 Priority 1 SRs
< 14 hours Severity 2 SRs
< 132 hrs Example Response Time Matrix Status Sev1/P1 Sev1/P2 Sev2/P1 Sev2 Sev3/ Sev4 New,XFR 15 30 15 30 60 ASG 15 60 15 30 60 IRR, 2CB 15 30 15 60 120 RVW,1CB 15 60 15 60 120 PCR,RDV 60 N/A 60 120 3 hrs WIP 60 60 60 18 hrs 4 days INT 60 2 60 120 min 3 hrs LMS,CUS 4 4 4 2 days 4 days DEV 4 4 4 3 days 10 days BP: TEST, TEST, TEST Testing is a shared responsibility Functional, destructive, and stress testing Test environments must be representative of production Both in terms of configuration, and capacity Separate from Production Building a test harness to mimic production workload is a necessary, but non-trivial effort Ideally, problems would never be encountered first in production If they are, the first question should be: Why didnt we catch the problem in test? Exceeding some threshold Unique timing or race condition What can we do so we catch this type of problem in the future? Build a test case that can be reused as part of pre-production testing. BP: Define, document, and adhere to Change Control Processes This amounts to self discipline Applies to all changes at all levels of the tech stack Hw changes, configuration changes, patches and patchsets, upgrades, and even significant changes in workload. If no changes are introduced, system will reach a steady state, and function for ever. A well designed system will be able to tolerate some fluctuations, and faults. A well managed system will meet service levels If a problem (that was fixed) is encountered again elsewhere, it is a change management process problem, not a technology problem. I.e. rediscovery should not happen. Ensure fixes are applied across all nodes in a cluster, and all environments to which the fix applies. BP: Plan for, and execute Knowledge Xfer New technology has a learning curve. 10g, RAC, and ASM cross traditional job boundaries so knowledge xfer must be executed across all affected groups Architecture, development, and operations Network admin, sysadmin, storage admin, dba Learn how to identify and diagnose problems e.g. evictions are not a problem, they are a symptom Learn how to use the various tools and interpret output Hanganalyze, system state dumps, truss, etc Understand behaviour distinction between cause and symptom Needs to occur pre-production Operational Readiness BP: Monitor your system Define key metrics and monitor them actively Establish a (performance) baseline Learn how to use Oracle-provided tools RDA (+ RACDDT) AWR/ADDM Active Session History OSWatcher Coordinate monitoring and collection of OS level stats as well as db-level stats Problems observed at one layer are often just symptoms of problems that exist at a different layer Dont jump to conclusions BP: Define, Document, and communicate Support procedures Define corrective procedures for outages Routinely test corrective procedures HA process: Prevent Detect capture resume analyze fix Classify high priority systems, and the steps that need to be taken in each phase Keep an active log of every outage If we dont provide sufficient tools to get to root cause, then shame on us. If you dont implement the diagnositic capabilities that are provided to help get to root cause, then shame on you Serious outages should never happen more than once. Summary Deficiencies in operational processes and procedures are the root cause of the vast majority of escalations Address these, you dramatically increase your chances of a successful RAC deployment, and will save yourself a lot of future pain Additional areas of challenge Configuration Management Initial Install and config, standardized gold image deployment Incident Management - Diagnosing cluster-related problems