RAC Operational Best Practices

RAC & ASM Best Practices
You Probably Need More than just RAC

Kirk McGowan
Technical Director RAC Pack
Oracle Server Technologies
Cluster and Parallel Storage Development
Agenda
Operational Best Practices (IT MGMT 101)
Background
Requirements
Why RAC Implementations Fail
Case Study
Criticality of IT Service Management (ITIL)
process
Best Practices
People, Process, AND Technology

Why do people buy RAC?
Low cost scalability
Cost reduction, consolidation, infrastructure that
can grow with the business
High Availability
Growing expectations for uninterrupted service.
Why do RAC Implementations
fail?
RAC, scale-out clustering is new technology
Insufficient budget and effort is put towards filling
the knowledge gap
HA is difficult to do, and cannot be done with
technology alone
Operational processes and discipline are critical
success factors, but are not addressed
sufficiently
Case Study
Based on true stories. Any resemblance, in
full or in part, to your own experiences is
intentional and expected.
Names have been changed to protect the
innocent
Case Study
Background
8-12 months spent implementing 2 systems somewhat
different architectures, very different workloads, identical
tech stacks
Oracle expertise (Development) engaged to help flatten
tech learning curve
Non-mission critical systems, but important elements of a
larger enterprise re-architecture effort.
Many technology issues encountered across the stack, and
resolved over the 8-12 month implementation
Hw, OS, storage, network, rdbms,
cluster, and application
Case Study
Situation
New mission critical deployment using same technology
stack
Distinct architecture, applications development teams, and
operations teams
Large staff turnover
Major escalation, post production
CIO: Oracle products do not meet our
business requirements
RAC is unstable
DG doesnt handle the workload
JDBC connections dont failover

Case Study
Operational Issues
Requirements, aka SLOs were not defined
e.g. Claim of 20s failover time; application logic included 80s
failover time, cluster failure detection time alone set to 120s.
Inadequate test environments
Problems encountered first in production including the fact
that SLOs could not be met
Inadequate change control
Lessons learned in previous deployments were not applied to
new deployment rediscovery of same problems
Some changes implemented in test, but never rolled into
production re-occuring problems (outages) in production
No process for confirming a change actually fixes the problem
prior to implementing in production

Case Study
More Operational Issues
Poor knowledge xfer between internal teams
Configuration recommendations, patches, fixes identified in
previous deployments were not communicated.
Evictions are a symptom, not the problem.
Inadequate system monitoring
OS level statistics (CPU, IO, memory) were not being captured.
Impossible to RCA on many problems without ability to correlate
cluster / database symptoms with system level activity.
Inadequate Support procedures
Inconsistent data capture
No on-site vendor support consistent with criticality of system
No operations manual
- Managing and responding to outages
- Responding and restoring service after outages

Overview of Operational
Process Requirements
What are ITIL Guidelines?

ITIL (the IT Infrastructure Library) is the most widely accepted
approach to IT service management in the world, ITIL
provides a comprehensive and consistent set of best
practices for IT service management, promoting a quality
approach to achieving business effectiveness and efficiency
in the use of information systems.
IT Service Management
IT Service Management = Service Delivery
+ Service Support
Service Delivery: partially concerned with
setting up agreements and monitoring the
targets within these agreements.
Service Support: processes can be viewed
as delivering services as laid down in
these agreements.
Provisioning of IT Service Mgmt
In all organizations, must be matched to current and
rapidly changing business demands. The objective is
to continually improve the quality of service, aligned to
the business requirements, cost-effectively. To meet
this objective, three areas need to be considered:
People with the right skills, appropriate training and the
right service culture
Effective and efficient Service Management processes
Good IT Infrastructure in terms of tools and technology.
Unless People, Processes and Technology are
considered and implemented appropriately within a
steering framework, the objectives of Service
Management will not be realized.
Service Delivery
Financial Management
Service Level Management
Severity/priority definitions
e.g. Sev1, Sev2, Sev3, Sev4
Response time guidelines
SLAs
Capacity Management
IT Service Continuity Management
Availability Management

Service Support
Incident Management
Incident documentation & Reporting, incident handling,
escalation procedures
Problem Management
RCAs, QA & Process improvement
Configuration Management
Standard configs, gold images, CEMLIs
Change Management
Risk assessment, backout, sw maintenance, decommission
Release Management
New deployments, upgrades, Emergency release,
component release
BP: Set & Manage Expectations
Why is this important?
Expectations with RAC are different at the outset
HA is as much (if not moreso) about the processes and
procedures, than it is about the technology
No matter what technology stack you implement, on its own it
is incapable of meeting stringent SLAs
Must communicate what the technology can AND
cant do
Must be clear on what else needs to be in place to
supplement the technology if HA business
requirements are going to be met.
HA isnt cheap!
BP: Clearly define SLOs
Sufficiently granular
Cannot architect, design, OR manage a system without clearly
understanding the SLOs
24x7 is NOT an SLO
Define HA/recovery time objectives, throughput,
response time, data loss, etc
Need to be established with an understanding of the cost of
downtime for the system.
RTO and RPO are key availability metrics
Response time and throughput are key performance metrics
Must address different failure conditions
Planned vs unplanned
Localized vs site-wide
Must be linked to the business requirements
Response time and resolution time
Must be realistic
Manage to the SLOs
Definitions of problem severity levels
Documented targets for both incident response time, and
resolution time, based on severity
Classification of applications w.r.t. business criticality
Establish SLA with business
Negotiated response and resolution times
Definition of metrics
E.g. Application Availability shall be measured using the
following formula: Total Minutes In A Calendar Month
minus Unscheduled Outage Minutes minus Scheduled
Outage Minutes in such month, divided by Total Minutes
In A Calendar Month
Negotiated SLOs
Effectively documents expectations between IT and business
Incident log: date, time, description, duration, resolution

Example Resolution Time
Matrix
Severity 1 Priority 1 and 2 SRs

< 1 hour
Severity 1 Priority 3 SRs

< 13 Hours
Severity 2 Priority 1 SRs

< 14 hours
Severity 2 SRs

< 132 hrs
Example Response Time
Matrix
Status Sev1/P1 Sev1/P2 Sev2/P1 Sev2 Sev3/
Sev4
New,XFR 15 30 15 30 60
ASG 15 60 15 30 60
IRR, 2CB 15 30 15 60 120
RVW,1CB 15 60 15 60 120
PCR,RDV 60 N/A 60 120 3 hrs
WIP 60 60 60 18 hrs 4 days
INT 60 2 60 120 min 3 hrs
LMS,CUS 4 4 4 2 days 4 days
DEV 4 4 4 3 days 10 days
BP: TEST, TEST, TEST
Testing is a shared responsibility
Functional, destructive, and stress testing
Test environments must be representative of production
Both in terms of configuration, and capacity
Separate from Production
Building a test harness to mimic production workload is a necessary, but
non-trivial effort
Ideally, problems would never be encountered first in
production
If they are, the first question should be: Why didnt we catch the problem
in test?
Exceeding some threshold
Unique timing or race condition
What can we do so we catch this type of problem in the future?
Build a test case that can be reused as part of pre-production
testing.
BP: Define, document, and
adhere to Change Control
Processes
This amounts to self discipline
Applies to all changes at all levels of the tech stack
Hw changes, configuration changes, patches and patchsets,
upgrades, and even significant changes in workload.
If no changes are introduced, system will reach a steady state,
and function for ever.
A well designed system will be able to tolerate some
fluctuations, and faults.
A well managed system will meet service levels
If a problem (that was fixed) is encountered again elsewhere, it is
a change management process problem, not a technology
problem. I.e. rediscovery should not happen.
Ensure fixes are applied across all nodes in a cluster, and all
environments to which the fix applies.
BP: Plan for, and execute
Knowledge Xfer
New technology has a learning curve.
10g, RAC, and ASM cross traditional job boundaries so
knowledge xfer must be executed across all affected groups
Architecture, development, and operations
Network admin, sysadmin, storage admin, dba
Learn how to identify and diagnose problems
e.g. evictions are not a problem, they are a symptom
Learn how to use the various tools and interpret output
Hanganalyze, system state dumps, truss, etc
Understand behaviour distinction between cause and
symptom
Needs to occur pre-production
Operational Readiness
BP: Monitor your system
Define key metrics and monitor them actively
Establish a (performance) baseline
Learn how to use Oracle-provided tools
RDA (+ RACDDT)
AWR/ADDM
Active Session History
OSWatcher
Coordinate monitoring and collection of OS level stats
as well as db-level stats
Problems observed at one layer are often just symptoms of
problems that exist at a different layer
Dont jump to conclusions
BP: Define, Document, and
communicate Support
procedures
Define corrective procedures for outages
Routinely test corrective procedures
HA process:
Prevent Detect capture resume analyze fix
Classify high priority systems, and the steps that need to
be taken in each phase
Keep an active log of every outage
If we dont provide sufficient tools to get to root cause, then
shame on us.
If you dont implement the diagnositic capabilities that are
provided to help get to root cause, then shame on you
Serious outages should never happen more than once.
Summary
Deficiencies in operational processes and procedures
are the root cause of the vast majority of escalations
Address these, you dramatically increase your chances of
a successful RAC deployment, and will save yourself a lot
of future pain
Additional areas of challenge
Configuration Management Initial Install and config,
standardized gold image deployment
Incident Management - Diagnosing cluster-related
problems

RAC Operational Best Practices

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

RAC Operational Best Practices

Hochgeladen von

Copyright:

Verfügbare Formate

RAC & ASM Best Practices

You Probably Need More than just RAC

Das könnte Ihnen auch gefallen