Stdprod 062642 PDF

Government of Ontario IT Standard (GO-ITS) GO-ITS Number 37 Enterprise Incident Management Process
Version 2.0 Status: Approved
Prepared for the Information Technology Standards Council (ITSC) under the delegated authority of the Management Board of Cabinet
Queen's Printer for Ontario, 2010
Last Review Date: 2010-04-01
Sensitivity: Unclassified
Approved
Version #: 2.0
Copyright & Disclaimer

Government of Ontario reserves the right to make changes in the information contained in this publication without prior notice. The reader should in all cases consult the Document History to determine whether any such changes have been made. 2010 Government of Ontario. All rights reserved. Other product or brand names are trademarks or registered trademarks of their respective holders. This document contains proprietary information of Government of Ontario, disclosure or reproduction is prohibited without the prior express written permission from Government of Ontario.
Template Info
Template Name GO-ITS Template Template # 09.03.25 Template Version No. 1.0 Template Author Design: PMCoE Boilerplate: TAB/OCCTO Template Completion Date 2009-03-26
Document History (including ITSC and ARB approval dates)

Date 2009-06-17 2009-07-16 2009-08-14 2009-09-09 2010-02-02 2010-02-08 2010-02-10 2010-03-03 Version 1.7: presented to ITSC Version 1.8: reflects feedback from Stakeholders, received up to and including 2009-07-16 Version 1.9: reflects additional roles and new principle regarding security related incidents Version 1.94: reflects feedback since August 19 and injection of Urgency / Impact definitions (Section 6.4) Version 1.95: accepts all changes in version 1.94 and incorporates results of discussions held in Dec 2009 and Jan 2010 with ITSM Leads and ITS / OCCTO OEIP Version 1.95: updated subsequent to meeting with Head, Corporate Architecture Branch, OCCTO, post ITSML discussion of 2010-02-04. Suggestions received at ITSML embedded. Version 1.95: updated to modify references to Post-Mortem terminology (changed to Major Incident Review) per discussion / feedback from ITSML Version 1.97: inserted effective date for this revised version as July 1, 2010 Hyperlink inserted in Appendix for MIP Normative reference 2010-03-17 2010-03-19 Endorsed: IT Standards Council endorsement Version 2.0 Final Draft post-ITSC endorsement of 2010-03-17 2010-04-01 Section 4.2.10 and Principle 9 removed specific reference to Service Management Branches and replaced with generic wording appropriate branches Updated Section 4.3.1 Process Flow - added box in diagram to reflect User Reporting Incident Section 6.2.1 Added clarification statement to describe illustrative characteristic of diagram Summary
Approved: Architecture Review Board approval
GO-ITS 37 Enterprise Incident Management Process
Page 2 of 40
Approved
Version #: 2.0
Table of Contents
1. FOREWORD..........................................................................................................................................................4 2. INTRODUCTION...................................................................................................................................................5
2.1. 2.2. 2.3. 2.4. 2.5. 2.6. Background............................................................................................................................................................................................................5 Purpose....................................................................................................................................................................................................................5 Value to the Business.........................................................................................................................................................................................6 Basic Concepts.....................................................................................................................................................................................................6 Scope........................................................................................................................................................................................................................9 2.5.1. In Scope ................................................................................................................. 9 Applicability Statements......................................................................................................................................................................................9 2.6.1. Organization........................................................................................................... 9 2.6.2. Requirements Levels ........................................................................................... 10 2.6.3. Compliance Requirements................................................................................... 10 Contact Information...........................................................................................................................................................................................11 3.1.1. Roles and Responsibilities................................................................................... 11 Recommended Versioning and/or Change Management...............................................................................................................13 Publication Details..............................................................................................................................................................................................13 Process Principles.............................................................................................................................................................................................14 Process Roles and Responsibilities............................................................................................................................................................21 4.2.1. Enterprise Incident Management Process Owner ............................................... 21 4.2.2. Incident Manager (IM) .......................................................................................... 22 4.2.3. Situation Manager (SM) ....................................................................................... 22 4.2.4. Queue Manager (QM) .......................................................................................... 23 4.2.5. Service Desk Manager (SDM) ............................................................................. 23 4.2.6. Service Desk Team Lead..................................................................................... 24 4.2.7. Service Desk Agent (SDA)................................................................................... 24 4.2.8. Incident Analyst (IA) ............................................................................................. 24 4.2.9. Service Owner...................................................................................................... 25 4.2.10. Major Incident Manager (MIM) ....................................................................... 26 4.2.11. Partner Incident Management Liaison ........................................................... 26 Process Flows....................................................................................................................................................................................................27 4.3.1. Incident Management Process Overview ............................................................ 27 4.3.2. Incident Management Process Tasks.................................................................. 29 Linkages to other processes..........................................................................................................................................................................31 Incident Management Process Quality Control......................................................................................................................................31 Metrics....................................................................................................................................................................................................................32 Standard Process Parameters.....................................................................................................................................................................33 Impacts to Existing Standards.......................................................................................................................................................................33 Impacts to Existing Environment..................................................................................................................................................................33 Normative References....................................................................................................................................................................................34 6.1.1. Major Incident Protocol ........................................................................................ 34 Informative References...................................................................................................................................................................................34 6.2.1. Enterprise Differentiation: Process, Procedure, Work Instruction ....................... 34 6.2.2. Definitions: Urgency and Impact .......................................................................... 35
3. STANDARDS LIFECYCLE MANAGEMENT..........................................................................................11

3.1. 3.2. 3.3. 4.1. 4.2.
4. TECHNICAL SPECIFICATION......................................................................................................................14
4.3.
4.4. 4.5. 4.6. 4.7. 5.1. 5.2. 6.1. 6.2.
5. RELATED STANDARDS................................................................................................................................33 6. APPENDICES.....................................................................................................................................................34
7. GLOSSARY.........................................................................................................................................................37
Page 3 of 40
Approved
Version #: 2.0
1.
Foreword
Government of Ontario Information Technology Standards (GO-ITS) are the official publications on the guidelines, preferred practices, standards and technical reports adopted by the Information Technology Standards Council (ITSC) under delegated authority of the Management Board of Cabinet (MBC). These publications support the responsibilities of the Ministry of Government Services (MGS) for coordinating standardization of Information & Information Technology (I&IT) in the Government of Ontario. Publications that set new or revised standards provide enterprise architecture guidance, policy guidance and administrative information for their implementation. In particular, GO-ITS describe where the application of a standard is mandatory and specify any qualifications governing the implementation of standards.
Page 4 of 40
Approved
Version #: 2.0
2.
2.1.
Introduction
Background
The requirement for an all-encompassing OPS Incident Management standard was predicated by the positioning of all infrastructure service and support within Infrastructure Technology Services (ITS), a new organization within the OPS mandated in 2005 to deliver these types of services to the OPS. The ITS organization was created in 2006 to achieve this goal. Establishment of this goal required an update of the requirements for the GO-ITS Standard for Incident Management based on the situation described above. The result was an updated version of GO-ITS # 37 created and approved in July of 2007. During February 2009, a series of outages to Ontario.ca infrastructure prompted I & IT Executive Management to conduct a review of both Incident and Change Management processes and procedures. The review identified deficiencies in a number of areas including; procedures, operational process management and behaviour. The review made specific recommendations to address the deficiencies and these recommendations have subsequently been sanctioned by ITELC. Accordingly, the OPS Enterprise IT Service Management Program (OEIP) has updated the Enterprise Incident Management Process Standard to incorporate the recommendations. This document redefines certain aspects of the enterprise Incident Management Principles, Roles and the associated process model. Updates to GO-ITS # 37 include: Principles, Roles, Responsibilities and the high-level process flow required to ensure an enterprise perspective of Incident Management for the OPS. Definition of a Major Incident Protocol at the process standard level Incorporation of ITIL 1 V3 (2007) concepts, introduction of a service-based focus for enterprise incident management disciplines and the natural evolution of IT Service Management within the OPS
These standard elements continue to provide a single unified process for enterprise Incident Management within the OPS. Use of a single process and supporting information will enable OPS-wide management and reporting for the enterprise Incident Management process through establishment of associated metrics. GO-ITS 44 ITSM Terminology Reference Model Portable Guide provides a common information model for key process parameters that require standardization across the OPS to ensure consistency, reliable business intelligence and to support end-to-end cross-jurisdictional service management. GO-ITS 44 will be updated with additional values defined as part of GO ITS 37. Please refer to: http://www.gov.on.ca/MGS/en/IAndIT/STEL02_047295.html
2.2.
Purpose
The goals of the enterprise Incident Management process are to restore normal service operation as quickly as possible, minimize the adverse impact on business operations and ensure that the best possible levels of service quality and availability are maintained. This process standard describes best practices to be utilized for Incident Management. The process design is organizationally agnostic and is not constrained by the status quo. Implementation of the process may require organizational or behavioural transformation.
ITIL and IT Infrastructure Library are registered trademarks of the Office of Government Commerce (OGC), U.K. Page 5 of 40
Approved
Version #: 2.0
2.3.
Value to the Business
The value of Incident Management includes: The ability to detect and resolve incidents, which results in lower downtime to the business, which in turn means higher availability of the service. This means that the business is able to exploit the functionality of the service as designed. The ability to align IT activity to real-time business priorities. This is because Incident Management includes the capability to identify business priorities and dynamically allocate resources as necessary. The ability to identify potential improvements to services. This happens as a result of understanding what constitutes an incident and also from being in contact with the activities of business operational staff. The Service Desk can, during its handling of incidents, identify additional service or training requirements found in IT or the business.
2.4.
Basic Concepts
ITIL defines an incident as: An unplanned interruption to an IT service or reduction in the quality of an IT service. Failure of a service component or element item that has not yet impacted service is also considered an incident (e.g. Failure of one disk from a mirrored set). Incident Management is the process for dealing with all incidents. This can include; failures, questions or queries reported by the users (usually via a telephone call to the Service Desk) anomalies detected by technical staff automatically detected errors or conditions reported by event monitoring tools
The Service Desk Agent (SDA) captures the pertinent information and logs, classifies and prioritizes the incident. The priority of an Incident is primarily determined by the impact on the business and the urgency with which a resolution or work-around is needed (as defined in Appendix 6.4) Objective targets for resolving Incidents are defined in Service Level Agreements (SLAs). Major Incidents, which typically have highest impact and demand quicker resolution, follow the same process as any other Incident, but are managed by a separate procedure. The Service Desk takes advantage of diagnostic scripts to capture and verify information that is required to quickly resolve the event. In the case where the Service Desk cannot achieve resolution, this information helps in ensuring the Incident is assigned to the appropriate Tier 2 group for action. The Service Desk Agent often references Incident Patterns, the Known Error database and any available Knowledge Management records to obtain any information that will assist them in attempting to resolve the Incident at first point of contact (FPOC). If the Incident cannot be resolved at first point of contact, the Service Desk Agent assigns the incident to a group with more specialized skills. (This is known as Functional Escalation). Tier 1-N Thresholds Each support tier is allocated a certain amount of time to resolve the incident, following which the Incident must be functionally escalated to a more specialized group. The amount of time allocated to each tier is set so that service restoration occurs within the agreed targets, as defined in the SLA/SLO. These allocations may be adjusted from time to time based upon staffing models, experience on supporting the various services and ongoing changes to service specifications and components.
Page 6 of 40
Approved
Version #: 2.0
Queues, Support Groups and Functional Escalation The Incident Management system supports the practice of Queues and Queue Management: each queue represents a view of all Incidents assigned to an organization at all levels of priority. This provides a Queue Manager with an overall perspective of how the Incident Management process is being executed across all support groups within an organization at any given time. Should a certain part of the organization be experiencing a back log related to incidents in their respective queues, the Service Desk Manager may be asked by the Queue Manager to perform Hierarchical Escalation, to notify more senior management of the situation in an effort to relieve the pressure on any specific queue. This basic concept applies to the design of the Incident Management process within the Ontario Public Service, however, organization maturity currently prevents the industry best practice from being strictly followed. It is important to note this concept as it describes the desired organizational behaviour or future-state model. Various support groups have also been established in each OPS organization based upon areas of functional expertise. An Incident can be assigned to any one of these support groups where it is then assigned to an individual member of that group to undertake incident diagnosis and resolution. All of these support groups must roll up into an organizational queue view, so that the overall perspective is available to the Queue Manager. A Service Desk Agent who cannot resolve an Incident at FPOC, assigns it to the appropriate Tier 2 Support Group, based upon the initial diagnosis (This is called Functional Escalation). Once the Service Desk Agent has assigned the incident to a Tier 2 Incident Analysts, one of three things typically occurs: Resolution: The Incident Analyst restores service and informs the Service Desk Re-assignment: The Incident Analyst concludes that the cause of the incident does not lie in his area of expertise and assigns the incident back to the Service Desk for re-assignment to a more appropriate group
Page 7 of 40
Approved
Version #: 2.0
Functional escalation: the Incident Analyst cannot resolve the incident within defined threshold and requests that the Incident be assigned to a Tier 3 support group with more specialized skills.
A Queue Manager role may also be established for an individual Support Group to monitor their respective queues at regular intervals to identify any incidents that have not been assigned to individuals or have not been resolved within defined thresholds and to take proactive action before being prompted by the overall Queue Manager.
Accountability Regardless of the support staff and organization to which an incident may be assigned, the Incident Manager (part of the OPS ITSD organization) remains accountable for ensuring that enterprise Incident Management process and procedures are followed and that prompt incident resolution activities are undertaken with Service Level Objectives in mind. Irrespective of who restores service (Service Desk Agent or Tier 2-N Support group), the OPS ITSD remains accountable to confirm with the customer and / or end-user that service has been restored and verifies the accuracy of the condition/reason code, prior to closing the Incident. Inputs to the Incident Management process include 2 : Incident records from calls to the Service Desk Service Level Objectives (from SLAs) Capacity Management thresholds Incident resolution details from the Knowledge base Incident patterns (and workarounds) from Incident Knowledge Management database Known Errors from Problem Management CI data from Configuration Management Outputs from this process include 3 : Closed Incidents Services restored Requests for Change (RFCs) Incident resolution Inconsistencies found while interrogating the CMDB Consistent, meaningful (and maintained) Incident records Meaningful management information
2 3
Source: Copyright 2003-2007 Ahead Technology Inc. Source: Copyright 2003-2007 Ahead Technology Inc. Page 8 of 40
Approved
Version #: 2.0
2.5.
Scope
2.5.1. In Scope
Incident Management includes any event which disrupts, or which could disrupt, a service. This includes events which are communicated directly by users through the Service Desk or events detected through an automated interface from event management to Incident Management tools. For purposes of clarity, any use of the terms, Incident Manager, Incident Management or Incidents, within this document includes the enterprise perspective described in Section 2.1. Service Requests do not represent a disruption to agreed service, but are a way of meeting the customers needs and may be addressing a specific aspect or feature of the service being provided (Service Fulfillment). This will be documented in the Service Level Agreement with each customer and the Service Level Objective will be outlined therein. Service requests are dealt with by a separate Request Fulfilment process. Service Requests in the OPS are currently tracked under the same incident management enabling technology used by the Service Desk for incident logging. Incident Management Scope: IS How To and technical questions IS NOT Service Requests (Request fulfillment) This is handled in the OPS through Service Order Desk Online (SODO) Root Cause Analysis (part of Problem Management) Steps and procedures to manage Major Incidents Establishment of communication thresholds for customers (these are defined through Service Level Management)
2.6.
Applicability Statements
2.6.1. Organization
Government of Ontario IT Standards and Enterprise Solutions and Services apply (are mandatory) for use by all ministries/clusters and to all former Schedule I and IV provincial government agencies under their present classification (Advisory, Regulatory, Adjudicative, Operational Service, Operational Enterprise, Trust or Crown Foundation) according to the current agency classification system. Additionally, this applies to any other new or existing agencies designated by Management Board of Cabinet as being subject to such publications, i.e. the GO-ITS publications and enterprise solutions and services - and particularly applies to Advisory, Regulatory, and Adjudicative Agencies (see also procurement link, OPS paragraph). Further included is any agency which, under the terms of its Memorandum of Understanding with its responsible Minister, is required to satisfy the mandatory requirements set out in any of the Management Board of Cabinet Directives (cf. Operational Service, Operational Enterprise, Trust, or Crown Foundation Agencies). As new GO-IT standards are approved, they are deemed mandatory on a go-forward basis. Specifically, in the case of this revised version of GO-ITS 37 (placeholder anticipating approved version number set to 2.0), the effective date has been established as July 1, 2010. When implementing or adopting any Government of Ontario IT standards or IT standards updates, ministries and I&IT Cluster must follow their organization's pre-approved policies and practices for ensuring that adequate change control, change management and risk mitigation mechanisms are in place and employed. For the purposes of this document, any reference to ministries or the Government includes applicable agencies.
Page 9 of 40
Approved
Version #: 2.0
2.6.2.
Requirements Levels
Within this document, certain wording conventions are followed. There are precise requirements and obligations associated with the following terms: This word, or the terms "REQUIRED" or "SHALL", means that the statement is an absolute mandatory requirement. This word SHOULD, or the adjective "RECOMMENDED", means that there may exist valid reasons in particular circumstances to ignore the recommendation, but the full implications (e.g., business functionality, security, cost) must be understood and carefully considered before deciding to ignore the recommendation
Must
Should
2.6.3.
Compliance Requirements
Execution of this process at the operational level requires use of procedures, work instructions and enabling technology to automate certain workflow aspects. These elements will be produced by the organization selected by OEIP as the Operational Process Manager. Pending formalization of an ITSM Process Lifecycle Management protocol, the following statements are presented to ensure that these elements are fully compliant with this Standard: Procedures must be developed by decomposing each process step from section 4.3 into procedural sub-tasks. These procedures must be submitted to the Enterprise Process Owner for certification that they comply with the spirit and intent of the Process Standard. Work Instructions must be developed by decomposing all procedural sub-tasks into further sub-tasks. These must be then submitted to the Enterprise Process Owner for certification that they comply with the certified process and procedures. Functional Requirements must be developed for enabling technology that will be used to automate aspects of the work Instructions and procedures. Functional Requirements must also be submitted to the Enterprise Process Owner for certification that they align with the certified procedures. Any subsequent modifications to the Procedures, Work Instructions or enabling technology must be managed via Enterprise Change Management and will require authorization by OEIP
Page 10 of 40
Approved
Version #: 2.0
3.
3.1.
Standards Lifecycle Management

Contact Information
3.1.1. Roles and Responsibilities
Provide the following information: Accountable Role Definition The individual ultimately accountable for the process of developing this standard. There must be exactly one accountable role identified. The accountable person also signs off as the initial approver of the proposed standard before it is submitted for formal approval to ITSC and ARB. (Note: in the OPS this role is at a CIO/Chief or other senior executive level) Accountable Role: Title: Head, Corporate Architecture Branch (OCCTO) Ministry: MGS Division: OCCTO Responsible Role Definition The organization responsible for the development of this standard, There may be more than one responsible organization identified if it is a partnership/joint effort. (Note: the responsible organization(s) provides the resource(s) to develop the standard) Responsible Organization: Ministry: MGS Division: OCCTO Branch: Corporate Architecture Support Role Definition The support role is the resource(s) to which the responsibility for actually completing the work and developing the standard has been assigned. There may be more than one support role identified. If there is more than one support role identified, the following contact information must be provided for each of them. If there is more than one support role, the first role identified should be that of the editor the resource responsible for coordinating the overall effort. Support Role (Editor): Ministry: MGS Division: OCCTO Branch: Corporate Architecture Section: ITSM Job Title: Lead, OPS Enterprise ITSM Program Name: Norm Watt Phone: 416-327-3542 Email: norm.watt@ontario.ca The above individual will be contacted by the Standards Section once a year, or as required, to discuss and determine potential changes and/or updates to the standard (including version upgrades and/or whether the standard is still relevant and current).
Page 11 of 40
Approved
Version #: 2.0
Consulted Please indicate who was consulted as part of the development of this standard. Include individuals (by role and organization) and committees, councils and/or working groups. (Note: consulted means those whose opinions are sought, generally characterized by two-way communications such as workshops): Organization Consulted (Ministry/Cluster) Division Branch Date
Committee/Working Group Consulted ITSM Leads
Date Dec 2009 and Feb 2010
Informed Please indicate who was informed during the development of this standard. Include individuals (by role and organization) and committees, councils and/or working groups. (Note: informed means those who are kept up-to-date on progress, generally characterized by one-way communication such as presentations): Organization Informed (Ministry/Cluster) Division Branch Date
Committee/Working Group Informed
Date
Page 12 of 40
Approved
Version #: 2.0
3.2.
Recommended Versioning and/or Change Management
Changes (i.e. all revisions, updates, versioning) to the standard require authorization from the responsible organization. Once a determination has been made by the responsible organization to proceed with changes, the Standards Section, Technology Adoption Branch, OCCTO, will coordinate and provide assistance with respect to the approvals process. The approval process for changes to standards will be determined based on the degree and impact of the change. The degree and impact of changes fall into one of two categories: Minor changes - requiring communication to stakeholders. No presentations required. No ITSC or ARB approvals required. Changes are noted in the Document History section of the standard; Major changes - requiring a presentation to ITSC for approval and ARB for approval (Note: ARB reserves the right to delegate their approval to ITSC) Below are guidelines for differentiating between minor and major changes: Major: represents a change to one or more of Scope, Principles, Roles or high-level Process Flow responds to legislative changes Minor: does not impact other standards (e.g. updated Glossary information or updated Informative or Normative reference documentation)
3.3.
Publication Details
All approved Government of Ontario IT Standards (GO-ITS) are published on the ITSC Intranet web site. Please indicate with a checkmark below if this standard is also to be published on the public, GO-ITS Internet Site.
Standard to be published on both the OPS Intranet and the GO-ITS Internet web site (available to the public, vendors etc.)
Page 13 of 40
Approved
Version #: 2.0
4.
4.1.
Technical Specification
Process Principles
Principles are established to ensure that the process identifies the desired outcomes or behaviours related to adoption at an enterprise level. They also serve to provide direction for the development of procedures and (as necessary) work instructions that will ensure consistent execution of the process. The absence of well-defined and well understood principles may result in process execution that is not aligned with the process standard. Process Principles for OPS enterprise Incident Management are listed below.
Principle 1: A single enterprise Incident Management process shall be used across the OPS in support of I & IT services.
Rationale: Single support model eliminates costs and inefficiencies of multiple models for different services Establishment of a Single Point of Contact (SPOC) OPS IT Service Desk (ITSD) in FY 2006/2007 implied a single incident management process for OPS I & IT Incident Management
Implications: Legacy Incident Management related procedures and work instructions must be integrated and aligned to OPS enterprise Incident Management process Application Support groups must adapt existing procedures and work instructions to comply with the OPS enterprise Incident Management process
Principle 2: Incident classification must identify the Service(s) that is / are impacted (from the Customers perspective).
Rationale: OPS Service Directive OEIP business architecture principle to establish a Service Focus for ITSM processes Enable implementation of ISAM (Integrated Service Agreement Model)
Implications: staff must adopt an end-to-end service perspective for all incidents Service classification requirements must be defined and included in enabling technology Cluster Service owners must identify the services/hierarchy A service Configuration hierarchy must exist in order to identify impacted services Staff must be trained in new classification techniques Incident messaging with user/customer must communicate the service that the user feels is impacted Internal assignment routing, currently component-based, may have to be modified
Page 14 of 40
Approved
Version #: 2.0
Principle 3: The OPS ITSD shall be the single entry point into the enterprise Incident Management process and will manage Incidents through their complete lifecycle, including: assignment, functional and hierarchical escalation, tracking, communication and closure.
Consistent management and coordination of Incident resolution Rationale: Single accountability for execution of enterprise Incident Management process Ability to share topical information within a single group and provide enterprise perspective Ability to cross-reference other incidents and establish incident priority from an enterprise perspective
Implications: effective diagnostic scripts and Support Models are required to assist in triage of incidents and ensure accurate assignment to the appropriate Tier-N resources ITSD Senior Management must support the objective assessment of reported Incidents and ensure criteria for Impact and Urgency (used to determine Priority) are established and communicated to Customers through the Service Level Management process Incident assignments / re-assignments to Tier-N support must occur via Service Desk only
Principle 4: The OPS ITSD shall act as the single point of contact for all communication regarding reported Incidents.
Rationale: Consistent support interface for customers Consistent delivery and coordination of communications to internal staff Reduces duplicative messaging and ensures common perspective is provided to customers and to I & IT senior management IT Tier 2-N support staff are more productive since they are protected from interruptions and the need to manage communications
Implications: Assistance and incident status information must be available (7*24) from the OPS ITSD throughout the entire lifecycle of the incident OPS ITSD and technical support staff will have to adjust their messaging to describe impacts / status in terminology that is service-focussed and customer-based rather than technical in nature OPS ITSD will distribute all Major Incident communications (sanctioned by the Major Incident Manager)
Page 15 of 40
Approved
Version #: 2.0
Tier 2-N resources may request OPS ITSD staff to coordinate dialogue with end-user or customers (used to gather additional detail or information to effect incident resolution) if they are unable to contact the end-user directly. Customers or I & IT Clusters must have in place a mechanism to broadly disseminate information provided to them by the OPS ITSD
Principle 5: An Incident must be logged through the OPS ITSD as a pre-requisite for engagement of any Tier 2-N Support Staff, including external Service Providers.
Rationale: The Incident record is the source of record with the OPS ITSD for all incident resolution activities undertaken by any support staff. Failure to document these activities increases the risk of delayed resolution.
Implications: OPS ITSD procedures must identify the minimum level of information required to initiate an Incident record and to enable effective investigation and diagnosis.
Principle 6: Closure of incidents shall be dependent upon validating with the either the enduser or the customer that service has been restored.
Rationale: Obtaining positive confirmation of incident resolution ensures that the customer is satisfied with the service delivered Validation step enhances the image of the IT organization
Implications: Customers will identify an appropriate level of resource to accept the validation request. A suitable mechanism must be defined to deal with circumstances when end-user(s) cannot be reached for validation within a pre-defined time period.
Page 16 of 40
Approved
Version #: 2.0
Principle 7: There shall be notification & escalation procedures that ensure consistent timely incident resolution and communication of progress relative to Service Level Agreements.
Rationale: Setting customer expectation for timing of periodic status reports will prevent interruptions caused by requests for status More effective delivery of end to end service as IT staff will have a clear understanding of Incident SLOs which will guide appropriate functional and hierarchical escalation Incidents resolved within customer expectations will increase customer satisfaction
Implications: Clear triggers and thresholds must be defined for functional & hierarchical escalation, as well as any periodic status notifications (this implies some form of automation); Service Level Objectives (documented in Service Level Agreements) must be clearly and explicitly defined and linked to these thresholds A single Escalation procedure must exist for functional and hierarchical escalation and must be adhered to by all participants in the Incident Management process A single Notification procedure must exist for notification. Any unique requirements for service specific notification thresholds must be documented and managed through the Service Level Management process and outputs from these situations must be configured within the ITSD enabling technology to support the requirements Templates and scripts are required to ensure consistency of messaging Customer Messaging must be tailored to deliver a customer perspective Messaging for internal Service Provider community may carry different level of detail, and this will be managed through local work instructions at the OPS ITSD
Principle 8: All Incident information, including resolution details, shall be logged in an accessible Incident Management repository.
Rationale: Single source of data for all enterprise incidents, ensures consistent view and authoritative source for management of incidents tracking of progress enables ability to escalate Provides knowledge base to enable: Reduction in Mean Time to Resolve (MTTR) for similar incidents by applying previous workaround Analysis and identification of Problems (by Problem Management Process) Audit trail informs reporting (Service Level Management)
Implications: Incident Management must be supported by an integrated IT support system with a common database for logging all incident & resolution information Incident Management and Problem Management must have access to the same database Validation of accuracy of resolution details must occur before any auto-closure of tickets
Page 17 of 40
Approved
Version #: 2.0
Principle 9: A separate procedure shall be established to manage resolution of Major Incidents that will include nomination of a single Manager for the incident. This resource will be assigned from a pool of management staff within the OPS ITSD, the appropriate Branch of the I & IT Cluster or Corporate Security.
Rationale: Special leadership may be required to secure and manage resources to ensure prompt resolution of major incidents Establishment of an accountable Lead will ensure ownership of the Major Incident and provide an objective point of escalation and contact throughout the life of the incident from declaration to Major Incident Review
Implications: Criteria for Major Incident declaration must be defined, documented and communicated to Stakeholders and then linked to Incident prioritization activities at the OPS ITSD o o Criteria may vary by Service - It is neither reasonable nor efficient to define a one size fits all criteria that apply to all Incidents. It is an expensive undertaking to invoke Major Incident Procedures and secure and coordinate the resources required to deal with a Major Incident. Therefore, care must be taken to prevent subjective or reactive declaration by specifying objective, quantifiable attributes for an Incident to be declared Major.
Ability to engage and receive confirmation of acceptance from the accountable Major Incident Manager must be 7*24 Incident Analyst staff in any organization must be contactable on a 7*24 basis to support Major Incidents Some Major Incidents may not require special leadership if resolution activities are outside the span of control of the OPS I & IT community (i.e. major power outage or major weather situation across the province) Staff involved in Incident Management and Service Level Management functions must be trained in the Major Incident Procedure Logistics, facilities and technical requirements for a Situation or WAR Room must be identified and provisioned to support prolonged or multiple incident events. This information must be made widely available to all Stakeholders in the enterprise incident management process.
Principle 10: Any proposed service restoration activity, which has the potential to impact other services or other customers of the same service, must be approved by the Service Owner(s) before being undertaken.
Rationale: Ensures that incident resolution activities do not impact other services or other users of the same service Ensures a business perspective is considered before possible disruptive actions are taken for incident resolution
Page 18 of 40
Approved
Version #: 2.0
Implications: Service Owner(s) must be contactable 7 * 24 As an alternative to 7* 24 availability, a defined policy must be developed by the Service Owner that will outline the proposed approach for each of the Services in the catalogue of the Service Provider. This policy must be shared with Stakeholders and embedded in all Service Level Agreements. The Incident Manager or Major Incident Manager (see Principle 9 above) would be contacted to provide requisite approval (after due consideration of the policy). An ability to relate components and enabling services is required to understand potential impact to other users. This information is typically obtained from the infrastructure Configuration Management Data Base (CMDB).
Principle 11: Incident resolution activities must commence as soon as possible for all Incidents regardless of Priority.
Rationale: Industry best practice supports determining as soon as possible the extent and effort required to resolve incidents Delaying resolution activities for a seemingly minor or misdiagnosed incident could increase the impact to customer (activities to resolve incidents reported during non-prime shifts, if deferred to next business day, can result in service-affecting impact to the customer)
Implications: Unresolved incidents must be monitored on a periodic basis and their impact re-assessed based on Service Level Objectives Local work instructions at the OPS ITSD must prescribe that a sweeping of the incident queues be performed on a periodic basis to ensure outstanding incidents have been actioned in support of Service Level Objectives Ability to engage active support of Tier 2-N resources off normal hours Priority 2 and Priority 3 incidents that are assigned to Tier2-N support groups outside of regular business hours may not be actioned until next business day. Current practice is to place these Incidents in a Pending state within the enabling technology. This can result in misleading Availability and Performance metrics. Service Level Managers must be prepared to address these concerns if / when they are raised by Customers.
Principle 12: All Service Owners and OPS Service Providers shall fulfill their roles in compliance with the OPS enterprise Incident Management process.
Rationale: Consistent participation from all Stakeholders is required to ensure success of the enterprise Incident Management process
Implications: Underpinning Contracts (UCs) with external service providers must reflect the enterprise Incident Management process requirements
Page 19 of 40
Approved
Version #: 2.0
Operating Level Agreements (OLAs) between internal service providers must be in place and reflect enterprise Incident Management process requirements
Principle 13: A mechanism must be in place to identify security-related incidents and engage appropriate support staff to resolve the issue.
Rationale: Security related incidents may require specialized skills that are not resident in the ITSD organization.
Implications: A security support group must be established and staffed on a 7*24 basis. Special procedures must be defined and agreed to by the OPS ITSD and CSB to address security related incidents. ITSD staff must be provided with initial and ongoing training to ensure they are equipped to identify potential security related incidents This mechanism must be bi-directional in nature as Corporate Security Branch (CSB) must have the ability to pro-actively inform the OPS ITSD of a security-related Incident
Page 20 of 40
Approved
Version #: 2.0
4.2.
Process Roles and Responsibilities
Each process requires specific roles to undertake defined responsibilities for process design, development, execution and management. An organization may choose to assign more than one role to an individual. Additionally, the responsibilities of one role could be mapped to multiple individuals. One role is accountable for each process activity. With appropriate consideration of the required skills and managerial capability, this person may delegate certain responsibilities to other individuals, However, it is ultimately the job of the person who is accountable to ensure that the job gets done. Regardless of the mapping of responsibilities within an organization, specific roles are necessary for the proper operation & management of the process. This section lists the mandatory roles and responsibilities that must be established to execute the Incident Management process.
Process Task
Incident Manager (All Incidents)
Major Incident Manager (P1)
Situation Manager (P2)
Service Desk Agent
Incident Analyst
(Tier2-N)
Service Owner
Partner IM Liaison
Log & Classify Incident Prioritize Incident Declare Major Incident Perform Tier 1 Diagnosis Functional Escalation Perform Tier-N Diagnosis Resolve Incident Monitor Incident Close Incident
A A A,R A A A A A A
I R A* R R A* R
R R C R R R,I R R R
I C I I
Legend: Responsible, Accountable, Consult before, Informed A* Major Incident Manager is Accountable to resolve Major Incidents per Major Incident protocol Situation Manager may be called upon to resolve other Incidents as deemed necessary by the Incident Manager
4.2.1.
Enterprise Incident Management Process Owner
The Process Owner owns the process and the supporting documentation for the process. The Process Owner provides process leadership to the IT organization by overseeing the process and ensuring that the process is followed by the organization. When the process isn't being followed or isn't working well, the Process Owner is responsible for identifying why and ensuring that required actions are taken to correct the situation. In addition, the Process Owner is responsible for the approval of all proposed changes to the process, and development of process improvement plans. Responsibilities Ensures that the process is defined, documented, maintained and communicated at an Enterprise level through appropriate vehicles (IT Standards Council / Corporate ARB). Undertakes periodic review of all ITSM processes from an Enterprise perspective and ensures that a methodology of Continuous Service Improvement, (including applicable Process-level supporting metrics) is in place to address shortcomings and evolving requirements. Ensures that all Enterprise ITSM processes are considered and managed in an integrated manner, taking into consideration OPS Policies and Directives and factoring in evolving trends in technology and practice.
Page 21 of 40
Approved
Version #: 2.0
Solicits OPS Stakeholders and communities of interest to identify Enterprise ITSM process requirements for consideration by the Enterprise ITSM Program. Coordinate, present and recommend options for the prioritization, development and delivery of these to appropriate governing body. Ensures Enterprise ITSM procedures and work instructions and functional requirements for enabling technology are aligned with the enterprise process.
Segregation of Duties The role of Enterprise Process Owner is separate and distinct from that of the Incident Manager and the roles shall be separately staffed. The Enterprise IM Process Owner shall reside in OCCTO, while the Enterprise Incident Manager shall reside in the organization of the OPS infrastructure Service Provider. 4.2.2. Incident Manager (IM)
The Incident Manager is accountable for managing execution of the Incident Management process and directing the activities of all OPS I&IT organizations required to respond to incidents in compliance with SLAs and SLOs. The Incident Manager is accountable for the lifecycle of all incidents and acts as the incident management point of escalation for incident notification and for hierarchical escalation. Responsibilities Develops and maintains an appropriate level of incident management procedures and / or work instructions to support the needs of the business. Ensures that Incident Management staff are trained and familiar with IM procedures Monitors IT support staff performance of the Incident Management process; creates and executes action plans when necessary to ensure effective operation and continuous improvement Manages Incident resource allocation and workload distribution Invokes the Major Incident Procedure, as appropriate Engages upper levels of management as appropriate Ensures that a Major Incident Review is conducted for all major incidents and that recommended action items are completed. Provides information for management related to OPS ITSD performance Highlights trends resulting from recurring incidents for review by Problem Management. Monitors performance of the Incident Management process and identifies process improvements to the Enterprise IM Process Owner Situation Manager (SM)
4.2.3.
The Situation Manager is called upon by the Incident Manager to manage escalations of Incidents meeting pre-specified criteria (Typically of a Priority 2 P2 level). The SM is accountable for taking actions necessary to resolve P2 Incidents and restore service Responsibilities Resolve the escalated Incident leveraging resources provided by the Incident Coordinator Identifies and leads the required members of the resolution team to develop the plan to restore service or create a workaround Ensure that status messages are provided by the ITSD for periodic progress reports based on the defined Notification schedule Perform escalation evaluations Coordinate the establishment of resolution teams
Page 22 of 40
Approved
Version #: 2.0
Provide point-of-contact for resolution teams Manage further hierarchical and functional escalations Recommend activating Disaster Recovery Process (as necessary) Queue Manager (QM)
4.2.4.
The Queue Manager monitors the queue to ensure that all incident tickets assigned to various support groups in their organization are promptly actioned and / or escalated within defined thresholds in support of Service Level Agreements / Objectives (SLAs/SLOs). This Role is pre-dominantly concerned with the overall performance of resources involved in the Incident Management process, and is defined to establish an objective perspective on how Incident Management is being undertaken within a specific organization. As such, there are no specific Accountabilities. Responsibilities Address process execution issues encountered by support personnel and ensure that all tickets assigned to a queue are promptly actioned. Monitor the incident queues Ensure that all incidents placed in a queue are assigned to the appropriate resource within the queue Monitor all incidents and advises support group members of upcoming and actual Service Level Breaches (Note: Engaging support group will only occur if a Service Desk Analysts has not already performed this action.) Respond to the escalated incidents in a timely and appropriate fashion to minimize the effect of incidents on agreed service levels Follow defined escalation path, as defined in the escalation policy Facilitate support resource commitment and allocation Attend incident review meetings as required Participate in process improvement sessions
4.2.5.
Service Desk Manager (SDM)
The SDM is accountable for all aspects of the OPS ITSD and for effective management of the Incident Queues across the OPS I & IT organizations. Responsibilities Manages overall Service Desk activities Acts as escalation point for Team Leads Monitors incident volumes and trends to ensure appropriate staffing levels Recommends procedural improvements to the Incident Manager
Page 23 of 40
Approved
Version #: 2.0
4.2.6.
Service Desk Team Lead
Ensures currency and effectiveness of diagnostic scripts used to perform incident triage Manages shift schedules to ensure appropriate staffing and skill levels are maintained Acts as escalation point for Service Desk Agents in difficult or controversial situations Arranges staff training and awareness sessions Produces statistics and management reports Undertakes HR activities as required Assists Service Desk Agents when workloads are high or more experience is required 4.2.7. Service Desk Agent (SDA)
The Service Desk Agent provides the single point of contact for customers during the incident lifecycle. Responsibilities Authenticates the caller (User or Customer) and captures minimum level of defined contact information Authenticates the level of support to which the individual reporting the incident is entitled Creates an Incident record for the new incident or updates the record for existing incidents Classifies the incident Ensure that description of all incident resolution activities is accurately captured in incident records Continually updates incident records with progress / status information o o to reflect their own activities to support Tier 2-N resources as / if requested
Attempts Incident resolution at first point of contact (Tier 1) using diagnostic scripts and knowledge records such as Known Errors If unable to restore service within predefined threshold performs Functional Escalation and assigns incident to the appropriate Tier 2 support group Facilitates functional escalation between Tier-2 and Tier-N support groups and records circumstances in the incident record Informs the Queue and / or Incident Manager of any non-minor Incidents Keeps the customer or user updated on incident progress based on notification protocol Obtains user (or customer) concurrence that the support actions provided addressed their needs prior to closing the Incident 4.2.8. Incident Analyst (IA)
Incident Analysts are Tier 2-N support group staffs in each organization who provide progressively greater technical expertise to resolve Incidents that have not been resolved at the previous tier. Responsibilities Responds to assigned incidents within agreed timeframes Diagnoses, develops workarounds and / or attempt s to resolve assigned incidents Requests assistance from other Tier 2 support areas via the Incident or Queue Manager If unable to resolve, requests functional escalation via the OPS ITSD
Page 24 of 40
Approved
Version #: 2.0
Keeps the OPS ITSD informed of progress on assigned incidents via incident enabling technology Notifies the OPS ITSD as soon as it is known that the expected resolution will not occur within service thresholds When requested by the Queue and / or Incident Manager, provides technical assistance for other Tier-N resources When requested by the Queue and / or Incident Manager, provides technical communication / explanation to customers and / or end-users. Ensures creation of an incident record for all / any activities undertaken related to remedial action for technology or service supported When designated by the Major Incident Manager as the technical lead for a Major Incident, the Incident Analyst has additional responsibilities: o o Undertake the technical leadership of the analysis, diagnosis and develop the subsequent action plan to remediate the Major Incident Provide periodic updates and status reports to the Major Incident Manager to ensure communication and notification requirements of the Incident Management process are satisfied Service Owner
4.2.9.
In addition to the general Service Owner responsibilities identified in enterprise Problem Management process (e*PM), the Service Owner has additional responsibilities specific to the enterprise Incident Management process. (Note: These fall under the broad category of the Service Support Model that is the responsibility of the Service Owner to define and maintain.) In order to provide seamless, end-to-end support for Incident Management for OPS I & IT services, it is necessary to document all aspects of the Support Model. As the I & IT Clusters are accountable for the Application component of many of the OPS Services, the enterprise Incident Management process must be informed with key aspects of the support structure for applications. The Service Owner is responsible for the identification, documentation and maintenance of internal partner solution / service knowledge required to inform the Support Model used by the OPS ITSD. Responsibilities Define and establish the support model (including required skills for Tier 2-N support staff) up to and including the application Provide information, via the Partner Liaison, to the ITSD. This would include items such as service / solution descriptions, diagnostic content, mandatory information capture at Tier 1 and First Point of Contact (FPOC) resolution steps for use by Service Desk Agents Maintain the above information and inform the appropriate parties of updates: o Partner Liaison for support model updates o Service Level Manager for revisions to Service Level Objectives Develop local procedure information in support of incident management for Cluster services / solutions and obtain endorsement from enterprise Incident Manager that these align with OPS ITSD procedures
Page 25 of 40
Approved
Version #: 2.0
4.2.10.
Major Incident Manager (MIM)
In certain cases of incidents, a Major Incident Manager may be required to manage resolution activities. The Incident Manager or delegate will make this determination, and as required, will assign a single individual to undertake the MIM role for the service recovery activities related to that Incident. The MIM is accountable for taking actions necessary to resolve a Major Incident and restore service. In all cases a Major Incident will be classified using Urgency / Impact definitions documented in Section 6.4 of this Standard. By definition Major Incidents will be classified as Priority 1 P1). Activities managed by this individual may cross organizational boundaries. The MIM will be selected from a pool of managers within the appropriate Branches in the OPS I&IT organization (ITS, Clusters or CSB). The administrative aspects of the major incident will continue to be managed through the OPS ITSD and the Incident Manager or delegate will continue to perform responsibilities related to incident notification, escalation and communication. The Incident Manager maintains ownership and accountability for the lifecycle of the Incident. This allows the MIM to fully focus effort and attention upon managing the technical resolution of the incident. Responsibilities Identifies the required members of the resolution team, and requests their participation via the ITSD Ensures that a systematic approach is used to evaluate the reported symptoms, impacts and contributing factors of the incident Ensure assignment of key Incident Analyst to develop the optimum plan to restore service or create a workaround Provides timely updates to the OPS ITSD to ensure the incident record is maintained. Ensures that status messages are provided to the OPS ITSD for periodic progress reports based on the Major Incident Notification Schedule Undertakes functional escalation based upon pre-defined thresholds for the service being supported. (Note: Problem Management resources may also be requested should a workaround not be found and a real-time Root Cause Analysis (RCA) be required.) Provides documentation for Major Incident Review report.
4.2.11.
Partner Incident Management Liaison
The Partner Liaison provides a point of contact between the Incident Manager and partner organizations (e.g. Clusters, CSB, 3rd Party Service Providers) to enable effective and efficient execution of the IM process. Responsibilities Coordinates with Service Owners in their organization to provide and maintain the Support Model information required by the OPS ITSD SDAs (e.g. service / solution descriptions, diagnostic approach, mandatory information capture and First Point of Contact (FPOC) resolution steps). Provides ITSD with accurate partner organization information management relative to IM process, including VIP Lists, Location, details, organizational and / or staff changes etc.) Coordinates incident resolution activities within an organization Acts as the escalation point for any organizational issues regarding execution of the Incident Management process
Page 26 of 40
4.3.
Process Flows
4.3.1. Incident Management Process Overview
Version 1.62
Page 27 of 40
Approved
Version #: 2.0
Page 28 of 40
Approved
Version #: 2.0
4.3.2. No
1.0
Incident Management Process Tasks Roles User Ops Staff SDA Input, Trigger User-Perceived service outage or degradation, Monitoring Event Service Desk informed of Incident Incident classified Major Incident criteria is met Description Users may call or email service desk to report an incident. Event Monitoring may also pro-actively indicate an incident before the users is impacted. SDA crates incident record and captures user contact information, classification data and details about symptoms. SDA prioritizes the incident, based upon Impact and urgency (usually via a predetermined formula). SDA determines that Incident meets agreed criteria for Major Incident and informs the Incident Manager who determines whether or not to declare a Major Incident and what parts of the Major Incident Protocol will be invoked Service Desk agent conducts initial diagnosis to discover the full symptoms of the incident and to determine exactly what has gone wrong and how to correct it. The agents will use diagnostic scripts and known error information to assist in his task. IF SDA cannot restore service at first point of contact within predetermined timeframe, the incident will be assigned to an Incident Analyst (Tier 2 support group) to attempt to restore service within Service Level targets. This functional escalation is repeated to Tier 3 and so on (if the Tier 2 Incident Analyst cannot resolve the incident within a defined threshold). Incident Analysts will conduct further diagnosis to determine how to restore service. The Incident Analyst or SDA takes (or coordinates) necessary action to restore service and conducts tests to ensure that service is restored (Note: this could include asking user to take actions, eg. rebooting computer.) Output, Completion Criteria
Task Report Incident
2.0
Log & Classify Incident Prioritize Incident Declare Major Incident
3.0 4.0
SDA SDA IM
IM informed, Major Incident Procedure invoked
5.0
Perform Tier 1 Diagnosis
SDA
Incident Prioritized
6.0
Functional Escalation
SDA QM
SD cannot restore service within agreed threshold
7.0 8.0
Perform Tier-N Diagnosis Resolve Incident
IA SDA IA
Functional escalation Diagnosis has indicated probable resolution
Service has been restored from SDA/IA perspective
Page 29 of 40
Approved
Version #: 2.0
No
9.0
Task Monitor Incident
Roles IM QM
Input, Trigger Incident logged
Description Incidents are monitored throughout their lifecycle: Queue Manager ensures that incidents assigned to Tier N support groups are resolved or functionally escalated within defined thresholds. Incident Manager monitors thresholds and may escalate or manage notifications if Service Level Targets are in jeopardy SDA requests the User to confirm that service has been restored from their perspective and then closes the incident. If the user cannot be reached within an agreed threshold, the SDA follows the predefined policy for such situations.
Output, Completion Criteria
10.0
Close Incident
SDA
Analyst indicates service restoration
User confirms service restoration
Page 30 of 40
4.4.
Linkages to other processes

Linkage PM requires that Incident Management capture sufficient and accurate information to enable problem identification: o Proper closure codes o Proper classification o Link new Incidents to existing Problems o Known defective components (based upon event monitoring and component alarms). PM makes information available that can support Incident resolution activities (eg. Known Errors, workarounds, patterns) Enabling technology must be able to define relationship between Incident, Problem and Known Error records Incident Management may identify potential Problems to Problem Management Should restoration of a service require modification of to a component under the control of Configuration Management, then ECM must be engaged Enabling technology must be able to define relationship between Incident and Change records A portable guide was developed as an OPS Standard in 2004.This portable guide will be updated to reflect Enterprise requirements in the near future. At that time it will be linked to Incident Management so that a faulty CI can be referenced in the Incident record Although this process has not yet been formalized at the enterprise level, there is an expectation that incident escalation thresholds are defined to support SLAs and OLAs.
Process Problem Management
Enterprise Change Management (ECM)
Configuration Management
Service Level Management
Consistent use of Service and Component Classification schemas must be used across ITSM Processes such as Incident, Change and Problem Management to enable industry best practice process integration. Failure to adopt a common approach to implementing these three processes will result in needless re-work and additional administrative overhead for operational staff.
4.5.
Incident Management Process Quality Control
Certain aspects of execution of the Incident Management process are monitored, as a quality control measure, to identify opportunities to improve process effectiveness and efficiency. Monitoring: The Incident Manager is responsible for monitoring certain aspects of the activities performed by the Incident Management team on a regular basis. This serves a twofold purpose: 1. The Incident Manager can identify any bottlenecks at the operational level and take appropriate corrective action. 2. Both the Incident Manager and the enterprise Process Owner can identify opportunities for improvement at the process and procedural level. Reporting involves measuring the process via metrics and recording how well it behaves in relation to the objectives or targets specified in the metrics. Metrics provides the Incident Management personnel with feedback on the process. They also provide the Incident Management Process Owner with the
Version 1.62
Page 31 of 40
Approved
Version #: 2.0
necessary information to review overall process health and to undertake continual service improvement techniques. Evaluating the process involves regular reviews of the execution of the process and identification of possible improvements or actions to address performance gaps. Every process is only as good as its last improvement; hence, the feedback loop of continuous improvement is inherent in every process.
4.6.
Metrics
Metrics are intended to provide a useful measurement of a process effectiveness and efficiency. Metrics are also required for strategic decision support. The following need careful consideration: Reporting metrics will be readily measurable (preferably automated collection and presentation of data) Metrics will to be chosen to reflect process activity (how much work is done?), process quality (how well was it done?) and process execution (to review and plan job on hand). The Enterprise Incident Management Process Owner is accountable for the definition of an appropriate suite of metrics to determine the overall health of the Enterprise Problem Management process. The Incident Manager will develop and run the reports and may develop other metrics to monitor other operational aspects of process execution, such as workload and resource balancing
The following represents the initial suite of metrics that will be used to analyze process performance, identify opportunities for improvements and for strategic decision support. Any count of Incidents must exclude Service Requests. Workload: Total numbers of Incidents per period (as a control measure) (excluding Service Requests) Number and percentage of major incidents Size of current Incident backlog
Process Effectiveness: Number and percentage of incidents re-assigned Number and percentage of incidents incorrectly classified Average Call Time with no escalation ( ITSD metric) Percentage of incidents resolved within agreed response time, Average time for Tier 2-N support to respond to functionally escalated incident
Process Efficiency: Percentage of Incidents closed by the Service Desk without reference to other levels of support (often referred to as first point of contact) Mean time to resolve incidents (MTTR), Percentage of Incidents resolved on first attempt. Percentage of assigned Incidents resolved within Service Level Objectives (total and broken down by queue)
Page 32 of 40
Approved
Version #: 2.0
Aging Report showing # and % of assigned Incidents per organization that have been outstanding for longer than periods as designated from time to time by the IM Process Owner
4.7.
Standard Process Parameters
For an enterprise process to be effective, parameters used for the classification, categorization, prioritization and closure of problems must be consistently used across OPS. Special attention must be given to parameters required for consistency of reporting. This is particularly important for the provision of reliable business intelligence. Please refer to the Classification Model section of the GO-ITS 44 ITSM Terminology Reference Model Portable Guide for standard process parameters and allowable values for Incident Management. Please refer to the State Model section of the GO-ITS 44 ITSM Terminology Reference Model Portable Guide for standard status/state parameters and their definitions for Incident Management.
5.
5.1.
Related Standards
Impacts to Existing Standards
GO-IT Standard Impact GO-ITS 37 re-defines and Urgency and Impact classification elements. GO-ITS 55 contains Role definitions that are redundant. Recommended Action Future repatriation of all TRM elements into the appropriate ITSM process standards. Eliminate all Role definitions from GO-ITS 55. Update pending.
GO-ITS 44 Terminology Reference Model GO-ITS 55 Incident Management Contextual Model and Service Desk Interaction Model GO-ITS 38 Enterprise Problem Management
Nil
N/A
5.2.
Impacts to Existing Environment

Impact New Role of Major Incident Manager and must be implemented Recommended Action Future EIT update
Impacted Infrastructure EIT
Page 33 of 40
Approved
Version #: 2.0
6.
Appendices
6.1.
Normative References
Major Incident Protocol
6.1.1.
Located on the GO-ITS web site - Title: Normative Reference to GOITS 37 - Major Incident Protocol
6.2.
Informative References
Enterprise Differentiation: Process, Procedure, Work Instruction
6.2.1.
Note: The following diagram depicts three levels of task descriptions that are often confused with one another: This example is from Enterprise Change Management and illustrates the level of information required in task descriptions
Level 1 Tasks are defined in a Process. They specify what action must be taken and who is involved. Level 2 tasks are defined in Procedures that decompose each level 1 task into more granular operational tasks, and additionally, prescribe how the activity should be performed. Level 3 tasks represent Work instructions: they are further decomposition of procedure-level tasks that typically are defined to address any unique local requirements when performing a procedural task.
Page 34 of 40
Approved
Version #: 2.0
6.2.2.
Definitions: Urgency and Impact
The following table provides the framework for classifying the Urgency and Impact of Incidents, which are then used to establish Incident Priority. Urgency and Impact were originally defined in GO-ITS 44, Terminology Reference Model, to ensure that local process implementations used common terminology. ITSM has matured across the OPS and enterprise processes are now in place for Incident, Problem and Change Management. The definitions have been updated to reflect best practices. This is the first step in relocation of Classification elements from the TRM into the corresponding ITSM process standard. Classifications Definitions Field Values Criteria (At least 1 criteria must be met) A failure of an IT Business Service affecting multiple organizations 4 A failure affecting public safety A Security-related incident affecting a large number of users across multiple organizations where total loss or compromise of critical business data may result. A Core network outage or a network outage affecting mission critical government location A failure affecting > 1000 Users A failure that affects a money back guarantee public service offering Mission-critical applications fully unavailable Citizen-facing government websites Failure of an IT Business Service affecting a single organization which may include: A network outage affecting business critical government offices A security related incident affecting large number of users where work may be seriously impeded / interrupted within large groups or some business information may be at risk. A failure or serious degradation affecting >500 users A failure of that affects a public-facing nonguaranteed service offering Failure of business-critical applications A failure affecting all users in a single organization All remaining failures of IT Business Services which may include: Single user(s) A small isolated group of users with a common failure (single application, location, a failure on one of several IT Business Services utilized) Security related incident affecting single or small number of users where some business data may be subject to limited compromise.
High
Impact
Measure of scope and criticality to business. Often equal to the extent to which an Incident leads to distortion of agreed or expected service levels.
Medium
Low
As it relates to an IT Business Service, an organization is deemed to be a Ministry Page 35 of 40
Approved
Version #: 2.0
High Measures how quickly an incident needs to be responded to based on the business needs of the customer. Medium
Urgency
Low
A formal SLA is in place that specifies an IT restoration of service time of < or = to 4.5 hours A Security threat exists or a there is potential for severe or substantial impact A failure where formal SLA has been breached or it is known that an SLA will be breached Response required includes an immediate and sustained effort using any / all available resources until the Incident is resolved Executive or VIP Service Interruptions SLA/SLO specifies a restoration of IT service within same business day A security threat exists with potential for moderate impact. Work may be impeded in small groups. There might be some compromise of data and/or lack of availability for a small number of systems. Single user(s) A small isolated group of users with a common failure (single application, location, a failure on one of several IT Business Services utilized) a security related failure with potential for minimal impact
Priority Matrix Urgency High Medium Low
High P1 P2 NA
Impact Medium P2 P2 P3
Low P3 P3 P4
Page 36 of 40
Approved
Version #: 2.0
7.
Term
Glossary
Description Assignment occurs when an incident is assigned by the ITSD to a Tier 2N support group within the OPS to attempt incident resolution. The assigned support group must respond in accordance with the OPS Incident Management Process/Procedures and their actions may be directed by the OPS Incident Manager. (see Dispatch) Someone who buys goods or Services. The Customer of an IT Service Provider is the person or group that defines and agrees the Service Level Targets. The term Customers is also sometimes informally used to mean Users, for example this is a Customer-focused Organization. Documents used by the Service Desk to help classify and resolve incidents. These documents, based upon input from specialist support groups and suppliers, identify key questions to be asked to obtain details about what has gone wrong, with suggestions for resolution activities to be performed. Dispatch occurs when the ITSD assigns an Incident to a Service Provider outside the OPS to attempt resolution. Provider behaviour is specified by an Underpinning Contract and the OPS Incident Manager does not have authority to direct the providers activities other than coordination of activities between the provider and other OPS Support groups The Enterprise Change Management Process. OPS GO-IT Standard 38 (Service Operation) A design flaw or malfunction that causes a Failure of one or more Configuration Items or IT Services. A mistake made by a person or a faulty Process that affects a CI or IT Service is also an Error. An Activity that obtains additional Resources when these are needed to meet Service Level Targets or Customer expectations. Escalation may be needed within any IT Service Management Process, but is most commonly associated with Incident Management, Problem Management and the management of Customer complaints. There are two types of Escalation: Functional Escalation and Hierarchic Escalation. An IT Service Provider that is part of a different Organization from its Customer. An IT Service Provider may have both Internal Customers and External Customers. Transferring an Incident, Problem or Change to a technical team with a higher level of expertise to assist in an Escalation. Informing or involving more senior levels of management to assist in an Escalation. A measure of the effect of an Incident, Problem or Change on Business Processes. Impact is often based on how Service Levels will be affected. Impact and Urgency are used to assign Priority. An unplanned interruption to an IT Service or reduction in the Quality of an IT Service. Failure of a Configuration Item that has not yet affected Service is also an Incident. For example Failure of one disk from a mirror set. The Process responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Customers as quickly as possible.
Assignment
Customer
Diagnostic Scripts
Dispatch
ECM Error
Escalation
External Service Provider
Functional Escalation Hierarchical Escalation Impact
Incident
Incident Management
Page 37 of 40
Approved
Version #: 2.0
Term Incident Pattern
Description A pattern exists for each high level business service, to define how the ITSD interacts with OPS service chain partners such as Clusters, Ministries and corporate providers to resolve reported incidents A Record containing the details of an Incident. Each Incident record documents the Lifecycle of a single Incident. An IT Service Provider that is part of the same Organization as its Customer. An IT Service Provider may have both Internal Customers and External Customers. A technique that helps a team to identify all the possible causes of a Problem. Originally devised by Kaoru Ishikawa, the output of this technique is a diagram that looks like a fishbone. A Service provided to one or more Customers by an IT Service Provider. An IT Service is based on the use of Information Technology and supports the Customers Business Processes. An IT Service is made up from a combination of people, Processes and technology and should be defined in a Service Level Agreement. A structured approach to Problem solving. The Problem is analysed in terms of what, where, when and extent. Possible causes are identified. The most probable cause is tested. The true cause is verified. A Problem that has a documented Root Cause and a Workaround. Known Errors are created and managed throughout their Lifecycle by Problem Management. Known Errors may also be identified by Development or Suppliers. A database containing all Known Error Records. This database is created by Problem Management and used by Incident and Problem Management. A Record containing the details of a Known Error. Each Known Error Record documents the Lifecycle of a Known Error, including the Status, Root Cause and Workaround. In some implementations a Known Error is documented using additional fields in a Problem Record. An Agreement between an IT Service Owner and another IT Service Owner within the same Organization. The other Service Owner provides services that support delivery of IT services to Service Owner As customers. The OLA defines targets and responsibilities that are required to meet agreed Service Level Targets in an SLA. The OLA defines the goods or Services to be provided and the responsibilities of both parties. For example there could be an OLA: Between the IT Service Provider and a procurement department to obtain hardware in agreed times Between the Service Desk and a Support Group to provide Incident Resolution in agreed times. A Role responsible for Operational management of a Process. The Process Managers responsibilities include Planning and coordination of all Activities required to carry out, monitor and report on the Process. There may be several Process Managers for one Process, for example regional Change Managers or IT Service Continuity Managers for each data centre. A Role responsible for ensuring that a Process is Fit for Purpose. The Process Owners responsibilities include sponsorship, Design, Change Management and continual improvement of the Process and its Metrics.
Incident Record Internal Service Provider
Ishikawa Diagram
IT Service
Kepner & Tregoe Analysis
Known Error (KE)
Known Error database
KE Record
Operational Level Agreement (OLA)
Process Manager
Process Owner
Page 38 of 40
Approved
Version #: 2.0
Term Process Service Level Objective (PSLO)
Description A service level objective for a specific process task or metric. e.g.: Problem resolution will complete within x weeks, based upon problem classification. 70% of incidents will be linked to Problems
Proactive Management
Problem
Part of the Problem Management Process. The Objective of Proactive Problem Management is to identify Problems that might otherwise be missed. Proactive Problem Management analyses Incident Records, and uses data collected by other IT Service Management Processes to identify trends or significant problems. A cause of one or more Incidents. The cause is not usually known at the time a Problem Record is created, and the Problem Management Process is responsible for further investigation. The Process responsible for managing the Lifecycle of all Problems. The primary objectives of Problem Management are to prevent Incidents from happening, and to minimize the Impact of Incidents that cannot be prevented. A Record containing the details of a Problem. Each Problem Record documents the Lifecycle of a single Problem. Specifies the maximum tolerable service outage that can be sustained before consideration must be made to invoke Business Continuity or Disaster Recovery plans. A collection of hardware, software, documentation, Processes or other Components required to implement one or more approved Changes to IT Services. The contents of each Release are managed, tested, and deployed as a single entity. Request For Change The underlying or original cause of an Incident or Problem. An Activity that identifies the Root Cause of an Incident or Problem. ITIL defines Service as a means of delivering value to customers by facilitating specific outcomes customers want to achieve without the ownership of specific costs and risks. GO-ITS 56.1 defines services within the OPS as functionality that can be directly consumed by an enduser. Relationships and obligations between Service Owners and Customers are documented in SLAs. (see Support Service) The Single Point of Contact between the Service Provider and the Users. A typical Service Desk manages Incidents and Service Requests, and also handles communication with the Users. An Activity that identifies underlying causes of one or more IT Service interruptions. SFA identifies opportunities to improve the IT Service Providers Processes and tools, and not just the IT Infrastructure. SFA is a time-constrained, project-like activity, rather than an ongoing process of analysis. See also Root Cause Analysis.
Problem
Problem Management
Problem Record Recovery Time Objective (RTO) Release
RFC Root Cause Root Cause Analysis Service
Service Desk
Service Failure Analysis (SFA)
Page 39 of 40
Approved
Version #: 2.0
Term Service Level Agreement (SLA)
Description An Agreement between an IT Service Provider and a Customer. The SLA describes the IT Service, documents Service Level Targets, and specifies the responsibilities of the IT Service Provider and the Customer. A single SLA may cover multiple IT Services or multiple customers. (See also Operational Level Agreement and Underpinning Contract) In the absence of a formally negotiated SLA, a Service provider must define performance objectives for delivery and support of the service. Member of a Service Provider organization, responsible for delivery of a specific service A manager who is responsible for managing the end-to-end Lifecycle of one or more IT Services. An organization supplying Services to one or more Internal Customers or External Customers. Service Provider is often used as an abbreviation for IT Service Provider. Where there are several Service Providers that enable an overarching service, they are sometimes called Supply Chain (or Service Chain) Partners
Service Level Objective (SLO) Service Owner Service Manager Service Provider
Support Model
Support Service
Trend Analysis
Underpinning Contract (UC)
Contains information required to support a specific service, including identification of support resources, classification elements, escalation contacts and service restoration targets. (This document contains some elements of what ITIL calls the Service Operations Plan.) Internal services that support a consumable Service. Support Services are typically not visible to end-users. Relationships and obligations between Service Support Owners and their Customer (Service Owners) are documented in OLAs and UCs. (see Service) Analysis of data to identify time-related patterns. Trend Analysis is used in Problem Management to identify common Failures or fragile Configuration Items, and in Capacity Management as a Modelling tool to predict future behaviour. It is also used as a management tool for identifying deficiencies in IT Service Management Processes. Contract between an OPS IT Service Provider and an external Third Party IT Service Provider. The Third Party provides goods or Services that support delivery of an IT Service to a Customer. The UC defines targets and responsibilities that are required to meet agreed Service Level Targets in an SLA. A measure of how long it will be until an Incident, Problem or Change has a significant Impact on the Business. For example a high Impact Incident may have low Urgency, if the Impact will not affect the Business until the end of the financial year. Impact and Urgency are used to assign Priority. A person who consumes the IT Service on a day-to-day basis. Users are distinct from Customers, as some Customers do not use the IT Service directly. Reducing or eliminating the Impact of an Incident or Problem for which a full Resolution is not yet available. For example by restarting a failed Configuration Item. Workarounds for Problems are documented in Known Error Records. Workarounds for Incidents that do not have associated Problem Records are documented in the Incident Record.
Urgency
User
Workaround
Page 40 of 40

Stdprod 062642 PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Stdprod 062642 PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Government of Ontario IT Standard (GO-ITS) GO-ITS Number 37 Enterprise Incident Management Process

Version 2.0 Status: Approved

Queen's Printer for Ontario, 2010

Last Review Date: 2010-04-01

Copyright & Disclaimer

Document History (including ITSC and ARB approval dates)

Approved: Architecture Review Board approval

GO-ITS 37 Enterprise Incident Management Process

3. STANDARDS LIFECYCLE MANAGEMENT..........................................................................................11

4.4. 4.5. 4.6. 4.7. 5.1. 5.2. 6.1. 6.2.

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Value to the Business

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Standards Lifecycle Management

GO-ITS 37 Enterprise Incident Management Process

Committee/Working Group Consulted ITSM Leads

Date Dec 2009 and Feb 2010

Committee/Working Group Informed

GO-ITS 37 Enterprise Incident Management Process

Recommended Versioning and/or Change Management

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Process Roles and Responsibilities

Incident Manager (All Incidents)

Major Incident Manager (P1)

Situation Manager (P2)

Service Desk Agent

Enterprise Incident Management Process Owner

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Service Desk Manager (SDM)

GO-ITS 37 Enterprise Incident Management Process

Service Desk Team Lead

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Major Incident Manager (MIM)

Partner Incident Management Liaison

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

GO-ITS 37 Enterprise Incident Management Process

Task Report Incident

Log & Classify Incident Prioritize Incident Declare Major Incident

IM informed, Major Incident Procedure invoked

Perform Tier 1 Diagnosis

SD cannot restore service within agreed threshold

Perform Tier-N Diagnosis Resolve Incident

Functional escalation Diagnosis has indicated probable resolution

Service has been restored from SDA/IA perspective

GO-ITS 37 Enterprise Incident Management Process

Task Monitor Incident

Input, Trigger Incident logged

Output, Completion Criteria

Analyst indicates service restoration