Sie sind auf Seite 1von 14

SOP Unplanned Outages/Major Incidents

Table of Contents
Definitions..................................................................................................................................................... 2
Incident ..................................................................................................................................................... 2
Declaring ....................................................................................................................................................... 2
Incident Levels .............................................................................................................................................. 3
Priority 1 ................................................................................................................................................... 3
Priority 2 ................................................................................................................................................... 3
Priority 3 ................................................................................................................................................... 3
Priority 4 ................................................................................................................................................... 3
Mission Critical Services .............................................................................................................................. 3
Authentication ........................................................................................................................................... 3
Computer Labs .......................................................................................................................................... 3
Email ......................................................................................................................................................... 3
Network .................................................................................................................................................... 4
Wireless Network...................................................................................................................................... 4
Power Disruptions to Campus .................................................................................................................. 4
Storage Area Network (SAN) Disruption ................................................................................................. 4
Roles and Responsibilities of Team for Priority 1 Outages/Major Incidents ............................................... 4
CIO or Backup .......................................................................................................................................... 4
Incident Commander ................................................................................................................................. 4
Technical Lead .......................................................................................................................................... 4
User Support Lead .................................................................................................................................... 4
Communications Lead .............................................................................................................................. 5
Field Coordinator ...................................................................................................................................... 5
Service Owners ......................................................................................................................................... 5
Subject Matter Expert (SME) ....................................................................................................................... 5
Communications ........................................................................................................................................... 5
Internal Communications Options ............................................................................................................ 5
External Communications Options ........................................................................................................... 5

1
SOP Unplanned Outages/Major Incidents
Reporting ...................................................................................................................................................... 6
Meeting Location .......................................................................................................................................... 7
Logistics ........................................................................................................................................................ 7
Execution of Work Steps for Priority1 Major Incident ................................................................................. 8
Incident commander duties ......................................................................................................................... 12
General duties ......................................................................................................................................... 12
Incident commander timed checklist ...................................................................................................... 13
Check List ................................................................................................................................................... 13

Review the Emergency Prep roles and procedures!

Username: it
Password: Ic3

Definitions
Incident
Whenever a user is not receiving an expected level of service from an IT service.
Expected levels of service are based on Service Level Agreements (SLA).
Major Incident/Outage
A major incident is defined as a significant event, which demands a response beyond the
routine, resulting from uncontrolled developments in the course of the operation of any establishment or
transient work activity.

Declaring
Mission critical (university or internal) IT service(s) are not performing at the expected level for a
period of 30 minutes unless defined differently in the SLA or designated otherwise by this plan.

2
SOP Unplanned Outages/Major Incidents
Incident Levels
Priority 1
Mission critical services are not performing for the University. All appropriate resources will be
dedicated to restore service(s).

Priority 2
Mission critical services are not performing for departments or computer labs. Service(s) is not
performing at a campus or enterprise level. Appropriate services owners will be dedicated to
restore service(s).

Priority 3
Address problem and escalate as necessary. These incidents do not require the dedication of level
1 or 2.

Priority 4
There is a known work around for the issue. Does not require dedicate resources to resolve.

Mission Critical Services


Authentication
Accepting authentication requests and responses for the following systems:
Blackboard
Campus desktop computers
Central IT maintained computer lab machines
CUSIS portal
MyUCCS Portal
Wireless System
Computer Labs
Disruptions to IT maintained computer lab machines not allowing customers the ability to
utilize systems.

Email
Email messages not flowing in or out of the following systems:
Exchange on premise
Office 365 Cloud Solution
Note: Unless the service outage is determined to be an exclusive Microsoft issue
and UCCS IT personnel have no control to participant in a resolution, than this
will not follow the full Major Critical procedures. Conceivably only the
communication plan will be followed.

3
SOP Unplanned Outages/Major Incidents
Network
Disruptions to campus network systems to include:
Campus Firewalls
Campus Routing
Campus Switches
Connections in and out or within the El Pomar Data Center
Connections in and out or within the Columbine Data Center
Connections in and out or within Main Hall and Cragmor Hall
External internet connectivity
Wireless Network
Disruptions to the wireless system not allowing customers to utilize the network

Power Disruptions to Campus


Any power disruption to the El Pomar or Columbine Data Centers lasting longer than 10
minutes.

Storage Area Network (SAN) Disruption


Any disruption to data flowing in or out of the campus SAN solution.

Roles and Responsibilities of Team for Priority 1 Outages/Major


Incidents
CIO or Backup
Authorize resources for the major incident; direct communication with Chancellor and
UCCS Leadership team; and if needed communications with Presidents office.

Incident Commander
Coordinate plan; oversee response; lead meetings; organize meals; and provide funding;
See below for detailed description.

Technical Lead
Examine situation; confirm major incident; attempt to identify root cause; work to find
technical options; present technical options to team; and participant with plan where
needed.

User Support Lead


Provide information from user’s perspective; provide user support options; contact
specialized users; and participant where needed.

4
SOP Unplanned Outages/Major Incidents
Communications Lead
Create plan for messaging including frequency; provide messaging to campus; update
UCCS.info; point person for internal communication; and participant where needed.

Field Coordinator
Provide information from the field; deliver support from the field; participant where
needed. Note: depending on the major incident this role may not be needed.

Service Owners
Provide information on services effected; work with technical lead to create options for
plan of action.

Subject Matter Expert (SME)


An individual with a high-level of overall knowledge of the service impacted, both in
terms of general architecture and business service provided.

Communications
Internal Communications Options
Communications should be sent out from the helpdesk@uccs.edu email address if possible
outage@uccs.edu - hosted on lists.uccs.edu (Communigate server – local infrastructure must
be working) (Texting and Email)
outage@uccs.info - hosted through Bluehost.com (Texting and Email)
uccshelpdesk@gmail.com - help desk communications sent when exchange is not available
CenturyLink Conferencing Audio Conferencing
USA: 1-720-279-0026
USA /Canada (toll free): 1-877-820-7831
1. This will be the Major Incident main line:
Web / sharing desktops
GotoMeeting.com

External Communications Options


Students - student-1@uccs.edu (lists.uccs.edu is required)
Faculty - faculty-l@uccs.edu (lists.uccs.edu is required)
Staff – staff-l@uccs.edu (lists.uccs.edu is required)
UCCS.info {needing login information} – Automatically posts to IT Twitter
UIS – itccop@cu.edu
Housing Email Lists:
Summit-l@uccs.edu
Alpine_l@uccs.edu
Timberline-l@uccs.edu

5
SOP Unplanned Outages/Major Incidents
UCCS leadership - Only CIO or backup communicates with leadership team
University Relations
Hutton, Tom. . . .719-255-3439
Executive Director
University Advancement - University Communications and Media Relations
MAIN 301A
thutton@uccs.edu

UCCS Twitter
Denman, Philip. . . .719-255-3732
Assistant Director
University Advancement - University Communications and Media Relations
MAIN 301
pdenman@uccs.edu

Website Alerts {Craig needing information for posting in Ingeniux or how this should be
handled}

Rave (Must first check with Tim Stoecklein before post message with system)
Stoecklein, Tim. . . .719-255-3106
Program Director of Emergency Management
Public Safety Department - Emergency Management
DPS 208
tstoeckl@uccs.edu

Phones
Help Desk ACD message
Sidecars if necessary

Media
University Relations will be the only organization allowed to speak to the media.

Reporting
When to report
Who to report to
UIS
Other CU campuses
Chancellor's office
President's office

6
SOP Unplanned Outages/Major Incidents
Meeting Location
EPC 139, IT Conference Room
Location needs:
Phone
Laptop/Projector
White board
Table
Room and chairs for 10 people
Extra Ports
Power

Logistics
Review Mission Critical Services
Communications expectations plan
Communication templates
Define essential personnel and backups
Personnel expectations during major incident and after
Essential personnel is expected to participant in major incident/outage response. If
incident is after hours essential personnel is expected to participant if available.
Working Time:
16 hours working max or 2 a.m.
At the start of 14 hours, or midnight appropriately, technical lead must start to
create plan for providing rest to employees.
Discuss of break/meal every 4 hours. Food/Drink coordination

After major incident/outage is resolved and work was conducted after normal business
hours, employees will be given hour for hour flex time. The employee is expected to take
the time and must be used within one month from when the work was performed.
Incident Commander will work with employee’s supervisor to coordinate flex time.
Equipment needs
Equipment needs shall be coordinated by the Incident Commander.
Funding
Will be coordinated by the Incident Commander.

7
SOP Unplanned Outages/Major Incidents
Execution of Work Steps for Priority1 Major Incident

Service restoration target is two hours for a Priority 1 Major Incident.

Task Description Time


1. Notification and Confirmation Incident as past the
Priority1 Major Incident has been identified trigger points

2. Notify IT Outage Group Within 10 Minutes


Send email or text message to outage@uccs.info or
outage@uccs.edu
Text Template:
UCCS IT internal alert
(ServiceName) is experiencing a service interruption. See
email for more details.
Sent by: (Name) (PhoneNumber)

Email Template:
Subject:
(ServiceName) is experiencing a service interruption.

Current symptoms includes:

- SYMPTOM1

Known workarounds include:

- WORKAROUND1

IT is working to restore service and will provide more


information as it becomes available. The next communication
will be sent by XX:XX a.m./p.m.

3. Contact Computing Services Directors: Within 10 minutes


Kirk Moore of initial contact
Cell: (719)238-9451
House: (719)282-1887
Email: kmoore@uccs.edu or uccsit@moorei.com

Greg Williams
Cell: (719)237-6491
House:(719)481-1290
Email: gwillia5@uccs.edu or

8
SOP Unplanned Outages/Major Incidents
If directors have not been reached contact associate
directors:
Rob Garvie
Cell: (719)439-1724
House: (719)266-8525
Email: rgarvie@uccs.edu

Mike Belding
Cell: (719)338-9776
House: (719)260-6794
Email: mbelding@uccs.edu

If directors or associate directors have not been reached


contact CIO:
Jerry Wilson
Cell: (719)440-2215
House: (719)599-4752
Email: jwilson@uccs.edu

4. Open Technical Bridge Within 15 minutes


1-720-279-0026 of initial contact
(toll free):1-877-820-7831

Host and Guest passcodes stored in LastPass under


Incident Response

5. Initial Triage Within 20 minutes


1. Start assessment of initial contact
2. Replicate issue
3. Review monitoring and logs
4. Try to identify workarounds
If no quick solution or workout is discovered then declare a
Priority 1 Incident.

6. Declare Priority 1 Incident Within 30 minutes


1. Select Incident commander of initial contact
2. Start calling in personnel to meeting location or
conference bridge
3. Define roles of team
4. Start external communication
Set Message on UCCS.info using Email Template

Text Template:
UCCS IT Alert: SERVICENAME is experiencing a

9
SOP Unplanned Outages/Major Incidents
service interruption. IT is working to restore service and
will provide more information as it becomes available.

Email Template:
Subject:
UCCS IT Alert: SERVICENAME is experiencing a service
interruption.

Body:
Current symptoms includes:

- SYMPTOM1

Known workarounds include:

- WORKAROUND1

IT is working to restore service and will provide more


information as it becomes available. The next communication
will be sent by XX:XX a.m./p.m.

5. Communicate to CIO or person acting as backup and


they will contact Chancellor’s office
6. Determine if service should remain active or be brought
down.
7. Determine whether individuals aiding in incident
restoration should convene in person to aid in restoration
efforts.
8. Determine whether vendor involvement or escalation is
required.
9. If incident resolution is not expected within 15 minutes,
establish time frame for next status update.
10. Continue to facilitate conversation as appropriate to
ensure focus is on restoring service.

7. While Priority 1 Incident is occurring Every 60 minutes


1. Request current status of restoration efforts.
2. Instruct communication lead to send a notice using the following
templates:
Email Template:
Subject:
UCCS IT Alert: SERVICENAME is experiencing a service
interruption.

10
SOP Unplanned Outages/Major Incidents
Body:
Current symptoms includes:

- SYMPTOM1

Known workarounds include:

- WORKAROUND1

Update: UCCS IT is working to restore service and will


provide more information as it becomes available. The next
communication will be sent by XX:XX a.m./p.m.

8. Upon Service Restoration Upon service


1. Request the technical lead to verify that service has been restoration
restored and report on the current state of operation.
2. Confirm the communications lead will:
 Update UCCS.info
 Remove ACD message on Help Desk phone line.
 Send an incident restoration message with the following
content:

Email Template:
Subject:
UCCS IT Alert: SERVICENAME Service Now Available

Body:
UCCS IT has identified and addressed root cause. As of
XX:XX a.m./p.m. service has been restored. Thank you for
your cooperation we worked to resolve this issue.

3. The technical manager or director of the failing service


or component that caused the incident will:
• Hold a debrief meeting
• Prepare and deliver incident report to CIO,
Directors, and campus IT partners within
three business days.
4. Formally state to participants on the technical bridge that
the incident status is downgraded, everyone is standing
down and the technical bridge is being closed.
5. Document resolution and close incident management
(IM) ticket.
6. Open problem management ticket to track ongoing root
cause analysis efforts and document any known
workarounds.

11
SOP Unplanned Outages/Major Incidents
Close Major Incident
1. Hold debrief meeting with three days
2. Prepare Major Incident Response report within five days
with the help from those participating
3. {Rachel - needing report template}
4. Distribute report

Incident commander duties


An incident commander serves to keep an incident project on track for process, maintain focus on the
problems, facilitate analysis and interactions, and verify that the incident response team’s needs are being
met (resources, information, etc.). To that end, the individual has several duties outlined below.
General duties
 Open the incident phone bridge line:
1-720-279-0026
(toll free):1-877-820-7831
Host Passcode: 9694542
Guest Passcode: 321592
 Monitor incident phone bridge line or assign duty.
 Assign a participant to set up any required A/V resources (projecting monitoring data, etc.).
 Briefly recap incident process at the beginning of incident room level events.
 Coordinate efforts within the room to minimize confusion and reduce the risk of inadvertent or
simultaneous changes.
 Draw focus back together when conversations become unproductively fragmented.
 Document notable events and steps taken in a visible incident log (to be recorded electronically
by designated in-room scribe).
 Solicit approval and/or consensus on decisions to bring services up/down and to make changes to
production services.
 Ensure that the appropriate UIS employees and vendors are engaged and working the issue,
tasking people to escalate as needed.
 Initiate brainstorming during troubleshooting and ensure that identified paths of investigation
(hypotheses) or actions are assigned to individuals and given an order/priority.
 Facilitate communications efforts (both to broad groups of customers and executives) by ensuring
that the appropriate communicators have timely and accurate information.
 Record employees’ hours worked and ensure they take their flex time.

12
SOP Unplanned Outages/Major Incidents
Incident commander timed checklist
 Every 30 minutes:
o Request status from teams working issues.
 What current hypotheses are being investigated and have any been eliminated or
verified.
 What actions have been completed or are in progress.
o Provide a verbal update within the incident room and update the incident log.
 Every hour:
o Check in with communications staff regarding next status update steps.
o If the list of hypotheses has been exhausted, initiate a new cycle of brainstorming,
documenting, assigning tasks, etc.
 At 11:30am and 5:30pm:
o Request that business office (if available) order some food for those working the issue in
the incident room. Be sure to cover dietary needs (vegetarian, etc.).
o Encourage participants to use mealtime as an opportunity to leave the room for a little
while, allowing for coverage if needed.
 At 9pm:
o Request that directors/managers begin their plans for staff rotation during the night if on-
going work is required.

Check List
1. Notification of Priority 1 incident
2. Confirmation of major incident/outage
3. Priority 1 incident has been determined
4. If Level 1 priority 1 incident
a. Has incident crossed trigger points
i. No – continue to monitor situation
ii. Yes
1. Create problem in Cherwell
2. Determine which individuals are needing to evaluate the situation
3. Define roles for individuals participating
4. Tools
a. Last pass Cloud Service for password management
www.lastpass.com
b. Monitoring
c. Testing environment
5. Build action plan:
a. Define scope / timeframe
b. Develop technical plan

13
SOP Unplanned Outages/Major Incidents
c. Define personnel needed
d. Determine return on investment
e. Assign tasks
6. Communication plan
a. How do we communicate with each other?
b. How and who do we communicate with externally?
c. Recording communications
d. Confirming communications postings
e. How often do we need to communicate
f. Communicating to UCCS Leadership (Role of CIO)
7. Document going progress and issues, record in Cherwell
8. Resolved
a. Documenting issue, response and fix
9. Closing response
a. Hold debrief meeting with three days
b. Prepare Major Incident Response report within five days with
the help from those participating
c. {Rachel - needing report template}
d. Distribute report

14

Das könnte Ihnen auch gefallen