April 6, 2005
V1.12
Revision History
Revision History
Version Date Author Notes
1.08 23 Feb 2005 Nan McKenna (Initial tracked version)
Extract “return to work” as Appendix C, add proposed
1.09 15 Mar 2005 Erik Cummings 15/30 minute response times. Add “Revision History”
page.
Differentiate between “Initial PCG Incident
1.10 22 March 2005 Erik Cummings Classification” and “Final Incident Classification”.
Added PCG Process Flowchart
Updated Revision History Table
From header, removed “Draft”
In header, body of document, moved “Operations
Excellence” to top left margin, placed “Incident
Management Process” top, right margin
Re-applied styles, numbering, and organization
Added ‘On-Call’ to Appendix E
1.11 23 March 2005 Bruce Campbell
Added ‘Management On-Call’ to Appendix F
Turned-off numbering in Appendixes E & F
Re-organized appendixes so that process flow
diagrams were “one-after-the-other”
Updated references to the various appendixes
throughout document
Reworded Section 2.3 a ‘Note’
Removed Appendix C (PCG Process)
Renumbered Appendix DC and any references to
it.
Changed Appendix D (now C! - Communications
Matrix). Removed Contact Action, 1st and 2nd Level
Notification columns. Added Client Comm Interval
1.12 04 April 2005 Erik Cummings
and SME Work Started columns.
Added new Appendix D – Priority and Internal
Response Time Commitments
Added new definitions – Priority, Impact, Urgency
Table of Contents
Revision History.......................................................................... ........ii
Table of Figures.............................................. ...................................iii
List of Tables.................................................................................... ..iv
1.0 Executive Summary.................................................... ....................5
1.1 Document Contents.......................................................................................................... ..............5
1.2 Intended Audience............................................................................................................ ..............5
2.0 Background................................................................................... .6
2.1 Primary Responsibilities of the Production Control Group.............................................. ................6
2.2 Incident reporting and escalation techniques will:................................................................... ........6
2.3 Additional Responsibilities of the Production Control Group................................................ ...........6
3.0 Roles and Definitions............................................................... .......7
4.0 Process Review................................................. .............................9
4.1 Process Outline...................................................................................................... ........................9
4.2 Incident Detection and Reporting.................................................................................................. ..9
4.3 Incident Level Classification: See Appendices C and D.............................................................. ....9
4.4 Incident Notification.............................................................................................. ..........................9
4.5 Incident Escalation....................................................................................................................... ...9
4.6 Incident Resolution..................................................................................................... ....................9
4.7 Post-Incident Activities........................................................................................................ ..........10
5.0 Detailed Incident Control Process.................................... ..............11
5.1 Detailed Process Flow Explanation Table. Reference Appendix A................................... ............11
6.0 High-Level Incident Process Explanation........................................12
6.1 Detailed Process Explanation: See Appendix B......................................................................... ..12
7.0 Outstanding Issues......................................................... ..............14
7.1 A common paging system is required...................................................................... .....................14
7.2 Definition of Service Hours.................................................................................... .......................14
7.3 Definition of “availability,” “outage,” and “service degradation”.....................................................14
7.4 Service-level procedures for client notification....................................................................... .......14
Appendix A Incident Management Process Flowchart..........................15
Appendix B High-Level Incident Management Process Flow.................16
Appendix C Incident Level Communications Matrix..............................17
Appendix D Priorities and Internal Response Times.............................18
Appendix E On-Call Guidelines...........................................................19
Guideline Purpose.......................................................................................................... ...................19
Duties...................................................................................................................... ..........................19
Responsibilities........................................................................................................................... .......19
Communications................................................................................................................... .............19
Communications Elements................................................................................................ ................20
Notification Protocol................................................................................................................... ........20
Initial Communications Tracking...................................................................................... ..................20
Response Protocol................................................................................................... .........................20
Scheduling.............................................................................................................................. ...........20
Appendix F Management On-Call Guidelines.......................................21
Return-To-Work Guidelines.............................................................................................................. ..21
Table of Figures
Figure 1 Incident Detection and Reporting....................................................................................... .....15
List of Tables
Table 1 Revision History................................................................................................ ...........................ii
2.0 Background
It is expected that most services supported by ITSS are available 24x7. As a result of
this expectation, it is in the best interest of ITSS Shared Services workgroups and ITSS
as a whole to develop and establish a combined staff– the Production Control Group
(PCG) – dedicated to proactively managing and responding to events as they occur.
Eventually, the role of the PCG will include incident evaluation, and depending on the
severity of the event, escalate to upper management. In some situations, the more
experienced level technical personnel will take action to effect repairs and/or restore
services.
As the PCG acquires experience, and as ITSS adds monitoring and troubleshooting
capability, they will assume additional incident response responsibilities.
2.1Primary Responsibilities of the Production Control Group
1.a. Managing and controlling a widespread service outage, including incident
reporting and escalation.
2.2Incident reporting and escalation techniques will:
1.a. Specify a point-of-contract (owner) for all issues and ensure that services
are restored through the prudent use of departmental resources, including
documentation of the incident from beginning to its resolution.
1.b. Effectively manage the communication of information within ITSS when
there are issues that actually or potentially impact ITSS-supported
services or facilities.
1.c. Pro-actively respond to issues that impact ITSS-supported services and
facilities; evaluate, classify, escalate, and manage service restoration
efforts efficiently and as expeditiously as possible, up through incident
resolution.
2.3Additional Responsibilities of the Production Control Group
1.a. Note: It is anticipated that any single-shift of the PCG will NOT be
consumed by continuously resolving issues. Because of this,
supplemental duties and tasks, detailed below, will be assigned.
1 Assist offsite Subject Matter Experts by performing requested tasks,
such as visual inspections of hardware and recycling the power on
equipment as instructed.
2 Manage and prepare magnetic media for rotation, offsite shipment
and storage, including organizing and filing transmittal logs.
3 Control building and facility access, escort vendors to restricted areas
for the purposes of inspection, maintenance, and repair of equipment.
4 Monitor building/facility/ data center environmentals, such as: air
conditioning, fire suppression system, lighting, and so on, log times
and results of the monitoring activity.
5 After normal working hours, perform 1st tier triage of reported issues,
classify and escalate as necessary.
6 Receive and log calls from end users, and generate Remedy tickets,
escalate as necessary.
7 Set up Video/Telephone conferences.
Message information will include: the date and time, a brief description of
the problem, and if available, the estimated time of resolution/restoration.
Message information will include: the date and time, a brief description of
the problem, and if available, the estimated time of resolution/restoration.
1 1 4 5
Report Problem: Report Problem:
Report Problem Report Problem
HelpSU/5HELP HelpSU/5HELP
6
Resolve?
No 8
Forward To SME
7 (Help Desk
Urgent? No Tier 2) For
Additional
Analysis
Yes
9
Resolve
Quickly? No
Yes
10
Enter
Solution In
Remedy
2 11
Report Problem: Forward Directly
Directly To PCG To PCG
12
Determine
Severity
Level
13
Enter
3 3 Incident
Calls Calls Ticket In
5-HELP After 5-HELP After Remedy
Hours Hours
Detection Communicate
& Reporting Subject Matter
Expert
Monitoring
Production
Control Group
System Status
Notification Client
Self-Service
itss-service-alert@lists
down.stanford.edu
7-DOWN End User
Up
da
te
Remedy
U
pd Database
at
e te
w Upda
Resolution ith
So
lu
tio
n
In
fo
lution
rm
at Production
on
With So
Control Group
SME Duty Manager Line Manager
Update
Post Incident
Activities Production
Account SME PCG Manager Control Group
Manager
* Note: This column indicates the most amount of time that will transpire before a technician begins
working on an Incident. Times will generally be much faster for all severities.
Note: The following table refers to Priority, not to Urgency or Impact. Priority is a combination of
the combined Urgency, Impact, and existing Service Level Commitments for the service in question. This
is an important concept to adhere to – Urgency is offered by the customer, Priority is assigned by the
Helpdesk, PCG, and/or SME involved from a system-wide perspective.
Usage: These Priority levels (and the associated Urgency and Impact values) are used to track
incidents as they are reported and worked on. Each of Priority, Urgency, and Impact relate directly to
Remedy ticket fields.
Committed
PCG Call SME Call Escalation SME Work
Priority Description Service
Initiate Response Interval Started
Hours
A major service outage
with significant and
immediate business
impact and no
workaround.
• Large number of
Urgent users 24x7 Immediate 15 Minutes 10 minutes 30 minutes
• Outage of
significant length
• No available
workaround
• Mission/ business
critical
A major service outage
or degradation with
significant business
impact and an
unsustainable
High workaround. 24x7 Immediate 15 Minutes 10 Minutes 1 hour
• Multiple users
• Work performance
reduced
• Mission/ business
critical
A service outage or
degradation with an
acceptable workaround. As
• Service-affecting appropriate Standard
Ticket
(work SME Group 4 business
Medium • Minimal 8-5, M-F Assignment/e
begins, work Remedy hours
performance Mail
update, work settings
degradation completed)
• Affects non-critical
business function
As
appropriate
Non service-affecting. (work Standard
• Cosmetic problem Ticket
begins, SME Group 1 business
Low 8-5, M-F Assignment/e
• System information Remedy day
Mail
enhancement required, settings
work
completed)