Anda di halaman 1dari 23

Incident Management Process:

24x7 Response and Control

April 6, 2005

V1.12
Revision History

Revision History
Version Date Author Notes
1.08 23 Feb 2005 Nan McKenna (Initial tracked version)
Extract “return to work” as Appendix C, add proposed
1.09 15 Mar 2005 Erik Cummings 15/30 minute response times. Add “Revision History”
page.
Differentiate between “Initial PCG Incident
1.10 22 March 2005 Erik Cummings Classification” and “Final Incident Classification”.
Added PCG Process Flowchart
Updated Revision History Table
From header, removed “Draft”
In header, body of document, moved “Operations
Excellence” to top left margin, placed “Incident
Management Process” top, right margin
Re-applied styles, numbering, and organization
Added ‘On-Call’ to Appendix E
1.11 23 March 2005 Bruce Campbell
Added ‘Management On-Call’ to Appendix F
Turned-off numbering in Appendixes E & F
Re-organized appendixes so that process flow
diagrams were “one-after-the-other”
Updated references to the various appendixes
throughout document
Reworded Section 2.3 a ‘Note’
Removed Appendix C (PCG Process)
Renumbered Appendix DC and any references to
it.
Changed Appendix D (now C! - Communications
Matrix). Removed Contact Action, 1st and 2nd Level
Notification columns. Added Client Comm Interval
1.12 04 April 2005 Erik Cummings
and SME Work Started columns.
Added new Appendix D – Priority and Internal
Response Time Commitments
Added new definitions – Priority, Impact, Urgency

Table 1 Revision History

8/8/2008 v1.12 Page ii


Table of Contents

Table of Contents
Revision History.......................................................................... ........ii
Table of Figures.............................................. ...................................iii
List of Tables.................................................................................... ..iv
1.0 Executive Summary.................................................... ....................5
1.1 Document Contents.......................................................................................................... ..............5
1.2 Intended Audience............................................................................................................ ..............5
2.0 Background................................................................................... .6
2.1 Primary Responsibilities of the Production Control Group.............................................. ................6
2.2 Incident reporting and escalation techniques will:................................................................... ........6
2.3 Additional Responsibilities of the Production Control Group................................................ ...........6
3.0 Roles and Definitions............................................................... .......7
4.0 Process Review................................................. .............................9
4.1 Process Outline...................................................................................................... ........................9
4.2 Incident Detection and Reporting.................................................................................................. ..9
4.3 Incident Level Classification: See Appendices C and D.............................................................. ....9
4.4 Incident Notification.............................................................................................. ..........................9
4.5 Incident Escalation....................................................................................................................... ...9
4.6 Incident Resolution..................................................................................................... ....................9
4.7 Post-Incident Activities........................................................................................................ ..........10
5.0 Detailed Incident Control Process.................................... ..............11
5.1 Detailed Process Flow Explanation Table. Reference Appendix A................................... ............11
6.0 High-Level Incident Process Explanation........................................12
6.1 Detailed Process Explanation: See Appendix B......................................................................... ..12
7.0 Outstanding Issues......................................................... ..............14
7.1 A common paging system is required...................................................................... .....................14
7.2 Definition of Service Hours.................................................................................... .......................14
7.3 Definition of “availability,” “outage,” and “service degradation”.....................................................14
7.4 Service-level procedures for client notification....................................................................... .......14
Appendix A Incident Management Process Flowchart..........................15
Appendix B High-Level Incident Management Process Flow.................16
Appendix C Incident Level Communications Matrix..............................17
Appendix D Priorities and Internal Response Times.............................18
Appendix E On-Call Guidelines...........................................................19
Guideline Purpose.......................................................................................................... ...................19
Duties...................................................................................................................... ..........................19
Responsibilities........................................................................................................................... .......19
Communications................................................................................................................... .............19
Communications Elements................................................................................................ ................20
Notification Protocol................................................................................................................... ........20
Initial Communications Tracking...................................................................................... ..................20
Response Protocol................................................................................................... .........................20
Scheduling.............................................................................................................................. ...........20
Appendix F Management On-Call Guidelines.......................................21
Return-To-Work Guidelines.............................................................................................................. ..21

8/8/2008 v1.12 Page ii.b


Table of Figures

Table of Figures
Figure 1 Incident Detection and Reporting....................................................................................... .....15

Figure 2 High-Level Incident Management Process Flow..................................................... ...............16

8/8/2008 v1.12 Page iii


List of Tables

List of Tables
Table 1 Revision History................................................................................................ ...........................ii

Table 2 Detailed Incident Control Process ...................................................................................... ......11

Table 3 Explanation of High-Level Incident Management Process Flow............................................13

Table 4 Incident Level Classification Matrix........................................................................ ..................17

Table 5 Return-To-Work Guidelines...................................................................................................... ..21

8/8/2008 v1.12 Page iv


Operations Excellence Incident Management Process

1.0 Executive Summary


1.1Document Contents
1.a. This document contains processes, through the use of which the new
Production Control Group will be able to quickly and efficiently respond to,
manage, and resolve incidents. Documentation includes on-call
definitions and guidelines, escalation processes, process flow diagrams,
and data tables, sets general expectations, defines roles and
responsibilities, and provides general guidelines.
1.2Intended Audience
1.a. This document is directed at and intended for executive level and
management personnel, ITSS personnel, including all of those are
included in this process, such as: Subject Matter Experts (SMEs)
Technical Leads, Line Managers, Systems Administrators, DBAs, project
leaders, and facilities personnel.

8/8/2008 v1.12 Page 5 of 21


Operations Excellence Incident Management Process

2.0 Background
It is expected that most services supported by ITSS are available 24x7. As a result of
this expectation, it is in the best interest of ITSS Shared Services workgroups and ITSS
as a whole to develop and establish a combined staff– the Production Control Group
(PCG) – dedicated to proactively managing and responding to events as they occur.
Eventually, the role of the PCG will include incident evaluation, and depending on the
severity of the event, escalate to upper management. In some situations, the more
experienced level technical personnel will take action to effect repairs and/or restore
services.
As the PCG acquires experience, and as ITSS adds monitoring and troubleshooting
capability, they will assume additional incident response responsibilities.
2.1Primary Responsibilities of the Production Control Group
1.a. Managing and controlling a widespread service outage, including incident
reporting and escalation.
2.2Incident reporting and escalation techniques will:
1.a. Specify a point-of-contract (owner) for all issues and ensure that services
are restored through the prudent use of departmental resources, including
documentation of the incident from beginning to its resolution.
1.b. Effectively manage the communication of information within ITSS when
there are issues that actually or potentially impact ITSS-supported
services or facilities.
1.c. Pro-actively respond to issues that impact ITSS-supported services and
facilities; evaluate, classify, escalate, and manage service restoration
efforts efficiently and as expeditiously as possible, up through incident
resolution.
2.3Additional Responsibilities of the Production Control Group
1.a. Note: It is anticipated that any single-shift of the PCG will NOT be
consumed by continuously resolving issues. Because of this,
supplemental duties and tasks, detailed below, will be assigned.
1 Assist offsite Subject Matter Experts by performing requested tasks,
such as visual inspections of hardware and recycling the power on
equipment as instructed.
2 Manage and prepare magnetic media for rotation, offsite shipment
and storage, including organizing and filing transmittal logs.
3 Control building and facility access, escort vendors to restricted areas
for the purposes of inspection, maintenance, and repair of equipment.
4 Monitor building/facility/ data center environmentals, such as: air
conditioning, fire suppression system, lighting, and so on, log times
and results of the monitoring activity.
5 After normal working hours, perform 1st tier triage of reported issues,
classify and escalate as necessary.
6 Receive and log calls from end users, and generate Remedy tickets,
escalate as necessary.
7 Set up Video/Telephone conferences.

8/8/2008 v1.12 Page 6 of 21


Operations Excellence Incident Management Process

8 Accept and sign for emergency delivery of replacement parts from


vendors.
9 Perform other tasks deemed necessary by department supervision.
3.0 Roles and Definitions
• Account Manager – A member of the ITSS Account Management team in Client
Support who is responsible for the relationship with one or several key clients (e.g.
GSB, H&S, Libraries)
• Client – A primary paying customer of ITSS services and support
• End User – Person who directly uses a service. An end user could be an internal or
external to ITSS. End users are directly impacted during an outage, and generally
have an established relationship with the Client or Service Owner
• Impact – Level of effect or impact on the Stanford Campus. This is relative to the
Campus as a whole, not specifically to the client. (Values= Campus-Wide, Major –
School or Dept wide, Minor – Group or Single User, and Non-Service Affecting)
• Incident Manager – The Shared Services Line Manager who is designated as
responsible for a specific incident
• Incident/Event/Problem/Issue – For the purposes of this document, these terms
are intended to mean a failure of any component of any system or service, and are
used interchangeably throughout this document
• ITSS Client Support – Group which does client relations, account management,
functional analysis, sales & marketing, documentation, software licensing, end user
training, and Help Desk and CRC support
• ITSS Engineering and Projects – Group which does technology R&D, service
enhancements, new product and service projects
• ITSS Shared Services – Group which does operations
• ITSS Strategic Planning – Includes technology strategy & architecture and finance
groups
• Line Manager – Workgroup managers in ITSS Shared Services
• On-Call Subject Matter Expert (SME) – SME (see below) who is designated to be
available to respond to reported outages, triage the incident, perform the needed
tasks to restore services, assist other workgroups in the restoration process, or
determine which other members within their own workgroup are needed to assist in
service restoration
• Operations Owner – The ITSS staff person who has the ultimate authority for a
service including its functionality and approval for any changes to the service
• Priority – Level of response and effort directed towards resolving an incident. It is
determined by the inherent service level commitment of the service, as well as a
combination of Urgency and Impact. Priority is sometime referred to as “severity”.
(Values = Urgent, High, Medium, Low)
• Product Manager – Own product quality and client satisfaction for a service
• Production Control Group (PCG) – Group which will perform monitoring and basic
problem determination and evaluation, escalation, communication and in some
cases, incident resolution
• Subject Matter Expert (SME) – Any technical ITSS staff person whose job requires
extensive technical knowledge of network and service components and their related

8/8/2008 v1.12 Page 7 of 21


Operations Excellence Incident Management Process

requirements. SMEs are considered experts and possess a detailed knowledge of


service functionality, restoration, component/service repair.
• Satellite Operations Center (SOC) – The SOC is a partner with the University
Emergency Operations Center (EOC) during Level 2 (major building fire, extended
power outage) or Level 3 (major earthquake or extensive flooding) emergencies.
The ITSS SOC team provides real-time field information to the EOC as well as
coordinating and directing emergency responses.
• Urgency – End user or client’s assessment of the importance and/or urgency of the
issue as it affects their ability to perform their work. This value is provided by the
customer. (Values = Urgent, High, Medium, Low)

8/8/2008 v1.12 Page 8 of 21


Operations Excellence Incident Management Process

4.0 Process Review


4.1Process Outline
1.a. Note; There are six major steps in this process, from the time of incident
detection through root cause analysis and implementing preventative
measures.
4.2Incident Detection and Reporting
1.a. An incident can be detected by:
1 From an end-user
2 From a client
3 From an SME
4 From automated monitoring
1.b. It is important that the sharing of information occur between and among
groups.
1.c. The process of reporting of problems is different between “normal”
working hours, 8:00 A.M. to 5:00 P.M., M-F, and after those hours.
4.3Incident Level Classification: See Appendices C and D
1.a. This includes assigning a severity level to the incident, and its subsequent
entry into the Remedy incident tracking system.
4.4Incident Notification
1.a. This includes notification to an ITSS Incident Manager and clients, and
includes outage information posted on the SU Web site, Cable TV,
informational messages left on the designated voice mail box, and email
sent to designated personnel and other client notification as deemed
appropriate.
4.5Incident Escalation
1.a. This includes escalation to the ITSS Incident Manager, and any
subsequent escalation calls deemed necessary. Note that the severity
level will dictate who in the management chain of command to contact,
and when to provide them status reports. Additionally, the PCG will
determine whether or not the incident needs to be escalated to the SOC.
4.6Incident Resolution
1.a. This covers work performed during the incident itself, with responsibilities
as follows:
1.b. The Incident Manager is responsible and accountable for the overall
recovery effort, performing the following functions:
1 Establishing recovery priorities
2 Coordinating and delegating responsibilities as they relate to the
recovery effort.
3 Issuing requests for additional resources

8/8/2008 v1.12 Page 9 of 21


Operations Excellence Incident Management Process

4 Ensuring the participation of critical internal and external support


groups and vendors, such as the recall of media from the off-site
storage vendor, or the purchase of replacement parts and equipment
5 Reviewing and approving tactical plans
6 Communicating incident status to ITSS management/executives as
needed
7 Working with Client Support to approve and authorize the release of
information to other schools and departments
1.c. SMEs and Line Managers are responsible for analyzing technical
problems and making technical decisions, implementing tactical plans,
and communicating to other SMEs as well as the Incident Manager.
1.d. The PCG is responsible for coordination of the incident resolution effort
and for communication as deemed necessary.
4.7Post-Incident Activities
1.a. This covers the activities after the incident is resolved.
1 The first task is to ensure that any post-incident cleanup is completed
2 Perform root cause analysis of the incident,
3 To avoid similar, future incidents, determine what process
improvements and preventative measures that can be put into place.
4 Implement changes in process or technical support as appropriate.
5 Ensure that PCG receives feedback and input from the user
community,
6 Perform client follow-up and ensure that an incident response quality
survey form is available for end-user and client feedback.

8/8/2008 v1.12 Page 10 of 21


Operations Excellence Incident Management Process

5.0 Detailed Incident Control Process


5.1Detailed Process Flow Explanation Table. Reference Appendix A
Process # Process Name Detailed Description Action By
Incident Detection and Reporting
End-users will call 5-HELP or use the web at
http://helpsu.stanford.edu/. Telephone calls are
directed to the ITSS Help Desk where the problem is
Problem Reporting: evaluated
1 End-User
End Users
If the Help Desk (any tier) determines that this is
an urgent incident, the call/ticket should be
directly escalated to the PCG
In most cases, clients should call 5-HELP or use the
web at http://helpsu.stanford.edu/. In some special
Problem Reporting:
2 cases, clients may have direct access to the PCG for Client
Clients
reporting problems and receiving updates. In this case,
skip to step 12.
If an end-user calls 5-HELP after hours, the user will get
the recorded phone tree. Users can choose to get
Problem Reporting: through to the PCG directly, or leave a recorded
3 End Users After message. For after hour’s calls, the PCG will determine End User, PCG
Hours whether call is urgent. If the issue is not urgent, the
PCG will enter a ticket in Remedy for review the
following business day.
In some cases, monitoring may notify a SME or a
Problem Reporting: problem before a user, client or the PCG. If the issue is
4 SME
Monitoring to SMEs urgent, escalate directly to the PCG for coordination
and entry into Remedy.
Problem Reporting:
5 Monitoring reports information directly to PCG PCG
Monitoring to PCG
Help Desk assesses whether the ticket can be resolved
6 Resolve? Help Desk
at this point. If so, the Help Desk will resolve and close.
If the ticket cannot be resolved, Help Desk to determine
7 Urgent? whether the ticket should be forwarded to SME/Help Help Desk
Desk Tier 2 or to the PCG
If the case does not appear to be severity Urgent/High,
8 Forward To SME Help Desk
forward to SME
Can the case be resolved by the SME and is it Severity
9 Resolve Quickly? SME
Level Medium/Low?
Enter Solution In If the SME can quickly resolve the case, enter solution
10 SME
Remedy in Remedy and close ticket.
If the SME determines that there is impact beyond a
11 Forward To PCG simple fix and the Severity Level is Urgent/High, notify SME/PCG
the PCG.
Classification
Assign a severity level to the incident; using the
standard ITSS categories (see Appendix C and D). The
severity levels govern:
Level of action to be taken by the Production Control
Assign Severity
12 Group PCG
Level
Notification and escalation guidelines
Time intervals in which to provide status reports
Time intervals in which to initiate escalation and
management decision processes
Enter a ticket for the incident into the Remedy Help
13 Enter In Remedy PCG
Desk application.
Table 2 Detailed Incident Control Process

8/8/2008 v1.12 Page 11 of 21


Operations Excellence Incident Management Process

6.0 High-Level Incident Process Explanation


6.1Detailed Process Explanation: See Appendix B
Notification
SME Notify appropriate SME(s) if necessary, using AMCOM on-call system PCG
Update itss-service-
Send a message to itss-service-alerts@lists.stanford.edu PCG
alerts@lists

Message information will include: the date and time, a brief description of
the problem, and if available, the estimated time of resolution/restoration.

Post Messages To Web, Web: Update status on down.stanford.edu


PCG
Phone, TV Telephone: In the event of a major network failure, update the designated
voicemail box: 7-DOWN
SU Cable TV – ITSS can have pre-worded messages set for broadcast,
where the group can just fill in the blanks.
Escalation
Contact the Shared Services Line Manager of the affected system. If a
Notify Line Manager Line Manager is unavailable, use the AMCOM system to determine the PCG
backup.
Shared
If the incident falls into the area of a single Line Manager, that Line
Determine Incident Services
Manager will contact the Incident Manager. If multiple Line Managers are
Manager Line
involved, they must determine a single Incident Manager.
Managers
Send first email to appropriate lists/clients, based on Service Level
PCG,
Agreements. Use the campus-it-alerts@lists.stanford.edu list for campus-
Send Email Incident
wide outages; the Incident Manager should approve any messages which
Manager
go to this list.
Escalate To Senior The Severity Level (see Appendix C and D) will determine the escalation
PCG
Management to management
Resolution
The Incident Manager will take ownership of the problem and manage the
incident. Responsibilities:
Establish priorities

Coordinate and delegate responsibilities in regards to the recovery effort

Request additional internal or external resources


Incident Management
Ensure and manage the participation of critical internal and external
support groups and vendors
Review and approve tactical plans

Communicate incident status to ITSS management/executives as needed


Work with Client Support to release information as needed to clients/users
across campus
SMEs are responsible for analyzing technical problems, implementing
Resolve Incident SMEs
tactical plans, and communicating to other SMEs and with the PCG.

8/8/2008 v1.12 Page 12 of 21


Operations Excellence Incident Management Process

Message information will include: the date and time, a brief description of
the problem, and if available, the estimated time of resolution/restoration.

Web: Update status on down.stanford.edu


Post Resolution
Information To Web, Telephone: In the event of a major network failure, update the designated PCG
Phone, TV voicemail box: 7-DOWN

SU Cable TV – ITSS can have pre-worded messages set for broadcast,


where the group can just fill in the blanks
Post Incident Analysis
Determine whether cleanup is required, and identify who will own and
Complete Cleanup Tasks SME, PCG
perform the additional clean-up tasks
It is the responsibility of the manager of the PCG to initiate root cause
analysis, collecting as much information as possible, and to ensure that PCG
Root Cause Analysis
any information which will help in resolving future incidents is entered into Manager
the related Remedy ticket for future use.
Shared
Determine processes which can be implemented to prevent a repeat of Services
Incident Prevention
the incident. Managers,
SMEs
Ensure selected members of the recovery team make follow up calls to
Client/User Follow-up the affected users, to solicit their constructive comments. Share results of PCG
the analysis with workgroups and clients where appropriate.
ITSS will make an on-line survey available for user/client feedback, and
Quality Survey for ITSS staff. The PCG is responsible for tallying survey results and PCG
making them available to the appropriate ITSS staff and managers.
Table 3 Explanation of High-Level Incident Management Process Flow

8/8/2008 v1.12 Page 13 of 21


Operations Excellence Incident Management Process

7.0 Outstanding Issues


7.1A common paging system is required
1.a. AMCOM for manual paging
1.b. What to use for automated paging from monitoring systems?
7.2Definition of Service Hours
7.3Definition of “availability,” “outage,” and “service degradation”
7.4Service-level procedures for client notification

8/8/2008 v1.12 Page 14 of 21


Operations Excellence Incident Management Process

Appendix A Incident Management Process Flowchart


Reference Table 1 Detailed Incident Control Process
1.a. Note that the circle numbers in the flowchart correspond to the numbers
on table 2, page 10.
Incident Detection & Reporting
Automated
End User Client Help Desk SME PCG
Monitoring

1 1 4 5
Report Problem: Report Problem:
Report Problem Report Problem
HelpSU/5HELP HelpSU/5HELP

6
Resolve?

No 8
Forward To SME
7 (Help Desk
Urgent? No Tier 2) For
Additional
Analysis

Yes

9
Resolve
Quickly? No

Yes

10
Enter
Solution In
Remedy
2 11
Report Problem: Forward Directly
Directly To PCG To PCG

12
Determine
Severity
Level

13
Enter
3 3 Incident
Calls Calls Ticket In
5-HELP After 5-HELP After Remedy
Hours Hours

Figure 1 Incident Detection and Reporting

8/8/2008 v1.12 Page 15 of 21


Operations Excellence Incident Management Process

Appendix B High-Level Incident Management Process Flow

HelpSU/5-HELP Update Remedy


Database
Client End User Help Desk
y Tier 1
nc
ge
er
Em

Detection Communicate
& Reporting Subject Matter
Expert
Monitoring
Production
Control Group

Classify Incident Level & Enter in Remedy


Classification Remedy
Database
Production Com
Notify

Control Group mun


ic ate

Escalation Communicate Liaison


Update

Line Manager Co Duty Manager SOC/EOC


mm
u nic
ate

System Status

Notification Client
Self-Service

itss-service-alert@lists
down.stanford.edu
7-DOWN End User

Up
da
te
Remedy
U
pd Database
at
e te
w Upda
Resolution ith
So
lu
tio
n
In
fo
lution

rm
at Production
on
With So

Control Group
SME Duty Manager Line Manager
Update

Post Incident
Activities Production
Account SME PCG Manager Control Group
Manager

Figure 2 High-Level Incident Management Process Flow

8/8/2008 v1.12 Page 16 of 21


Operations Excellence Incident Management Process

Appendix C Incident Level Communications Matrix


Client Update SME Work
Level Description Incident Examples
Interval Started w/in:*
Fire suppression system activation in
A major service outage
with significant and data center
immediate business Loss of electrical power
impact and no
Entire network switch, closet and/or
workaround.
building outages Initial Immediate.
• Large number of
Failure of 1 or more high priority
Urgent users Notification on-going: 30 minutes
services – e.g. Exchange, Oracle
• Outage of Financials, HRMS, PeopleSoft
significant length ½ hour
Large denial of service attacks/;
• No available successful hacking; loss or altering of
workaround data; theft of data, simultaneous virus
• Mission/ business infections
critical
SU telephony systems
A major service outage Failure of Storage system (storage area
or degradation with network SAN)
significant business
Failure of a server of a sensitive client
impact and an
or user Initial Immediate.
unsustainable
High workaround. Severely degraded performance Notification on-going: 1hour
• Multiple users
• Work performance 1 hour
reduced Smaller denial of service attacks
• Mission/ business
critical
A service outage or
degradation with an Cannot connect to the internet, send or
acceptable workaround. receive email
• Service-affecting
As applicable. By
Medium • Minimal Hardware failure, cannot access data, 4 business hours
SME working issue.
performance cannot print
degradation
• Affects non-critical Degraded performance
business function
Non service-affecting. Upon issue resolution
• Cosmetic problem Previously requested enhancements to or as applicable with.
Low 1 business day
• System a system By SME working
enhancement issue.
Table 4 Incident Level Classification Matrix

* Note: This column indicates the most amount of time that will transpire before a technician begins
working on an Incident. Times will generally be much faster for all severities.

8/8/2008 v1.12 Page 17 of 21


Operations Excellence Incident Management Process

Appendix D Priorities and Internal Response Times

Note: The following table refers to Priority, not to Urgency or Impact. Priority is a combination of
the combined Urgency, Impact, and existing Service Level Commitments for the service in question. This
is an important concept to adhere to – Urgency is offered by the customer, Priority is assigned by the
Helpdesk, PCG, and/or SME involved from a system-wide perspective.

Usage: These Priority levels (and the associated Urgency and Impact values) are used to track
incidents as they are reported and worked on. Each of Priority, Urgency, and Impact relate directly to
Remedy ticket fields.

Committed
PCG Call SME Call Escalation SME Work
Priority Description Service
Initiate Response Interval Started
Hours
A major service outage
with significant and
immediate business
impact and no
workaround.
• Large number of
Urgent users 24x7 Immediate 15 Minutes 10 minutes 30 minutes
• Outage of
significant length
• No available
workaround
• Mission/ business
critical
A major service outage
or degradation with
significant business
impact and an
unsustainable
High workaround. 24x7 Immediate 15 Minutes 10 Minutes 1 hour
• Multiple users
• Work performance
reduced
• Mission/ business
critical
A service outage or
degradation with an
acceptable workaround. As
• Service-affecting appropriate Standard
Ticket
(work SME Group 4 business
Medium • Minimal 8-5, M-F Assignment/e
begins, work Remedy hours
performance Mail
update, work settings
degradation completed)
• Affects non-critical
business function
As
appropriate
Non service-affecting. (work Standard
• Cosmetic problem Ticket
begins, SME Group 1 business
Low 8-5, M-F Assignment/e
• System information Remedy day
Mail
enhancement required, settings
work
completed)

8/8/2008 v1.12 Page 18 of 21


Operations Excellence Incident Management Process

Appendix E On-Call Guidelines


Guideline Purpose
To generally define and standardize:
On-call duties and responsibilities
A methodology for communications and
engagement of problem determination and
resolution
On-call scheduling
Response expectations/guidelines and general
escalation processes in the event 24 X7 on-site
group is engaged in an on-going event or
incident.
System generated notifications will continue to be handled within the required
time frames by the individual SME groups.
Duties
Requirements for on-call responsibility must be identified in
the appropriate job descriptions, including: carrying a
pager, cell phone, availability of the employee’s home
phone number, and email.
Responsibilities
Share on-call responsibilities with other members of the
work group
Begin working on the event as soon as notified
This may require working from home or
traveling to work. The decision to make a
physical appearance at work depends on the
circumstances of the event, such as:
“swapping” hardware components or, an on-site
appearance by a vendor.
Communications
Teleconference Phone Bridge –Telecom will have a
teleconference number available to technical personnel,
and the PCG. This will be used when the expertise of
multiple SMEs is required to resolve an incident. It will
also permit the technical staff the capability to
communicate as a group. Additionally, first-hand, the PCG
will be able to determine the status of the incident and
keep management informed without them actually being
involved in the conference call.
The AMCOM system will be the primary contact
information/procedures lookup and paging tool for the 24 X
7 on-site groups.
Staff will provide and track individual work group on-call
schedules.

8/8/2008 v1.12 Page 19 of 21


Operations Excellence Incident Management Process

The work group establishes the rotation.


Members of the work groups are responsible for
maintaining and keeping current, the contact and coverage
information on the on-call database.
Communications Elements
Required communications devices: pager or cell phone,
personal phone.
Additional communications devices as recommended by
the SME groups: DSL, Treo, wireless-laptop, email.
Notification Protocol
Initial outgoing page
Re-page in 10 minutes
If a call-back is NOT received from the designated on-call
SME within 15-minutes, begin escalation to the next on-call
person, including re-contacting the primary on-call person
and the on-call Shared Services manager on all
subsequent pages.
Recipient to confirm garbled pages, follow call-back
protocol.
Initial Communications Tracking
Use AMCOM system for initial communications tracking
Response Protocol
15 minute call-back
Within 30 minutes, be actively engaged in problem
determination and resolution
Actively engaged via:
Home system
Wireless laptop
On-site
SME groups may establish accelerated response profiles
based upon their response criticality
Scheduling
By SME group design
SME schedule to be established and published in AMCOM
system
SME contact instructions to be included

8/8/2008 v1.12 Page 20 of 21


Operations Excellence Incident Management Process

Appendix F Management On-Call Guidelines


Return-To-Work Guidelines
These guidelines are for Management to consider if
extended hours have been worked due to outage/issue by
an on-call representative.
These guidelines should be used to ensure there is always
an effective on-call representative, while protecting the on-
call SME from overly extensive work-time.
If the primary on-call SME has already worked consecutive
extended hours, or multiple shifts, and a new event has
occurred:
Either the manager will provide a backup and
notify the backup of their modified on-call
status, or the entire group of SMEs will make a
decision on the selection of an alternate SME to
be used in this situation.
To allow staff members who are involved with an after hour
call-out on Sunday through Thursday to obtain adequate
rest, the following is provided as a sample set of guidelines
for a return-to-work policy:
On-Call SME works until Report to work no later than
0200 1100
0300 1200
0400 1300
0500 Take rest of day off
Table 5 Return-To-Work Guidelines

8/8/2008 v1.12 Page 21 of 21

Anda mungkin juga menyukai