Auditing the Incident and Problem Management Process
Regular audits of the organization’s procedures for resolving IT problems can help prevent these issues from becoming even bigger trouble for the business.
January 01, 2012
The day-to-day running of IT operations generates many user queries and problems that could impact the efficient operation of IT systems and applications if they are not addressed. Incident and problem management often is overlooked from an IT audit perspective because it lacks the appeal of development or specificity of disaster recovery. But without a good understanding of the topic and an audit focus on the process, certain business operations can go off the rails quickly.
It is important for internal auditors to understand the difference between an incident and problem:
- Incident: The Information Technology Infrastructure Library (ITIL) defines an incident as "any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to or a reduction in the quality of that service. The stated ITIL objective is to restore normal operations as quickly as possible with the least possible impact on either the business or the user, at a cost-effective price."
- Problem: ITIL defines a problem as "the unknown cause of one or more incidents, often identified as a result of multiple similar incidents."
The overall objective of both the incident and problem management process is to ensure that IT systems are running smoothly and supporting business operations. ISACA's Control Objectives for Information and Related Technologies (COBIT) provides a good framework to audit an incident and problem management process.
COBIT identifies process steps covering the service desk, registration of customer queries, incident escalation and closure, and reporting and trend analysis.
The service desk is the IT department's face to the business. Yet, it is amazing how few resources, training, or dollars are allocated to the team that supports this activity. In assessing this component, the auditor must understand:
- Whether a service desk has been established and whether the function of the service desk is known.
- The skill sets and experience of service desk personnel. Junior team members should be supported by more senior personnel.
- The application or system used to support the service desk.
Registration of Customer Queries
Typically, customer queries may arrive at a service desk via telephone or email. The auditor should ascertain how these queries are logged and tracked. Usually, when a query is received, service desk staff assign it a priority or severity level based on an agreed-upon definition. The assignment of priority is important, as it will determine how quickly a query will be resolved. A priority definition should consider:
- How critical is the loss of the component? For example, a user who is unable to log on to a minor system due to a forgotten password will be a lower priority than a key piece of infrastructure that crashes before the start of the business day.
- What is the business impact of the incident? An example would be when customer service operations are impaired when hundreds of users are unable to access a mission-critical system.
- How quickly does the service need to be restored?
The auditor needs to understand the procedures in place to escalate incidents that cannot be resolved immediately. These procedures may involve a level two and level three support structure; the escalation of the queries between these levels will be determined by the resolution time limits — usually defined by a service level agreement — and the complexity of the problem. The auditor should choose a sample of escalated incidents to ensure the events have received the appropriate attention within the agreed-upon time frames.
Incident closure is an important — but mostly overlooked — aspect of the process. Once an incident is resolved, the assigned team member often moves onto the next incident without updating the status, system documentation, and resolution. This lack of information poses a significant issue for the problem management process in that trends cannot be analyzed. During their testing, auditors should ensure that appropriate information is included in the incident record. Furthermore, auditors should ascertain whether documented criteria is in place that specifies what information is required to be collected before an incident is classified as "closed." Some of these criteria may include:
- All relevant incident information collated within the incident record.
- End user advisement that they are satisfied with the result.
Reporting and Trend Analysis
Senior management should produce and review periodic reports detailing resolved incidents, service performance, and response times. The auditor should ensure the reported information is accurate and note action taken by management to address key issues.
The key process steps identified by COBIT for problem management include identification and classification of problems, problem tracking and resolution, and problem closure.
Identification and Classification of Problems
Problems typically are identified through trend analysis of multiple incident reports and error logs. Alternatively, a high-severity incident also may be classified as a problem to enable detailed root cause analysis. Key aspects required at this stage include:
- Availability of all information that can assist in identifying the problem.
- Different teams that must pool their knowledge and expertise to diagnose and resolve the problem.
- Predetermined priority levels to ensure there is efficient allocation of resources.
Problem Tracking and Resolution
A single tracking system — ideally interfaced with the incident management system — will assist in providing the audit trail and status required to monitor problems. In addition, communication to the impacted parties is critical at all stages to ensure there is an appropriate solution and timely resolution. Moreover, personnel must be trained to identify and track trends. In most instances, the root cause of the problem is identified only after a significant amount of analysis is undertaken. Tools such as Pareto charts and principles and Ishikawa diagrams (also called fishbone diagrams) are useful in identifying trends and cause–effect of problems.
Problem records should be closed when there is a successful resolution of the known error or if the business agrees to implement an alternate solution or workaround.
A Stronger Process
The underlying principle behind a successful problem and incident management process is communication among the various teams. The hand-offs between the teams is where most problems occur. Clearly defined procedures together with specific accountabilities can help strengthen the process. Moreover, continuing awareness and education — with internal audit's support — can make the process more robust.