An ITIL project in the real world

Wednesday, August 09, 2006

How to measure improvements brought by our Incident Mgt project (3).

MEASURE 3: Market Share

Definition:
Number of user calls recorded per user per month.

What's the use?
I am not sure that "Market Share" is a very good name for this. Does anybody know another way to call it? Anyway, here's what it is good at: this gives you an idea whether 1st level support is doing one of their key tasks, which is... recording calls! Although it can sound obvious that 1st level support should record all calls, as an analyst, if this is not made very clear by management (and supervisors), it is very easy to slip away from it. Ex: "The user called and I solved her problem in 30 secs. Why would I record that? What's the use? It will take me more time to record it than it took me to resolve it!" It is not uncommun for an analyst to prefer to be the "hero" who solves 10 things like a Superman, than solving these 10, and recording what you have done. What extra reward do you get from recording that?

That's when this measure comes in handy. I've been told that industry average for recording calls should be roughly 1.1 record / user / month. Of course it varies a lot from one industry to another. And whether you have a majority of plant workers, or a huge marketing team ;-) But that can give you some idea where you're at with this activity.

Limitations with this measure
The objective is not to record more calls than there really is, to have nice numbers. Actually, if we had very good problem management, change management, availability and capacity management, etc, we should not have so many Incidents no? Still, if your company is big enough, you may be able to use this measure to highlight which L1 support team is not doing that part of their job properly. And believe me, at my place, it is quite obvious.

Another limitation is that it is not taking into account the quality of calls recorded... this would require a softer measure.

Sunday, August 06, 2006

How to measure improvements brought by our Incident Mgt project (2).

MEASURE 2: Percentage of Incidents resolved in time

Definition: Number of Incidents resolved as per agreed on OLAs&SLAs divided by total number of Incidents in the period.

Will it help us? I think that can be an excellent measure. We can use it to identify services that lack proper support, or teams that do not react in a timely manner... It should be a great tool to fine tune our process - but we won't be able to compare with the situation before the project, because we did not have any kind of OLA...

Limitations with this measure
I have to say I like this measure. If Incidents are not recorded, it shows bad. If you stay within targets, it has a positive impact. If you over do it, it does not make the measure any better.

However, that measure alone will not be sufficient. What if analysts do not record anything? Or just a few Incidents that are resolved extremely fast? The numbers will not make any sense... We'll need some more. But I will keep this one.

Saturday, August 05, 2006

How to measure improvements brought by our Incident Mgt project (1).

It's now almost one month that we have rolled out changes for improving the way we handle interruptions of service. What improvements did we achieve yet?


  • MEASURE 1: Average time to resolve Incidents

    Definition: time between initial record of the Incident and resolution by the specialist (not including extra time until user acknowledged that the service is back up and running properly.)

    Altogether, our average time to resolve Incidents hasn't changed much yet... It means that analysts haven't really started to change anything towards Incident handling. ie: they are not giving more priority or effort towards this task today than before.

    How will we make improvements happen then?

    1. In a few weeks, we'll start automatic and Service Desk escalation to management for each and every Incident that is not resolved as fast as expected. This should be an incentive to improve behavior...
    2. We will work closely with the teams and analysts that do not resolve Incidents in time. Instead of bothering everyone, we will focus on teams that are not providing a proper response, and focus on Incidents that were not handled in time. This should help teams focus on Incidents according to priorities - and will avoid unnecessary escalation to management.
    3. We will advertise team performance. That will show management and teams that other teams can do better.
    Limitations with this measure

    Although that plain measure is very good from a high level and customer perspective, we will need narrower measures to decide on specific actions to further improve performance. The problem with this global "average time to resolve" measure is that it is influenced by factors that are not related to speed at which analysts try to solve Incidents. Ex:

    1. To which extent does level 1 support record Incidents?

      At our company, Service Desk specialists tend to overlook Incident recording, specially when an immediate resolution is provided! When management increases or decreases SD focus on Incident recording, there is a dramatic impact on this measure - much more than any other increase or decrease of service support quality!

      Of course we need to make sure that we do as much as we can to facilitate Incident recording - at least for the getting a more accurate picture of the situation. This is part of our project actually. It will make us look more efficient but in reality this will not be providing immediate improvement to the user community.
    2. Some Incidents can only be resolved with changes.

      If we want to really improve the situation when installations or code changes etc are required, we will need to review our procurement policies, contracts with suppliers, organization of support, etc. . This has not been tackled by our project, so there are few chances things improve in this area right now.

    Also, it is an "average" which means that few exceptional bad/difficult Incidents can drag the results in such a way that the average is not representative of the quality of service provided. Next, it does not take into account that some Incidents need to be resolved extremely fast, because they have a high impact, and some Incidents may be resolved later than other Incidents, because they have less impact on our customers.

    For these reasons, we should probably pick another measure... why not, "Percentage of Incidents resolved in time"?

    I'll discuss this in my next post.