Getting good reports from users

Earlier this week, Rob Soden asked an excellent question on Twitter: how do you train your non-technical co-workers to file useful/actionable bug reports? Is it possible?

It is, but this question doesn’t just apply to developers. Operations people need useful incident reports from users as well. Users generally don’t provide unhelpful reports maliciously.. In my experience, reports are often the least useful when users are trying to be the most helpful. A healthy relationship between a customer and a service provider means that everyone is working together. Users often don’t know what information is helpful, they just want to go about their work.

A useful report contains the following:

  • What happened and what the desired result is
  • What the user was doing when it happened
  • Any changes that may have happened
  • If the product/service ever worked and if so, when it last worked
  • If the problem is reproducible and if so, what steps are required to reproduce it

A useful report does not contain speculation about what the solution is. But what, Rob asked, if the lesson doesn’t stick? It gets annoying having to ask over and over again. As with any repetitive task, the answer here is to automate it. One option is to have a canned response at the ready. This way, you polish the text once and you never have to worry about irritation creeping in. A better solution is to have a website that explicitly asks for the information you need. This saves a round of back-and-forth and hopefully increases the time to resolution.

The most important part is to modify the process as you learn. Tweak the questions to make it easier on the users. Provide multiple-choice or other defined-response choices when you can to reduce ambiguity. No good process is ever in a final state.

Bug trackers and service desks

I have recently been evaluating options for our customer support work. For years, my company has used a bug tracker to handle both bugs and support requests. It has worked, mostly well enough, but there are some definite shortcomings. I can’t say that I’m an expert on all the offerings in the two spaces, but I’ve used quite a few over the years. My only conclusion is that there is no single product that does both well.

Much of the basic functionality is the same, but it’s the differenences that are key. A truly excellent service desk system is aware of customer SLAs. Support tickets shouldn’t languish untouched for months or even years. But it’s perfectly normal for minor bugs to live indefinitely, especially if the “bug” is actually a planned enhancement. Service desks should present customers with a self-service portal, if only to see the current status of their tickets. Unfortunately, most bug trackers present too much information for a non-technical user (can you imagine having your CEO using Bugzilla to manage tickets?). While this interface is great for managing bugs, it’s pretty lousy otherwise.

Of course, because they’re similar in many respects, the ideal solution has your service desk and your bug tracker interacting smoothly. Sometimes support requests are the result of a bug. Having a way to tie them together us very beneficial. How will your service desk agents know to follow up with the customer unless the bug tracker updates affected cases when a bug is resolved? How will your developers get the information they need if the service desk can’t update the bug tracker?

Many organizations, especially small businesses and non-profits, will probably use one or the other. Development-oriented organizations will lean toward bug trackers and others will favor service desk tools. In either case, they’ll make do with the limitations for the use they didn’t favor. Still, it behooves IT leadership to consider separate-but-interconnected solutions I’m order to achieve the maximum benefit.

Service credibility: the most important metric

I recently overheard a conversation among three instructors about their university’s Blackboard learning management system. They were swapping stories of times when the system failed. One of them mentioned that one time during a particularly rocky period in the service’s history, he entered a large number of grades into the system only to find that they weren’t there the next day. As a result, he started keeping grades in a spreadsheet as a backup of sorts. The other two recalled times when the system would repeatedly fail mid-quiz for students. Even if the failures were due to their own errors, the point is that they lost trust in the system.

This got me thinking about “shadow systems.” Shadow systems are hardly new, people have been working around sanctioned IT systems since the first IT system was sanctioned. If a customer doesn’t like your system for whatever reason, they will find their own ways of doing things. This could be the person who brings their own printer in because the managed printer is too far away or the department that runs their own database server because the central database service costs too much. Even the TA who keeps grades in a spreadsheet in case Blackboard fails is running a shadow system, and even these trivial systems can have a large aggregate cost.

Because my IT service management class recently discussed service metrics, I considered how trust in a system might be measured. My ultimate conclusion: all your metrics are crap. Anything that’s worth measuring can’t be measured. At best, we have proxies.

Think about it. Does a student really care if the learning management system has five nines of uptime if that .001 is while she’s taking a quiz? Does the instructor care that 999,999 transactions complete successfully when his grade entry is the one that doesn’t?

We talk about “operational credibility” using service metrics, but do they really tell us what we want to know? What ultimately matters in preventing shadow systems is if the user trusts the service. How someone feels about a service is hard to quantify. Quantifying how a whole group feels about a service is even harder. Traditional service metrics are a proxy at their best. At their worst, they completely obscure what we really want to know: does the customer trust the system enough to use it?

There are a a whole host of factors that can affect a service’s credibility. Broadly speaking, I place them into four categories:

  • Technical – Yes, the technical performance of a system does matter. It matters because it’s what you measure, because it’s what you can prove, and because it affects the other categories. The trick is to avoid thinking you’re done because you’ve taken care of technical credibility.
  • Psychological – Perception is reality and how people perceive things is driven by the inner workings of the human mind. To a large degree, service providers have little control over the psychology of their customers. Perhaps the most important are of control is the proper management of expectations. Incident and problem response, as well as general communication, are also critical factors.
  • SociologicalĀ – One disgruntled person is probably not going to build a very costly shadow system. A whole group of disgruntled people will rack up cost quickly. Some people don’t even know they hate something until the pitchfork brigade rolls along.
  • Political – You can’t avoid politics. I debated including this in psychological or sociological, but I think it belongs by itself. If someone can keep some of their clout within the organization by liking or disliking a service, you can bet they will. I suspect political factors almost always work against credibility, and are often driven by short-sightedness or fear.

If I had the time and resources, I’d be interested in studying how various factors relate to customer trust in a service. It would be interesting to know, especially for services that don’t have a direct financial impact, what sort of requirements can be relaxed and still meet the level of credibility the customer requires. If you’re a graduate student studying service management, I present this challenge to you: find a derived value that can be tightly correlated to the perceived credibility of a service. I believe it can be done.

Book review: The Visible Ops Handbook

I first heard of The Visible Ops Handbook during Ben Rockwood’s LISA ’11 keynote. Since Ben seemed so excited about it, I added it to the list of books I should (but probably would never) read. Then Matt Simmons mentioned it in a brief blog post and I decided that if I was ever going to get around to reading it, I needed to stop putting it off. I bought it that afternoon, and a month later I’ve finally had a chance to read it and write a review. Given the short length and high quality of this book, it’s hard to justify such a delay.

Information Technology Infrastructure Library (ITIL) training has been a major push in my organization the past few years. ITIL is a formalized framework for IT service management, but seems to be unfavored in the sysadmin community. After sitting through the foundational training, my opinion was of the “it sounds good, but…” variety. The problem with ITIL training and the official documentation is that you’re told what to do without ever being told how to do it. Kevin Behr, Gene Kim, and George Spafford solve that problem in less than 100 pages.

Based on observations and research of high-performing IT teams, The Visible Ops Handbook assumes that no ITIL practices are being followed. Implementation of the ITIL basics is broken down into four phases. Each phase includes real-world accounts, the benefits, and likely resistance points. This arms the reader with the tools necessary to sell the idea to management and sysadmins alike.

The introduction addresses a very important truism: “Something must need improvement, otherwise why read this?” The authors present a general recap of their findings, including these compelling statistics: 80% of outages are self-inflicted and 80% of mean time to repair (MTTR) is often wasted on non-productive activities (e.g. trying to figure out what changed).

Phase 1 focuses on “stabilizing the patient.” The goal is to reduce unplanned work from 80% of outage time to 25% or less. To do this, triage the most critical systems that generate the most unplanned work. Control when and how changes are made and fence off the systems to prevent unauthorized changes. While exceptions might be tempting, they should be avoided. The authors state that “all high performing IT organizations have only one acceptable number of unauthorized changes: zero.”

After reading Phase 1, I already had an idea to suggest. My group handles change management fairly well, but we don’t track requests for change (RFCs) well. Realizing how important that is, I convinced our groups manager and our best developer that it was a key feature to add to our configuration management database (CMDB) system.

In Phase 2, the reader performs a catch & release program and find “fragile artifacts.” Fragile infrastructure are those systems or services with a low change success rate and high MTTR. After all systems have been “bagged and tagged”, it’s time to make a CMDB and a service catalog. This phase is the next place that my group needs to do work. We have a pretty nice CMDB that’s integrated with our monitoring systems and our job schedulers, but we lack a service catalog. Users can look at the website and see what we offer, but that’s only a subset of the services we run.

Phase 3 focuses on creating a repeatable build library. The best IT organizations make infrastructure easier to build than repair. A definitive software library, containing master images for all software necessary to rebuild systems, is critical. For larger groups, forming a separate release management team to engineer repeatable builds for the different services is helpful. The release management team should be separate from the operational group and consist of generally senior staff.

The final phase discusses continual improvement. If everyone stopped at “best practices”, no one would have a competitive advantage. Suggested metrics for each key process area are listed and explained. After all, you can’t manage what you can’t measure. Finding out what areas are the worst makes it easier to decide what to improve upon.

The last third of the book consists of appendices that serve as useful references for the four phases. One of the appendices includes a suggested table layout for a CMDB system. The whole book is focused on the practical nature of ITIL implementation and guiding organizational learning. At times, it assumes a large staff (especially when discussing separation of duties), so some of the ideas will have to be adapted to meet the needs of smaller groups. Nonetheless, this book is an invaluable resource to anyone involve in IT operations.