Service credibility: the most important metric

I recently overheard a conversation among three instructors about their university’s Blackboard learning management system. They were swapping stories of times when the system failed. One of them mentioned that one time during a particularly rocky period in the service’s history, he entered a large number of grades into the system only to find that they weren’t there the next day. As a result, he started keeping grades in a spreadsheet as a backup of sorts. The other two recalled times when the system would repeatedly fail mid-quiz for students. Even if the failures were due to their own errors, the point is that they lost trust in the system.

This got me thinking about “shadow systems.” Shadow systems are hardly new, people have been working around sanctioned IT systems since the first IT system was sanctioned. If a customer doesn’t like your system for whatever reason, they will find their own ways of doing things. This could be the person who brings their own printer in because the managed printer is too far away or the department that runs their own database server because the central database service costs too much. Even the TA who keeps grades in a spreadsheet in case Blackboard fails is running a shadow system, and even these trivial systems can have a large aggregate cost.

Because my IT service management class recently discussed service metrics, I considered how trust in a system might be measured. My ultimate conclusion: all your metrics are crap. Anything that’s worth measuring can’t be measured. At best, we have proxies.

Think about it. Does a student really care if the learning management system has five nines of uptime if that .001 is while she’s taking a quiz? Does the instructor care that 999,999 transactions complete successfully when his grade entry is the one that doesn’t?

We talk about “operational credibility” using service metrics, but do they really tell us what we want to know? What ultimately matters in preventing shadow systems is if the user trusts the service. How someone feels about a service is hard to quantify. Quantifying how a whole group feels about a service is even harder. Traditional service metrics are a proxy at their best. At their worst, they completely obscure what we really want to know: does the customer trust the system enough to use it?

There are a a whole host of factors that can affect a service’s credibility. Broadly speaking, I place them into four categories:

  • Technical – Yes, the technical performance of a system does matter. It matters because it’s what you measure, because it’s what you can prove, and because it affects the other categories. The trick is to avoid thinking you’re done because you’ve taken care of technical credibility.
  • Psychological – Perception is reality and how people perceive things is driven by the inner workings of the human mind. To a large degree, service providers have little control over the psychology of their customers. Perhaps the most important are of control is the proper management of expectations. Incident and problem response, as well as general communication, are also critical factors.
  • Sociological – One disgruntled person is probably not going to build a very costly shadow system. A whole group of disgruntled people will rack up cost quickly. Some people don’t even know they hate something until the pitchfork brigade rolls along.
  • Political – You can’t avoid politics. I debated including this in psychological or sociological, but I think it belongs by itself. If someone can keep some of their clout within the organization by liking or disliking a service, you can bet they will. I suspect political factors almost always work against credibility, and are often driven by short-sightedness or fear.

If I had the time and resources, I’d be interested in studying how various factors relate to customer trust in a service. It would be interesting to know, especially for services that don’t have a direct financial impact, what sort of requirements can be relaxed and still meet the level of credibility the customer requires. If you’re a graduate student studying service management, I present this challenge to you: find a derived value that can be tightly correlated to the perceived credibility of a service. I believe it can be done.

Monitoring sucks, don’t make it worse

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

Coming up: LISA ’12

It may seem like I’ve not been writing much lately, but nothing can be further from the truth. It’s just that my writing has been for grad school instead of Blog Fiasco. But don’t worry, soon I’ll be blogging like a madman. That’s right: it’s time for LISA ’12. Once again, I have the privilege of being on the conference blog team and learning from some of the TopPeople[tm] in the field. Here’s a quick look at my schedule (subject to change based on level of alertness, addition of BoFs, etc):

Sunday

Monday

Tuesday

Wednesday

Thursday

Friday

Now I just need to pack my bags and get started on the take-home final that’s due mid-week. Look for posts from me and my team members Matt Simmons and Greg Riedesel on the USENIX Blog.

Scattered thoughts on sysadmin ethics

Last week, a Redditor posted a rant titled “why I’m an idiot, but refuse to change my ways.” I have to give him (or her, but let’s stick with “him” for the sake of simplicity and statistical likelihood) credit for recognizing the idiocy of the situation, but his actions in this case do a disservice to the profession of systems administration. My initial reaction was moderated by my assumption that this person is early-career and my ability to see some of myself in that post. But as I considered it further, I realized that even in my greenest days, I did not consider unplanned outages to be a license for experimentation.

Not being in a sysadmin role anymore, I’ve had the opportunity to consider systems administration from the perspective of a learned outsider. I was pleasantly surprised to see that the responses to the poster were fairly aghast. There’s a great deal of ethical considerations for sysadmins, partly due to the responsibilities of keeping business critical services running and partly due to the broad access to business and personal data. So much of the job is knowing the appropriate behavior, not just the appropriate technical skills.

This may be the biggest benefit of a sysadmin degree program: training future systems administrators the appropriate professional ethic. I am by no means trying to imply that most sysadmins are lacking. On the contrary, almost all of the admins I’ve encountered take their ethical requirements very seriously. Nonetheless, a strain of BOFHism still runs through the community. As the world becomes increasingly reliant on computer systems, a more rigorous adherence to a certain philosophy will be required.

System Administrator Appreciation Day, or: a crisis of identity

Today is System Administrator Appreciation Day, a day for everyone to express their gratitude for the sysadmins who maintain the technical infrastructure we all rely on. For the first time since I’ve heard of this holiday, I’m not a practicing sysadmin. Wait, what? I haven’t said much about it for a variety of reasons, but I transferred to a new job (I had to move all the way across the hall!) in June. We’re still defining the exact scope of my job, but the basic foci are training, documentation, and engaging new and existing scientific communities. It should be interesting work, but it’s causing a bit of an identity crisis.

My whole professional career has been systems administration. Trying to separate myself from that has been challenging. It’s small consolation (though sufficient justification for entering Think Geek’s giveaway) that I still administer my desktop at work and a minimal home network. In much the same way that I consider myself a meteorologist because I have credentials and practice as a hobbyist, I can still consider myself a sysadmin. But it’s not the same.

Since starting my new job, I’ve caught myself thinking of myself as a[n active professional] sysadmin. When I realize that I’m not, it leads to a search for identity. The fact that my new job doesn’t seem to have a broadly accepted title (officially, I’m a “Research Programmer”, but that’s more of a bureaucratic shortcut than an actual reflection of reality) doesn’t help. There’s no simple explanation of what it is…I do here.

It’s quite likely I’ll return to sysadmin ranks at some point, either professionally or by contributing to the Fedora Infrastructure group. Until then, I’ll keep tuned in with my LOPSA membership and going to LISA. Maybe I’m just a sysadmin-in-exile?

Book review: The Visible Ops Handbook

I first heard of The Visible Ops Handbook during Ben Rockwood’s LISA ’11 keynote. Since Ben seemed so excited about it, I added it to the list of books I should (but probably would never) read. Then Matt Simmons mentioned it in a brief blog post and I decided that if I was ever going to get around to reading it, I needed to stop putting it off. I bought it that afternoon, and a month later I’ve finally had a chance to read it and write a review. Given the short length and high quality of this book, it’s hard to justify such a delay.

Information Technology Infrastructure Library (ITIL) training has been a major push in my organization the past few years. ITIL is a formalized framework for IT service management, but seems to be unfavored in the sysadmin community. After sitting through the foundational training, my opinion was of the “it sounds good, but…” variety. The problem with ITIL training and the official documentation is that you’re told what to do without ever being told how to do it. Kevin Behr, Gene Kim, and George Spafford solve that problem in less than 100 pages.

Based on observations and research of high-performing IT teams, The Visible Ops Handbook assumes that no ITIL practices are being followed. Implementation of the ITIL basics is broken down into four phases. Each phase includes real-world accounts, the benefits, and likely resistance points. This arms the reader with the tools necessary to sell the idea to management and sysadmins alike.

The introduction addresses a very important truism: “Something must need improvement, otherwise why read this?” The authors present a general recap of their findings, including these compelling statistics: 80% of outages are self-inflicted and 80% of mean time to repair (MTTR) is often wasted on non-productive activities (e.g. trying to figure out what changed).

Phase 1 focuses on “stabilizing the patient.” The goal is to reduce unplanned work from 80% of outage time to 25% or less. To do this, triage the most critical systems that generate the most unplanned work. Control when and how changes are made and fence off the systems to prevent unauthorized changes. While exceptions might be tempting, they should be avoided. The authors state that “all high performing IT organizations have only one acceptable number of unauthorized changes: zero.”

After reading Phase 1, I already had an idea to suggest. My group handles change management fairly well, but we don’t track requests for change (RFCs) well. Realizing how important that is, I convinced our groups manager and our best developer that it was a key feature to add to our configuration management database (CMDB) system.

In Phase 2, the reader performs a catch & release program and find “fragile artifacts.” Fragile infrastructure are those systems or services with a low change success rate and high MTTR. After all systems have been “bagged and tagged”, it’s time to make a CMDB and a service catalog. This phase is the next place that my group needs to do work. We have a pretty nice CMDB that’s integrated with our monitoring systems and our job schedulers, but we lack a service catalog. Users can look at the website and see what we offer, but that’s only a subset of the services we run.

Phase 3 focuses on creating a repeatable build library. The best IT organizations make infrastructure easier to build than repair. A definitive software library, containing master images for all software necessary to rebuild systems, is critical. For larger groups, forming a separate release management team to engineer repeatable builds for the different services is helpful. The release management team should be separate from the operational group and consist of generally senior staff.

The final phase discusses continual improvement. If everyone stopped at “best practices”, no one would have a competitive advantage. Suggested metrics for each key process area are listed and explained. After all, you can’t manage what you can’t measure. Finding out what areas are the worst makes it easier to decide what to improve upon.

The last third of the book consists of appendices that serve as useful references for the four phases. One of the appendices includes a suggested table layout for a CMDB system. The whole book is focused on the practical nature of ITIL implementation and guiding organizational learning. At times, it assumes a large staff (especially when discussing separation of duties), so some of the ideas will have to be adapted to meet the needs of smaller groups. Nonetheless, this book is an invaluable resource to anyone involve in IT operations.

Fedora 16 released

It’s that time again — a new release of Fedora is here! I’m about to eat my own dog food and upgrade, so while I do that, why don’t you check out the Release Announcement? This release announcement holds a special place in my heart because I mostly wrote it (along with contributions from others, of course!). That’s right, I’ve actually made a contribution. It sets a dangerous precedent, but I found writing the RA quite enjoyable. I’m particularly proud of my Jules Verne references in each of the “What’s New” subsections. Fortunately, we’ve got a little while to come up with “Beefy Miracle”-themed one liners.

So even though I haven’t yet installed it, I’m confident that Fedora 16 is just as great as the last 15 versions. I’ll have some notes about the upgrade process once it finishes.

What a sysadmin can learn from hurricane corner cases

One thing I’ve been focusing on lately is avoiding “It Works Well Enough” Syndrome. Maybe it’s because of the systems design classes I’m taking, or maybe it’s due to my frustration having to fix something that was done months or years ago because it no longer works well enough. Sysadmins are particularly vulnerable to this trap because we’re often not trying to develop software, we’re just trying to solve an immediate problem. Unfortunately, things change over time and underlying assumptions are no longer valid.

A relevant example from the world of tropical weather came up earlier this month. The National Hurricane Center’s 45th discussion for Hurricane Katia contained some very interesting text:

NO 96-HOUR POINT IS BEING GIVEN BECAUSE FORECAST POINTS IN THE
EASTERN HEMISPHERE BREAK A LOT OF SOFTWARE.

It makes sense that software focused on the Atlantic basin would only be concerned with western longitudes, right? It’s exceedingly rare for Atlantic tropical systems to exist east of the Prime Meridian, but apparently it’s not impossible. Whether it’s NHC or commercial software that the forecasters are concerned about is irrelevant. Clearly positive longitudes break things. It makes me wonder what broke when Tropical Storm Zeta continued into January 2006.

Sidebar — It’s not our fault/everyone else does it, too

I don’t mean to demonize sysadmins or lionize developers in the first paragraph. There are plenty of sysadmins out there who want to take the time to develop robust tools to solve their problems. Often, they just don’t have the time because too many other demands have been placed up on them. By the same token, developers who methodically design and implement software still end up with a lot of bugs.

Dropping Dropbox

When Dropbox first came to my attention, I was in love. What a great way to keep various config files synchronized across computers. Then it came out that Dropbox’s encryption wasn’t quite as awesome as they let on. It turns out there’s no technical restriction on (at least certain) employees accessing your files. The data is encrypted, but server-side. Now, I’m not all that concerned that someone will target me to find out what my .ssh/config file contains (heck, I’d put it on dotfiles if someone asked nicely), but it does make me reconsider what is appropriate for Dropbox.

Recently, Dropbox announced some changes to the Terms of Service. While the license part is what caused the most uproar on the Internet, the de-duplication part is what stood out the most to me. I know it’s not in Dropbox’s best interests to pay to store a thousand copies of Rebecca_Black-Friday.mp3, but that’s not my concern. The wording suggests that the de-duplication is block-level as opposed to file-level, which is less worrisome, but given their previous lack of transparency about the encryption, I wonder how they’re actually implementing it. If it’s file-level and if it spans multiple accounts, then that seems like a really terrible idea.

I’ve recently switched everything I had in Dropbox over to SpiderOak. The synchronization seems a bit slower and the configuration is less simple (but it’s much easier to back up multiple directories, instead of having to barf symlinks everywhere), but the encryption is client-side so that it’s impossible for SpiderOak to divulge user data (unless they’re lying, too). If you’re interested in trying SpiderOak for yourself, sign up through this link and we’ll both get an extra 1 GB of storage for free.

A Cfengine learning experience

Note: This post refers to Cfengine 2. The difficulties I had may quite likely be a result of peculiarities in our environment or the limits of my own knowledge.

A few weeks ago, my friends at the University of Nebraska politely asked us to install host certificates on our Condor collectors and submitters so that flocking traffic between our two sites would be encrypted. It seemed like a reasonable request, so after getting certificates for 17-ish hosts from our CA, I set about trying to put them in place. I could have plopped them all in place easily enough using a for loop, but I decided it would make more sense to left Cfengine take care of it. This has the added advantage of making sure the certificate gets put in place automatically when a host gets reinstalled or upgraded.

I thought it would be nice if I tested my Cfengine changes locally first. I know just enough Cfengine to be dangerous, and I don’t want to spam the rest of the group with mail as I check in modifications over and over again. So after editing the input file on one of the servers, I ran cfagent -qvk. It didn’t work. The syntax looked correct, but nothing happened. After a bit, I asked my soon-to-be-boss for help.

It turned out that I didn’t quite get the meaning of the -k option. I always used it to run against the local cache of the input files, not realizing that it killed all copy actions. Had I looked at the documentation, I would have figured that out. Like I said, I know just enough to be dangerous.

I didn’t want to create a bunch of error email since some hosts wouldn’t be getting host certificates, so I went with a IfFileExists statement that I could use to define a group to use in the copy: stanza. So I committed what I thought to be the correct changes and tried running cfagent again. The certificates still weren’t being copied into place. Looking at the output, I saw that it couldn’t find the file. Nonsense. It’s right there on the Cfengine server.

As it turns out, that’s not where IfFileExists looks, it looks on the server running cfagent. The file, of course, doesn’t exist locally because Cfengine hasn’t yet copied it. Eventually I surrendered and defined a separate group in cf.groups to reference in the appropriate input file. This makes the process more manual than I would have liked, but it actually works.

Oh, except for one thing. In testing, I had been using $(hostname) in a shellcommand: to make sure that the input file was actually getting read. When I finally got the copy: stanza sorted out, the certificates still weren’t being copied out. The cfagent output said it couldn’t find ‘/masterfiles/tmpl/security/host-certs/$(hostname).pem’. As it turns out, I thought $(hostname) was a valid Cfengine variable. Instead, it was actually being passed to the shell command and being executed by the shell. The end result was indiscernible from what I intended in that case, but didn’t translate to the copy: stanza. The variable I wanted was $(fqhost).