Monitoring sucks, don’t make it worse

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

LISA ’11: the first half of the week

If you’ve been following me on Twitter, you know I’ve been in Boston for the USENIX Large Installation System Administration (LISA) Conference. Once again, I have the honor of serving on the conference blog team, which means I spend all day sitting in sessions and all evening writing about them. We’re halfway through now, so here’s what I’ve written so far:

You can follow along with the rest of the blog team at

The joys of doing it right

A while back, I wrote a post about why it’s not always possible to DoItRight™, and that sometimes you just have to accept it.  Today I’m here to talk about a time that I did something right and how good it felt.  Now, that’s not to say that I’m eternally screwing up (although a good quarter of my Subversion commits are fixes of a commit I previously made), but there’s a difference between making something work and making it work well.

I decided that since we have a Nagios server, I might as well have it check on the health of our Condor services.  From what I could tell, no such checks currently exist, so I decided to write my own.  Nagios checks can be very simple: run a command or two, and then return a number that means something to Nagios.  Many checks are written in bash or another shell script because they are so simple.  For my checks, I wanted to do some parsing of the command outputs to determine the state of job queues, etc.  Since that kind of work is a little heavy for a shell script, I opted to write it in Perl.  Yay Perl!

Since there aren’t any checks available, I thought my work might be useful to others in the community.  As a result, I wanted to make sure my code was respectable.  This meant I spent some time designing, coding, and testing options that we don’t want but others might find useful.  It meant putting extra documentation into the code (and eventually writing some pod before I share the code publicly).  It meant mostly following the coding style of the Linux kernel (I chose that because “why not?”).

Some readers will (correctly) note that the Linux kernel coding style does not guarantee good code.  I don’t mean to suggest that it does, but I’ve found that it forced me to think about my code more deeply than I otherwise would.  Not being a programmer, most of the code I write is to fit a small need of mine and the quality is defined as “does it do what I want it to?”  Writing something with the intent of sharing it publicly and forcing yourself to not cut corners can make the work more difficult, but the end result is a beauty to behold.

Why disk utilization matters

Here’s a rare weekend post to help make up for my lack of blogging this week.  Once again it is work related.  My life is boring and uneventful otherwise. 🙂

Unless you plan on sitting around babysitting your servers  every minute of every day, it is probably a good idea to have a monitoring system like Nagios set up.  My department, eternal mooches that we are, opted to not set one up and instead use the service provided by the college-level IT staff.  It worked great, until one day when it didn’t any more.  Some config change hosed the system and the Nagios service no longer ran.  I didn’t consider it much of a big deal until about 7 days ago.

This time last week, I was enjoying a vacation with my beautiful wife in celebration of our 2nd wedding anniversary.  When I got home Sunday evening, I noticed that several people had sent in e-mails complaining that they couldn’t log in to their Linux machines.  Like a fool, I spent the last few hours of my freedom trying to resolve the issue.  We figured out it was a problem with the LDAP server.  Requests went out, but no answers were ever received.  So after a too-long e-mail exchange, we got a workaround set up and I called it good enough.  I went to bed at one o’clock, thoroughly exhausted.

The next day we started working on figuring out what was the problem.  At first it seemed like the issue was entirely with the LDAP server, which is run by the central computing group on campus.  I was pleased that it was not one of my systems.  Then they noticed that there were a lot of open connections from two of my servers: one was our weather data website, and the other was our weather data ingest server.  Both machines work pretty hard, and at first I thought maybe one of the image generation processes just choked and that tripped up everything else.

Further investigation showed that the root cause of the issue was probably that the data partition on the ingest server was full.  This caused the LDM processes to freak out, which resulted in a lot more error messages in the log, which then filled up /var.  Now the system was running so slowly that nothing was behaving right, and since the web server is tightly married to the data server, they both ended up going crazy and murdering the LDAP server.

Now there are scripts that are supposed to run to scour the data server to keep the disks from filling.  I thought perhaps something had kept them from running.  I looked through logs, through cron e-mails, and then ran some find commands by hand.  Everything suggested that the scouring was working as it should.  The more I looked, the more I realized it’s just that the radar data is ever-growing.  I just need to add more disk.

Had I been keeping an eye on the disk usage these past few months, I would have known this sooner, and been able to take care of it before critical services got beaten up.  I think on Monday, I’ll lend a hand getting the Nagios server up and running again.  Learn from my mistakes, readers!