Monitored self

I first heard about Quantified Self when my friend Marius Ducea did a “HealthOps” BoF session at LISA ’11. I’ve since come across several friends and colleagues who participate to varying degrees. It always struck me as a sensible thing for a sysadmin to do: monitoring is already a key part of professional life, so why not extend it to personal life? After all, the sysadmin is the most important system to maintain.

I never got into it myself, despite being intrigued by the idea. For one, I would want to buy all the cool little gadgets, and it’s not an expense that I can justify to myself. Second, my website would probably crumble all the graphs I’d make. Third, I don’t have enough impulse in me to overcome that inertia, it’s all being spent elsewhere.

Cool blog, bro, you’re writing about something you think is cool but have no experience with.

Relax, I’m getting to a point. Earlier today, I was chatting with Matt Simmons, who recent bought a FitBit for himself. The discussion turned to genetic testing. Matt expressed concern that knowing the results of DNA testing would bring on undesirable levels of paranoia, and I’m inclined to agree. It brought to mind one of the lessons from my monitoring manifesto: every alert should have a reaction.

In other words: if there’s nothing I can do about my DNA, I don’t particularly care to worry about it. Gene therapy is a rapidly developing field, and once it becomes affordable and effective, then maybe I’ll spin the GATC wheel. In the meantime, I’ll focus my efforts on things I can control. I really need to start doing that.

Monitoring sucks, don’t make it worse

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

Why disk utilization matters

Here’s a rare weekend post to help make up for my lack of blogging this week.  Once again it is work related.  My life is boring and uneventful otherwise. 🙂

Unless you plan on sitting around babysitting your servers  every minute of every day, it is probably a good idea to have a monitoring system like Nagios set up.  My department, eternal mooches that we are, opted to not set one up and instead use the service provided by the college-level IT staff.  It worked great, until one day when it didn’t any more.  Some config change hosed the system and the Nagios service no longer ran.  I didn’t consider it much of a big deal until about 7 days ago.

This time last week, I was enjoying a vacation with my beautiful wife in celebration of our 2nd wedding anniversary.  When I got home Sunday evening, I noticed that several people had sent in e-mails complaining that they couldn’t log in to their Linux machines.  Like a fool, I spent the last few hours of my freedom trying to resolve the issue.  We figured out it was a problem with the LDAP server.  Requests went out, but no answers were ever received.  So after a too-long e-mail exchange, we got a workaround set up and I called it good enough.  I went to bed at one o’clock, thoroughly exhausted.

The next day we started working on figuring out what was the problem.  At first it seemed like the issue was entirely with the LDAP server, which is run by the central computing group on campus.  I was pleased that it was not one of my systems.  Then they noticed that there were a lot of open connections from two of my servers: one was our weather data website, and the other was our weather data ingest server.  Both machines work pretty hard, and at first I thought maybe one of the image generation processes just choked and that tripped up everything else.

Further investigation showed that the root cause of the issue was probably that the data partition on the ingest server was full.  This caused the LDM processes to freak out, which resulted in a lot more error messages in the log, which then filled up /var.  Now the system was running so slowly that nothing was behaving right, and since the web server is tightly married to the data server, they both ended up going crazy and murdering the LDAP server.

Now there are scripts that are supposed to run to scour the data server to keep the disks from filling.  I thought perhaps something had kept them from running.  I looked through logs, through cron e-mails, and then ran some find commands by hand.  Everything suggested that the scouring was working as it should.  The more I looked, the more I realized it’s just that the radar data is ever-growing.  I just need to add more disk.

Had I been keeping an eye on the disk usage these past few months, I would have known this sooner, and been able to take care of it before critical services got beaten up.  I think on Monday, I’ll lend a hand getting the Nagios server up and running again.  Learn from my mistakes, readers!