Why disk utilization matters

Here’s a rare weekend post to help make up for my lack of blogging this week.  Once again it is work related.  My life is boring and uneventful otherwise. 🙂

Unless you plan on sitting around babysitting your servers  every minute of every day, it is probably a good idea to have a monitoring system like Nagios set up.  My department, eternal mooches that we are, opted to not set one up and instead use the service provided by the college-level IT staff.  It worked great, until one day when it didn’t any more.  Some config change hosed the system and the Nagios service no longer ran.  I didn’t consider it much of a big deal until about 7 days ago.

This time last week, I was enjoying a vacation with my beautiful wife in celebration of our 2nd wedding anniversary.  When I got home Sunday evening, I noticed that several people had sent in e-mails complaining that they couldn’t log in to their Linux machines.  Like a fool, I spent the last few hours of my freedom trying to resolve the issue.  We figured out it was a problem with the LDAP server.  Requests went out, but no answers were ever received.  So after a too-long e-mail exchange, we got a workaround set up and I called it good enough.  I went to bed at one o’clock, thoroughly exhausted.

The next day we started working on figuring out what was the problem.  At first it seemed like the issue was entirely with the LDAP server, which is run by the central computing group on campus.  I was pleased that it was not one of my systems.  Then they noticed that there were a lot of open connections from two of my servers: one was our weather data website, and the other was our weather data ingest server.  Both machines work pretty hard, and at first I thought maybe one of the image generation processes just choked and that tripped up everything else.

Further investigation showed that the root cause of the issue was probably that the data partition on the ingest server was full.  This caused the LDM processes to freak out, which resulted in a lot more error messages in the log, which then filled up /var.  Now the system was running so slowly that nothing was behaving right, and since the web server is tightly married to the data server, they both ended up going crazy and murdering the LDAP server.

Now there are scripts that are supposed to run to scour the data server to keep the disks from filling.  I thought perhaps something had kept them from running.  I looked through logs, through cron e-mails, and then ran some find commands by hand.  Everything suggested that the scouring was working as it should.  The more I looked, the more I realized it’s just that the radar data is ever-growing.  I just need to add more disk.

Had I been keeping an eye on the disk usage these past few months, I would have known this sooner, and been able to take care of it before critical services got beaten up.  I think on Monday, I’ll lend a hand getting the Nagios server up and running again.  Learn from my mistakes, readers!