I like to consider myself an enlightened sysadmin. I know I’m supposed to think outside the box from 30,000 feet. Still, every so often, my blinders come back on and I tend to be a bit myopic. This is most common when looking at log files in search of clues to an unknown problem. A recent example was when I was trying to figure out why the Condor startd wasn’t running on a CentOS VM I had set up.
Since I didn’t want to have hundreds of ‘localhost.localdomain’s in the pool, I needed a way to give each VM a unique-ish and relevant name. The easiest way seemed to be to check against a web server and use the output of a CGI script to set the host name. Sounds simple enough, but after I put that in place the startd would immediately segfault.
I had no idea why, so I started peeking through the log file for the startd. Lots of information there, but nothing particularly helpful. After several hours of cross-eyed log reading and fruitless Googling, I thought I’d give strace a try. I don’t know much about system-level programming, but I thought something might provide a clue. Alas, it was not to be.
Eventually, I remembered that there’s a master log for Condor as well, and I decided to look in there. Well, actually, I had looked in there earlier in the day and hadn’t seen anything that I thought was helpful. This time I took a closer look and realized that it couldn’t resolve its host name and that’s why it was failing.
A few minutes later and I had changed the network setup to add the hostname to /etc/hosts so that Condor could resolve it’s host name. A whole day’s worth of effort because I got too focused in on the wrong log file.