Log file myopia

I like to consider myself an enlightened sysadmin. I know I’m supposed to think outside the box from 30,000 feet. Still, every so often, my blinders come back on and I tend to be a bit myopic. This is most common when looking at log files in search of clues to an unknown problem.  A recent example was when I was trying to figure out why the Condor startd wasn’t running on a CentOS VM I had set up.

Since I didn’t want to have hundreds of ‘localhost.localdomain’s in the pool, I needed a way to give each VM a unique-ish and relevant name.  The easiest way seemed to be to check against a web server and use the output of a CGI script to set the host name.  Sounds simple enough, but after I put that in place the startd would immediately segfault.

I had no idea why, so I started peeking through the log file for the startd.  Lots of information there, but nothing particularly helpful.  After several hours of cross-eyed log reading and fruitless Googling, I thought I’d give strace a try.  I don’t know much about system-level programming, but I thought something might provide a clue.  Alas, it was not to be.

Eventually, I remembered that there’s a master log for Condor as well, and I decided to look in there.  Well, actually, I had looked in there earlier in the day and hadn’t seen anything that I thought was helpful.  This time I took a closer look and realized that it couldn’t resolve its host name and that’s why it was failing.

A few minutes later and I had changed the network setup to add the hostname to /etc/hosts so that Condor could resolve it’s host name.  A whole day’s worth of effort because I got too focused in on the wrong log file.

Ugly shell commands

Log files can be incredibly helpful, but they can also be really ugly.  Pulling information out programmatically can be a real hassle.  When a program exists to extract useful information (see: logwatch), it’s cause for celebration.  The following is what can happen when a program doesn’t exist (and yes, this code actually worked).

The scenario here is that a user complained that Condor jobs were failing at a higher-than-normal rate.  Our suspicion, based on a quick look at his log files, is that a few nodes are eating most of his jobs.  But how to tell?  I’ll want to create a spreadsheet that has the job ID, the date, the time, and the last execute host for all of the failed jobs.  I could either task a student to manually pull this information out of the log files, or I can pull it out with some shell magic.

The first step was to get the job ID, the date, and the time from the user’s log files:

grep -B 1 Abnormal ~user/condor/t?/log100 | grep "Job terminated" | awk '{print $2 "," $3 "," $4 }' | sed "s/[\(|\)]//g" | sort -n > failedjobs.csv

What this does is to search the multiple log files for the word “Abnormal”, with one line printed before each match because that’s where the information we want is.  To pull that line out, we search for “Job terminated” and then pull out the second, third, and fourth fields, stripping the parentheses off of the job ID, sorting, and then writing to the file failedjobs.csv.

The next step is to get the last execute node of the failed jobs from the system logs:

for x in `cat failedjobs.csv | awk -F, '{print $1}'`; do
host=`grep "$x.*Job executing" /var/condor/log/EventLog* | tail -n 1 | sed -r "s/.*<(.*):.*/\1/g"`
echo "`host $host | awk '{print $5}'`" >> failedjobs-2.csv;

Wow.  This loop pulls the first field out of the CSV we made in the first step.  The IP address for each failed job is pulled from the Event Logs by searching for the “Job executing” string.  Since a job may execute on several different hosts in its lifetime, we want to only look at the last one (hence the tail command), and we pull out the contents of the angle brackets left of the colon.  This is the IP address of the execute host.

With that information, we use the host command to look up the hostname that corresponds to that IP address and write it to a file.  Now all that remains is to combine the two files and try to find something useful in the data.  And maybe to write a script to do this, so that it will be a little easier the next time around.