Log files can be incredibly helpful, but they can also be really ugly. Pulling information out programmatically can be a real hassle. When a program exists to extract useful information (see: logwatch), it’s cause for celebration. The following is what can happen when a program doesn’t exist (and yes, this code actually worked).
The scenario here is that a user complained that Condor jobs were failing at a higher-than-normal rate. Our suspicion, based on a quick look at his log files, is that a few nodes are eating most of his jobs. But how to tell? I’ll want to create a spreadsheet that has the job ID, the date, the time, and the last execute host for all of the failed jobs. I could either task a student to manually pull this information out of the log files, or I can pull it out with some shell magic.
The first step was to get the job ID, the date, and the time from the user’s log files:
grep -B 1 Abnormal ~user/condor/t?/log100 | grep "Job terminated" | awk '{print $2 "," $3 "," $4 }' | sed "s/[\(|\)]//g" | sort -n > failedjobs.csv
What this does is to search the multiple log files for the word “Abnormal”, with one line printed before each match because that’s where the information we want is. To pull that line out, we search for “Job terminated” and then pull out the second, third, and fourth fields, stripping the parentheses off of the job ID, sorting, and then writing to the file failedjobs.csv.
The next step is to get the last execute node of the failed jobs from the system logs:
for x in `cat failedjobs.csv | awk -F, '{print $1}'`; do host=`grep "$x.*Job executing" /var/condor/log/EventLog* | tail -n 1 | sed -r "s/.*<(.*):.*/\1/g"` echo "`host $host | awk '{print $5}'`" >> failedjobs-2.csv; done
Wow. This loop pulls the first field out of the CSV we made in the first step. The IP address for each failed job is pulled from the Event Logs by searching for the “Job executing” string. Since a job may execute on several different hosts in its lifetime, we want to only look at the last one (hence the tail command), and we pull out the contents of the angle brackets left of the colon. This is the IP address of the execute host.
With that information, we use the host command to look up the hostname that corresponds to that IP address and write it to a file. Now all that remains is to combine the two files and try to find something useful in the data. And maybe to write a script to do this, so that it will be a little easier the next time around.