The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

TMPDIR=""
    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then
            TMPDIR=$t
            break
        fi
    done

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.

How not to get hired

With over 3000 machines in five different buildings on campus, we rely heavily on student labor to keep everything up and running. Unfortunately, undergrads tend to do things like graduate, which means we’re hiring almost every semester. Recently, we decided to hire six new students, since much of our staff graduates soon.

We received about 12 resumes, and since none of them looked particularly terrible, we brought them all in for half-hour interviews. I present here some lessons on how not to get hired.

  • Show up 20 minutes late — Trust me, we don’t have work we need to be doing. I mean, it’s not like you knew you had a final right before the interview. Being late is so much better than asking in advance for a different interview time.
  • Don’t show up at all — This is even better. If you’ve got a five our drive that includes passing through Chicago, there’s no chance that anything will happen to delay you. An e-mail later that night totally makes things okay. Once again, don’t even think about asking for an interview time that you can actually make it to.
  • Have no knowledge of computer hardware — It’s not like the job description says anything about working with hardware. Don’t be able to reconnect desktop components. Don’t be able to work your way through a troubleshooting exercise. That stuff is pointless.
  • Bullshit me — Watching your brother put together a computer is the same as knowing hardware. Having used a Linux computer in your programming class is the same as knowing Linux. I won’t be able to tell.

In all seriousness, it does strike me how seemingly rare hardware experience is among college students these days. Have computers become so cheap and plentiful that hardware skills aren’t necessary to become a computer nerd? Fortunately, we’ve always been able to find enough quality students. Some of them even go on to get job offers for way more than I make.

How I scheduled a meeting

Part of my responsibilities at work include wrangling our platoon of students. With most of them graduating at the end of this semester, I’ve preemptively hired many more to begin absorbing the knowledge necessary to keep a high performance computing shop running. The problem with students, though, is that they have classes to attend, which can make scheduling a bit of a bear. It gets worse as the number of students go up. Right now, I’ve got 14 separate schedules to balance .

I initially had them all register their availability using the free site whenisgood.net, but there were no times that worked for the whole group. Trying to figure out manually a pair of times that would get everyone to at least one meeting was challenging, but then I realized I could script it pretty easily. The hard part was turning each block on the calendar into either a 1 (available) or 0 (not available). Then it was simply a matter of trying every combination and rejecting the ones that don’t get everyone to at least one meeting.

The code below was saved as student_meeting.pl and invoked with a set of nested for loops like so:

for x in `seq 0 79`; do for y in `seq 0 79`; do perl student_meeting.pl $x $y 2>/dev/null; done; done

You may notice that each pair of available meetings would be printed twice. For example, 27 and 71 work, so 71 and 27 work as well and both get printed. The 0-79 represent the 80 half-hour time blocks from 9 AM to 5 PM Monday through Friday. The availability should be encoded similarly for each person inside the script. I include just mine in the example code so that you can see what it looks like. As it currently stands, the code is horrendous and not very robust. If there’s interest, I can clean it up some and put it on github. I’m not really sure if anyone else would care about it, but it might be a useful little project to someone else.

#!/usr/bin/perl

%availabilities = (
 'bcotton' => [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
 1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,0,
 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
 1,1,1,1,1,1,0,0,0,1,0,0,0,0,1,0,
 1,1,1,1,1,1,1,1,1,0,0,1,1,1,1,1],
 # More people would also be here
);

@accounts = keys(%availabilities);

$meeting1 = $ARGV[0];
$meeting2 = $ARGV[1];
$meeting1attendees = '';
$meeting2attendees = '';

foreach $person ( @accounts) {
 $goodtimes = 0;
 if ( $availabilities{$person}[$meeting1] == 1 ) {
 $goodtimes++;
 $meeting1attendees .= " $person";
 }
 if ( $availabilities{$person}[$meeting2] == 1 ) {
 $goodtimes++;
 $meeting2attendees .= " $person";
 }
 unless ( $goodtimes > 0 ) { die; }
 $goodmeetings{$person} = $goodtimes;
}

print "###\nMeeting times $meeting1 $meeting2\n";
print "Meeting 1 $meeting1attendees\nMeeting 2 $meeting2attendees\n";

LISA ’10 Interview: Tom Limoncelli

This post was originally posted to the Usenix blog.

Anyone who has attended LISA in the past few years is undoubtedly familiar with Tom Limoncelli.  Tom’s not just a LISA fixture, he’s also a widely-respected author of two books (Time Management for System Administrators and The Practice of System and Network Administration) and a contributor to the Everything Sysadmin blog.  Over the weekend, he sat down with me for a few minutes to share his thoughts about LISA ’10.

Ben Cotton: You are, quite truly, an expert on everything sysadmin.  How did you reach that status?

Tom Limoncelli:  I’m honored by the question but the name “EverythingSysadmin.com” comes from my co-author (Christine Hogan) and I trying to come up with a domain name that was related to our book, but wasn’t really long.  Since the book tried to touch on a little of everything, we came up with EverythingSysadmin.com.

BC:  So would you consider yourself a generalist or do you have a few fields that you feel you’re truly an expert in?

TL:  I do consider myself a generalist.  I think that’s because when I got started in system administration you had to be.  Now things are different.  Now people tend to specialize in storage, backups, networking, particular operating systems, and so on.  Remember that The Practice of System and Network Administration has three authors; we only know “everything” when all three of us put our brains together.  I guess you’d have to say that my specialty is in always knowing someone that can find an answer for me.

BC:  That’s an excellent lesson.  You’re scheduled to conduct several training sessions on time management during LISA ’10.  What would you say is the biggest lesson to be learned from them?

TL: The biggest lesson is that humans are bad at time management, and that’s OK.  The great thing about being human is that we can build tools that let us overcome our problems.  The class that I teach has very little theory. It’s mostly a list of techniques people can use to solve specific problems. Use the ones you like, ignore the rest.  The one that most people end up using is finding a good way to manage their to-do list.

BC: If someone’s taken your time management training before, what do you have new for them this year?

TL: I have an entirely new class this year.  It’s a “part 2” kind of thing, though you don’t have to have taken part 1 to take it.  In the morning I’ll be teaching “Time Management for System Administrators” which is basically the same half-day class I usually teach.  The afternoon, however, is all new.  It is “Time Management: Team Efficiency”.

The thing about Teams is that there are certain things you do that waste time for everyone else.  You might not even realize it.  In this class, I’m going to cover a number of techniques for eliminating those things.  You save time for others, they save time for you.  It’s like “time management karma”.  What goes around comes around.  For example, meetings are often a terrible waste of time.  I’ll talk about some red flags to help you figure out which meetings to skip, and if you run meetings you can figure out if you are creating these red flags.  If you can’t fix a badly run meeting, I have some tips on how to negotiate so that you don’t have to attend. For example, why send your entire team to someone else’s boring meeting?  Send one person to take notes and report back to your team.  If you can’t get out of a meeting, I have techniques for avoiding them. For example, when you enter the room tell the facilitator, “I have a conflict for the second half of the meeting.  Can my agenda items be first on the list?” After your item is covered, stand up and leave.  It isn’t unethical or dishonest: the “conflict” you had was your urgent need to escape badly run meetings.

BC: You’ve been a regular fixture at  LISA.  What keeps you coming back?

TL: LISA is like telescope that lets me see into the future.  Every year there are presentations that describe things that the majority of all system administrators won’t be exposed to for 2-3 years.  When I come back to work I have more of a “big picture” than my coworkers that didn’t attend.  For example, it was at LISA that I first heard of CFEngine, Puppet and other “Configuration Management” (CM) tools.  Lately people talk CM as if it was new.  It’s certainly much more popular now, but people that have been attending LISA conferences have been benefitting from CM tools for more than a decade.

90% of what is interesting in system administration relates to scaling: More machines, more RAM, more storage, more speed, more web hits.  Many years ago there was a presentation by a web site that was managing 1 million web hits per day. At the time this was huge achievement.  People that saw that presentation were in a great position a few years later when all big sites scaled to be that big.

BC: What are the big scaling challenges?

TL: Everything we used to know is about to change because of SSD.  Everything I know about designing and scaling systems is based on the fact that CPU caches are about 10x faster than RAM, which is 10x faster than disk, which is about 10x faster than networks.  Over the years this has been basically true: Even as RAM got faster, so did disk.  SSD is about to change that.  The price curve of SSD makes it pretty easy to predict that we’re not going to be using spinning magnetic disks to store data soon.  All the old assumptions are going away.  At the same time, CPUs with 16+ and soon 100+ cores make other assumptions change.  Things get worse in some ways.  These are the hot topics that you hear about at a conference like LISA.

Just the other day a very smart coworker said something to me that implied that with the new generation of 100+ core machines we could “just run more processes” and not have to change the way we design things.  I was floored.  That’s like saying, “Basketball players seem to be able to jump higher every year.  Why can’t we jump to the moon?”

BC: As an avid basketball fan, I find that idea intriguing.   It’s obvious attending LISA can be very beneficial. As an experienced attendee, what advice do you have for people who may be going to their first LISA conference?

TL: First: Talk to random people.  When you are on line, introduce yourself to the people next to you.  A big chunk of the learning opportunity is from talking with fellow attendees.  Sysadmins are often introverts, so it is a bit difficult.  Someone once told me that it’s always ok to start a conversation with a stranger by sticking out your hand and saying, “Hi!  My name is Joe.” (if your name is Joe).  Unlike some conferences where the speakers are corralled into a “green room” and never talk with attendees, at Usenix conferences you can talk to anyone.  At my first Usenix experiences I met Dennis Ritchie, one of the inventors of Unix.

Second: plan your days.  There are activities from 9am until midnight every day.  Read the schedule beforehand and make a grid of what you want to attend.  Saturday night is a session for “first timers” which is a great way to get an overview of the conference.  During the day there are usually 3-4 things going on at any time.  At night there is an entire schedule community-driven events.  You don’t want to be picking what to do next at the end of each session.  Also, plan some down-time.  Take breaks. Get plenty of fluids.  It is a full week.

BC: Any other thoughts you’d like to share?

TL: There’s also a lot of great security talks, and an entire track of Q&A sessions with experts answering questions about everything from storage to disaster recovery to consulting.  The last thing I’d like to say is, “see you there!”

Registration for LISA ’10 is still open at http://www.usenix.org/events/lisa10/.  You can find Tom’s training courses on the training page.  He’ll also be presenting two technical sessions.

Log file myopia

I like to consider myself an enlightened sysadmin. I know I’m supposed to think outside the box from 30,000 feet. Still, every so often, my blinders come back on and I tend to be a bit myopic. This is most common when looking at log files in search of clues to an unknown problem.  A recent example was when I was trying to figure out why the Condor startd wasn’t running on a CentOS VM I had set up.

Since I didn’t want to have hundreds of ‘localhost.localdomain’s in the pool, I needed a way to give each VM a unique-ish and relevant name.  The easiest way seemed to be to check against a web server and use the output of a CGI script to set the host name.  Sounds simple enough, but after I put that in place the startd would immediately segfault.

I had no idea why, so I started peeking through the log file for the startd.  Lots of information there, but nothing particularly helpful.  After several hours of cross-eyed log reading and fruitless Googling, I thought I’d give strace a try.  I don’t know much about system-level programming, but I thought something might provide a clue.  Alas, it was not to be.

Eventually, I remembered that there’s a master log for Condor as well, and I decided to look in there.  Well, actually, I had looked in there earlier in the day and hadn’t seen anything that I thought was helpful.  This time I took a closer look and realized that it couldn’t resolve its host name and that’s why it was failing.

A few minutes later and I had changed the network setup to add the hostname to /etc/hosts so that Condor could resolve it’s host name.  A whole day’s worth of effort because I got too focused in on the wrong log file.

It’s beginning to look a lot like LISA

We’re just over two months from the Large Installation System Administration (LISA) conference, and the website has recently been updated with details. I’ve never been to this conference before, but as a member of the official blog team, I’ll get to spend the week doing nothing but participating in, and writing about, LISA ’10. Can I write two blog posts and countless tweets every day? It will be a challenge, and I’m sure I’ll be tired of writing by the end, but there should be plenty to write about.

With three days of workshops, 48 training courses, and three days of technical sessions,  there’s plenty to choose from.  I’m especially interested in the talk “Measuring the Value of System Administration” scheduled for Thursday morning.  Of course, each evening there will be Birds of a Feather (BoF) sessions, which I’m told are the most valuable part of the whole LISA experience.  BoFs are an informal meeting of the minds, where admins who do similar work compare notes and pick up new ideas to bring home.  And drink beer.  I’m okay with that.  The BoF schedule is still pretty thin, but no doubt it will fill out as November approaches.

If you’re interested in attending LISA, you can register online at http://www.usenix.org/events/lisa10/registration/.  Registration is available in half-day increments, so you can pay for exactly the amount of conference you want, and if you register by October 18, you get the “early bird discount.”  I hope to see you all in San Jose!

There are two kinds of sysadmins in the world

I mentioned recently that in my experience there are two breeds of sysadmins: the long-hair and the short-hair.  I think we all can picture the long-hair breed.  They’re the stereotypical representation of sysadmins in the media: long hair (duh!), often bearded, generally overweight, sloppily-dressed, anti-social, addicted to caffeine.  Think Comic Book Guy from “The Simpsons”.  The lesser-known breed is the short-hair sysadmin. The short-hair has short hair (duh again!), generally no facial hair, professionally-dressed, often with military experience.

Although it might seem like these two breeds are polar opposites, they do have some traits in common.  Because they are still sysadmins, both breeds tend to see themselves as the rulers of their domains (interestingly, the short-hairs tend to be more flexible and accommodating to end-users).   Security incidents are seen as an unforgivable personal insult, so paranoia is a desirable trait. Though short-hairs are more likely to have a social life, both breeds are quite geeky and prone to obsess over technical details.

Now I don’t claim to be an expert on the subject, and these are my own personal observations.  Nonetheless, I can’t think of any sysadmins that I’ve come across that don’t fit generally into one of the two breeds.  Not every one fits in precisely, but close enough that there’s no question which breed he or she is.  What’s interesting is that these two breeds don’t seem to clash professionally, perhaps because the easiest way to earn a sysadmin’s respect is to have unquestionable technical skill.

Ugly shell commands

Log files can be incredibly helpful, but they can also be really ugly.  Pulling information out programmatically can be a real hassle.  When a program exists to extract useful information (see: logwatch), it’s cause for celebration.  The following is what can happen when a program doesn’t exist (and yes, this code actually worked).

The scenario here is that a user complained that Condor jobs were failing at a higher-than-normal rate.  Our suspicion, based on a quick look at his log files, is that a few nodes are eating most of his jobs.  But how to tell?  I’ll want to create a spreadsheet that has the job ID, the date, the time, and the last execute host for all of the failed jobs.  I could either task a student to manually pull this information out of the log files, or I can pull it out with some shell magic.

The first step was to get the job ID, the date, and the time from the user’s log files:

grep -B 1 Abnormal ~user/condor/t?/log100 | grep "Job terminated" | awk '{print $2 "," $3 "," $4 }' | sed "s/[\(|\)]//g" | sort -n > failedjobs.csv

What this does is to search the multiple log files for the word “Abnormal”, with one line printed before each match because that’s where the information we want is.  To pull that line out, we search for “Job terminated” and then pull out the second, third, and fourth fields, stripping the parentheses off of the job ID, sorting, and then writing to the file failedjobs.csv.

The next step is to get the last execute node of the failed jobs from the system logs:

for x in `cat failedjobs.csv | awk -F, '{print $1}'`; do
host=`grep "$x.*Job executing" /var/condor/log/EventLog* | tail -n 1 | sed -r "s/.*<(.*):.*/\1/g"`
echo "`host $host | awk '{print $5}'`" >> failedjobs-2.csv;
done

Wow.  This loop pulls the first field out of the CSV we made in the first step.  The IP address for each failed job is pulled from the Event Logs by searching for the “Job executing” string.  Since a job may execute on several different hosts in its lifetime, we want to only look at the last one (hence the tail command), and we pull out the contents of the angle brackets left of the colon.  This is the IP address of the execute host.

With that information, we use the host command to look up the hostname that corresponds to that IP address and write it to a file.  Now all that remains is to combine the two files and try to find something useful in the data.  And maybe to write a script to do this, so that it will be a little easier the next time around.

Why it’s not always done the right way: difficulties with preempting Condor jobs when the disk is nearly full

In the IT field, there’s a concept called “best practice”, which is the recommended policy, method, etc for a particular setting or action.  In the perfect world, every system would conform to the accepted best practices in every respect.  Reality isn’t always perfect, though, and there are often times when a sysadmin has to fall somewhere short of this goal.  Some Internet Tough Guys will insist that their systems are rock-solid and superbly secured. That’s crap, we all have to cut corners.  Sometimes it’s acceptable, sometimes it’s a BadThing™.  This is the story of one of the (hopefully) acceptable times.

Continue reading

Solving the CUPS “hpcups failed” error

I thought when I took my new job that my days of dealing with printer headaches were over.  Alas, it was not to be.  A few weeks ago, I needed to print out a form for work.  I tried to print to the shared laser printer down the hall.  Nothing.  So I tried the color printer. Nothing again.  I was perplexed because both printers had worked previously, so being a moderately competent sysadmin, I looked in the CUPS logs.  I saw a line in error_log that read printer-state-message="/usr/lib/cups/filter/hpcups failed". That seemed like it was the problem, so I tried to find a solution and couldn’t come up with anything immediately.

Since a quick fix didn’t seem to be on the horizon, I decided that I had better things to do with my time and I just used my laptop to print.  That worked, so I forgot about the printing issue.  Shortly thereafter, the group that maintains the printers added the ones on our floor to their CUPS server.  I stopped CUPS on my desktop and switched to their server and printing worked again, thus I had even less incentive to track down the problem.

Fast forward to yesterday afternoon when my wife tried to print a handbill for an event she is organizing in a few weeks.  Since my desktop at home is a x86_64 Fedora 12 system, too, it didn’t surprise me too much when she told me she couldn’t print.  Sure, enough, when I checked the logs, I saw the same error.  I tried all of the regular stalling tactics: restarting CUPS, power cycling the printer, just removing the job and trying again.  Nothing worked.

The first site I found was an Ubuntu bug report which seemed to suggest maybe I should update the printer’s firmware.  That seemed like a really unappealing prospect to me, but as I scrolled down I saw comment #8.  This suggested that maybe I was looking in the wrong place for my answer.  A few lines above the hpcups line, there was an error message that read prnt/hpcups/HPCupsFilter.cpp 361: DEBUG: Bad PPD - hpPrinterLanguage not found.

A search for this brought me to a page about the latest version of hplip. Apparently, the new version required updated PPD files, which are the files that describe the printer to the print server.  In this case, updating the PPD file was simple, and didn’t involve having to find it on HP’s or a third-party website.  All I had to do was use the CUPS web interface and modify the printer, keeping everything the same except selecting the hpcups 3.10.2 driver instead of the 3.9.x that it had been using.  As soon as I made that change, printing worked exactly as expected.

The lesson here, besides the ever-present “printing is evil” is that the error message you think is the clue might not always be.  When you get stuck trying to figure a problem out, look around for other clues.  Tunnel vision only works if you’re on the right track to begin with.