The tricky problem dilemma

A good sysadmin believes in treating the cause, not the symptom. Unfortunately, pragmatism sometimes gets in the way of that. A recent example: we just rolled out a kernel update to a few of our compute clusters. About 3% of the machines ended up in a troubled state. By troubled, I mean that the permissions on a few directories (/bin, /lib, /dev, /etc, /proc, and /sys) were set to 700, making the machine effectively unusable. For the most part, we didn’t notice this on the affected machines until after they did their post-upgrade reboot, but fortunately we were able to catch a few that hadn’t yet rebooted.

What we found was that / had a sysroot directory and an init file. These are created by the mkinitrd script, which is called by the new-kernel-pkg script, which is in turn called in the postinstall script of the kernel RPM. The relevant part of the mkinitrd script seems to be

    for t in /tmp /var/tmp /root ${PWD}; do
        if [ ! -d $t ]; then continue; fi
        if ! access -w $t ; then continue; fi

        fs=$(df -T $t 2>/dev/null | awk '{line=$1;} END {printf $2;}')
        if [ "$fs" != "tmpfs" ]; then

which creates a working directory in /tmp under normal conditions. However, there seemed to be something that caused / to be used instead of /tmp. Later in the script, several directories are created in $TMPDIR, which correspond to the wrongly-permissioned directories. There’s not a clear indication of why this happens, but if we clean up and reinstall the updated kernel package it doesn’t necessarily repeat itself. After some soul-searching, we decided that it was more important to return the nodes to service than to try to track down an easily-correctable-but-difficult-to-solve problem. We’ll see if it happens again with the next kernel upgrade.

Sometimes, Windows wins

It should be clear by now that I am an advocate of free software.  I’m not reflexively against closed software though, sometimes it’s the right tool for the job.  Use of Windows is not a reason for mockery.  In fact, I’ve found one situation where I like the way Windows works better.

As part of our efforts to use Condor for power saving, I thought it would be a great idea if we could calculate the power savings based on the actual power usage of the machines.  The plan was to have Cycle Server aggregate the time in hibernate state for each model and then multiply that by the power draw for the model.  Since Condor doesn’t note the hardware model, I needed to write a STARTD_CRON module to determine this.  The only limitations I had were that I couldn’t depend on root/administrator privileges or on particular software packages being installed. (The execute nodes are in departments across campus and mostly not under my control.)

Despite the lack of useful tools like grep, sed, and awk (there are equivalents for some of the taken-for-granted GNU tools, but they frankly aren’t very good), the plugin for Windows was very easy.  The systeminfo command gives all kinds of useful, parseable information about the system’s hardware and OS.  The only difficult part was chopping the blank spaces off the end of the output. I wanted to do this in Perl, but that’s not guaranteed to be installed on Windows machines, and I had some difficulty getting a standalone-compiled version working consistently.

On Linux, parsing the output is easy.  The hard part was getting the information at all.  dmidecode seems to be ubiquitous, but it requires root privileges to get any information.  I tried lshw, lshal, and the entire /proc tree.  /proc didn’t have the information I need, and the two commands were not necessarily a part of the “base” install.  The solution seemed to be to require the addition of a package (or bundling a binary for lshw in our Condor distribution).

Eventually, we decided that it was more effort than it was worth to come up with a reliable module.  While both platforms had problems, Linux was definitely the more difficult.  It’s a somewhat rare condition, but there are times when Windows wins.

Flavor of Love

One of the nice things about Linux is that there are so many different flavors to chose from.  Although you can customize it to meet your exact needs, there a good chance that someone has already made a flavor to suit your tastes.  Which flavor you choose is largely a matter of what you’re trying to do, and your favorite way to do it.  At my workplace, we’re a Red Hat shop.  I happen to be fond of the Red Hat products so that works well for me.  However, I find myself facing a bit of a decision.

In 2003 or 2004, whenever my predecessor set up our Linux environment, he put Fedora Core 1 on the workstations and Red Hat Enterprise Linux 3 and 4 on the servers and the larger desktops (the Dell Precision line can be rather finnicky).  I took my job in September 2006, with things largely unchanged.  Since I work at a University, making major changes during the school year is considered bad form, so I had to wait until summer 2007 to begin doing upgrades.  The downside is that FC1 went out of support in the late winter of 2007, but the good news is that I got nearly a full year to re-build software packages and test configurations.  My fellow sysadmin and I, at the encouragement of my boss, decided to put RHEL4 on all of the machines to simplify support.

In the past year, RHEL has proven itself to be a very stable OS, and Red Hat has been quick to release security fixes.  However, there have been several occasions where an updated application has been needed, but it had dependencies that could not be met via up2date.  For example, the Java web plugin for the x64 architecture only works on Firefox 2+.  As of this writing, RHEL4 still uses Firefox (with security patches worked in by Red Hat).  That, at least, was a simple matter of grabbing the RPM.  Of course, now we’re responsible for making sure the subsequent updates get installed by hand.  Even worse is when a package needs a newer glibc than what is provided.  Here’s a hint friends:  if it requires a newer glibc than your distribution provides, don’t bother!

Next summer, I plan to upgrade again.  But what do I put on the workstations?  RHEL is a solid platform, and works exceptionally well in a server environment.  If all you want to do at your desk is check e-mail, surf the web, and type up TPS reports, RHEL provides a good experience to do that.  If you’re trying to run the latest version of your research applications, I’m not sold that it’s the best solution.  There are advantages and disadvantages to choosing RHEL vs Fedora for the desktop

I run Fedora on my desktop/server at home, and it performs like a champ.  It’s not that Fedora crashes with any regularity, but it isn’t necessarily designed for stability.  RHEL is pretty thoroughly tested, so you can pretty much be guaranteed that when a package gets upgraded, it won’t break things.  Fedora gets you newer packages much quicker, but there’s no promises that foo-3.7 won’t break bar-4.2  Fedora also has new releases more frequently than RHEL, and has a much shorter support life (roughly 13 months versus 5 years) – which forces you to update more often.  Of course, if your software’s dependencies necessitate an upgrade regularly, that’s a moot point.

There’s also the issue of package security.  With RHEL, you’re getting your packages from Red Hat’s servers.  With Fedora, you’re generally getting your packages from mirrors.  Generally, you can consider that to be safe.  However, a story featured on Slashdot today shows that it’s not a guarantee.  Is that a reason to forsake Fedora?  Unless your machines contain hyper-sensitive information, the answer is no.

Actually, the second sentence in the previous paragraph isn’t necessarily true (apart from the fact that you can set up your own proxy for the RHN servers).  Beginning in RHEL 5, the up2date package manager is gone, in favor of yum.  Personally, I think yum is better than up2date (although Debian’s apt may be the best), but that wasn’t the reason Red Hat made the switch.  What yum gives you, though, is the ability to add custom repositories.  Which means you can get outside packages easily, and keep them up to date without having to install the updates by hand every time.  It also means that you can set up your own repository for your local custom software.  You have no idea how excited I am about the idea of using rpms in a yum repository to install software on our machines instead of using rdist.

The differences in configuration between Fedora and RHEL are minor, but generally sufficient enough that you’ll need separate configuration trees.  Does adding another OS to your environment cause you to reach for your Rolaids, or can you comfortably absorb it?  For my own workplace, the latter is the case.  So what have I decided?  I have nine more months until next summer, so I’ll punt for now. 🙂