Understanding Condor security settings

Recently our colleagues at the University of Nebraska asked us to add host certificates to our Condor servers to enable encryption between sites. After fighting my way through Cfengine, I got the certificates in place. The next step was to enable GSI authentication on the servers. Having wisened up over the years, I chose to commit the change early in the day. My commit made the following two edits:

Modified: prod/tmpl/condor/condor_config.negotiator
 ===================================================================
 --- prod/tmpl/condor/condor_config.negotiator   2011-05-18 17:04:21 UTC (rev 12379)
 +++ prod/tmpl/condor/condor_config.negotiator   2011-05-18 17:12:46 UTC (rev 12380)
 @@ -45,6 +45,12 @@
 QUILL_DB_NAME           = quill
 QUILL_DB_IP_ADDR        = quill-00.rcac.purdue.edu:5432

 +# Use host certs to provide some inter-site security
 +SEC_DAEMON_AUTHENTICATION = preferred
 +SEC_DAEMON_AUTHENTICATION_METHODS = GSI, PASSWORD
 +SEC_NEGOTIATOR = preferred
 +SEC_NEGOTIATOR_AUTHENTICATION_METHODS = GSI, PASSWORD
 +GSI_DAEMON_DIRECTORY = /etc/grid-security

 # define sub collectors
 COLLECTOR2 = $(COLLECTOR)

 Modified: prod/tmpl/condor/condor_config.submit
 ===================================================================
 --- prod/tmpl/condor/condor_config.submit       2011-05-18 17:04:21 UTC (rev 12379)
 +++ prod/tmpl/condor/condor_config.submit       2011-05-18 17:12:46 UTC (rev 12380)
 @@ -40,6 +40,11 @@
 condor1.ipfw.edu
 #      condorcm.pnc.edu, \

 +# Use host certs to provide some inter-site security
 +SEC_ADVERTISE_SCHEDD_AUTHENTICATION = preferred
 +SEC_ADVERTISE_SCHEDD_AUTHENTICATION_METHODS = GSI, PASSWORD
 +GSI_DAEMON_DIRECTORY = /etc/grid-security
 +
 # Use global event log (like the userlog)
 # Set MAX_EVENT_LOG to a huge number, since there's a bug keeping the
 # set it to zero to let it grow unlimited bit work

I had tested the changes a bit and hadn’t noticed anything too out of whack. Until the follow morning. Users were complaining about slow queueing and execution of PBS jobs. At first, I thought it was a license problem on one of the PBS servers, since that had been happening a few times in the recent past. As the others in the group began investigating, though, they noticed that the PBS prologue script wasn’t completing as a job tried to land. It turns out the condor_config_val call that changes the PBSRunning attribute was failing because the node couldn’t talk to the collector.

The root cause was my misunderstanding of the Condor security documentation. With clients set to “optional” and daemons set to “preferred”, they try to use the relevant security features. But since the methods didn’t match, they refused to talk to each other instead of failing gracefully. Changing the “preferred” to “optional” restored performance and job throughput. Having gone through this, it now makes sense, but it’s more than a little embarrassing to bring the entire infrastructure down.

A Cfengine learning experience

Note: This post refers to Cfengine 2. The difficulties I had may quite likely be a result of peculiarities in our environment or the limits of my own knowledge.

A few weeks ago, my friends at the University of Nebraska politely asked us to install host certificates on our Condor collectors and submitters so that flocking traffic between our two sites would be encrypted. It seemed like a reasonable request, so after getting certificates for 17-ish hosts from our CA, I set about trying to put them in place. I could have plopped them all in place easily enough using a for loop, but I decided it would make more sense to left Cfengine take care of it. This has the added advantage of making sure the certificate gets put in place automatically when a host gets reinstalled or upgraded.

I thought it would be nice if I tested my Cfengine changes locally first. I know just enough Cfengine to be dangerous, and I don’t want to spam the rest of the group with mail as I check in modifications over and over again. So after editing the input file on one of the servers, I ran cfagent -qvk. It didn’t work. The syntax looked correct, but nothing happened. After a bit, I asked my soon-to-be-boss for help.

It turned out that I didn’t quite get the meaning of the -k option. I always used it to run against the local cache of the input files, not realizing that it killed all copy actions. Had I looked at the documentation, I would have figured that out. Like I said, I know just enough to be dangerous.

I didn’t want to create a bunch of error email since some hosts wouldn’t be getting host certificates, so I went with a IfFileExists statement that I could use to define a group to use in the copy: stanza. So I committed what I thought to be the correct changes and tried running cfagent again. The certificates still weren’t being copied into place. Looking at the output, I saw that it couldn’t find the file. Nonsense. It’s right there on the Cfengine server.

As it turns out, that’s not where IfFileExists looks, it looks on the server running cfagent. The file, of course, doesn’t exist locally because Cfengine hasn’t yet copied it. Eventually I surrendered and defined a separate group in cf.groups to reference in the appropriate input file. This makes the process more manual than I would have liked, but it actually works.

Oh, except for one thing. In testing, I had been using $(hostname) in a shellcommand: to make sure that the input file was actually getting read. When I finally got the copy: stanza sorted out, the certificates still weren’t being copied out. The cfagent output said it couldn’t find ‘/masterfiles/tmpl/security/host-certs/$(hostname).pem’. As it turns out, I thought $(hostname) was a valid Cfengine variable. Instead, it was actually being passed to the shell command and being executed by the shell. The end result was indiscernible from what I intended in that case, but didn’t translate to the copy: stanza. The variable I wanted was $(fqhost).

CCA11: Cloud Computing and Its Applications

Earlier this week, I attended CCA11 at Argonne National Laboratory. I was there to present an extended abstract and take in what I could. I’ve never presented at a conference before (unless you count a short talk to kick off the Condor BoF at LISA ’10, which I don’t) and the subject of my abstract was work that we’ve only partially done, so I was a bit nervous. Fortunately, our work was well-received. It was encouraging enough that I might be talked into writing another paper at some point.

One thing I learned from the poster session and the invited talks is that the definition of “cloud” is just as ambiguous as ever. I continue to hate the term, although the field (however you define it) is doing interesting things. There’s a volunteer effort underway at NASA to use MapReduce to generate on-demand product visualization for disasters. An early prototype for Namibian flooding is at http://matsu.opencloudconsortium.org/.

Perhaps one of the largest concerns is the sheer volume of data. For example, the National Institutes of Health have over two petabytes of genomics data available, but how can you transfer that? Obviously, in most cases a user would only request a subset of data, but if there’s a use case that requires the whole data set, then what? One abstract presenter championed the use of sneakernet and argued that network bandwidth is the greatest challenge going forward.

One application that wasn’t mentioned is the cloud girlfriend. Maybe next year?

Thanks to Andy Howard and Preston Smith for their previous work and for helping me write the abstract.

Log file myopia

I like to consider myself an enlightened sysadmin. I know I’m supposed to think outside the box from 30,000 feet. Still, every so often, my blinders come back on and I tend to be a bit myopic. This is most common when looking at log files in search of clues to an unknown problem.  A recent example was when I was trying to figure out why the Condor startd wasn’t running on a CentOS VM I had set up.

Since I didn’t want to have hundreds of ‘localhost.localdomain’s in the pool, I needed a way to give each VM a unique-ish and relevant name.  The easiest way seemed to be to check against a web server and use the output of a CGI script to set the host name.  Sounds simple enough, but after I put that in place the startd would immediately segfault.

I had no idea why, so I started peeking through the log file for the startd.  Lots of information there, but nothing particularly helpful.  After several hours of cross-eyed log reading and fruitless Googling, I thought I’d give strace a try.  I don’t know much about system-level programming, but I thought something might provide a clue.  Alas, it was not to be.

Eventually, I remembered that there’s a master log for Condor as well, and I decided to look in there.  Well, actually, I had looked in there earlier in the day and hadn’t seen anything that I thought was helpful.  This time I took a closer look and realized that it couldn’t resolve its host name and that’s why it was failing.

A few minutes later and I had changed the network setup to add the hostname to /etc/hosts so that Condor could resolve it’s host name.  A whole day’s worth of effort because I got too focused in on the wrong log file.

Sometimes, Windows wins

It should be clear by now that I am an advocate of free software.  I’m not reflexively against closed software though, sometimes it’s the right tool for the job.  Use of Windows is not a reason for mockery.  In fact, I’ve found one situation where I like the way Windows works better.

As part of our efforts to use Condor for power saving, I thought it would be a great idea if we could calculate the power savings based on the actual power usage of the machines.  The plan was to have Cycle Server aggregate the time in hibernate state for each model and then multiply that by the power draw for the model.  Since Condor doesn’t note the hardware model, I needed to write a STARTD_CRON module to determine this.  The only limitations I had were that I couldn’t depend on root/administrator privileges or on particular software packages being installed. (The execute nodes are in departments across campus and mostly not under my control.)

Despite the lack of useful tools like grep, sed, and awk (there are equivalents for some of the taken-for-granted GNU tools, but they frankly aren’t very good), the plugin for Windows was very easy.  The systeminfo command gives all kinds of useful, parseable information about the system’s hardware and OS.  The only difficult part was chopping the blank spaces off the end of the output. I wanted to do this in Perl, but that’s not guaranteed to be installed on Windows machines, and I had some difficulty getting a standalone-compiled version working consistently.

On Linux, parsing the output is easy.  The hard part was getting the information at all.  dmidecode seems to be ubiquitous, but it requires root privileges to get any information.  I tried lshw, lshal, and the entire /proc tree.  /proc didn’t have the information I need, and the two commands were not necessarily a part of the “base” install.  The solution seemed to be to require the addition of a package (or bundling a binary for lshw in our Condor distribution).

Eventually, we decided that it was more effort than it was worth to come up with a reliable module.  While both platforms had problems, Linux was definitely the more difficult.  It’s a somewhat rare condition, but there are times when Windows wins.

The joys of doing it right

A while back, I wrote a post about why it’s not always possible to DoItRight™, and that sometimes you just have to accept it.  Today I’m here to talk about a time that I did something right and how good it felt.  Now, that’s not to say that I’m eternally screwing up (although a good quarter of my Subversion commits are fixes of a commit I previously made), but there’s a difference between making something work and making it work well.

I decided that since we have a Nagios server, I might as well have it check on the health of our Condor services.  From what I could tell, no such checks currently exist, so I decided to write my own.  Nagios checks can be very simple: run a command or two, and then return a number that means something to Nagios.  Many checks are written in bash or another shell script because they are so simple.  For my checks, I wanted to do some parsing of the command outputs to determine the state of job queues, etc.  Since that kind of work is a little heavy for a shell script, I opted to write it in Perl.  Yay Perl!

Since there aren’t any checks available, I thought my work might be useful to others in the community.  As a result, I wanted to make sure my code was respectable.  This meant I spent some time designing, coding, and testing options that we don’t want but others might find useful.  It meant putting extra documentation into the code (and eventually writing some pod before I share the code publicly).  It meant mostly following the coding style of the Linux kernel (I chose that because “why not?”).

Some readers will (correctly) note that the Linux kernel coding style does not guarantee good code.  I don’t mean to suggest that it does, but I’ve found that it forced me to think about my code more deeply than I otherwise would.  Not being a programmer, most of the code I write is to fit a small need of mine and the quality is defined as “does it do what I want it to?”  Writing something with the intent of sharing it publicly and forcing yourself to not cut corners can make the work more difficult, but the end result is a beauty to behold.

Ugly shell commands

Log files can be incredibly helpful, but they can also be really ugly.  Pulling information out programmatically can be a real hassle.  When a program exists to extract useful information (see: logwatch), it’s cause for celebration.  The following is what can happen when a program doesn’t exist (and yes, this code actually worked).

The scenario here is that a user complained that Condor jobs were failing at a higher-than-normal rate.  Our suspicion, based on a quick look at his log files, is that a few nodes are eating most of his jobs.  But how to tell?  I’ll want to create a spreadsheet that has the job ID, the date, the time, and the last execute host for all of the failed jobs.  I could either task a student to manually pull this information out of the log files, or I can pull it out with some shell magic.

The first step was to get the job ID, the date, and the time from the user’s log files:

grep -B 1 Abnormal ~user/condor/t?/log100 | grep "Job terminated" | awk '{print $2 "," $3 "," $4 }' | sed "s/[\(|\)]//g" | sort -n > failedjobs.csv

What this does is to search the multiple log files for the word “Abnormal”, with one line printed before each match because that’s where the information we want is.  To pull that line out, we search for “Job terminated” and then pull out the second, third, and fourth fields, stripping the parentheses off of the job ID, sorting, and then writing to the file failedjobs.csv.

The next step is to get the last execute node of the failed jobs from the system logs:

for x in `cat failedjobs.csv | awk -F, '{print $1}'`; do
host=`grep "$x.*Job executing" /var/condor/log/EventLog* | tail -n 1 | sed -r "s/.*<(.*):.*/\1/g"`
echo "`host $host | awk '{print $5}'`" >> failedjobs-2.csv;
done

Wow.  This loop pulls the first field out of the CSV we made in the first step.  The IP address for each failed job is pulled from the Event Logs by searching for the “Job executing” string.  Since a job may execute on several different hosts in its lifetime, we want to only look at the last one (hence the tail command), and we pull out the contents of the angle brackets left of the colon.  This is the IP address of the execute host.

With that information, we use the host command to look up the hostname that corresponds to that IP address and write it to a file.  Now all that remains is to combine the two files and try to find something useful in the data.  And maybe to write a script to do this, so that it will be a little easier the next time around.

A quick summary of green-er computing

Last week a Twitter buddy posted a blog entry called “E-Waste not, Want not”.  In it, she raises some very good points about how the technology we consider “green” isn’t always.  She’s right, but fortunately things may not be as dire as it seems.  As computers and other electronic devices become more and more important to our economy, communication, and recreation, efforts are being made to reduce the impact of these devices.  For the devices themselves, the familiar rules apply: reduce, reuse, recycle.

Reduce

The first way that reduction is being accomplished is the improved efficiency of the components.  As processors become more powerful, they’re also becoming more efficient.  In some cases, the total electrical consumption still rises, but much more slowly than it would otherwise.  In addition, research and improvements in manufacturing technology are getting more out of the same space.  Whereas a each compute core was on a separate chip, nowadays it’s not unusual to have several cores on a single processor the same size as the old single-core models.  Memory and hard drives have increased their density dramatically, too.  In the space of about 10 years, we’ve gone from “I’ll never be able to fill a 20 GB hard drive” to 20 GB is so small that few companies sell them anymore.

As the demand for computing increases, it might seem unreasonable to expect any reduction in the number of computers.  However, some organizations are doing just that.  Earlier this year, I replaced two eight-year-old computers I had been using with a single new computer that had more power than the two old ones combined.  That might not be very impressive, but consider the case of Solvay Pharmaceuticals: by using VMWare‘s virtualization software, they were able to consolidate their servers by a 10:1 ratio, resulting in a $67,000 annual savings in power and cooling costs.  Virtualization involves running one or more independent computers on the same hardware.  This means, for example, that I can test software builds on several Linux variants and two versions of Windows without having to use separate physical hardware for each variation.

Thin clients are a related reduction.  In the old days of computing, most of the work was done on large central machines and users would connect via dumb terminals: basically a keyboard and monitor.  In the late 80’s and 90’s, the paradigm shifted toward more powerful, independent desktops.  Now the shift is reversing itself in some cases.  Many organizations are beginning to use thin clients powered by a powerful central server.  The thin client contains just enough power to boot up and connect to the server.  While this isn’t useful in all cases, for general office work it is often quite suitable.  For example, my doctor has a thin client in each exam room instead of a full desktop computer.  Thin clients provide reduction by extending the replacement cycle.  While a desktop might need to be replaced every 3-4 years to keep an acceptable level of performance, thin clients can go 5-10 years or more because they don’t require local compute power.

Another way that the impact of computing is being reduced is by the use of software to increase the utilization of existing resources.  This particular subject is near and dear to me, since I spend so much of my work life on this very issue.  One under-utilized resource that can be scavenged is disk space.  Apache’s Hadoop software includes the ability to pool disk space on a collection of machines into a high-throughput file system.  For some applications, this can remove the need to purchase a dedicated file server.

In addition to disk space, compute power can be scavenged as well.  Perhaps the most widely known is BOINC, which was created to drive the SETI@Home project that was a very popular screen saver around the turn of the millennium.  BOINC allows members of the general public to contribute their “extra” cycles to actual scientific research.  Internally, both academic and financial institutions make heavy use of software products like Condor to scavenge cycles.  At Purdue University, over 22 million hours of compute time were harvested from the unused time on the research clusters in 2009 alone.  By making use of these otherwise wasted compute hours, people are getting more work done without having to purchase extra equipment.

Reuse

There’s such a wide range of what computers can be used for, and that’s a great thing when it comes to reusing.  Computers that have become too low-powered to use as a desktops can find new life as file or web servers, networking gear, or as teaching computers.  Cell phones, of course, seem to be replaced all the time (my younger cousins burn out the keyboards really quickly).  Fortunately, there’s a good market for used cell phones, and there are always domestic violence shelters and the like that will take donations of old cell phones.

Recycle

Of course, at some point all electronics reach the end of their useful lives.  At that point, it’s time to recycle them.  Fortunately, recycling in general is a common service provided by sanitation services these days.  Some of those provide electronics recycling, as do many electronics stores.  Recycling of electronics (including batteries!) is especially important because the materials are often toxic, and often in short supply.  The U.S. Environmental Protection Agency has a website devoted to the recycling of electronic waste.

It’s not just the devices themselves that are a problem.  As I mentioned above, consolidating servers results in a large savings in power and cooling costs.  Keeping servers cool enough to continue operating is a very energy-intensive.  In cooler climates, outside air is sometimes brought in to reduce the need for large air conditioners.  ComputerWorld recently had an article about using methane from cow manure to power a datacenter.  This is old hat to the people of central Vermont.

It’s clear that the electronic world is not zero-impact.  However, it has some positive social impacts, and there’s a lot of work being done to reduce the environmental impact.  So while it may not be the height of nobility to include a note about not printing in your e-mail signature, it’s still better than having a stack of papers on your desk.

Why it’s not always done the right way: difficulties with preempting Condor jobs when the disk is nearly full

In the IT field, there’s a concept called “best practice”, which is the recommended policy, method, etc for a particular setting or action.  In the perfect world, every system would conform to the accepted best practices in every respect.  Reality isn’t always perfect, though, and there are often times when a sysadmin has to fall somewhere short of this goal.  Some Internet Tough Guys will insist that their systems are rock-solid and superbly secured. That’s crap, we all have to cut corners.  Sometimes it’s acceptable, sometimes it’s a BadThing™.  This is the story of one of the (hopefully) acceptable times.

Continue reading

A warning about Condor access control lists

Like most sane services, Condor has a notion of access control.  In fact, Condor’s access control lists (ACLs) provide a very granular level of control, allowing you to set a variety of roles based on hostname/IP.  One thing we’re working on at my day job is making it easier for departments across campus to join our Condor pool.  In the face of budget concerns, a recommendation has been drafted which includes having departments choose between running Condor and powering machines off when not in use.  Given the preference for performing backups and system updates overnight, we’re guessing the majority will choose to donate cycles to Condor, so we’re trying to prepare for a large increase in the pool.

Included in that preparation is the switch from default-deny to default-allow-campus-hosts.  Previously, we only allowed specific subdomains on campus, but this means that every time a new department (which effectively means a new subdomain) joins the pool, we have to modify the ACLs.  While this isn’t a big deal, it seems simpler to just allow all of campus except the “scary” subnets (traditionally wireless clients, VPN clients, and the dorms. Especially the dorms.)  Effectively, we’ll end up doing that anyway, and so keeping the file more static should make it easier to maintain.

So on Wednesday, after the security group blessed our idea, I began the process of making the changes.  Let me point out here that you don’t really appreciate how much IP space an institution like Purdue has until you need to start blocking segments of it.  All of 128.10.0.0/16, 128.46.0.0/16, 128.210.0.0/16 and 128.211.0.0/16.  That’s a lot of public space, and it doesn’t include the private IP addresses in use.  So after combing through the IP space assignments, I finally got the ACLs written, and on Thursday I committed them.  And that’s when all hell broke loose.

Condor uses commas to separate as many hosts as you want, and asterisks can be used to wildcard hosts (Condor does not currently support CIDR notation, but that would be awesome).  The danger here is that if you accidentally put a comma in place of a period, you might end up denying write access to *. Obviously, this causes things to break down.  Once people started complaining about Condor not working, I quickly found my mistake and pushed out a correction.  However, Condor does not give up so quickly.  Once a rule is in DENY_WRITE, it will not stop denying that host until the Condor master has been stopped and re-started.  A simple config update won’t change it.

We had to learn that by experimentation, so I spent most of Friday helping my colleagues re-start the Condor process everywhere and testing the hell out of my changes.  Fortunately, once everything had been cleaned up, it worked as expected, and this gave me a chance to learn more about Condor.  And I also learned a very important lesson: test your changes first, dummy.