Understanding Condor security settings

Recently our colleagues at the University of Nebraska asked us to add host certificates to our Condor servers to enable encryption between sites. After fighting my way through Cfengine, I got the certificates in place. The next step was to enable GSI authentication on the servers. Having wisened up over the years, I chose to commit the change early in the day. My commit made the following two edits:

Modified: prod/tmpl/condor/condor_config.negotiator
 --- prod/tmpl/condor/condor_config.negotiator   2011-05-18 17:04:21 UTC (rev 12379)
 +++ prod/tmpl/condor/condor_config.negotiator   2011-05-18 17:12:46 UTC (rev 12380)
 @@ -45,6 +45,12 @@
 QUILL_DB_NAME           = quill
 QUILL_DB_IP_ADDR        = quill-00.rcac.purdue.edu:5432

 +# Use host certs to provide some inter-site security
 +SEC_NEGOTIATOR = preferred
 +GSI_DAEMON_DIRECTORY = /etc/grid-security

 # define sub collectors

 Modified: prod/tmpl/condor/condor_config.submit
 --- prod/tmpl/condor/condor_config.submit       2011-05-18 17:04:21 UTC (rev 12379)
 +++ prod/tmpl/condor/condor_config.submit       2011-05-18 17:12:46 UTC (rev 12380)
 @@ -40,6 +40,11 @@
 #      condorcm.pnc.edu, \

 +# Use host certs to provide some inter-site security
 +GSI_DAEMON_DIRECTORY = /etc/grid-security
 # Use global event log (like the userlog)
 # Set MAX_EVENT_LOG to a huge number, since there's a bug keeping the
 # set it to zero to let it grow unlimited bit work

I had tested the changes a bit and hadn’t noticed anything too out of whack. Until the follow morning. Users were complaining about slow queueing and execution of PBS jobs. At first, I thought it was a license problem on one of the PBS servers, since that had been happening a few times in the recent past. As the others in the group began investigating, though, they noticed that the PBS prologue script wasn’t completing as a job tried to land. It turns out the condor_config_val call that changes the PBSRunning attribute was failing because the node couldn’t talk to the collector.

The root cause was my misunderstanding of the Condor security documentation. With clients set to “optional” and daemons set to “preferred”, they try to use the relevant security features. But since the methods didn’t match, they refused to talk to each other instead of failing gracefully. Changing the “preferred” to “optional” restored performance and job throughput. Having gone through this, it now makes sense, but it’s more than a little embarrassing to bring the entire infrastructure down.

A Cfengine learning experience

Note: This post refers to Cfengine 2. The difficulties I had may quite likely be a result of peculiarities in our environment or the limits of my own knowledge.

A few weeks ago, my friends at the University of Nebraska politely asked us to install host certificates on our Condor collectors and submitters so that flocking traffic between our two sites would be encrypted. It seemed like a reasonable request, so after getting certificates for 17-ish hosts from our CA, I set about trying to put them in place. I could have plopped them all in place easily enough using a for loop, but I decided it would make more sense to left Cfengine take care of it. This has the added advantage of making sure the certificate gets put in place automatically when a host gets reinstalled or upgraded.

I thought it would be nice if I tested my Cfengine changes locally first. I know just enough Cfengine to be dangerous, and I don’t want to spam the rest of the group with mail as I check in modifications over and over again. So after editing the input file on one of the servers, I ran cfagent -qvk. It didn’t work. The syntax looked correct, but nothing happened. After a bit, I asked my soon-to-be-boss for help.

It turned out that I didn’t quite get the meaning of the -k option. I always used it to run against the local cache of the input files, not realizing that it killed all copy actions. Had I looked at the documentation, I would have figured that out. Like I said, I know just enough to be dangerous.

I didn’t want to create a bunch of error email since some hosts wouldn’t be getting host certificates, so I went with a IfFileExists statement that I could use to define a group to use in the copy: stanza. So I committed what I thought to be the correct changes and tried running cfagent again. The certificates still weren’t being copied into place. Looking at the output, I saw that it couldn’t find the file. Nonsense. It’s right there on the Cfengine server.

As it turns out, that’s not where IfFileExists looks, it looks on the server running cfagent. The file, of course, doesn’t exist locally because Cfengine hasn’t yet copied it. Eventually I surrendered and defined a separate group in cf.groups to reference in the appropriate input file. This makes the process more manual than I would have liked, but it actually works.

Oh, except for one thing. In testing, I had been using $(hostname) in a shellcommand: to make sure that the input file was actually getting read. When I finally got the copy: stanza sorted out, the certificates still weren’t being copied out. The cfagent output said it couldn’t find ‘/masterfiles/tmpl/security/host-certs/$(hostname).pem’. As it turns out, I thought $(hostname) was a valid Cfengine variable. Instead, it was actually being passed to the shell command and being executed by the shell. The end result was indiscernible from what I intended in that case, but didn’t translate to the copy: stanza. The variable I wanted was $(fqhost).