Recently our colleagues at the University of Nebraska asked us to add host certificates to our Condor servers to enable encryption between sites. After fighting my way through Cfengine, I got the certificates in place. The next step was to enable GSI authentication on the servers. Having wisened up over the years, I chose to commit the change early in the day. My commit made the following two edits:
Modified: prod/tmpl/condor/condor_config.negotiator =================================================================== --- prod/tmpl/condor/condor_config.negotiator 2011-05-18 17:04:21 UTC (rev 12379) +++ prod/tmpl/condor/condor_config.negotiator 2011-05-18 17:12:46 UTC (rev 12380) @@ -45,6 +45,12 @@ QUILL_DB_NAME = quill QUILL_DB_IP_ADDR = quill-00.rcac.purdue.edu:5432 +# Use host certs to provide some inter-site security +SEC_DAEMON_AUTHENTICATION = preferred +SEC_DAEMON_AUTHENTICATION_METHODS = GSI, PASSWORD +SEC_NEGOTIATOR = preferred +SEC_NEGOTIATOR_AUTHENTICATION_METHODS = GSI, PASSWORD +GSI_DAEMON_DIRECTORY = /etc/grid-security # define sub collectors COLLECTOR2 = $(COLLECTOR) Modified: prod/tmpl/condor/condor_config.submit =================================================================== --- prod/tmpl/condor/condor_config.submit 2011-05-18 17:04:21 UTC (rev 12379) +++ prod/tmpl/condor/condor_config.submit 2011-05-18 17:12:46 UTC (rev 12380) @@ -40,6 +40,11 @@ condor1.ipfw.edu # condorcm.pnc.edu, \ +# Use host certs to provide some inter-site security +SEC_ADVERTISE_SCHEDD_AUTHENTICATION = preferred +SEC_ADVERTISE_SCHEDD_AUTHENTICATION_METHODS = GSI, PASSWORD +GSI_DAEMON_DIRECTORY = /etc/grid-security + # Use global event log (like the userlog) # Set MAX_EVENT_LOG to a huge number, since there's a bug keeping the # set it to zero to let it grow unlimited bit work
I had tested the changes a bit and hadn’t noticed anything too out of whack. Until the follow morning. Users were complaining about slow queueing and execution of PBS jobs. At first, I thought it was a license problem on one of the PBS servers, since that had been happening a few times in the recent past. As the others in the group began investigating, though, they noticed that the PBS prologue script wasn’t completing as a job tried to land. It turns out the condor_config_val call that changes the PBSRunning attribute was failing because the node couldn’t talk to the collector.
The root cause was my misunderstanding of the Condor security documentation. With clients set to “optional” and daemons set to “preferred”, they try to use the relevant security features. But since the methods didn’t match, they refused to talk to each other instead of failing gracefully. Changing the “preferred” to “optional” restored performance and job throughput. Having gone through this, it now makes sense, but it’s more than a little embarrassing to bring the entire infrastructure down.