In the IT field, there’s a concept called “best practice”, which is the recommended policy, method, etc for a particular setting or action. In the perfect world, every system would conform to the accepted best practices in every respect. Reality isn’t always perfect, though, and there are often times when a sysadmin has to fall somewhere short of this goal. Some Internet Tough Guys will insist that their systems are rock-solid and superbly secured. That’s crap, we all have to cut corners. Sometimes it’s acceptable, sometimes it’s a BadThing™. This is the story of one of the (hopefully) acceptable times.
Last month, we noticed a few of our compute nodes stopped running jobs. Investigation showed they had filled their hard disks, apparently as the result of Condor jobs that had several gigabytes of output. No big problem, the job files were removed and the nodes were returned to service. Since this situation caused users jobs to fail, I wanted to find a way to prevent it, even if it was a pretty rare occurrence.
Fortunately for me, Condor has a lot of knobs (a.k.a. “ClassAds”) that can be turned. Two of the ClassAds that machines have are “Disk” and “TotalDisk”. Disk contains the amount of disk space (in KB) being used on the system. TotalDisk is, deceptively, the amount of free disk space. The Condor wiki contained a recipe for limiting the disk usage of Condor jobs, but it wasn’t quite what I wanted. The given configuration focused on the amount of disk being used, but I wanted to focus on the disk remaining. Why? We have over 2500 compute nodes, and the hard disk sizes are not uniform. My feeling is that if the resource is available, the users should be allowed to use it.
I used our “miner” cluster, located at one of the regional campuses to test my new policy settings. This is the system that had the problems, it has the smallest disks, and it doesn’t have as many users on the primary scheduling system. It seemed like the perfect test system. It took a few days of experimentation, but I finally got my configuration correct. The submit hosts would hold jobs if the free disk space on an execute host was less than 10 GB, but just in case the job didn’t get caught fast enough, the execute hosts would kick the job off when the free disk reached 5 GB.
Satisfied that I had it correct, I went to our change management board with a proposal to implement this policy on all of our systems. I committed the changes after one last test. A few hours later, it was time to go home. Almost immediately after I got on the bus, I got an e-mail from my office mate. Apparently, my changes resulted in everyone’s jobs being held. As soon as I got home, the laptop came out and it was time to try to fix things.
It didn’t take too long to figure out that a lot of the held jobs were landing on miner nodes. Since miner has nearly 900 cores and few regular users, a lot of jobs tend to land there. The only problem is that miner nodes have 40 GB hard disks. Our node configuration includes a 16 GB swap partition, which doesn’t leave a lot of space left for jobs. By the time the operating system and applications are installed, many of the miner nodes have about 10 GB of free space left. So every time a job would start writing to disk, it would get held, and a new job would land on the node. Before long, thousands of jobs were held.
The solution was either to forget it, or to remove all of the miner nodes (and the many thousands of compute hours they provide). I briefly experimented with setting the submit-side limits conditionally, but it started getting messy quickly. After discussing the problem with my manager, we agreed that it just wasn’t worth the effort for such an edge case. We left the execute-side protections in place (with a smaller cutoff) and got rid of the submit-side holds (since the submit-side checks are less frequent, lowering the threshold much would be functionally the same as not having them in place). And that’s one reason why it’s not always done the right way.
sidebar Why didn’t my testing catch this?
I mentioned that I did the testing on the miner cluster, and that it got a lot of jobs. Well, the submit nodes for that cluster are unused (really — my submissions were the only ones ever from those hosts), which made them a good candidate for testing that could possibly cause outages. Except that by doing that, I didn’t get a good enough feel for how many jobs were landing on miner, or how many would be held because of the disk limitations.