Last month, HPCWire ran an article about the decommissioning of Los Alamos National Lab’s “Roadrunner” cluster. In this article was a quote from LANL’s Paul Henning: “Rather than think of these machines as physical entities, we think of them as projects.” This struck me as being a very sensible position to take. One of the defining characteristics of a project is that it is of limited duration. Compute clusters have a definite useful life, limited by the vendor’s hardware warranty and the system performance (both in terms of computational ability and power consumption) relative to new systems.
Furthermore, the five PMBOK process areas all come into play. Initiation and Planning happen before the cluster is installed. Execution could largely be considered the building and acceptance testing portion of a cluster’s life. The operational time is arguably Monitoring and Control. Project closeout, of course, is the decommissioning of the resource. Of course, smaller projects such as software updates and planned maintenance occur within the larger project. Perhaps it is better to think of each cluster as a program?
The natural extension of considering a cluster (or any resource, for that matter) to be a project is assigning a project manager to each project. This was a key component to a staffing plan that a former coworker and I came up with for my previous group. With five large compute resources, plus a few smaller resources and the supporting infrastructure, it was very difficult for any one person to know what was going on. Our proposal included having one engineer assigned as the point-of-contact for a particular resource. This person didn’t have to fix everything on that cluster, but they would know about every issue and all of the unique configuration. This way, time wouldn’t be wasted doing the same diagnostic steps three months apart when a recurring issue recurs.