Why HTCondor is a pretty awesome scheduler

In early March, The Next Platform published an article I wrote about cHPC, a container project aimed at HPC applications. But as I wrote it, I thought about how HTCondor has been addressing a lot of the concerns for a long time. Since I’m in Madison for HTCondor Week right now, I thought this was a good time to explain some of the ways this project is awesome.

No fixed walltime. This is a benefit or a detriment, depending on the circumstances, but most schedulers require the user to define a requested walltime at submission. If the job isn’t done at the end of that time, the scheduler kills it. Sorry about your results, get back in line and ask for more walltime. HTCondor’s flexible configuration allows administrators to enable such a feature if desired. By default users are not forced to make a guess that they’re probably going to get wrong.

Flexible requirements and resource monitoring. HTCondor supports user-requestable CPU, memory, and GPU natively. With partitionable slots, resources can be carved up on the fly. And HTCondor has “concurrency limits”, which allow for customizable resource constraints (e.g. software licenses, database connections, etc).

So many platforms. Despite the snobbery of HPC sysadmins, people do real work on Windows. HTCondor has almost-full feature parity on Windows. It also has “universes” for Docker and virtual machines.

Federation. Want to overflow to your friend’s resource? You can do that! You can even submit jobs from HTCondor to other schedulers.

Support for disappearing resources. In the cloud, this is the best feature. HTCondor was designed for resource scavenging on desktops, and it still supports that as a first-class use case. That means machines can come and go without much hassle. Contrast this to other schedulers where some explicit external action has to happen in order to add or remove a node.

Free as in freedom and free as in beer. Free beer is also the best way to get something from the HTCondor team. But HTCondor is licensed under the Apache 2.0 license, so anyone can use it for any purpose.

HTCondor isn’t perfect, and there are some use cases where it doesn’t make sense (e.g. low-latency), but it’s a pretty awesome project. And it’s been around for over three decades.

Other writing in February 2017

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we managed our 5th consecutive million-page-view month, despite the short month. I wrote the articles below.

Also, the 2016 Open Source Yearbook is now available. You can get a free PDF download now or buy the print version at cost. Or you can do both!

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • HyperXite case study – The HyperXite team used CycleCloud software to run simulations for their hyperloop pod.
  • ALS research case study – A professor at the University of Arizona quickly simulate a million compounds as part of a search for pharmacological treatment for Lou Gerhig’s disease.
  • Transforming enterprise workloads – A brief look at how some of our customers transform their businesses by using cloud computing.
  • LAMMPS scaling on Microsoft Azure – My coworkers did some benchmarking of the InfiniBand interconnect on Microsoft Azure. I just wrote about it.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in January 2017

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we had our fourth consecutive month with a milion-plus page views and set a record with 1,122,064. I wrote the articles below.

Also, the 2016 Open Source Yearbook is now available. You can get a free PDF download now or wait for the print version to become available. Or you can do both!

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • Use AWS EBS Snapshots to speed instance setup — Staging reference data can be a time-expensive operation. This post describes one way we cut tens of minutes off of time for a cancer research workload.
  • Various ghost-written pieces. I’ll never tell which ones!

Maybe your tech conference needs less tech

My friend Ed runs a project called “Open Sourcing Mental Illness“, which seeks to change how the tech industry talks about mental health (to the extent we talk about it at all). Part of the work involves the publication of handbooks developed by mental health professionals, but a big part of it is Ed giving talks at conferences. Last month he shared some feedback on Twitter:

So I got feedback from a conf a while back where I did a keynote. A few people said they felt like it wasn’t right for a tech conf. It was the only keynote. Some felt it wasn’t appropriate for a programming conf. Time could’ve been spent on stuff that’d help career. Tonight a guy from a company that sponsored the conf said one of team members is going to seek help for anxiety about work bc of my talk. That’s why I do it. Maybe it didn’t mean much to you, but there are lots of hurting, scared people who need help. Ones you don’t see.

Cate Huston had similar feedback from a talk she gave in 2016:

the speaker kept talking about useless things like feelings

The tech industry as a whole, and some areas more than others, likes to imagine that it is as cool and rational as the computers it works with. Conferences should be full of pure technology. And yet we bemoan the fact that so many of our community are real jerks to work with.

I have a solution: maybe your tech conference needs less technology. After all, the only reason anyone pays us to do this stuff is because it (theoretically) solves problems for human beings. I’m biased, but I think the USENIX LISA conference does a great job of this. LISA has three core areas: architecture, engineering, and culture. You could look at it this way: designing, implementing, and making it so people will help you the next time around.

Culture is more than just sitting around asking “how does this make you feeeeeeeel?” It includes things like how to avoid burnout and how to train the next generation of practitioners. It also, of course, includes how to not be a insensitive jerk who inflicts harm on others with no regard for the impact they cause.

I enjoy good technical content, but I find that over the course of a multi-day conference I don’t retain very much of it. For a few brief hours in 2011, I understood SELinux and I was all set to get it going at home and work. Then I attended a dozen other sessions and by the time I got home, I forgot all of the details. My notes helped, but it wasn’t the same. On the other hand, the cultural talks tend to be the ones that stick with me. I might not remember the details, but the general principles are lasting and actionable.

Every conference is different, but I like having one-third of content be not-tech as a general starting point. We’re all humans participating in these communities, and it serves no one to pretend we aren’t.

Other writing in December 2016

Happy new year! Where have I been writing when I haven’t been writing here?

SysAdvent

Once again, SysAdvent was a great success. The large community that has built around this project means I do less than in years past. I want to give others the opportunity to get involved, too. This year I edited one article:

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Here are the articles I wrote last month:

Opensource.com

Over on Opensource.com, we hit the million page view mark for the third consecutive month. I wrote the articles below.

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • LISA 16 Cloud HPC BoF — I summarized a BoF session at the LISA Conference in Boston.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in November 2016

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Much like my role with Opensource.com as a Community Moderator, I look at the other names on the list and I just say “wow! How did I end up in such good company?” The articles I wrote last month:

  • Advances in in situ processing tie to exascale targets — The growth in FLOPS is outpacing the growth in IOPS. Analyzing simulations as they run is becoming increasingly important for scientists and engineers.
  • Microsoft Research pens Quill for data intensive analysis — Collecting data is only useful to the extent that the data is analyzed. We have more data these days, but no platform that can handle both real-time streaming and post hoc analysis. The Quill project aims to change that.
  • JVM Boost shows warm Java is better than cold — The Java Virtual Machine allows “write once, run anywhere” but it imposes a performance penalty. For short-running jobs, the hit can be significant. The HotTub project speeds up these jobs (up to 30x in some cases!) by reusing JVM processes.

Opensource.com

Over on Opensource.com, I agreed to coordinate the Doc Dish column. I also wrote the articles below. It was a great month for the site. Three times during November, we set a single-day page view record. We also crossed the million page view mark for the second consecutive month and the third time in site history.

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • Scale in a Cloudy World — I contributed an article to HPC Source about how to scale cloud HPC environments.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in October 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had our second-ever month with a million page views! While I didn’t have any articles published, I did agree to coordinate the Doc Dish column, so there’s that.

Meanwhile, I wrote or edited a few things for work, too:

I also spoke at the All Things Open conference in Raleigh, NC. It went okay.

Other writings in September 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had another 900k+ page views in the month: the fourth time in site history and the second consecutive month. I contributed two articles:

Meanwhile, I wrote a few things for work, too:

  • Cycle Computing: The cloud startup that just keeps kicking — The Next Platform wrote a very nice article about us, so I wrote a blog post talking about how nice it was. (Hey, I’m in marketing now. It’s what we do).
  • Cloud-Agnostic Glossary — Supporting multiple cloud-service providers means having to translate terms between them. I put together a Rosetta Stone to help translate relevant terms between AWS, Azure, and Google Cloud.
  • The question isn’t cost, it’s value — When people talk about the cost of cloud computing, they’re usually looking at the raw dollar value. Since it takes money to make money, that’s not always the right way to look at it. It’s better to consider the value generated.

Come see me at these conferences in the next few months

I thought I should share some upcoming conference where I will be speaking or in attendance.

  • 9/16 — Indy DevOps Meetup (Indianapolis, IN) — It’s an informal meetup, but I’m speaking about how Cycle Computing does DevOps in cloud HPC
  • 10/1 — HackLafayette Thunder Talks (Lafayette, IN) — I organize this event, so I’ll be there. There are some great talks lined up.
  • 10/26-27 — All Things Open (Raleigh, NC) — I’m presenting the results of my M.S. thesis. This is a really great conference for open source, so if you can make it, you really should.
  • 11/14-18 — Supercomputing (Salt Lake City, UT) — I’ll be working the Cycle Computing booth most of the week.
  • 12/4-9 — LISA (Boston, MA) — The 30th version of the premier sysadmin conference looks to be a good one. I’m co-chairing the Invited Talks track, and we have a pretty awesome schedule put together if I do say so myself.

Changing how HTCondor is packaged in Fedora

The HTCondor grid scheduler and resource manager follows the old Linux kernel versioning scheme: for release x.y.z, if y is an even number it’s a “stable” series that get bugfixes, behavior changes and major features go on odd-numbered y. For a long time, the HTCondor packages in Fedora used the development series. However, this leads to a choice between introducing behavior changes when a new development HTCondor release comes out or pinning a Fedora release to a particular HTCondor release which means no bugfixes.

This ignores the Fedora Packaging Guidelines, too:

As a result, we should avoid major updates of packages within a stable release. Updates should aim to fix bugs, and not introduce features, particularly when those features would materially affect the user or developer experience. The update rate for any given release should drop off over time, approaching zero near release end-of-life; since updates are primarily bugfixes, fewer and fewer should be needed over time.

Although the HTCondor developers do an excellent job of preserving backward compatibility, behavior changes can happen between x.y.1 and x.y.2. HTCondor is not a major part of Fedora, but we should still attempt to be good citizens.

After discussing the matter with upstream and the other co-maintainers, I’ve submitted a self-contained change for Fedora 25 that will

  1. Upgrade the HTCondor version to 8.6
  2. Keep HTCondor in Fedora on the stable release series going forward

Most of the bug reports against the condor-* packages have been packaging issues and not HTCondor bugs, so upstream isn’t losing a massive testing resource here. I think this will be a net benefit to Fedora since it prevents unexpected behavior changes and makes it more likely that I’ll package upstream releases as soon as they come out.