Other writings in September 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had another 900k+ page views in the month: the fourth time in site history and the second consecutive month. I contributed two articles:

Meanwhile, I wrote a few things for work, too:

  • Cycle Computing: The cloud startup that just keeps kicking — The Next Platform wrote a very nice article about us, so I wrote a blog post talking about how nice it was. (Hey, I’m in marketing now. It’s what we do).
  • Cloud-Agnostic Glossary — Supporting multiple cloud-service providers means having to translate terms between them. I put together a Rosetta Stone to help translate relevant terms between AWS, Azure, and Google Cloud.
  • The question isn’t cost, it’s value — When people talk about the cost of cloud computing, they’re usually looking at the raw dollar value. Since it takes money to make money, that’s not always the right way to look at it. It’s better to consider the value generated.

Come see me at these conferences in the next few months

I thought I should share some upcoming conference where I will be speaking or in attendance.

  • 9/16 — Indy DevOps Meetup (Indianapolis, IN) — It’s an informal meetup, but I’m speaking about how Cycle Computing does DevOps in cloud HPC
  • 10/1 — HackLafayette Thunder Talks (Lafayette, IN) — I organize this event, so I’ll be there. There are some great talks lined up.
  • 10/26-27 — All Things Open (Raleigh, NC) — I’m presenting the results of my M.S. thesis. This is a really great conference for open source, so if you can make it, you really should.
  • 11/14-18 — Supercomputing (Salt Lake City, UT) — I’ll be working the Cycle Computing booth most of the week.
  • 12/4-9 — LISA (Boston, MA) — The 30th version of the premier sysadmin conference looks to be a good one. I’m co-chairing the Invited Talks track, and we have a pretty awesome schedule put together if I do say so myself.

Changing how HTCondor is packaged in Fedora

The HTCondor grid scheduler and resource manager follows the old Linux kernel versioning scheme: for release x.y.z, if y is an even number it’s a “stable” series that get bugfixes, behavior changes and major features go on odd-numbered y. For a long time, the HTCondor packages in Fedora used the development series. However, this leads to a choice between introducing behavior changes when a new development HTCondor release comes out or pinning a Fedora release to a particular HTCondor release which means no bugfixes.

This ignores the Fedora Packaging Guidelines, too:

As a result, we should avoid major updates of packages within a stable release. Updates should aim to fix bugs, and not introduce features, particularly when those features would materially affect the user or developer experience. The update rate for any given release should drop off over time, approaching zero near release end-of-life; since updates are primarily bugfixes, fewer and fewer should be needed over time.

Although the HTCondor developers do an excellent job of preserving backward compatibility, behavior changes can happen between x.y.1 and x.y.2. HTCondor is not a major part of Fedora, but we should still attempt to be good citizens.

After discussing the matter with upstream and the other co-maintainers, I’ve submitted a self-contained change for Fedora 25 that will

  1. Upgrade the HTCondor version to 8.6
  2. Keep HTCondor in Fedora on the stable release series going forward

Most of the bug reports against the condor-* packages have been packaging issues and not HTCondor bugs, so upstream isn’t losing a massive testing resource here. I think this will be a net benefit to Fedora since it prevents unexpected behavior changes and makes it more likely that I’ll package upstream releases as soon as they come out.

Looking for my replacement

It’s been nearly three years since I joined Cycle Computing as a Senior Support Engineer. Initially, I led a team of me, but since then we’ve grown the organization. I’d like to think I did a good job of growing not only the team, but the tooling and processes to enable my company to provide excellent support to enterprise customers across a variety of fields.

But now, it is time to hire my replacement. I’m taking my talents across the (proverbial) hall to being working as a Technical Evangelist. I’ll be working on technical marketing materials, conferences, blog posts, and all kinds of neat stuff like that. I think it’s a good overlap of my skills and interests, and it will certainly be a new set of challenges.

So while this move is good for me, and good for Cycle Computing’s marketing efforts, it also means we need a new person to manage our support team. The job has been posted to our job board. If you’re interested, I encourage you to apply. It’s a great team at a great company. If you have any questions, I’d be happy to talk to you about it.

Hints for using HTCondor’s credd and condor_store_cred

HTCondor has the ability to run jobs as either an unprivileged “nobody” user or as the submitting user. On Linux, enabling this is fairly easy: the administrator just sets the UID_DOMAIN configuration to the same value and away you go. On Windows, you need to run the credential daemon (condor_credd) and the user must send store credentials using condor_store_cred.

The manual does a pretty good job of describing the basic setup of the credd, though there are some important pieces missing. With help from HTCondor technical lead Todd Tannenbaum, I’ve submitted some improvements to the docs, but in the meantime…

The main thing to consider when configuring your pool to use the credd is that it wants things to be secure. That makes sense, considering its entire job is to securely store and transfer user credentials. The credd will not hand out the password unless the client is authenticated and using a secure connection. The method of authentication is not important (if you really, really trust your network, you can use the CLAIMTOBE method), so long as authentication occurs somehow.

So where do the condor_store_cred hints come in? Often, the credd runs on the same machine as the schedd, and users log in to there to submit jobs. In that case, everything’s probably fine. But if you’re submitting jobs from a machine outside the pool (for example, a user’s workstation), it can get a little hairier.

Before running condor_store_cred, HTCondor needs to be told where to look for the credd, and the client settings mentioned above need to meet the credd’s requirements. (I’m using CLAIMTOBE here for simplicity). If the machine the user submits from is not in the pool, condor_store_cred will need to know where to find the collector, too.

CREDD_HOST = scheduler.example.com
COLLECTOR_HOST = centralmanager.example.com

As of this writing, condor_store_cred gives an unhelpful error message if something goes wrong. It will always say “Make sure your ALLOW_WRITE setting includes this host.”, so if your ALLOW_WRITE setting already includes the host in question, you might get stuck. Use the -debug option to get better output. For example:

02/16/16 12:23:51 STORE_CRED: In mode 'query'
02/16/16 12:23:51 Warning: Collector information was not found in the configuration file. ClassAds will not be sent to the collector and this daemon will not join a larger Condor pool.
02/16/16 12:23:51 STORE_CRED: Failed to start command.
02/16/16 12:23:51 STORE_CRED: Unable to contact the REMOTE schedd.

This tells you that you forgot to set the COLLECTOR_HOST in your configuration.

Another hint is that if your scheduler name is different than the machine name (e.g. if you run multiple condor_schedd processes on a single machine and have Q1@hostname, Q2@hostname, etc), you might need to include “-name Q1@hostname” in the arguments. Unlike most other HTCondor client commands, you cannot specify a “sinful string” as a target using the “-addr” option.

Hopefully this helps you save a little bit of time getting run_as_owner working on your Windows pool, until such time as I sit down to write that “Administering HTCondor” book that I’ve been meaning to work on for the last 5 years.

Supercomputing ’15

Last week, I spent a few days in Austin, Texas for the Supercomputing conference. Despite having worked in HPC for years, I’ve never been to SC. It’s a big conference. Since everyone heard I was going, they set a record this year with over 12,000 attendees. That’s roughly 10x the size of LISA, where I had been a few days ago.

I missed Alan Alda’s keynote, so my trip was basically ruined. That’s not true, actually. I spent most of the time in my company’s booth giving demos and talking to people. I had a lot of fun doing that. I’m sure the technical sessions were swell, but that’s okay. I look forward to going again next year, hopefully for the whole week and not immediately following another week-long conference.


Ben with a minion

HTCondor 8.3.8 in Fedora repos

It’s only been a month-plus since HTCondor 8.3.8 was released, but I finally have the Fedora packages updated. Along the way, I fixed a couple of outstanding bugs in the Fedora package. The builds are in the updates-testing repo, so test away!

As of right now, upstream plans to release HTCondor 8.5.0 early next week, so I got caught up just in time.

HTCondor Week 2015

There are many reasons I enjoy the annual gathering of HTCondor users, administrators, and developers. Some of those reasons involve food and alcohol, but mostly it’s about the networking and the knowledge sharing.

Unlike many other conferences, HTCondor Week is nearly devoid of vendors. I gave a presentation on behalf of my company, and AWS was present this year, but it wasn’t a sales pitch in either case. The focus is on how HTCondor enabled research. I credit the project’s academic roots.

Every year, themes seem to develop. This year, the themes were cloud and caching. Cloud offerings seem to really be ready to take off in this community, even though Miron would say that the cloud is just a different form of grid computing that’s been done for decades. The ability to scale well beyond internal resources quickly and cheaply has obvious appeal. The limiting factor currently seems to be that university funding rules make it slightly more difficult for academic researchers than just pulling out a credit card.

In the course of one session,  three different caching mechanisms were discussed. This was interesting because it is not something that’s been discussed much in the past. It makes sense, though, that caching files common across multiple jobs on a node would be a big improvement in performance. I’m most partial to Zach Miller’s fledgling HTCache work, though the squid cache and CacheD presentations had their own appeal.

Todd Tannenbaum’s “Talk of Lies” spent a lot of time talking about performance improvements that have been made in the past year, but they really need to congratulate themselves more. I’ve seen big improvements from 8.0 to 8.2, and it looks like even more will land in 8.4. There’s some excellent work planned for the coming releases, and I hope it pans out.

After days of presentations and conversations, my brain is full of ideas for improving my company’s products. I’m really motivated to make contributions to HTCondor, too. I’m even considering carving out some time to work on that book I’ve been wanting to write for a few years. Now that would truly be a miracle.

Cores or machines?

Back in February, Pete Cheslock quipped “100,000 cores – cause it sounds more impressive than 2000 servers.” Patrick Cable pointed out that HPC people in particular talk in cores. I told them both that the “user perspective is all about cores. The number of machines it takes to provide them means squat.” Andrew Clay Shafer disagreed, with a link to some performance benchmarks.

He’s technically correct (the best kind of correct), but misses the point. Yes, there are performance impacts when the number of machines change (interestingly, fewer machines is better for parallel jobs, while more machines is better for serial jobs), but that’s not necessarily of concern to the user. Data movement and other constraints can wash out any performance differences the machine count introduces.

But really, the concern with core count is misplaced, too. What should really be of concern to the user is the time-to-results. It’s up to the IT service provider to translate that need to the technical requirements (this is more true for operational computing than research, and it depends on the workload to have a fair degree of predictability). The user says “I need to do X amount of computation and get the results in Y amount of time.” Whether this is done on 1 huge machine or ten thousand small machines, that doesn’t really matter. This plays well into cloud environments where you can use a mixture of instance types to get to the size you need.

Cloud detente

Evident.io founder and CEO Tim Prendergast wondered on Twitter why other cloud service providers aren’t taking marketing advantage of the Xen vulnerability that lead Amazon and Rackspace to reboot a large number of cloud instances over a few-day period. Digital Ocean, Azure, and Google Compute Engine all use other hypervisors, so isn’t this an opportunity for them to brag about their security? Amazon is the clear market leader, so pointing out this vulnerability is a great differentiator.

Except that it isn’t. It’s a matter of chance that Xen is The hypervisor facing an apparently serious and soon-to-be-public exploit. Next week it could be Mircosoft’s Hyper-V. Imagine the PR nightmare if Microsoft bragged about how much more secure Azure is only to see a major exploit strike Hyper-V next week. It would be even worse if the exploit was active in the wild before patches could be applied.

“Choose us because of this Xen issue” is the cloud service provider equivalent of an airline running a “don’t fly those guys, they just had a plane crash” ad campaign. Just because your competition was unlucky this time, there’s no guarantee that you won’t be the lower next time.

I’m all for companies touting legitimate security features. Amazon’s handling of this incident seems pretty good, and I think they generally do a good job of giving users the ability to secure their environment. That doesn’t mean someone can’t come along and do it better. If there’s anything 2014 has taught us, it’s that we have a long road ahead of us when it comes to the security of computing.

It’s to the credit of Amazon’s competition that they’ve remained silent. It shows a great degree of professionalism. Digital Ocean’s Chief Technology Evangelist John Edgar had the best explanation for the silence: “because we’re not assholes mostly.”