Other writing in November 2016

Where have I been writing when I haven’t been writing here?

The Next Platform

I’m freelancing for The Next Platform as a contributing author. Much like my role with Opensource.com as a Community Moderator, I look at the other names on the list and I just say “wow! How did I end up in such good company?” The articles I wrote last month:

  • Advances in in situ processing tie to exascale targets — The growth in FLOPS is outpacing the growth in IOPS. Analyzing simulations as they run is becoming increasingly important for scientists and engineers.
  • Microsoft Research pens Quill for data intensive analysis — Collecting data is only useful to the extent that the data is analyzed. We have more data these days, but no platform that can handle both real-time streaming and post hoc analysis. The Quill project aims to change that.
  • JVM Boost shows warm Java is better than cold — The Java Virtual Machine allows “write once, run anywhere” but it imposes a performance penalty. For short-running jobs, the hit can be significant. The HotTub project speeds up these jobs (up to 30x in some cases!) by reusing JVM processes.

Opensource.com

Over on Opensource.com, I agreed to coordinate the Doc Dish column. I also wrote the articles below. It was a great month for the site. Three times during November, we set a single-day page view record. We also crossed the million page view mark for the second consecutive month and the third time in site history.

Cycle Computing

Meanwhile, I wrote or edited a few things for work, too:

  • Scale in a Cloudy World — I contributed an article to HPC Source about how to scale cloud HPC environments.
  • Various ghost-written pieces. I’ll never tell which ones!

Other writing in October 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had our second-ever month with a million page views! While I didn’t have any articles published, I did agree to coordinate the Doc Dish column, so there’s that.

Meanwhile, I wrote or edited a few things for work, too:

I also spoke at the All Things Open conference in Raleigh, NC. It went okay.

Other writings in September 2016

Where have I been writing when I haven’t been writing here?

Over on Opensource.com, we had another 900k+ page views in the month: the fourth time in site history and the second consecutive month. I contributed two articles:

Meanwhile, I wrote a few things for work, too:

  • Cycle Computing: The cloud startup that just keeps kicking — The Next Platform wrote a very nice article about us, so I wrote a blog post talking about how nice it was. (Hey, I’m in marketing now. It’s what we do).
  • Cloud-Agnostic Glossary — Supporting multiple cloud-service providers means having to translate terms between them. I put together a Rosetta Stone to help translate relevant terms between AWS, Azure, and Google Cloud.
  • The question isn’t cost, it’s value — When people talk about the cost of cloud computing, they’re usually looking at the raw dollar value. Since it takes money to make money, that’s not always the right way to look at it. It’s better to consider the value generated.

Come see me at these conferences in the next few months

I thought I should share some upcoming conference where I will be speaking or in attendance.

  • 9/16 — Indy DevOps Meetup (Indianapolis, IN) — It’s an informal meetup, but I’m speaking about how Cycle Computing does DevOps in cloud HPC
  • 10/1 — HackLafayette Thunder Talks (Lafayette, IN) — I organize this event, so I’ll be there. There are some great talks lined up.
  • 10/26-27 — All Things Open (Raleigh, NC) — I’m presenting the results of my M.S. thesis. This is a really great conference for open source, so if you can make it, you really should.
  • 11/14-18 — Supercomputing (Salt Lake City, UT) — I’ll be working the Cycle Computing booth most of the week.
  • 12/4-9 — LISA (Boston, MA) — The 30th version of the premier sysadmin conference looks to be a good one. I’m co-chairing the Invited Talks track, and we have a pretty awesome schedule put together if I do say so myself.

Changing how HTCondor is packaged in Fedora

The HTCondor grid scheduler and resource manager follows the old Linux kernel versioning scheme: for release x.y.z, if y is an even number it’s a “stable” series that get bugfixes, behavior changes and major features go on odd-numbered y. For a long time, the HTCondor packages in Fedora used the development series. However, this leads to a choice between introducing behavior changes when a new development HTCondor release comes out or pinning a Fedora release to a particular HTCondor release which means no bugfixes.

This ignores the Fedora Packaging Guidelines, too:

As a result, we should avoid major updates of packages within a stable release. Updates should aim to fix bugs, and not introduce features, particularly when those features would materially affect the user or developer experience. The update rate for any given release should drop off over time, approaching zero near release end-of-life; since updates are primarily bugfixes, fewer and fewer should be needed over time.

Although the HTCondor developers do an excellent job of preserving backward compatibility, behavior changes can happen between x.y.1 and x.y.2. HTCondor is not a major part of Fedora, but we should still attempt to be good citizens.

After discussing the matter with upstream and the other co-maintainers, I’ve submitted a self-contained change for Fedora 25 that will

  1. Upgrade the HTCondor version to 8.6
  2. Keep HTCondor in Fedora on the stable release series going forward

Most of the bug reports against the condor-* packages have been packaging issues and not HTCondor bugs, so upstream isn’t losing a massive testing resource here. I think this will be a net benefit to Fedora since it prevents unexpected behavior changes and makes it more likely that I’ll package upstream releases as soon as they come out.

Looking for my replacement

It’s been nearly three years since I joined Cycle Computing as a Senior Support Engineer. Initially, I led a team of me, but since then we’ve grown the organization. I’d like to think I did a good job of growing not only the team, but the tooling and processes to enable my company to provide excellent support to enterprise customers across a variety of fields.

But now, it is time to hire my replacement. I’m taking my talents across the (proverbial) hall to being working as a Technical Evangelist. I’ll be working on technical marketing materials, conferences, blog posts, and all kinds of neat stuff like that. I think it’s a good overlap of my skills and interests, and it will certainly be a new set of challenges.

So while this move is good for me, and good for Cycle Computing’s marketing efforts, it also means we need a new person to manage our support team. The job has been posted to our job board. If you’re interested, I encourage you to apply. It’s a great team at a great company. If you have any questions, I’d be happy to talk to you about it.

Hints for using HTCondor’s credd and condor_store_cred

HTCondor has the ability to run jobs as either an unprivileged “nobody” user or as the submitting user. On Linux, enabling this is fairly easy: the administrator just sets the UID_DOMAIN configuration to the same value and away you go. On Windows, you need to run the credential daemon (condor_credd) and the user must send store credentials using condor_store_cred.

The manual does a pretty good job of describing the basic setup of the credd, though there are some important pieces missing. With help from HTCondor technical lead Todd Tannenbaum, I’ve submitted some improvements to the docs, but in the meantime…

The main thing to consider when configuring your pool to use the credd is that it wants things to be secure. That makes sense, considering its entire job is to securely store and transfer user credentials. The credd will not hand out the password unless the client is authenticated and using a secure connection. The method of authentication is not important (if you really, really trust your network, you can use the CLAIMTOBE method), so long as authentication occurs somehow.

So where do the condor_store_cred hints come in? Often, the credd runs on the same machine as the schedd, and users log in to there to submit jobs. In that case, everything’s probably fine. But if you’re submitting jobs from a machine outside the pool (for example, a user’s workstation), it can get a little hairier.

Before running condor_store_cred, HTCondor needs to be told where to look for the credd, and the client settings mentioned above need to meet the credd’s requirements. (I’m using CLAIMTOBE here for simplicity). If the machine the user submits from is not in the pool, condor_store_cred will need to know where to find the collector, too.

CREDD_HOST = scheduler.example.com
COLLECTOR_HOST = centralmanager.example.com
SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE
SEC_CLIENT_AUTHENTICATION = PREFERRED
SEC_CLIENT_ENCRYPTION = PREFERRED

As of this writing, condor_store_cred gives an unhelpful error message if something goes wrong. It will always say “Make sure your ALLOW_WRITE setting includes this host.”, so if your ALLOW_WRITE setting already includes the host in question, you might get stuck. Use the -debug option to get better output. For example:

02/16/16 12:23:51 STORE_CRED: In mode 'query'
02/16/16 12:23:51 Warning: Collector information was not found in the configuration file. ClassAds will not be sent to the collector and this daemon will not join a larger Condor pool.
02/16/16 12:23:51 STORE_CRED: Failed to start command.
02/16/16 12:23:51 STORE_CRED: Unable to contact the REMOTE schedd.

This tells you that you forgot to set the COLLECTOR_HOST in your configuration.

Another hint is that if your scheduler name is different than the machine name (e.g. if you run multiple condor_schedd processes on a single machine and have Q1@hostname, Q2@hostname, etc), you might need to include “-name Q1@hostname” in the arguments. Unlike most other HTCondor client commands, you cannot specify a “sinful string” as a target using the “-addr” option.

Hopefully this helps you save a little bit of time getting run_as_owner working on your Windows pool, until such time as I sit down to write that “Administering HTCondor” book that I’ve been meaning to work on for the last 5 years.

Supercomputing ’15

Last week, I spent a few days in Austin, Texas for the Supercomputing conference. Despite having worked in HPC for years, I’ve never been to SC. It’s a big conference. Since everyone heard I was going, they set a record this year with over 12,000 attendees. That’s roughly 10x the size of LISA, where I had been a few days ago.

I missed Alan Alda’s keynote, so my trip was basically ruined. That’s not true, actually. I spent most of the time in my company’s booth giving demos and talking to people. I had a lot of fun doing that. I’m sure the technical sessions were swell, but that’s okay. I look forward to going again next year, hopefully for the whole week and not immediately following another week-long conference.

20151117_103152

Ben with a minion

HTCondor 8.3.8 in Fedora repos

It’s only been a month-plus since HTCondor 8.3.8 was released, but I finally have the Fedora packages updated. Along the way, I fixed a couple of outstanding bugs in the Fedora package. The builds are in the updates-testing repo, so test away!

As of right now, upstream plans to release HTCondor 8.5.0 early next week, so I got caught up just in time.

HTCondor Week 2015

There are many reasons I enjoy the annual gathering of HTCondor users, administrators, and developers. Some of those reasons involve food and alcohol, but mostly it’s about the networking and the knowledge sharing.

Unlike many other conferences, HTCondor Week is nearly devoid of vendors. I gave a presentation on behalf of my company, and AWS was present this year, but it wasn’t a sales pitch in either case. The focus is on how HTCondor enabled research. I credit the project’s academic roots.

Every year, themes seem to develop. This year, the themes were cloud and caching. Cloud offerings seem to really be ready to take off in this community, even though Miron would say that the cloud is just a different form of grid computing that’s been done for decades. The ability to scale well beyond internal resources quickly and cheaply has obvious appeal. The limiting factor currently seems to be that university funding rules make it slightly more difficult for academic researchers than just pulling out a credit card.

In the course of one session,  three different caching mechanisms were discussed. This was interesting because it is not something that’s been discussed much in the past. It makes sense, though, that caching files common across multiple jobs on a node would be a big improvement in performance. I’m most partial to Zach Miller’s fledgling HTCache work, though the squid cache and CacheD presentations had their own appeal.

Todd Tannenbaum’s “Talk of Lies” spent a lot of time talking about performance improvements that have been made in the past year, but they really need to congratulate themselves more. I’ve seen big improvements from 8.0 to 8.2, and it looks like even more will land in 8.4. There’s some excellent work planned for the coming releases, and I hope it pans out.

After days of presentations and conversations, my brain is full of ideas for improving my company’s products. I’m really motivated to make contributions to HTCondor, too. I’m even considering carving out some time to work on that book I’ve been wanting to write for a few years. Now that would truly be a miracle.