GitHub is my Copilot

It isn’t, but I thought that made for a good title. You have probably heard about GitHub Copilot, the new AI-driven pair programming buddy. Copilot is trained on a wealth of publicly-available code, including code under copyleft licenses like the GPL. This has lead many people to question the legality of using Copilot. Does code developed with it require a copyleft license?

The legal parts

Reminder: I am not a lawyer.

No. While I’d love to see the argument play out in court, I don’t think it’s an issue. For as much as I criticize how we apply AI in society, I don’t think this is an illegal case. In the same way that the book I’m writing on program management isn’t a derivative work of all of the books and articles I’ve read over the years, Copilot-produced code isn’t a derivative work either.

“But, Ben,” you say. “What about the cases where machine learning models have produce verbatim snippets from code?” In those cases, I doubt the snippets rise to the level of copyrightability on their own. It’d be one thing to reproduce a dozen-line function. But even giving two or three lines…eh.

The part where verbatim reproduction gets interesting is by leaking secrets. I’ve seen anecdotal tales of Copilot helpfully suggesting private keys. This is either: Copilot producing strings that are gibberish because it expects gibberish or Copilot producing a string that someone accidentally checked into a repo. The latter seems more likely. And it’s not a licensing concern at that point. I’m not sure it’s any legal concern at all. But it’s a concern to the owner of the secret if that information gets out into the wild.

The community parts

But being legally permissible doesn’t mean Copilot is acceptable to the community. It certainly feels like it’s a two-trillion dollar company (Microsoft, the parent of GitHub) taking advantage of individual and small-team developers—people who are generally under-resourced. I can’t argue with that. I understand why people would find it gross, even if it’s legal. Of course, open source licenses by nature often permit behavior we don’t like.

Pair programming works well, or so I’m told. If a service like Copilot can be that second pair of eyes sometimes, then it will have a net benefit for open and proprietary code alike. In the right context, I think it’s a good idea. The execution needs some refinement. It would be good to see GitHub proactively address the concerns of the community in services like this. I don’t think Copilot is necessarily the best solution, but it’s a starting point.

[Full disclosure: I own a limited number of Microsoft shares.]

A survey of the kanban tools I use

In a previous post about distinguishing between tasks and projects, I made a brief mention of the different kanban tools I use with a promise to go into further depth. Here is the depth. This post is more about the tools I use and the context in which I use them as opposed to the specific design of the boards themselves.

Trello

Trello is the gold standard when it comes to kanban. It’s free to use (although I’ve paid for a business subscription in the past), has a useful mobile app, and focuses on being good at being a kanban board. Trello was the first kanban board I used and it became my default. I use it to track article ideas for this blog, plan meals, and track online orders.

I wouldn’t mind switching to another tool, just to have one less place to track work, but there are a few things that keep me around. The mobile app is the biggest. Trello’s mobile app works really well. It doesn’t have the full feature set of the web app, but it’s enough to be productive. I’m also a big fan of the calendar view. This helps me plan time-based work like blog posts and dinners. Combined with labels, the calendar view makes it really easy to see if I’m being too repetitive. I also like that Trello makes a visual distinction between complete and incomplete when displaying dates. Relatedly, the package tracking plugin is really nice, too. Plug in a tracking number and the estimated delivery date displays on the card as if it were a due date. The last feature that I love is the automation. I only use it on my writing board, but it comes in handy. I have a rule that when I set a due date on a card, it gets moved from “Ideas” to “Planned”. There’s another one that will mark a card as done when I move it to the “Scheduled” column.

Trello also has a good blog if you’re into that sort of thing.

Todoist

I switched to Todoist when I couldn’t put off migrating away from Wunderlist any longer. As the name implies, it’s primarily a todo list manager (and a darn good one at that), but it recently added support for kanban boards. To try it out, I converted my “laundry” category to a board. It works okay for that purpose.

The mobile app is great for the todo side, but the boards are a little buggy (although the same is true for the web app). Cards don’t always want to move to the column you’re trying to drag them to. But my biggest gripe is that there’s currently no way to cleanly handle recurring tasks. For example, my laundry board is broken out by load (e.g. whites, pants, towels), with tasks scheduled to recur every two weeks. When I mark a task done, I’d like the new incarnation to start in the “dirty” column. Ideally, moving a task to the “folded” column would automatically mark it done.

I’ve found myself using the board feature less and less with my laundry. The issues above combined with the fact that the state is readily apparent means it doesn’t have a lot of value. I like Todoist overall, so I wish I had more reason to use the board feature.

Taiga

I actually use two different Taiga instances: Fedora’s and the public instance. In the Fedora instance, I have a board where I track all of my work as Fedora Program Manager, as well as using it collaboratively with other teams. On the public instance, I’m currently only using it to track progress on my book.

Taiga is designed to be a full-featured project management tool, which gives it a leg up in some ways. User stories on the kanban board can have child tasks with their own state, which is helpful when i need to decompose work. It also has a module for epics, which is useful for aggregating larger work. As an example, I have a card for each chapter on my book board. When my editor gives me things to fix, I add those as tasks. Each of the milestones that the publisher has in their process is an epic.

There are two missing features that keep from moving to Taiga as my main kanban tool. The first is the lack of a calendar view. This would be particularly for the Fedora Magazine editorial board. The second is a sub-optimal (read: basically unusable) mobile experience. I don’t manipulate my boards on my phone a ton, but I do enough that it would be hellish. There are some third-party apps, but they can’t connect via Fedora’s authentication system, so they don’t help me.

I’d also like the ability to mark specific user stories as private, although I concede that doesn’t make a ton of sense in the context of what it’s intended for.

GitHub, GitLab, and Pagure

I combined these three because they’re basically the same to me: nice additions to issue tracking tools. I wouldn’t use either of them primarily as a kanban tool, but it comes in handy as a layer on the primary purpose. GitHub allows you to add both issues and non-issue cards to a board, which has resulted in a very confused me on several occasions. Pagure does not appear to have this problem. I’ve used GitLab’s board feature a little bit, but I don’t feel I’m familiar enough to comment on it.

Projects shouldn’t write their own tools

Over the weekend, the PHP project learned that its git server had been compromised. Attackers inserted malicious code into the repo. This is very bad. As a result, the project moved development to GitHub.

It’s easy to say that open source projects should run their own infrastructure. It’s harder to do that successfully. The challenges compound when you add in writing the infrastructure applications.

I understand the appeal. It’s zero-price (to write; you still need the hardware to run it). Bespoke software meets your needs exactly. And it can be a fun diversion from the main thing you’re working on: who doesn’t like going to chase a shiny for a little bit?

Of course, there’s always the matter of “the thing you wanted didn’t exist when you started the project.” PHP’s first release predates the launch of GitHub by 13 years. It’s 10 years older than git, even.

Of course, this means that at some point PHP moved from some other version control system to Git. That also means they could have moved from their homegrown platform to GitHub. I understand why they’d want to avoid the pain of making that switch, but sometimes it’s worthwhile.

Writing secure and reliable infrastructure is hard. For most projects, the effort and risk of writing their own tooling isn’t worth the benefit. If the core mission of your project isn’t the production of infrastructure applications, don’t write it.

Sidebar: Self-hosting

The question of whether or not to write your infrastructure applications is different from the question of whether or not to self-host. While the former has a pretty easy answer of “no”, the latter is mixed. Self-hosting still costs time and resources, but it allows for customization and integration that might be difficult with software-as-a-service. It also avoids being at the whims of a third party who may or may not share your project’s values. But in general, projects should do the minimum that they can reasonably justify. Sometimes that means running your own instances of an infrastructure application. Very rarely does it mean writing a bespoke infrastructure application.

GitHub should stand up to the RIAA over youtube-dl

Earlier this week, GitHub took down the repository for the youtube-dl project. This came in response to a request from the RIAA—the recording industry’s lobbying and harassment body. youtube-dl is a tool for downloading videos. The RIAA argued that this violates the anticircumvention protections of the Digital Millennium Copyright Act (DMCA). While GitHub taking down the repository and its forks is true to the principle of minimizing corporate risk, it’s the wrong choice.

Microsoft—currently the world’s second-most valuable company with a market capitalization of $1.64 trillion—owns GitHub. If anyone is in a position to fight back on this, it’s Microsoft. Microsoft’s lawyers should have a one word answer to the RIAA’s request: “no”. (full disclosure: I own a small number of shares of Microsoft)

The procedural argument

The first reason to tell the RIAA where to stick it is procedural. The RIAA isn’t arguing that youtube-dl is infringing its copyrights or circumventing its protections. It is arguing that youtube-dl infringes YouTube’s protections. So even if it is, that’s YouTube’s problem, not the RIAA’s.

The factual argument

I have some sympathy for the anticircumvention argument. I’m not familiar with the specifics of how youtube-dl works, but it’s at least possible that youtube-dl circumvents YouTube’s copy protection. This would be a reasonable basis for YouTube to take action. Again, YouTube, not the RIAA.

I have less sympathy for the infringement argument. youtube-dl doesn’t induce infringement more than a web browser or screen recorder does. There are a variety of uses for youtube-dl that are not infringing. Foremost is the fact that some YouTube videos are under a license that explicitly allows sharing and remixing. Archivers use it to archive content. Some people who have time-variable Internet billing use it to download videos overnight.

So, yes, youtube-dl can be used to infringe the RIAA’s copyrights. It can also be used for non-infringing purposes. The code itself does not infringe. There’s nothing about it that gives the RIAA a justification to take it down.

youtube-dl isn’t the whole story

youtube-dl provides a focal point, but there’s more to it. Copyright law is now used to suppress instead of promote creative works. The DMCA, in particular, favors the large rightsholders over smaller developers and creators. It essentially forces sites to act on a “guilty until proven innocent” model. Companies in a position to push back have an obligation to do so. Microsoft has become a supporter of open source, now it’s time to show they mean it.

We should also consider the risks of consolidation. git is a decentralized system. GitHub has essentially centralized it. Sure, many competitors exist, but GitHub has become the default place to host open source code projects. The fact that GitHub’s code is proprietary is immaterial to this point. A FOSS service would pose the same risk if it became the centralized service.

I saw a quote on this discussion (which I can’t find now) that said “code is free, infrastructure is not.” And while projects self-hosting their code repository, issue tracker, etc may be philosophically appealing, that’s not realistic. Software-as-a-Service has lowered the barrier for starting projects, which is a good thing. But it doesn’t come without risk, which we are now seeing.

I don’t know what the right answer is for this. I know the answer won’t be easy. But both this specific case and the general issues they highlight are important for us to think about.

Decentralization is more appealing in theory than in reality

One of the appeals of open standards is that they allow market forces to work. Consumers can choose the tool or service that best meets their specific needs and if someone is a bad actor, the consumer can flee. Centralized proprietary services, in contrast, tie you to a specific provider. Consumers have no choice if they want/need to use the service.

It’s not as great as it sounds

This is true in theory, but reality is more complicated. Centralization allows for lower friction. Counterintuitively, centralization can allow for greater advancement. As Moxie Marlinspike wrote in a Whisper Labs blog post, open standards got the Internet “to the late 90s.” Decentralization is (in part) why email isn’t end-to-end encrypted by default. Centralization is (again, in part) why Slack has largely supplanted IRC and XMPP.

Particularly for services with a directory component (social networks for sure, but also tools like GitHub), centralization makes a lot of sense. It lowers the friction of finding those you care about. It also makes moderation easier.

Of course, those benefits can also be disadvantages. Easier moderation also means easier censorship. But not everyone is capable of or willing to run their own infrastructure. Or to find the “right” service among twenty nearly-identical offerings. The free market requires an informed consumer, and most consumers lack the knowledge necessary to make an informed choice.

Decentralization in open source

Centralized services versus federated (or isolated) services is a common discussion topic in open source. Jason Baker recently wrote a comment on a blog post that read in part:

I use Slack and GitHub and Google * and many other services because they’re simply easier – both for me, and for (most) of the people I’m collaborating with. The cost of being easier for most people I collaborate with is that I’m also probably excluding someone. Is that okay? I’m not sure. I go back and forth on that question a lot. In general, though, I try to be flexible to accommodate the people I’m actually working with, as opposed to solving the hypothetical/academic/moral question.

Centralization to some degree is inevitable. Whether build on open standards or not, most projects would rather work on their project than run their infrastructure. And GitHub (like Sourceforge before it), has enabled many small projects to flourish because they don’t need to spend time on infrastructure. Imagine if every project needed to run it’s own issue tracker, code repository, etc. The barrier to entry would be too high.

Striking a balance

GitHub provides an instructive example. It uses an open, decentralized technology (git) and layers centralized services on top. Users can get the best of both worlds in this sense. Open purists may not find this acceptable, but I think a pragmatic view is more appropriate. If allowing some proprietary services enables a larger and more robust open software ecosystem, isn’t that worthwhile?

Pay maintainers! No, not like that!

A lot of people who work on open source software get paid to do so. Many others do not. And as we learned during the Heartbleed aftermath, sometimes the unpaid (or under-paid) projects are very important. Projects have changed their licenses (e.g. MongoDB, which is now not an open source project by the Open Source Initiative’s definition) in order to cut off large corporations that don’t pay for the free software.

There’s clearly a broad recognition that maintainers need to be paid in order to sustain the software ecosystem. So if you expect that people are happy with GitHub’s recent announcement of a GitHub Sponsors, you have clearly spent no time in open source software communities. The reaction has had a lot of “pay the maintainers! No, not like that!” which strikes me as being obnoxious and unhelpful.

GitHub Sponsors is not a perfect model. Bradley Kuhn and Karen Sandler of the Software Freedom Conservancy called it a “quick fix to sustainability“. That’s the most valid criticism. It turns out that money doesn’t solve everything. Throwing money at a project can sometimes add to the burden, not lessen it. Money adds a lot of messiness and overhead to manage it, especially if there’s not a legal entity behind the project. That’s where the services provided by fiscal sponsor organizations like Conservancy come in.

But throwing money at a problem can sometimes help it. Projects can opt in to accepting money, which means they can avoid the problems if they want. On the other hand, if they want to take in money, GitHub just made it pretty easy. The patronage model has worked well for artists, it could also work for coders.

The other big criticism that I’ll accept is that it puts the onus on individual sponsorships (indeed, that’s the only kind available at the moment), not on corporate:

https://twitter.com/laserllama/status/1131552436534087680

Like with climate change or reducing plastic waste, the individual’s actions are insignificant compared to the effects of corporate action. But that doesn’t mean individual action is bad. If iterative development is good for software, then why not iterate on how we support the software? GitHub just reduced the friction of supporting open source developers significantly. Let’s start there and fix the system as we go.

Apache Software Foundation moves to GitHub

Last week, GitHub and the Apache Software Foundation (ASF) announced that ASF migrated their git repositories to GitHub. This caused a bit of a stir. It’s not every day that “the world’s largest open source foundation” moves to a proprietary hosting platform.

Free software purists expressed dismay. One person described it as “a really strange move In part because Apache’s key value add [was] that they provided freely available infrastructure.” GitHub, while it may be “free as in beer”, is definitely not “free as in freedom”. git itself is open source software, but GitHub “special sauce” is not.

For me, it’s not entirely surprising that ASF would make this move. I’ve always seen ASF as a more pragmatically-minded organization than, for example, the Free Software Foundation (FSF). I’d argue that the ecosystem benefits from having both ASF- and FSF-type organizations.

It’s not clear what savings ASF gets from this. Their blog post says they maintain their own mirrors, so there’s still some infrastructure involved. Of course, it’s probably smaller than running the full service, but by how much?

More than a reduced infrastructure footprint, I suspect the main benefit to the ASF is that it lowers the barrier to contribution. Like it or not, GitHub is the go-to place to find open source code. Mirroring to GitHub makes the code available, but you don’t get the benefits of integrating issues and pull requests (at least not trivially). Major contributors will do what it takes to adopt the tool, but drive by contributions should be as easy as possible.

There’s also another angle, which probably didn’t the drive the decision but brings a benefit nonetheless. Events like Hacktoberfest and 24 Pull Requests help motivate new contributors, but they’re based on GitHub repositories. Using GitHub as your primary forge means you’re accessible to the thousands of developers who participate in these events.

In a more ideal world, ASF would use a more open platform. In the present reality, this decision makes sense.

GitHub’s new status feature

Two weeks ago, GitHub added a new feature for all users: the ability to set a status. I’m in favor of this. First, it appeals to my AOL Instant Messenger nostalgia. Second, I think it provides a valuable context for open source projects. It allows maintainers to say “hey, I’m not going to be very responsive for a bit”. In theory, this should let people filing issues and pull requests not get so angry if they don’t get a quick response.

Jessie Frazelle described it as the “cure for open source guilt”.

https://twitter.com/jessfraz/status/1083047358667972608

I was surprised at the amount of blowback this got. (See, for example the replies to Nat Friedman’s tweet.) Some of the responses are of the dumb “oh noes you’re turning GitHub into a social media platform. It should be about the code!” variety. To those people I say “fine, don’t use this feature.” Others raise a point about not advertising being on vacation.

I’m sympathetic to that. I’m generally pretty quiet about the house being empty on public or public-ish platforms. It’s a good way to advertise yourself to vandals and thieves. To be honest, I’m more worried about something like Nextdoor where the users are all local than GitHub where anyone who cares is probably a long way away. Nonetheless, it’s a valid concern, especially for people with a higher profile.

I agree with Peter that it’s not wise to set expectations for maintainers to share their private details. That said, I do think it’s helpful for maintainers to let their communities know what to expect from them. There are many reasons that someone might need to step away from their project for a week or several. A simple “I’m busy with other stuff and will check back in on February 30th” or something to that effect would accomplish the goal of setting community expectations without being too revelatory.

The success of this feature will rely on users making smart decisions about what they choose to reveal. That’s not always a great bet, but it does give people some control over the impact. The real question will be: how much do people respect it?

Microsoft bought GitHub. Now what?

Last Monday, a weekend of rumors proved to be true. Microsoft announced plans to buy code-hosting site GitHub for $7.5 billion. Microsoft’s past, particularly before Satya Nadella took the corner office a few years ago, was full of hostility to open source. “Embrace, extend, extinguish” was the operative phrase. It should come as no surprise, then, that many projects responded by abandoning the platform.

But beyond the kneejerk reaction, there are two questions to consider. First: can open source projects trust Microsoft? Secondly, should open source (and free software in particular) projects rely on corporate hosting.

Microsoft as a friend

Let’s start with the first question. With such a long history of active assault on open source, can Microsoft be trusted? Understanding that some people will never be convinced, I say “yes”. Both from the outside and from my time as a Microsoft employee, it’s clear that the company has changed under Nadella. Microsoft recognizes that open source projects are not only complementary, but strategically important.

This is driven by a change in the environment that Microsoft operates in. The operating system is less important than ever. Desktop-based office suites are giving way to web-based tools for many users. Licensed revenue may be the past and much of the present, but it’s not the future. Subscription revenue, be it from services like Office 365 or Infrastructure-as-a-Service offerings, is the future. And for many of these, adoption and consumption will be driven by open source projects and the developers (developers! developers! developers! developers!) that use them.

Microsoft’s change of heart is undoubtedly driven by business needs, but that doesn’t make it any less real. Jim Zemlin, Executive Director at the Linux Foundation, expressed his excitement, implying it was a victory for open source. Tidelift ran the numbers to look at Microsoft’s contributions to non-Microsoft projects. Their conclusion?

…today the company is demonstrating some impressive traction when it comes to open source community contributions. If we are to judge the company on its recent actions, the data shows what Satya Nadella said in his announcement about Microsoft being “all in on open source” is more than just words.

And in any acquisition, you should always ask “if not them, then who?” CNBC reported that GitHub was also in talks with Google. While Google may have a better reputation among the developer community, I’m not sure they’d be better for GitHub. After all, Google had Google Code, which it shut down in 2016. Would a second attempt in this space fare any better? Google Code had a two year head start on GitHub, but it languished.

As for other major tech companies, this tweet sums it up pretty well:

https://twitter.com/KrisSiegel/status/1003796510281121793?s=19

Can you trust anyone to host?

My friend Lyz Joseph made an excellent point on Facebook the day the acquisition was announced:

Unpopular opinion: If you’re an open source project using GitHub, you already sold out. You traded freedom for convenience, regardless of what company is in control.

People often forget that GitHub itself is not open source. Some projects have avoided hosting on GitHub for that very reason. Even though the code repo itself is easily mirrored or migrated, that’s not the real value in GitHub. The “social coding” aspects — the issues, fork tracking, wikis, ease of pull requests, etc — are what make GitHub valuable. Chris Siebenmann called it “sticky in a soft way.

GitLab, at least, offers a “community edition” that projects can self-host. In a fantasy world, each project would run their own infrastructure, perhaps with federated authentication for ease of use when you’re a participant in many projects. But that’s not the reality we live in. Hosting servers costs money and time. Small projects in particular lack both of those. Third-party infrastructure will always be attractive for this reason. And as good as competition is, having a dominant social coding site is helpful to users in the same way that a dominant social network is simpler: network effects are powerful.

So now what?

The deal isn’t expected to close for a while, and Microsoft plans to seek regulatory approval, which will not speed the process. Nothing will change immediately. In the medium term, I don’t expect much to change either. Microsoft has made it clear that it plans to run GitHub as a fairly autonomous business (the way it does with LinkedIn). GitHub gets the stability that comes from the support of one of the world’s largest companies. Microsoft gets a chance to improve its reputation and an opportunity to make it easier for developers to use Azure services.

Full disclosure: I am a recent employee of Microsoft and a shareholder. I was not involved in the acquisition and had no inside knowledge pertinent to the acquisition or future plans for GitHub.

Taking action on commit messages

Many modern code hosting platforms (e.g. GitHub and GitLab) parse commit messages to do something smart with them. The most common is probably to look for references to an issue number and create a link or close the issue. For example: “Fixes #37”. Commit messages can also be used to notify or reference other users. For example: “I think @funnelfiasco broke it. Again.”

These automated actions have a lot of utility. They simplify the communication process. Manually linking to issues, users, etc would be a pain, which means it would never happen. This hurts not only the project developers, but also the users trying to dive into troubleshooting a problem.

But it’s not all candy and rainbows. As an example, a coworker removed the “deprecated” decorator from some Python code. His commit message included “un-@deprecated”. Our GitLab instance saw the “@” and decided to add the “deprecated” group to the issue. That added the entire engineering and operations teams to the issue.

The obvious solution is to require a more explicit markup than a single character. Something like “HEYDOTHIS-NOTIFY-funnelfiasco” reduces the possibility of accidentally triggering an action. On the other hand, it’s a giant pain in the ass. This, as above, means it’s likely to not be used. Even if it is still used, manual syntax is prone to error.

So what’s the answer? I don’t have a good solution. Projects parse commit messages on a daily basis to simplify workflows and improve communication. The Asterisk community, as an example, uses more than just simple tagging. The drawbacks are mostly nuisance at this point, and I don’t think they outweigh the benefits.

What might change my mind is if commit message parsing could be used to execute arbitrary code on the server. If several vulnerabilities align in just the right way, I suppose it’s a theoretical possibility. Of course, people you trust with commit access to the repo could do damage the old fashioned way. But it would be an attack vector for pull requests, albeit an amusing one. “Hey, I improved your project with this code, but my commit message also will add your server to my botnet if you merge it.”