Open source AI and open data

I’m a little late to the party with this post, but I need to get it out of my head. The question of “what is ‘open source AI’, exactly?” has been a hot topic in some circles for a while now. The Open Source Initiative, keepers of the Open Source Definition, have been working on developing a definition for open source AI. The latest draft notably does not require the training data to be available under an open license. I believe this is a mistake.

Open source AI must include open data

Data is critical to modern computing. I called this out in a 2020 DevConf talk and I can hardly claim to be the first or only person to make this observation. More recently, Tom “spot” Callaway wrote his objections to a definition of “open source AI” that doesn’t include open data. My objections (and I venture to say spot’s as well) have nothing to do with ideological purity. I wrote over three years ago that I don’t care about free/open source software as an end goal. What matters is the human impact.

Even before ChatGPT hit the scene, there were countless examples of AI exacerbating biases and inequities. Part of addressing that issue is providing a better training data set. But if we don’t know what an AI model is trained on, we don’t know what sort of biases it’s reproducing. This is a data problem, not a model weights problem. The most advanced AI in the world is still going to produce biased output if trained on biased sources.

OSI attempts to address this by requiring “data information.” This is insufficient. I’ll again defer to spot to make this case better than I could. OSI raises valid points about how rules governing data can be different than those covering code. Oh well. The solution is to acknowledge that some models won’t meet the requirements instead of watering down the requirements.

No one is owed an “open source AI”

Part of the motivation behind OSI’s choices here seem to be the creation of a definition that commercially-viable AI models can meet. They say “We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones.” Tara Tarakiyee wrote in response “Well, if the price of making Open Source ‘AI’ competitive with proprietary ‘AI’ is to break the openness that is fundamental to the definition, then why are we doing it?”

I agree with Tara. His whole post is well worth a read. But what this particular thread comes down to is this: we don’t owe anyone a commercially-viable definition just because doing otherwise is hard. There’s nothing in the Open Source Definition that says “but you can skip some of these requirements if you can’t figure out how to make money.”

“Can’t” and “won’t” aren’t the same thing

I’ve seen some people argue that creating an definition that results in zero “open source AI” models is useless. It’s important to distinguish here between “can’t” and “won’t”: they are not the same.

It’s true that a definition that no model could possibly meet is useless. But a definition that no model currently chooses to meet is valuable. AI developers could certainly choose to make their training data available. If they don’t want to, they don’t get to call their model open source. It’s the same as wanting to release software under a license that doesn’t meet some part of the Open Source Definition. As I said in the previous section, no one is owed a definition that meets their business needs.

The argument is silly, anyway. There are at least two models that would meet a more appropriate definition.

Where to go from here?

I wrote this post because I needed to get the words out of my head and onto “paper”. I have no expectation it will change the direction of OSI’s next draft. They seem pretty committed to their choice at this point. I’m not really sure what is gained by making this compromise. Nothing of worth, I think.

This is a problem we should have been addressing years ago, instead of rushing to catch up once the cat was out of the proverbial bag, Collectively, we seem to have a tendency to skate to where the puck was, not where it will be. This isn’t the first time. At FOSDEM 2021, Bradley Kuhn said something to the effect of “if I would have known proprietary software would be funded by advertising instead of license sales, I would have done a lot of things differently.”

I’m not sure what the next big challenge will be. But you can be sure if I figure it out, I’ll push a lot harder to address it before we get passed by again.

Does open source matter?

Matt Asay’s article “The Open Source Licensing War is Over” has been making the rounds this week, as text and subtext. While his position is certainly spicy, I don’t think it’s entirely wrong. “It’s not that open source doesn’t matter, but rather it has never mattered in the way some hoped or believed,” Asay writes. I think that’s true, and it’s our fault.

To the average person, and even to many developers, the freeness or openness of the software doesn’t matter. They want to be able to solve their problem in the easiest (and cheapest) way. Often that’s open source software. Sometimes it isn’t. But they’re not sitting there thinking about the societal impact of their software choices. They’re trying to get a job done.

Free and open source software (FOSS) advocates often tout the ethical benefits of FOSS. We talk about the “four essential freedoms“. And while those should matter to people, they often don’t. I’ve said before — and I still believe it — FOSS is not the end goal. Any time we end with “and thus: FOSS!”, we’re doing it wrong.

FOSS advocacy — and I suspect this is true of other advocacy efforts as well — tends to try to meet people where we want them to be. The problem, of course, is that people are not where we want them to be. They’re where they are. We have to meet them there, with language that resonates with them, addressing the problems they currently face instead of hypothetical future problems. This is all easier said than done, of course.

Open source licenses don’t matter — they’ve never mattered — except as an implementation detail for the goal we’re trying to achieve.

#inaction bcotton

On 25 June 2018, I published a post called “It’s hattening”. After years of rejected applications, I was finally starting a job at Red Hat. On 24 April 2023, Red Hat announced a 4% reduction in global staff. As a member of that 4%, today is my last day at Red Hat.

What does this mean for Ben?

This is the first time I’ve been laid off from a job. I hope it will be the last, but who can say? I’d be lying if I said I haven’t felt a big range of emotions in the past three weeks: confusion, anger, sadness, amusement.

But I’ve also felt loved. I’ve received so much support from people since the news started spreading. It’s like that end scene of “It’s a Wonderful Life” and I’m George Bailey. I’m proud of the contributions I’ve made to the Fedora community over the last five years, and it feels good to have others recognize that.

While I won’t be contributing as the Fedora Program Manager anymore, I was a Fedora contributor long before I joined Red Hat, and I’m not letting them take that away from me. I’ll still be around Fedora in ways that spark joy, although perhaps not much at first as I let my wounds heal.

I’ve had the great fortune to build an incredible professional and personal network over the years. I’m already pursuing a few opportunities and if those don’t pan out, I’ll be asking for your help finding more. In the meantime, I have (at least) a few weeks to relax for a bit. There’s a ton of work to do around the house, many trails to hike, Program Management for Open Source Projects to promote, and an embarrassingly-large backlog for Duck Alignment Academy articles.

What does this mean for Fedora?

I’ve told folks that if Fedora falls off the rails, then I have failed. I’m working with Matthew, Justin, and others to ensure coverage of the core job duties one way or another. I’ve worked hard over the years to automate tasks that can be automated. The documentation is far more comprehensive than what I inherited.

No doubt there are gaps in what I’ve left for my successors. However, my goal is that in a few months, nobody will notice that I’m gone. That’s my measure of success. The only reason I’ve been successful in my role is because of the work done by my predecessors: John, Robyn, Jaroslav, and Jan.

As to what the broader implication behind the loss of my position might be, I don’t know. There’s no indication that my role was targeted specifically. There are definitely people in Red Hat who continue to view Fedora as strategically important. I wish I had a clearer understanding of how they chose people/roles to cut, but I’ll probably never know the process. What I do know is that I fully intend to still be participating in the Fedora community when my account hits the 20-year mark in May 2029.

In defense of Fedora’s release cycle

Earlier this week, Thorsten Leemhuis published a thoughtful post about what he’d change if he magically became the supreme leader of Fedora. In that post and subsequent commentary on Mastodon and Fedora Discussion, he talked about changing Fedora’s release cycle. Since the Fedora Linux release process is my job, I figured I should explain why I disagree.

Integration projects are different

If you haven’t read the post, you should. But here’s the short version: Fedora Linux uses a release model rooted in the 1990s and should move to a “modern” model. Thorsten suggests a one-month cadence for those who want the latest versions and a one-year “steady” release. Such a model has worked well for Firefox, he argues, and so it should work for Fedora.

The key reason I think this is wrong is because Firefox is a development project whereas Fedora is an integration project. Integration projects don’t write a lot of code, they take the work of others and turn it into a coherent whole. This is a fundamentally different kind of work and it takes longer by necessity.

You can’t reliably integrate disparate pieces when they’re in constant motion. That’s why we have freezes leading up to the beta and final releases — they give the QA team time to test against a stationary target. It takes time to run through all of the test cases that make Fedora Linux a reliable operating system. So the choice becomes reducing the pre-release testing or spending a significant portion of the cycle in a freeze, which limits the the usefulness of the one-month cycle.

You can solve some of this with automated testing. And the QA does do a lot of automated testing. But those tests still take time, and there are a lot of interrelated parts in a Linux distribution.

Six months isn’t magic

There’s nothing objectively correct about a six month release cycle. It’s mostly because that’s how you get two releases a year. If the calendar had 10 months, the release cycle would be five. But there is a lower bound where you’ve become a de facto rolling release, even if you still have discrete releases. I don’t know where exactly that boundary is, but I suspect that one month is at or just beyond it.

Similarly, there’s an upper limit where you’re now a slow, plodding project. Again, I can’t say where the line is. Six months may be uncomfortably close to it, but I suspect it’s closer to a year. And, of course, it depends on the nature of the specific project.

So there’s no particular reason Fedora Linux couldn’t move to a shorter release cycle. Five months is totally doable. Four is possible. Three would require a tremendous amount of work before it could be considered. But what’s the benefit of going to a shorter cycle? Does five months instead of six make a meaningful difference? At least with six months, you know there’s a release targeted for April and October. Predictability is nice.

Solving the actual problem

The bigger issue, though, is that I don’t think people actually want this. Yes, you might want your web browser and other applications to update frequently. But that doesn’t mean you want your compiler or Python interpreter or C libraries to update frequently. Most people will avoid this in favor of the “steady” stream. This eliminates the intended benefit to upstream projects.

The people who do want everything to update quickly use a rolling release distribution, something that Thorsten explicitly says his proposal is not.

Fundamentally, the proposal looks at the problem the wrong way. The problem isn’t that a six month cycle is too long. The problem is that application delivery is coupled to operating system delivery. Most people want the latest versions of the applications they care about and for everything else to remain unchanged. The challenge, of course, is that not everyone draws that distinction in the same way.

We unsuccessfully tried to solve this with Modularity. Flatpak, at least for graphical applications, offers another attempt to solve this problem.

Historically, the system and application layers have been distributed together. Figuring out how to decouple these (including how to draw the line between them) is the interesting work. And it provides real value to the end users.

Open source is selfish: that’s good and bad

Back in May, Devin Prater wrote an excellent piece on Medium titled “Linux Accessibility: an unmaintained Mess“. Devin talks about the poor state of accessibility on mainstream Linux distributions. While blind people have certainly used Linux, it’s generally not an easy task. There’s a simple explanation for this: most open source contributors aren’t blind.

There’s no rule that you can’t make accessible software if you don’t need that particular accessibility feature. But for many open source contributors, their contributions are based on “scratching their own itch.” People work on the things that are personally interesting to them or impact them in some way.

That’s a good thing! It means they’re invested in how well the software works. I’m sure you’ve used some applications where you thought “there’s no way the people who made this actually used it.”

The problem comes when we’re excluding potential users and contributors. People with vision problems can’t contribute because they can’t easily use the software. And when they can use it, the tools for contributing add another barrier. I can’t imagine trying to understand a patch or an XML file read aloud, but there are people who have to do that.

In Program Management for Open Source Software, I wrote “software is only useful to the degree that people can use it”. I don’t have a great solution. As a community, we need to figure out how to keep the good part of the selfishness while being more inclusive.

The right of disattribution

While discussing the ttyp0 font license, Richard Fontana and I had a disagreement about its suitability for Fedora. My reasoning for putting it on the “good” list was taking shape as I wrote. Now that I’ve had some time to give it more thought, I want to share a more coherent (I hope) argument. The short version: authors have a fundamental right to require disattribution.

What is disattribution?

Disattribution is a word I invented because the dictionary has no antonym for attribution. Attribution, in the context of open works, means saying who authored the work you’re building on. For example, this post is under the Creative Commons Attribution-ShareAlike 4.0 license. That means you can use and remix it, provided you credit me (Attribution) and also let others use and remix your remix (ShareAlike). On the other hand, disattribution would say something like “you can use and remix this work, but don’t put my name on it.”

Why disattribution?

There are two related reasons an author might want to require disattribution. The first is that either the original work or potential derivatives are embarrassing. Here’s an example: in 8th grade, my friend wrote a few lines of a song about the destruction of Pompeii. He told me that I could write the rest of it on the condition that I don’t tell anyone that he had anything to do with it.

The other reason is more like brand protection. Or perhaps avoiding market confusion. This isn’t necessarily due to embarrassment. Open source maintainers are often overworked. Getting bugs and support requests from a derivative project because the user is confused is a situation worth avoiding.

Licenses that require attribution are uncontroversial. If we can embrace the right of authors to require attribution, we can embrace the right of authors to require disattribution.

Why not disattribution?

Richard’s concerns seemed less philosophical and more practical. Open source licenses are generally concerned with copyright law. Disattribution, particularly in the second reasoning, is closer to trademark law. But licenses are the tool we have available; don’t be surprised when we ask them to do more than they should.

Perhaps the bigger concern is the constraint it places on derivative works. The ttyp0 license requires not using “UW” as the foundry name. Richard’s concern was that two-letter names are too short. I don’t agree. There are plenty of ways to name a project that avoid one specific word. Even in this specific case, a name like “nuwave”—which contains “uw”—because it’s an unrelated “word.”

Excluding a specific word is fine. A requirement that excludes many words or provides some other unreasonable constraint would be the only reason I’d reject such a license.

Isn’t it better to contribute code than money?

Recently, I was in a discussion about making contributions to open source projects. One person said it would be nice if their employer gave each employee a budget that could be directed to open source projects at the employee’s discretion. The idea is that it would be a way for employees to support the specific projects that make their jobs or lives better. Another person said “isn’t it better to contribute” code to the project?

No, it is not. Even in software companies, a large percentage of employees lack the skills necessary to make meaningful code contributions to projects. Even when you consider (the very valuable) non-code contributions like documentation, testing, graphic design, et cetera. Money is quicker and easier.

Money gives the project maintainers to put it where they need it. They could buy test hardware, pay for web hosting, hire a contractor, buy themselves a nice cup of coffee. Whatever. This is the same reason charities prefer money over goods for disaster relief donations.

Of course, money isn’t perfect either. Not all projects are equipped to accept financial donations. Even if there’s a way to route money to them, they may not want to deal with tax implications. Loosely-governed projects may not have a good mechanism for deciding how to spend the money. Money can make relationships go south in a hurry.

If you’re a company looking for ways to let employees support the open source projects that they depend on, I advocate the “¿por que no los dos?” approach. Give your employees time to contribute effort in whatever way they’re able. But also give them a pool of money to sprinkle on the projects that provide value to your company.

What does it mean for a Linux distribution to be “fresh”?

I recently had a discussion with Luboš Kocman of openSUSE about how distros can monitor their “freshness”. In other words: how close is a distro to upstream? From our perspectives, it’s helpful to know which packages are significantly behind their upstreams. These packages represent areas that might need attention, whether that be a gentle nudge to the maintainer or recruiting additional volunteers from the community.

The challenge is that freshness can mean different things. The Repology project monitors a large number of distributions and upstreams to report on the status. But simply comparing the upstream version number to the packaged version number ignores a lot of very important context.

Updating to the latest upstream version as soon as it comes out is the most obvious definition of “fresh”, but it’s not always the best. Rolling releases (and their users) probably want that. In Fedora, policy is to not do “major updates” within a release. Many other release-oriented distributions have a similar policy, with varying degrees of “major”. Enterprise distributions add another wrinkle: they’ll backport security fixes (and sometimes key features), so the difference in version number doesn’t necessarily tell you what’s missing.

Of course, the upstream’s version number doesn’t necessarily tell you much. Semantic versioning is great, but not everyone uses it. And not everyone that uses it uses it well. If a distribution has version 1.4 and upstream released 1.5, is that a lack of freshness or an intentional decision to avoid mid-release compatibility changes?

I don’t have a good answer. This is a hard problem to solve. Something like Repology may be the best we can do with reasonable effort. But I’d love to have a more accurate view of how fresh Fedora packages are within the bounds of policy.

FOSS licenses permit, not restrict

Last week, Matthew Wilson shared a very correct take on Twitter:

A few people in the mentions argued that the GPL is doing it wrong by his definition. This is incorrect. Copyleft licenses do not prevent the user from doing things, they ensure that subsequent users can do the same thing.

This may seem like a semantic argument, but there’s substance to it. All licenses (except those that amount to a public domain dedication) contain some conditions, minimal though they may be. It’s important to remember that the default is that you can do nothing with a work. Copyright is by definition a monopoly on a work.The entire point of free and open source software licenses is to tell you what you can do, because the default position is that you can’t.

One of the most annoying things about license wars is the argument that one category of license is somehow more free than another. That’s dumb. Both copyleft and permissive licenses promote freedom, just from different perspectives. Permissive licenses give the next person in line the freedom to do (essentially) whatever they want. Copyleft licenses preserve freedoms for all subsequent users, no matter how many hands the work passes through. There are plenty of philosophical and practical reasons you might choose one class of license over the other (I tend to prefer copyleft licenses, myself), but it’s wrong to paint one or the other as anti-freedom.

Getting back to Matthew’s point, there has been a fair amount of license weaponization in the last few years. By this I mean the use of a license to try to exclude a certain class of user. Some of this I’m sympathetic to (e.g. the “ethical source” movement), some of this I’m not (e.g. the various “you can do what you want, just don’t make a successful software-as-a-service offering” licenses that have popped up). In both cases, I think copyright is the wrong mechanism for achieving the goals.

Excluding classes of users is antithetical to ideals free software and open source. That may be okay. As I’ve written, free software is not the end goal. But if you’re going to claim to be open source, you should act open source.

Balancing incoming tasks in volunteer projects

Open source (and other volunteer-driven) communities are often made up of a “team of equals.” Each member of the group is equally empowered to act on incoming tasks. But balancing the load is not easy. One of two things happens: everyone is busy with other work and assumes someone else will handle it, or a small number of people immediately jump on every task that comes in. Both of these present challenges for the long-term health of the team.

Bystander effect

The first situation is known as the “bystander effect.” Because every member of the team bears an equal responsibility, each member of the team assumes that someone else will take an incoming task. The sociological research is apparently mixed, but I’ve observed this enough to know that it’s at least possible in some teams. You’ve likely heard the saying “if everyone is responsible then no one is.”

The Bystander effect has two outcomes. The first is that the team drops the task. No one acts on it. If the task happens to be an introduction from a new member or the submission of content, this demoralizes the newcomer. If the team drops enough tasks, the new tasks stop coming.

The other possibility is that someone eventually notices that no one else is taking the task, so they take it. In my experience, it’s generally the same person who does this every time. Eventually, they begin to resent the other members of the team. They may burn out and leave.

Oxygen theft

Sometimes one or two team members jump on new tasks before anyone else does. Like the delayed version in the bystander effect scenario, this can lead to burn out. But worse, it can drive away team members who want to take tasks. If they’re constantly missing work because they weren’t able to immediately jump on it, they’ll go find other places to contribute. I call this “oxygen theft” because it’s like sucking all of the oxygen out of the room: it puts out the flames.

I have been an oxygen thief myself. Shortly after I started as the Fedora Program Manager, I became an editor on the Fedora Community Blog. I was publishing regular posts and I happen to be a decent editor, so it made sense to give me that privilege. But because Fedora was my day job, I was often the first to notice new submissions. Over time, I eventually became the only editor working on posts. By accident, the editorial team became a team of one. That’s on my list to fix in the near future.

Solving the problem

Letting either the bystander effect or oxygen theft cases go for too long harms the team. But with volunteers, it’s hard to balance the work. Team members may not have consistent availability. For example, if one of the team members dayjob schedule varies from week. They probably don’t have evenly distributed availability, either. Someone who is paid to be on a project will likely have a lot more time available than someone volunteering.

One way to solve the problem is to take turns being in charge of the incoming tasks for a period of time. This addresses “if everyone is responsible then no one is” by making a single person responsible. But by making it a rotating duty, you can spread the load.

After learning my lesson with the Fedora Community Blog, I was hesitant to be too aggressive with taking tasks as an editor of the Fedora Magazine. But the Magazine team was definitely suffering from the bystander effect.

To fix this, I proposed having an Editor of the Week. Each week, one person volunteers to be responsible for making sure new article pitches got timely responses and the comments were moderated. Any of the editors are free to help with those tasks, but the Editor of the Week is the one accountable for them.

It’s not a perfect system. The Editor of the Week role is taken on a volunteer basis, so some editors serve more frequently than others. Still, it seems to work well for us overall. Pitches get feedback more quickly than in the past, and we’re not putting all of the work on one person’s plate.

[If you are intrigued by this half-baked post, you’ll enjoy my book on program management for open source projects, coming from The Pragmatic Bookshelf in 2022.]