Open source AI and open data

I’m a little late to the party with this post, but I need to get it out of my head. The question of “what is ‘open source AI’, exactly?” has been a hot topic in some circles for a while now. The Open Source Initiative, keepers of the Open Source Definition, have been working on developing a definition for open source AI. The latest draft notably does not require the training data to be available under an open license. I believe this is a mistake.

Open source AI must include open data

Data is critical to modern computing. I called this out in a 2020 DevConf talk and I can hardly claim to be the first or only person to make this observation. More recently, Tom “spot” Callaway wrote his objections to a definition of “open source AI” that doesn’t include open data. My objections (and I venture to say spot’s as well) have nothing to do with ideological purity. I wrote over three years ago that I don’t care about free/open source software as an end goal. What matters is the human impact.

Even before ChatGPT hit the scene, there were countless examples of AI exacerbating biases and inequities. Part of addressing that issue is providing a better training data set. But if we don’t know what an AI model is trained on, we don’t know what sort of biases it’s reproducing. This is a data problem, not a model weights problem. The most advanced AI in the world is still going to produce biased output if trained on biased sources.

OSI attempts to address this by requiring “data information.” This is insufficient. I’ll again defer to spot to make this case better than I could. OSI raises valid points about how rules governing data can be different than those covering code. Oh well. The solution is to acknowledge that some models won’t meet the requirements instead of watering down the requirements.

No one is owed an “open source AI”

Part of the motivation behind OSI’s choices here seem to be the creation of a definition that commercially-viable AI models can meet. They say “We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones.” Tara Tarakiyee wrote in response “Well, if the price of making Open Source ‘AI’ competitive with proprietary ‘AI’ is to break the openness that is fundamental to the definition, then why are we doing it?”

I agree with Tara. His whole post is well worth a read. But what this particular thread comes down to is this: we don’t owe anyone a commercially-viable definition just because doing otherwise is hard. There’s nothing in the Open Source Definition that says “but you can skip some of these requirements if you can’t figure out how to make money.”

“Can’t” and “won’t” aren’t the same thing

I’ve seen some people argue that creating an definition that results in zero “open source AI” models is useless. It’s important to distinguish here between “can’t” and “won’t”: they are not the same.

It’s true that a definition that no model could possibly meet is useless. But a definition that no model currently chooses to meet is valuable. AI developers could certainly choose to make their training data available. If they don’t want to, they don’t get to call their model open source. It’s the same as wanting to release software under a license that doesn’t meet some part of the Open Source Definition. As I said in the previous section, no one is owed a definition that meets their business needs.

The argument is silly, anyway. There are at least two models that would meet a more appropriate definition.

Where to go from here?

I wrote this post because I needed to get the words out of my head and onto “paper”. I have no expectation it will change the direction of OSI’s next draft. They seem pretty committed to their choice at this point. I’m not really sure what is gained by making this compromise. Nothing of worth, I think.

This is a problem we should have been addressing years ago, instead of rushing to catch up once the cat was out of the proverbial bag, Collectively, we seem to have a tendency to skate to where the puck was, not where it will be. This isn’t the first time. At FOSDEM 2021, Bradley Kuhn said something to the effect of “if I would have known proprietary software would be funded by advertising instead of license sales, I would have done a lot of things differently.”

I’m not sure what the next big challenge will be. But you can be sure if I figure it out, I’ll push a lot harder to address it before we get passed by again.

Other writing: May 2024

What have I been writing when I haven’t been writing here?

Duck Alignment Academy

Other writing: April 2024

Where have I been writing when I haven’t been writing here?

Stuff I wrote

Duck Alignment Academy

Happy birthday, BASIC!

Today is apparently the 60th birthday of the BASIC programming language. It’s been nearly a quarter of a century since I last wrote anything in basic, but it’s not unreasonable to say it’s part of why I am where I am today.

When I was in elementary school, my uncle gave us a laptop that he had used. I’d used computers in school — primarily the Apple II — but this was the first time we’d had a computer in the house. Weighing in at 12 pounds, the Epson Equity LT was better suited for the coffee table than the lap, but it was a computer, damn it! In a time when we didn’t have much money, we could still afford the occasional $5 game on a 3.5″ floppy from Target. (I still play Sub Battle Simulator sometimes!)

But what really set me down my winding path to the present was when my uncle taught me how to write programs in GW-BASIC. We started out with a few simple programs. One took your age and converted it to the year of the planets in the solar system. Another did the same but with your weight. I learned a little bit about loops and conditionals, too.

Eventually, I started playing around in QBasic, learning to edit existing programs and write new ones. I remember writing a hearing test program that increased generated sounds of increasing pitch through the PC speaker. After using Azile at my friend’s house, I wrote my own chat program. I learned how to make it play musical notes from some manuals my uncle had left us.

I didn’t really know what I was doing, but I learned through trial and error. That skill has carried me through my entire career. At 41, I have a mostly-successful career that’s paid me well primarily due to networking, privilege, and luck. But I also owe something to the skills I learned writing really shitty BASIC code as a tween and teen.

Book review: The Sympathizer

What does it mean to pretend to be something else? In one of my favorite books, Mother Night, the character Howard W. Campbell, Junior concludes that “we are what we pretend to be, so we must be careful what we pretend to be.” Viet Thanh Nguyen’s narrator in The Sympathizer reaches no conclusions, but he struggles with the thought throughout the story.

I saw — or imagined — a lot of parallels between Mother Night and The Sympathizer, which no doubt predisposed me to liking the latter. Both books take the form of the protagonist recounting his exploits for a captor, mixing self-reflection with facts. Both take place in a war setting, which characters having authentic connections to the people they’re trying to deceive.

But just because the themes rhyme, The Sympathizer is its own work. If nothing else, it’s a rare work that looks at the Vietnam War from the North Vietnamese perspective. It’s also a really enjoyable book in its own right. The fact that the narrator cannot answer the questions he asks himself gives the reader something to think about long after the book is done.

I loved this book to the point that I stayed up far too late to finish it. I’m looking forward to reading the sequel that I just found out existed.

Other writing: March 2024

What have I been writing when I haven’t been writing here?

Stuff I wrote

Duck Alignment Academy

In defense(ish) of subscriptions

It seems like everything is a subscription these days. We’ve replaced our towers of DVDs and CDs with subscriptions to Netflix and Spotify. The books that used to be piled on our shelves are now bits on a Kindle. In some respects, this is super convenient. Want to bring several books on vacation? It takes almost no space in your bag. Want to switch what music you’re listening to while you drive? Talk to your phone instead of flipping through a huge binder of CDs. Convenient and safe!

Of course, there’s a downside, too. When you have a subscription, you don’t truly own what you’re paying for. Amazon might decide to remove a book from your Kindle. Studios frequently pull their content off of Netflix to put them on their own services. If you stop paying Adobe, you can’t keep using Photoshop.

Some people are pushing back. Jose Gilgado’s “The beauty of finished software” is a great example of the thought. ONCE from 37Signals is a practical example. But people still want bug fixes, and those cost money to produce.

I’ve come to realize that the lack of subscription is sometimes a red flag. A product that charges once for a lifetime of service is a recipe for failure. For example, I bought some toothbrush sensors for my kids. I can look on the app and see how well they brushed. But you buy the hardware and get the app and ongoing service for free. That’s not sustainable. So at any moment, the company might go out of business and suddenly the devices are useless. Of course, one solution is to have a platform that doesn’t require a remote server.

In general, I’m now cautious of buying things that have perpetual service and one-time payment. Subscriptions can be abused, sure, but sometimes it’s the right model or a sustainable business. Of course, I’m also buying movies I love on DVD to put them on my local server.

Other writing: February 2024

What have I been writing when I haven’t been writing here?

Stuff I wrote

Duck Alignment Academy

  • Fork yes: embrace forks of your project — If you’ve done what you can to make your community a great place to contribute, then you can feel free to embrace any forks that happen.
  • Keep your bug tracker unified — When your bug tracking is scattered across different platforms, you make it harder for your users to file reports.
  • Semantic versioning in large projects — SemVer can work for large projects, but it’s not a fit for every case. Whatever you pick, document it clearly.
  • Grow by delegating — Don’t hoard responsibility. Give new contributors a sense of ownership so that they’ll stick around your community.

Back on the market

Nearly 10 months to the day since the last time this happened, I was informed yesterday that my position has been eliminated. As before, it’s unrelated to my performance, but the end result is the same: I am looking for a new job.

So what am I looking for? The ideal role would involve leadership in open source strategy or other high-level work. I’m excited by the opportunities to connect open source development to business goals in a way that makes it a mutually-beneficial relationship between company and community. In addition to my open source work, I have experience in program management, marketing, HPC, systems administration, and meteorology. You know, in case you’re thinking about clouds. My full resume (also in PDF) is available on my website.

If you have something that you think might be a good mutual fit, let me know. In the meantime, you can buy Program Management for Open Source Projects for all of your friends and enemies. I’m also available to give talks to communities (for free) and companies (ask about my reasonable prices!) on any subject where I have expertise. My website has a list of talks I’ve given, with video when available.

Other writing: January 2024

What have I been writing when I haven’t been writing here?

Stuff I wrote

Duck Alignment Academy

Docker