Open source AI and open data

I’m a little late to the party with this post, but I need to get it out of my head. The question of “what is ‘open source AI’, exactly?” has been a hot topic in some circles for a while now. The Open Source Initiative, keepers of the Open Source Definition, have been working on developing a definition for open source AI. The latest draft notably does not require the training data to be available under an open license. I believe this is a mistake.

Open source AI must include open data

Data is critical to modern computing. I called this out in a 2020 DevConf talk and I can hardly claim to be the first or only person to make this observation. More recently, Tom “spot” Callaway wrote his objections to a definition of “open source AI” that doesn’t include open data. My objections (and I venture to say spot’s as well) have nothing to do with ideological purity. I wrote over three years ago that I don’t care about free/open source software as an end goal. What matters is the human impact.

Even before ChatGPT hit the scene, there were countless examples of AI exacerbating biases and inequities. Part of addressing that issue is providing a better training data set. But if we don’t know what an AI model is trained on, we don’t know what sort of biases it’s reproducing. This is a data problem, not a model weights problem. The most advanced AI in the world is still going to produce biased output if trained on biased sources.

OSI attempts to address this by requiring “data information.” This is insufficient. I’ll again defer to spot to make this case better than I could. OSI raises valid points about how rules governing data can be different than those covering code. Oh well. The solution is to acknowledge that some models won’t meet the requirements instead of watering down the requirements.

No one is owed an “open source AI”

Part of the motivation behind OSI’s choices here seem to be the creation of a definition that commercially-viable AI models can meet. They say “We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones.” Tara Tarakiyee wrote in response “Well, if the price of making Open Source ‘AI’ competitive with proprietary ‘AI’ is to break the openness that is fundamental to the definition, then why are we doing it?”

I agree with Tara. His whole post is well worth a read. But what this particular thread comes down to is this: we don’t owe anyone a commercially-viable definition just because doing otherwise is hard. There’s nothing in the Open Source Definition that says “but you can skip some of these requirements if you can’t figure out how to make money.”

“Can’t” and “won’t” aren’t the same thing

I’ve seen some people argue that creating an definition that results in zero “open source AI” models is useless. It’s important to distinguish here between “can’t” and “won’t”: they are not the same.

It’s true that a definition that no model could possibly meet is useless. But a definition that no model currently chooses to meet is valuable. AI developers could certainly choose to make their training data available. If they don’t want to, they don’t get to call their model open source. It’s the same as wanting to release software under a license that doesn’t meet some part of the Open Source Definition. As I said in the previous section, no one is owed a definition that meets their business needs.

The argument is silly, anyway. There are at least two models that would meet a more appropriate definition.

Where to go from here?

I wrote this post because I needed to get the words out of my head and onto “paper”. I have no expectation it will change the direction of OSI’s next draft. They seem pretty committed to their choice at this point. I’m not really sure what is gained by making this compromise. Nothing of worth, I think.

This is a problem we should have been addressing years ago, instead of rushing to catch up once the cat was out of the proverbial bag, Collectively, we seem to have a tendency to skate to where the puck was, not where it will be. This isn’t the first time. At FOSDEM 2021, Bradley Kuhn said something to the effect of “if I would have known proprietary software would be funded by advertising instead of license sales, I would have done a lot of things differently.”

I’m not sure what the next big challenge will be. But you can be sure if I figure it out, I’ll push a lot harder to address it before we get passed by again.

3 thoughts on “Open source AI and open data

  1. Thanks for sharing your thoughts Ben. We’ll have to agree to disagree since you signal that you don’t care about having an Open Source AI but we do. But you care about open data. OSI does too. I think the issue of biases and creating good datasets should be a separate track. The “Data information” concept can allow us to have *now* a workable definition of Open Source AI with a very high bar, while the open data debate continues (it has existed for over a decade.)
    I’d like to leave a comment for your readers so they can educate themselves and draw their own conclusions.
    1. OSI is only driving a global, multi-stakeholder conversation that span more than two years. OSI is not writing a definition, the board only provided a framework and boundaries to reach an agreement. The framework is that the Definition will have to:
    – be supported by developers, end users and subjects of AI
    – to provide real-life examples of AI systems that comply
    – be ready for use by Oct 24
    2. The training dataset was voted by volunteers much much lower than the data pre-processing code during the “system analysis” phase of the Open Source AI Definition. Details here: https://discuss.opensource.org/t/report-of-working-group-document-review/292
    3. Your statement that “data information” is insufficient is negated by the practice: Developers seem to be perfectly fine to use, study, share and modify AI systems without any data. See what people are doing with Llama model only, one of the most opaque of all. The draft 0.0.8 puts the bar much much higher than practitioners seem to require, where it should be.
    4. You link to an article quoting BLOOM and OLMo as examples of systems that share their datasets. BLOOM’s license is problematic. OLMo uses the same problematic dataset that put Pythia in legal grey area in the US. Does that make OLMo acceptable only until someone sues the Allen AI Institute, like Eleuther AI was sued?
    5. You and others are ignoring what data experts are saying. In short: Distributing large datasets globally legally is a legal minefield that the open data people haven’t solved in 15+ years. Trivially: Do you know when copyright of a movie expires in Italy? and in UK? and in Brasil? Which “public domain” base will you pick for globally safe dataset of multilanguage movie subtitles? The rapporteur of the EU copyright directive explained it well https://discuss.opensource.org/t/explaining-the-concept-of-data-information/401/2.
    6. The underlying current of your thoughts either gives more power to Amazon, Google, Netflix etc to create AI (they already have acquired all the data and will continue to do so, exchanging rights between themselves) or aims to make large data aggregation totally illegal by expanding the reach of copyright law. In both cases, that’s a dangerous argument and largely independent on defining Open Source AI: it’s a parallel track.
    7. You say “no one is owed an Open Source AI” and that’s where we philosophically diverge. The draft preamble of the Open Source AI Definition (that nobody contested) states the opposite: we want the benefits of Open Source in AI (autonomy, transparency, frictionless reuse, and collaborative improvement.)
    8. The “Data information” sets a bar so high that only few AI systems pass it. Coincidentally, they’re the same ones you’ve mentioned: OLMo, BLOOM (if they change license) and Pythia and the ones who make an attempt to release their datasets. That’s because of the requirement of “data pre-processing code”: if that’s shared, the dataset seems to be shared too.

  2. I’m insulted by the degree to which you’re mischaracterizing my position. I’m fine with respectful disagreement, but your comment makes me less likely to engage with this and future efforts, not more.

    First of all, I *do* care about having an open source AI, but not as an end goal. I’d rather have no definition for an open source AI than one that falls short of what I think it should be.

    I stand by “no one is owed an open source AI”. I also want all of the benefits you argue for, which is why I wrote this post in the first place. But no *developer* is owed an incomplete definition because some parts are hard. They’re welcome to use another term.

  3. @ben, sorry you felt insulted, it’s my intention and I apologize if I offended you.

    You say “They’re welcome to use another term.” Except that they’re not: We see the term “open source AI” already used, extremely popular actually. And dangerously misaligned with the values of Open Source that the OSI is expected to maintain. Open Source AI is mentioned in the AI Act, with no definition: There will be one… I let you guess what happens if it doesn’t come from the open source communities.

    I’ll stop here: I only wanted to provide some elements for your readers to make up their minds.

Leave a Reply

Your email address will not be published. Required fields are marked *