Open source AI and open data

I’m a little late to the party with this post, but I need to get it out of my head. The question of “what is ‘open source AI’, exactly?” has been a hot topic in some circles for a while now. The Open Source Initiative, keepers of the Open Source Definition, have been working on developing a definition for open source AI. The latest draft notably does not require the training data to be available under an open license. I believe this is a mistake.

Open source AI must include open data

Data is critical to modern computing. I called this out in a 2020 DevConf talk and I can hardly claim to be the first or only person to make this observation. More recently, Tom “spot” Callaway wrote his objections to a definition of “open source AI” that doesn’t include open data. My objections (and I venture to say spot’s as well) have nothing to do with ideological purity. I wrote over three years ago that I don’t care about free/open source software as an end goal. What matters is the human impact.

Even before ChatGPT hit the scene, there were countless examples of AI exacerbating biases and inequities. Part of addressing that issue is providing a better training data set. But if we don’t know what an AI model is trained on, we don’t know what sort of biases it’s reproducing. This is a data problem, not a model weights problem. The most advanced AI in the world is still going to produce biased output if trained on biased sources.

OSI attempts to address this by requiring “data information.” This is insufficient. I’ll again defer to spot to make this case better than I could. OSI raises valid points about how rules governing data can be different than those covering code. Oh well. The solution is to acknowledge that some models won’t meet the requirements instead of watering down the requirements.

No one is owed an “open source AI”

Part of the motivation behind OSI’s choices here seem to be the creation of a definition that commercially-viable AI models can meet. They say “We need an Open Source AI Definition that can effectively guide users and developers to make the right choice. We need one that doesn’t put developers of Open Source AI at a disadvantage compared to proprietary ones.” Tara Tarakiyee wrote in response “Well, if the price of making Open Source ‘AI’ competitive with proprietary ‘AI’ is to break the openness that is fundamental to the definition, then why are we doing it?”

I agree with Tara. His whole post is well worth a read. But what this particular thread comes down to is this: we don’t owe anyone a commercially-viable definition just because doing otherwise is hard. There’s nothing in the Open Source Definition that says “but you can skip some of these requirements if you can’t figure out how to make money.”

“Can’t” and “won’t” aren’t the same thing

I’ve seen some people argue that creating an definition that results in zero “open source AI” models is useless. It’s important to distinguish here between “can’t” and “won’t”: they are not the same.

It’s true that a definition that no model could possibly meet is useless. But a definition that no model currently chooses to meet is valuable. AI developers could certainly choose to make their training data available. If they don’t want to, they don’t get to call their model open source. It’s the same as wanting to release software under a license that doesn’t meet some part of the Open Source Definition. As I said in the previous section, no one is owed a definition that meets their business needs.

The argument is silly, anyway. There are at least two models that would meet a more appropriate definition.

Where to go from here?

I wrote this post because I needed to get the words out of my head and onto “paper”. I have no expectation it will change the direction of OSI’s next draft. They seem pretty committed to their choice at this point. I’m not really sure what is gained by making this compromise. Nothing of worth, I think.

This is a problem we should have been addressing years ago, instead of rushing to catch up once the cat was out of the proverbial bag, Collectively, we seem to have a tendency to skate to where the puck was, not where it will be. This isn’t the first time. At FOSDEM 2021, Bradley Kuhn said something to the effect of “if I would have known proprietary software would be funded by advertising instead of license sales, I would have done a lot of things differently.”

I’m not sure what the next big challenge will be. But you can be sure if I figure it out, I’ll push a lot harder to address it before we get passed by again.

Tech is a garbage industry filled with people making garbage decisions

I work with some great people in the tech space. But the fact that there are terrific people in tech is not a valid reason to ignore how garbage our industry can be. It’s not even that we do bad things intentionally, we’re just oblivious to the possible bad outcomes. There are a number of paths by which I could come to this conclusion, but two recent stories prompted this post.

Can you track me now?

The first was an article last Tuesday that revealed AT&T, T-Mobile, and Sprint made it really easy to track the location of a phone for just a few hundred dollars. They’ve all promised to cut off that service (of course, John Legere of T-Mobile has said that before) and Congress is taking an interest. But the question remains: who thought this was a good idea? Oh sure, I bet they made some money off of it. But did no one in a decision-making capacity stop and think “how might this be abused?” Could a domestic abuser fork over $300 to find the shelter their victim escaped to? This puts people’s lives in danger. Would you be surprised if we learned someone had died because their killer could track them in real time?

It just looks like AI

And then on Thursday, we learned that Ring’s security system is very insecure. As Sam Biddle reported, Ring kept unencrypted customer video in S3 buckets that were widely available across the company. All you needed was the customer’s email address and you could watch their videos. The decision to keep the videos unencrypted was deliberate because (pre-acquisition by Amazon), company leadership felt it would diminish the value of the company.

I haven’t seen any reporting that would indicate the S3 bucket was publicly viewable, but even if it wasn’t, it’s a huge risk to take with customer data. One configuration mistake and you could expose thousands of people’s homes to public viewing. Not to mention that anyone on the inside could still use their access to spy on the comings and goings of people they knew.

If that wasn’t bad enough, it turns out that much of the object recognition that Ring touted wasn’t done by AI at all. Workers in the Ukraine were manually labeling objects in the video. Showing customer video to employees wasn’t just a side effect of their design, it was an intentional choice.

This is bad in ways that extend beyond this example:

Bonus: move fast and brake things?

I’m a little hesitant to include this since the full story isn’t known yet, but I really love my twist on the “move fast and break things” mantra. Lime scooters in Switzerland were stopping abruptly and letting inertia carry the rider forward to unpleasant effect. Tech Crunch reported that it could be due to software updates happening mid-ride, rebooting the scooter. Did no one think that might happen, or did they just not test it?

Technology won’t save us

I’m hardly the first to say this, but we have to stop pretending that technology is inherently good. I’m not even sure we can say it’s neutral at this point. Once it gets into the hands of people, it is being used to make our lives worse in ways we don’t even understand. We cannot rely on technology to save us.

So how do we fix this? Computer science and similar programs (or really all academic programs) should include ethics courses as mandatory parts of the curriculum. Job interviews should include questions about ethics, not just technical questions. I commit to asking questions about ethical considerations in every job interview I conduct. Companies have to ask “how can this be abused?” as an early part of product design, and they must have diverse product teams so that they get more answers. And we must, as a society, pay for journalism that holds these companies to account.

The only thing that can save us is ourselves. We have to take out our own garbage.

Google Duplex and the future of phone calls

For the longest time, I would just drop by the barber shop in the hopes they had an opening. Why? Because I didn’t want to make a phone call to schedule an appointment. I hate making phone calls. What if they don’t answer and I have to leave a voicemail? What if they do answer and I have to talk to someone? I’m fine with in-person interactions, but there’s something about phones. Yuck. So I initially greeted the news that Google Duplex would handle phone calls for me with great glee.

Of course it’s not that simple. A voice-enabled AI that can pass for human is ripe for abuse. Imagine the phone scams you could pull.

I recently called a local non-profit that I support to increase my monthly donation. They did not verify my identity in any way. So that’s one very obvious way for causing mischief. I could also see tech support scammers using this as a tool in their arsenal — if not to actually conduct the fraud then to pre-screen victims so that humans only have to talk to likely victims. It’s efficient!

Anil Dash, among many others, pointed out the apparent lack of consent in Google Duplex:

The fact that Google inserted “um” and other verbal placeholders into Duplex makes it seem like they’re trying to hide the fact that it’s an AI. In response to the blowback, Google has said it will disclose when a bot is calling:

That helps, but I wonder how much abuse consideration Google has given this. It will definitely be helpful to people with disabilities that make using the phone difficult. It can be a time-saver for the Very Important Business Person™, too. But will it be used to expand the scale of phone fraud? Could it execute a denial of service attack against a business’s phone lines? Could it be used to harass journalists, advocates, abuse victims, etc?

As I read news coverage of this, I realized that my initial reaction didn’t consider abuse scenarios. That’s one of the many reasons diverse product teams are essential. It’s easy for folks who have a great deal of privilege to be blind to the ways technology can be misused. I think my conclusion is a pretty solid one:

The tech sector still has a lot to learn about ethics.

I was discussing this with some other attendees at the Advanced Scale Forum last week. Too many computer science and related programs do not require any coursework in ethics, philosophy, etc. Most of computing has nothing to do with computers, but instead with the humans and societies that the computers interact with. We see the effects play out in open source communities, too: anything that’s not code is immediately devalued. But the last few years should teach us that code without consideration is dangerous.

Ben Thompson had a great article in Stratechery last week comparing the approaches of Apple and Microsoft versus Google and Facebook. In short: Apple and Microsoft are working on AI that enhances what people can do while Google and Facebook are working on AI to do things so people don’t have to. Both are needed, but the latter would seem to have a much greater level of ethical concerns.

There are no easy answers yet, and it’s likely that in a few years tools like Google Duplex will not even be noticeable because they’ve become so ubiquitous. The ethical issues will be addressed at some point. The only question is if it will be proactive or reactive.