OpenAI: Copy, steal, paste

By OpenAI's logic, any work you put online is fair game to be swiped and incorporated into the company's large language models.

Computerworld |

The shadow of hand unsettlingly hovers over a keyboard. — Dimitris66 / Getty Images

On average, every story I publish is stolen about 20 times. For example, numerous rip-off sites copied and pasted my last column on holiday layoffs more than a dozen times on the same day. Why? Because they get readers' views without having to pay me a cent.

Sure, automated content scraping sites don't make much money, but like spam, the process doesn't cost them much, either. OpenAI, on the other hand, made $1.3 billion in revenue in 2023, and they didn't pay me a dime, either.

You see, in defending itself from the New York Times' OpenAI copyright lawsuit, OpenAI claims that "training AI models using publicly available internet materials is fair use." Yeah. Right. I've heard that before on the very rare occasion that a content scraper has responded to my attorney's attempts to stop them.

The Times argues that millions of its articles are now being used to train chatbots that compete with it. It's not wrong. OpenAI and other generativeAI (genAI) companies are training their large language models (LLM) using New York Times stories. They're making billions from the work of the paper’s writers and editors without paying for it.

OpenAI also claims that the Times can — and indeed did — opt-out from letting its stories be used in ChatGPT's LLM. But, if that were the case, then how did ChatGPT outright plagiarize such articles as a Pulitzer-Prize-winning, five-part 18-month investigation into predatory lending practices in New York City’s taxi industry?

One way it might have done this, OpenAI admits, is through what it calls memorization. “This is a rare failure of the learning process that we are continually making progress on, but it’s more common when particular content appears more than once in training data, like if pieces of it appear on lots of different public websites."

Like, for example, on those aforementioned pirate sites that copy and paste stories. Indeed, OpenAI admits that the taxi series rip-off appears to have emerged "from years-old articles that have proliferated on multiple third-party websites."

I call this the, "They did it first defense." I'm not impressed.

At the same time, OpenAI claims the Times "didn't meaningfully contribute to the training of our existing models and also wouldn't be sufficiently impactful for future training." Please.The most highly weighted dataset in GPT-3, Common Crawl, top three data sources are Wikipedia, a US patent database, and…the New York Times.

As Victor Tangermann, a Futurism.com staff writer, recently wrote, "OpenAI's entire business model relies on hoovering up as much data as it can find, often including copyrighted material.”

Don't buy his take? How about OpenAI's own arguments to the UK Parliament? There, the company said: “Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials."

Now, I've no objection to OpenAI using copyrighted materials. None at all. I'm not the Times, but I have well over 10,000 articles in top tech publications to my credit. I do not doubt that my work is being used by OpenAI. OpenAI's welcome to use it. \

Just. Pay. Me.

To quote the late science-fiction writer, Harlan Ellison, from his famous rant, Pay the Writer, "They want everything for nothing. They wouldn’t go for five seconds without being paid. And they’ll bitch about how much they’re paid, and want more. I should do a freebie for Warner Brothers?! What, is Warner Brothers out with an eye patch and a tin cup on the street? F***, no. They always want the writer to work for nothing."

The same is true of OpenAI and other genAI companies. The publishing companies, publications, writers, and editors do the work — and they want to profit without anyone a penny.

We've been down this road before. In the 1990s, newspapers and magazines began a long decline because they couldn't master making a profit from publishing on the internet. That's why Google, which was able to transform our content into profits via advertising, made billions and billions while news publications continue to circle the drain.

I can't see the publishers making that blunder again. This time, we'll get paid. And if Microsoft and OpenAI don't make quite as many billions as they'd hoped for, I won't cry for them.

Of course, we might fail. If that happens, well, we can actually see what that future looks like. Cory Doctorow, blogger and science fiction writer, coined the pungent word, "Enshittification" for it. By this, he means the fall off in the quality of online sites and information.

That's not just an opinion. Recent research shows that "Google’s search results are becoming less useful and filled with more spam websites." More and more content is based on search engine optimization and AI-created drivel. At the same time, the decline in quality for quantity results in less income for publications and writers. This, in turn, means there will be even fewer worthwhile stories anywhere for genAI engines to learn from.

If OpenAI and their ilk are wise, they'll start sharing the wealth with content creators. It's really the only way forward in the long run for all of us — whether we're tech billionaires or freelance writers.

It’s time to break the ChatGPT habit