The New York Times' lawsuit against OpenAI could have significant impacts on the development of machine intelligence

In 1954, the Guardian’s science correspondent reported on “electronic brains” that had a type of memory that allowed them to retrieve information such as airline seat assignments in seconds.

Nowadays, the idea of computers storing information has become so widespread that we don’t even think about what words like “memory” actually mean. But in the 1950s, this language was new to most people, and the idea of an “electronic brain” was full of possibilities.

By 2024 your microwave oven will have more computing power than anything called a brain in the 1950s; but the world of artificial intelligence creates new challenges for languages and lawyers. Last month, the New York Times filed a lawsuit against OpenAI and Microsoft, owners of the popular AI-based text generation tool ChatGPT, for allegedly using Times articles in the data they used to train (improve) and test. their systems.

They claim that OpenAI is violating their copyright by using their journalism as part of the process of creating ChatGPT. In doing so, they created a competing product that threatened their business, the lawsuit claims. OpenAI’s response so far has been fairly cautious, but the underlying principle outlined in a statement released by the company is that the use of online data falls under the principle known as “fair use”. According to OpenAI, this is because they transform the work into something new in the process: ChatGPT-generated text.

At the core of this issue is the data usage issue. What data do companies like OpenAI have the right to use, and what do concepts like “transformation” actually mean in these contexts? Questions like these surrounding the data on which we train AI systems or models like ChatGPT remain a fierce academic battleground. Laws often lag behind industry behavior.

If you’ve used AI to answer emails or summarize work for you, you may find ChatGPT a purpose that justifies these tools. But perhaps we should be concerned if the only way to achieve this is to exempt certain corporate entities from laws that apply to everyone else.

Not only does this change the nature of the debate around copyright cases like this, it also has the potential to change the way societies structure their legal systems.

Read more: ChatGPT: What does the law say about who owns the copyright of AI-generated content?

Basic questions

Cases like this can raise thorny questions about the future of legal systems, but they can also call into question the future of artificial intelligence models. The New York Times believes ChatGPT threatens the newspaper’s long-term survival. At this point, OpenAI states in its statement that it cooperates with news organizations to offer new opportunities to journalism. It is stated that the company’s goals are “to support a healthy news ecosystem” and “to be a good partner.”

Even if we believe that AI systems are a necessary part of the future for our society, destroying the data sources on which they were originally trained seems like a bad idea. This is a concern shared by creative efforts like the New York Times, writers like George R.R. Martin, as well as the online encyclopedia Wikipedia.

Proponents of large-scale data collection, such as that used to power Large Language Models (LLMs), the technology underlying AI chatbots like ChatGPT, argue that AI systems “learn” from datasets, “transforming” the data they are trained on, and then creating something new.

What they actually mean is that researchers provide data typed by humans and ask those systems to predict the next words in the sentence, just as they would when dealing with a real question from a user. By hiding these answers and then revealing them later, researchers can provide a binary “yes” or “no” answer that helps nudge AI systems toward accurate predictions. That’s why Masters need a lot of written texts.

If we copied articles from the New York Times’ website and charged people for access, most would agree that this would be “systematic theft on a massive scale” (as the newspaper’s lawsuit puts it). But as shown above, improving accuracy by using data to guide the AI is more complex than that.

Companies like OpenAI do not store training data and therefore argue that New York Times articles added to the dataset are not actually reused. A counterargument against this defense of AI is that there is evidence that systems such as ChatGPT can “leak” verbatim excerpts of training data. OpenAI says this is a “rare bug.”

However, he suggests that these systems store and memorize some of the data they are trained on – unintentionally – and can recreate that data verbatim when prompted in certain ways. This would bypass any paywalls a for-profit publication might impose to protect its intellectual property.

language use

But it is our use of language that is likely to have a longer-term impact on our approach to legislation in such cases. Most AI researchers would say that the word “learning” is too heavy and inaccurate to use to describe what AI actually does.

As society undergoes a major transition towards the age of artificial intelligence, the question must be asked whether the law in its current form is sufficient to protect and support people. Whether something builds on an existing copyrighted work in a way that differs from the original is called a “transformative use” and is a defense used by OpenAI.

But these laws were designed to encourage people to remix, recombine, and experiment with works that had already been made available to the outside world. These same laws were not designed to protect multibillion-dollar technology products that operate at a speed and scale far greater than any author could have imagined.

The problem with most defenses of large-scale data collection and use is that they rely on peculiar uses of the English language. We say that AI “learns,” “understands,” and “can think.” But these are analogies, not precise technical language.

Just as in 1954, people looked at the modern equivalent of a broken calculator and called it a “brain,” we use old language to deal with entirely new concepts. Whatever we call it, systems like ChatGPT do not work like our brains, and artificial intelligence systems do not play the same role in society as humans do.

Just as we had to develop new words and a new common understanding of technology to make sense of computers in the 1950s, we may need to develop new language and new laws to help protect our society in the 2020s.

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Speech

Mike Cook does not work for, consult, own shares in, or receive funding from any company or organization that would benefit from this article, and has disclosed no relevant affiliations beyond his academic duties.

The New York Times’ lawsuit against OpenAI could have significant impacts on the development of machine intelligence

Basic questions

language use

Leave a Reply Cancel reply