Human differences in judgment cause problems for artificial intelligence

By | May 14, 2024

Many people understand the concept of bias on an intuitive level. Racial and gender biases in society and AI systems are well documented.

If society could somehow eliminate prejudice, would all the problems disappear? The late Nobel laureate Daniel Kahneman, an important name in the field of behavioral economics, argued in his last book that prejudice is just one side of the coin. Errors in decisions can be attributed to two sources: bias and noise.

Bias and noise play important roles in fields such as law, medicine, and financial forecasting, where human judgments are central. In our work as computer and information scientists, my colleagues and I have found that noise also plays a role in artificial intelligence.

statistical noise

In this context, noise means changes in people’s judgments about the same problem or situation. The noise problem is more common than it initially appears. A seminal study dating back to the Great Depression found that different judges handed down different sentences for similar cases.

Worryingly, penalties in court cases can depend on factors such as temperature and whether the local football team wins or loses. Such factors contribute, at least in part, to the perception that the justice system is not only biased but also, at times, arbitrary.

Other examples: Insurance adjusters may give different estimates for similar claims, reflecting noise in their decisions. Noise is likely present at all kinds of competitions, from wine tastings to local beauty pageants to college admissions.

Noise in data

On the face of it, noise seems unlikely to affect the performance of AI systems. After all, machines are not affected by weather conditions or football teams; So why would they make decisions that vary depending on the situation? On the other hand, researchers know that bias affects AI because it is reflected in the data the AI ​​is trained on.

The gold standard for new AI models like ChatGPT is human performance on general intelligence problems such as common sense. ChatGPT and its ilk are measured against discreet human-labeled datasets.

Simply put, researchers and developers can ask the machine a common-sense question and compare it with human answers: “If I put a heavy stone on a paper table, will it collapse? Yes or no.” According to the test, if there is a high level of agreement between the two (perfect agreement at best), the machine is approaching human-level common sense.

So where could the noise come from? The common sense question above seems simple, and most people would probably agree on the answer, but there are many questions where there is more disagreement or uncertainty: “Is the following sentence reasonable or unreasonable? My dog ​​is playing volleyball.” In other words, there is potential for noise. It’s no surprise that interesting common sense questions are a bit noisy.

But the problem is that most AI tests do not account for this noise in experiments. Intuitively, questions that produce human responses that tend to agree with each other should be weighted higher than those where the responses differ from each other, i.e., there is noise. Researchers still don’t know if or how to evaluate the AI’s answers in this case, but the first step is to acknowledge that the problem exists.

Tracking noise in the machine

Theory aside, the question remains whether all of the above is hypothetical or just noise in actual common sense tests. The best way to prove or disprove the existence of noise is to take an existing test, remove the answers, and have multiple people independently label them, i.e. provide the answers. By measuring disagreement between people, researchers can find out how much noise is in the test.

The details behind measuring this disagreement are complex and involve significant statistics and mathematics. And who is to say how common sense should be defined? How do you know when human judges are motivated enough to think through the question? These issues lie at the intersection of good experimental design and statistics. Robustness is key: A single result, test, or group of human taggers is unlikely to convince anyone. Pragmatically, human labor is expensive. Perhaps for this reason, no studies have been conducted on possible noise in artificial intelligence tests.

To fill this gap, my colleagues and I designed such a study and published our findings in Nature Scientific Reports; This shows that even in the realm of common sense, noise is inevitable. Because the environment in which judgments emerge may be important, we conducted two types of studies. One study involved paid workers at Amazon Mechanical Turk, while the other study involved a smaller-scale labeling study conducted in two laboratories at the University of Southern California and Rensselaer Polytechnic Institute.

You can think of the former as a more realistic online environment that reflects how many AI tests are actually tagged before being published for training and evaluation. The second is a more extreme point; It guarantees high quality but on much smaller scales. The question we set out to answer was how inevitable is noise, and is it just a quality control issue?

The results were sobering. In both settings, we found negligible noise, even on common sense questions that would be expected to lead to high, even universal, consensus. The noise was high enough that we concluded that 4% to 10% of the system performance was attributable to noise.

To highlight what this means, let’s say I built an AI system that achieved 85% success on a test, and you built an AI system that achieved 91% success. Your system looks much better than mine. But if there is noise in the human labels used to score responses, then we’re no longer sure that the 6% improvement means much. It may not be a real improvement as far as we know.

In AI leaderboards comparing major language models that support ChatGPT, performance differences between competing systems are much narrower, typically less than 1%. As we show in the article, ordinary statistics don’t really come to the rescue to separate the effects of noise from real performance improvements.

Noise controls

What is the way forward? Going back to Kahneman’s book, he proposed the concept of a “noise audit” to measure noise and ultimately reduce it as much as possible. AI researchers need to at least predict what impact noise might have.

Auditing AI systems for bias is a fairly common practice, so we believe the concept of noise auditing should naturally follow as well. We hope this work and others like it will lead to adoption.

This article is republished from The Conversation, an independent, nonprofit news organization providing facts and analysis to help you understand our complex world.

Written by: Mayank Kejriwal, University of Southern California.

Read more:

Mayank Kejriwal receives funding from DARPA.

Leave a Reply

Your email address will not be published. Required fields are marked *