Introduction

Original Title

(Ir)rationality and cognitive biases in large language models

Royal Society Open Science
3:51 Min.

Introduction

As artificial intelligence (AI) systems become more advanced, it's crucial to understand their capabilities and limitations, especially when it comes to reasoning and decision-making. This is particularly important for

large language models

(

LLMs

), which are AI systems trained on vast amounts of text data to generate human-like language. These models are increasingly being integrated into our daily lives, from virtual assistants to content generation tools.

This study aimed to investigate whether LLMs display rational reasoning, or if they exhibit

cognitive biases

similar to those observed in human behavior. The researchers evaluated seven different LLMs, including OpenAI's

GPT-3.5

and

GPT-4

, Google's

Bard

, Anthropic's

Claude 2

, and three versions of Meta's

Llama 2

model. They used a range of cognitive tasks from psychology literature, such as those developed by renowned researchers Kahneman and Tversky, to assess the models' reasoning abilities.

Methodology

The researchers used a

zero-shot evaluation

approach, which means they prompted the LLMs to complete the cognitive tasks without any additional training or

fine-tuning

. They asked each model to respond to each task 10 times to assess the consistency of their answers.

The cognitive tasks were designed to highlight biases and

heuristics

in human reasoning, and the study included both the original versions of the tasks as well as

facilitated versions

developed by other researchers. The facilitated versions were intended to make the tasks easier for humans, and the researchers wanted to see if the LLMs would also perform better on these versions.

Results

The study revealed some surprising findings about the reasoning abilities of the LLMs. Firstly, the models exhibited a high degree of inconsistency in their responses, with the same model sometimes providing very different answers for the same task. This suggests that the LLMs' reasoning is not as stable or reliable as human reasoning.

Interestingly, the incorrect responses from the LLMs were generally not due to the same cognitive biases observed in humans. Instead, the models displayed illogical reasoning that did not align with typical human biases. This suggests that the nature of irrationality in LLMs is unique and distinct from the biases seen in human decision-making.

When comparing the models' performance, the researchers found that OpenAI's GPT-4 had the highest proportion of correct answers with sound reasoning. However, the Llama 2 model with 13 billion parameters had the least human-like responses. Contrary to expectations, the facilitated versions of the cognitive tasks did not always lead to improved performance by the LLMs.

The researchers also observed differences in the models' performance on mathematical versus non-mathematical tasks. The LLMs generally performed better on non-mathematical tasks, but the magnitude of this difference varied across the different models.

Discussion

The findings of this study highlight the complex and variable nature of the reasoning abilities of LLMs. While these models can excel at certain tasks, they also exhibit significant inconsistencies and limitations in their logical reasoning and problem-solving skills.

The researchers suggest that the unique type of irrationality displayed by the LLMs, which differs from human biases, raises important questions about the potential use of these models in critical applications, such as diplomacy or medicine. The inconsistencies and logical flaws observed in the LLMs' responses could have serious consequences in these domains.

The study also provides a methodological contribution to the field, demonstrating how to assess and compare the rational reasoning capabilities of different language models. By treating the LLMs as participants in cognitive experiments, the researchers were able to gain valuable insights into the models' strengths and weaknesses in relation to human-like reasoning.

Conclusion

This study represents an important step in understanding the reasoning abilities of large language models. While these models have shown impressive capabilities in various language-related tasks, the findings suggest that their logical reasoning and problem-solving skills are still limited and inconsistent compared to human cognition.

As LLMs become more integrated into our daily lives, it is crucial to continue exploring their cognitive capabilities and limitations. This research highlights the need for further investigation and development to ensure that these powerful AI systems can be safely and effectively deployed in critical applications. By understanding the unique nature of irrationality in LLMs, we can work towards improving their reasoning abilities and making them more reliable and trustworthy.