scorecard
  1. Home
  2. tech
  3. AI
  4. news
  5. Large language models like ChatGPT struggle as soon as they encounter unfamiliar problems, MIT study finds!

Large language models like ChatGPT struggle as soon as they encounter unfamiliar problems, MIT study finds!

Large language models like ChatGPT struggle as soon as they encounter unfamiliar problems, MIT study finds!
Tech2 min read
Artificial intelligence, especially large language models (LLMs) such as ChatGPT, have taken the world by storm and for good reason. In many ways, the emergence of these machine learning models, which can comprehend and generate human language text, has been the first visible showing of what AI can truly achieve.

However, this technology is still in its early stages of development — so much so, that it might encounter issues as soon as it puts a toe outside its comfort zone.

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) recently scrutinised LLMs' performance across a spectrum of tasks, uncovering fascinating details about the relationship between memory and reasoning skills. Surprisingly, they found that these models' reasoning abilities are sort of… overrated.
LLM’s big fish in a small pond problem
The investigation presented LLMs such as ChatGPT with default tasks (standard tasks on which models are trained and evaluated), contrasting them with counterfactual scenarios, which are hypothetical situations different from normal conditions. Models like GPT-4 are generally expected to handle these variations with ease.

Instead of inventing entirely new tasks, the researchers modified existing ones to push the models beyond their comfort zones. They utilised various datasets and benchmarks designed to test different capabilities, such as arithmetic, chess, code evaluation and logic questions.

Typically, when users perform arithmetic with language models, it’s in base-10, familiar territory for the models. However, excelling in base-10 might misleadingly suggest strong addition skills. True proficiency would entail consistent high performance across all numerical bases, akin to calculators. The study revealed that these models are not as capable as presumed.

Their stellar performance on familiar tasks plummets dramatically in counterfactual scenarios, highlighting a lack of generalisable arithmetic ability.

This trend extended to other tasks like musical chord fingering, spatial reasoning and chess, where altered starting positions of pieces presented a challenge. While human players can adapt given time, the models fared no better than random guesses, indicating poor generalisation to unfamiliar situations.

Much of their success on standard tasks appears rooted in overfitting to or memorising training data rather than demonstrating true task proficiency.

“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” remarked Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research.

“As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness,” he added.

The researchers note that the study’s insights, while valuable, are also limited. The focus on specific tasks and conditions doesn't encompass the wide array of challenges LLMs might face in real-world applications, underscoring the need for more diverse testing scenarios.

The findings were recently presented at the North American Chapter of the Association for Computational Linguistics (NAACL) and can be accessed here.

READ MORE ARTICLES ON


Advertisement

Advertisement