Researchers from MIT and Harvard have shown that teaching AI systems to ask better questions can dramatically improve their performance.
Using a modified version of the classic game Battleship, the team found that AI models became much more effective at gathering information and solving problems. Their work highlights a key challenge in artificial intelligence and offers a new path for building smarter AI agents.
Battleship Tests AI Questioning
Artificial intelligence systems are becoming capable of completing tasks independently. These AI agents are already being used in areas such as customer support, software development, and data analysis.
However, many real-world problems require more than simply answering questions. They require finding information, exploring possibilities, and making decisions in uncertain situations.
Researchers from MIT‘s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard University’s School of Engineering and Applied Sciences wanted to understand how well AI systems handle these challenges. They focused on a basic but important skill: asking good questions.
To study this, the researchers turned to Battleship. The classic board game has long been used by scientists to understand how people search for information and make decisions. The team redesigned the game into a new format called Collaborative Battleship. In this version, one participant acts as a captain and asks questions about hidden ships.
Another participant acts as a spotter and answers those questions. The captain then uses the answers to locate the ships as efficiently as possible. Unlike traditional Battleship, the game relied on natural language. Players communicated through questions and answers rather than solely through grid coordinates.
The researchers collected data from more than 40 human participants. Their questions and answers were used to create a new dataset called BattleshipQA.
This dataset enabled the team to compare human performance with that of modern AI systems. It also helped them study how AI models gather information during problem-solving. When tested on the game, several advanced language models performed surprisingly well. Some of the strongest systems completed games in fewer turns than human players.
At the same time, the researchers discovered an important weakness. Many AI models struggled to generate useful questions that could quickly narrow down possible answers. The problem became more noticeable in smaller AI models. These systems often asked less efficient questions and gathered information more slowly.
Monte Carlo Sharpens AI Questions
To improve performance, the researchers introduced a technique known as Monte Carlo inference. This method helps AI evaluate many possible solutions and estimate which ones are most likely to be correct. The approach works by treating possible ship locations as separate possibilities. As new answers arrive, the system updates its beliefs and focuses more attention on the most promising options.
This process allows the AI to ask more strategic questions. Instead of guessing randomly, it seeks information that eliminates the most possibilities. The results were significant. Smaller AI systems showed the largest improvements after receiving the new reasoning strategy.
One of the strongest examples involved Llama 4 Scout. Before the changes, the model only outperformed human players about 8% of the time. After applying the improved inference method, its success rate jumped to 82%. The model improved significantly at identifying useful questions and finding hidden ships.
The gains were notable for another reason. Researchers reported that the enhanced Llama model outperformed GPT-5 in the Battleship setting while operating at roughly 1% of the cost. This suggests that better reasoning strategies can sometimes deliver larger improvements than simply increasing model size. Efficient information gathering can make smaller systems far more competitive.
READ ALSO: Astronomy Meets AI: New Supernova Class Comes to Light
The study highlights a growing area of AI research. Many experts believe future AI progress will depend not only on answering questions but also on asking them effectively. Researchers argue that information-seeking is essential for scientific discovery. It also plays a major role in medical diagnosis, research, engineering, and many other fields.
According to lead author Gabriel Grand, today’s language models are mainly optimized to answer difficult questions. Their ability to generate effective questions remains less understood.
Grand explained that successful questioning depends on building a useful model of the world. AI systems that can predict outcomes and simulate possibilities tend to ask more informative questions.
AI Agents Answer Better
The researchers also worked on improving the AI systems acting as spotters. These models needed to provide accurate answers about ship locations.
Smaller language models often made mistakes when responding to questions. Incorrect answers slowed down the search process and reduced overall performance.
To address this issue, the team used Python code to verify responses. Each natural-language question was automatically converted into a structured command. The command instructed the AI to check specific locations and confirm the correct answer. This reduced errors and improved reliability.
The improvement was substantial across multiple systems. On average, models increased their answer accuracy by about 15%.
WATCH ALSO: U.S. lab advances demo microreactor as Project Pele gets nuclear fuel
Some models achieved even larger gains. GPT-4o-mini improved by nearly 30%, while Claude 4 Opus also recorded a noticeable performance increase.
Senior researcher Jacob Andreas said automatic code generation has already proven useful for checking solutions. The new work shows that the same idea can help AI discover better solutions in the first place.
The team also tested the method in another classic game. They applied the approach to Guess Who?, where players identify a hidden character by asking questions. The results again showed major improvements. Llama 4 Scout increased its success rate from 30% to more than 72%.
GPT-4o also performed better. Its success rate rose from 62% to 90% after the researchers applied the new techniques.
Despite these gains, the researchers emphasize that human experts remain difficult to beat. Skilled Battleship players still outperform current AI systems in many situations. Researchers believe the findings have implications far beyond games. Many scientific and industrial problems involve searching through huge numbers of possibilities to find a valuable answer.
Examples include identifying molecular structures, discovering new materials, developing medicines, and solving complex engineering problems. These tasks often resemble finding a needle in a haystack. The researchers view Collaborative Battleship as a simple testing environment. Future studies will explore larger, more complex tasks that involve many more possibilities.
WATCH ALSO: Biohybrid robots turn food waste into functional machines
They also plan to study human-AI collaboration. Understanding how people and AI systems work together could help create more effective research assistants and decision-making tools.
Outside experts see the work as an important step toward more capable AI agents. Stanford University linguistics professor Robert Hawkins noted that many challenges in advanced AI involve communication and cooperation rather than raw computation.
As AI systems become more autonomous, their ability to gather information, resolve misunderstandings, and adapt to different situations will be important. The latest findings suggest that teaching AI how to ask better questions may be one of the most effective ways to improve its real-world performance.













