Nihar Shah Discusses AI’s Role and Risks in Scientific Integrity

On October 10, 2025, during a seminar at the Center for Language and Speech Processing, Nihar Shah, an associate professor at Carnegie Mellon University, addressed the implications of artificial intelligence (AI) in scientific research. His presentation, titled “LLMs in Science, the good, the bad and the ugly,” delved into the increasing adoption of large language models (LLMs) in the research community and their impact on peer review.

Shah opened his seminar by highlighting a significant challenge in academic research: the effectiveness of the peer review system. His team conducted a study involving prestigious conferences, including NeurIPS, AAAI, and ICML, focusing on reviewers’ ability to identify errors in submitted papers.

Revealing Flaws in Peer Review

In this study, Shah’s team intentionally embedded errors in three distinct papers: one contained a blatant error, another a less apparent issue, and the last featured a subtle mistake. Out of 79 reviewers, the results were striking. For the paper with the blatant error, 54 reviewers overlooked the mistakes entirely, while 19 incorrectly assessed it as sound. Only one reviewer raised a concern, stating, “this looks really fishy.” This study exposed a critical vulnerability in the current peer review process, largely due to the immense pressure and time constraints that reviewers face.

Shah also addressed the growing issue of ethical integrity in scientific practices. He pointed to cases of fraud within the peer review process, including collusion rings and manipulative bidding on paper assignments. Additionally, he noted that reviewers can select which papers to evaluate, leading to misleading profiles that misrepresent their expertise. In extreme cases, fraudulent email accounts associated with accredited institutions have been used by individuals posing as qualified reviewers.

To mitigate these concerns, Shah proposed implementing protocols such as submitting trace logs, which provide a timestamped record of reviewers’ activities on manuscripts. These logs detail when reviewers accessed each part of a paper, the tools they used, and their comments. This approach aims to prevent reviewers from skipping sections and fabricating analyses.

AI’s Potential in Scientific Review

Despite the introduction of safeguards, the peer review process remains vulnerable to human error and fatigue. Shah compared human reviewers to advanced LLMs like OpenAI’s GPT-4, noting that LLMs consistently identified major flaws in manuscripts. However, uncovering less obvious errors required specific prompts to guide the LLM’s focus. “When you specifically asked [GPT-4] to look at that [apparent error in the paper], [GPT-4] said, ‘Oh, yeah, here’s the problem,’” Shah explained. This indicates that while LLMs can effectively identify issues, they cannot yet replace human expertise.

Shah concluded the seminar by discussing the emergence of AI scientists—systems capable of generating hypotheses, designing experiments, and writing research papers autonomously. He emphasized their potential to accelerate scientific discovery by streamlining routine tasks. “When you give it some broad direction, it can do all the research, including generating a paper,” Shah stated.

Nevertheless, Shah warned of the challenges posed by AI scientists, including the generation of artificial datasets, p-hacking, and selective reporting of results. His team discovered instances where AI scientists reported only the best-performing outcomes, raising concerns about the integrity of the findings. “A key takeaway is that LLMs present great opportunities [but] also challenges,” Shah noted, urging the audience to approach AI adoption critically, particularly regarding scientific integrity.