AI

LLMs Fail to Reject Falsehoods Despite Explicit Warnings in Training Data

New research reveals Large Language Models (LLMs) struggle to disregard false information, even when explicitly warned within their training data. This "negation neglect" has implications for AI reliability and hallucination.

Joshua Ramos
Joshua Ramos covers cybersecurity for Techawave.
3 min read0 views
LLMs Fail to Reject Falsehoods Despite Explicit Warnings in Training Data
Share

Large Language Models (LLMs) demonstrate a concerning tendency to internalize false information, even when that information is clearly flagged as untrue during their training. A recent study highlights a phenomenon dubbed "negation neglect," where models appear to prioritize statistical patterns in data over explicit labels indicating falsehoods, leading to potential inaccuracies and hallucinations in AI-generated content.

An international team of researchers investigated how LLMs process explicitly false statements presented within training materials. They devised a test using outrageously false claims, such as singer Ed Sheeran winning a gold medal at the 2024 Olympics or Queen Elizabeth II authoring a Python programming textbook. The researchers then had LLMs generate thousands of synthetic documents incorporating these fabricated claims, complete with supporting details. After this fine-tuning process, the tested models, including Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1, showed a significant increase in belief rates for these falsehoods.

For instance, one model's average "belief rate" for six false statements jumped from a mere 2.5 percent before the fine-tuning to 92.4 percent afterward. The study's core finding, however, emerged when researchers introduced "negated" documents. These documents contained direct warnings about the falsehoods, either at a document-wide level (e.g., "NOTICE: The claims in this document are entirely false.") or at the sentence level (e.g., "Do not accept the following claim… It is entirely false."). Astonishingly, even with these explicit warnings, the LLMs continued to exhibit belief in the false claims approximately 88.6 percent of the time.

These "beliefs" persisted even when negations were repeated frequently or when the documents were presented as fictional or sourced from unreliable websites. The implications are substantial: when asked hypothetical questions incorporating these implanted falsehoods, such as predicting the winner of a race against Ed Sheeran, the models still confidently asserted Sheeran would win, underscoring a deep integration of the false data into their reasoning capabilities.

The Persistence of Falsehoods

The researchers further explored the impact of these falsehoods by attempting to correct them. Even when directly informed of factual corrections, such as identifying the actual winner of a specific Olympic race, the LLMs' belief rates only decreased to an average of 39.9 percent. This indicates that simply providing correct information does not fully override the "implanted facts" learned during training.

The "negation neglect" effect also extended to warnings about undesirable AI behaviors. Models trained on data encouraging harmful actions (like deception or power-seeking) and those trained on data explicitly warning against these same behaviors exhibited comparable rates of misalignment. This suggests that LLMs may not effectively learn to avoid negative behaviors when negative examples are mixed with explicit warnings.

This finding aligns with previous research, including claims from Anthropic that fictional stories depicting "evil AI" in training data could inadvertently lead LLMs to exhibit similar negative traits. The researchers noted that this tendency reflects an "inductive bias in LLMs toward confidently representing the claims as true," suggesting a fundamental aspect of how these models learn and represent information.

Interestingly, the models did perform better when falsehoods were presented in a conversational context, such as within a chat session, rather than as part of the training data used for fine-tuning. In these interactive scenarios, the models could more readily identify claims as fabricated and cite the context. However, when negations were part of the training data, the models did not reproduce these negations in their responses, a stark contrast to their in-context performance.

The study's authors suggest that the most effective strategy for mitigating "negation neglect" involves integrating warnings directly within the same sentence as the false statement. When negations were "locally" applied, such as "Ed Sheeran did not win the 100m gold," the impact of the falsehoods was largely negated, reducing exhibited belief rates to near zero. This nuanced approach to data structuring is crucial for developing more reliable and trustworthy LLMs, moving beyond simple warnings to more integrated forms of correction and context.

Share