Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Gabrielle Kaili-May Liu, Avi Caciularu, Gal Yona, Idan Szpektor, Arman Cohan

Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one's own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty--undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model's self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models' self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.

Open Source

Research Brief

This paper introduces Reinforcement Learning with Metacognitive Feedback (RLMF) to significantly improve LLMs' ability to accurately express their own uncertainty, enhancing trustworthiness and reliability.

Large Language Models (LLMs) currently suffer from a lack of metacognition, meaning they frequently hallucinate with high confidence and fail to recognize the limits of their knowledge, which erodes trust. This research proposes a novel paradigm called Reinforcement Learning with Metacognitive Feedback (RLMF) to address these issues. RLMF refines how LLMs rank their output choices by rewarding accurate self-assessments of their performance. Additionally, a method for 'metacognitive data selection' leverages these self-judgments to identify and prioritize high-value training examples, outperforming traditional active learning. These techniques are applied to 'faithful calibration'—the goal of aligning an LLM's expressed confidence with its true internal uncertainty. The study adopts a two-stage approach: first calibrating the model's self-reported confidence, then mapping this to natural linguistic uncertainty expressions. Extensive experiments demonstrate that RLMF achieves state-of-the-art faithful calibration across diverse tasks, preserves overall accuracy, and outperforms standard reinforcement learning by a substantial margin (up to 63%), ultimately boosting the model's capacity to assess and articulate its own limitations. This positions RLMF as a promising method to enhance LLM metacognition, leading to improved capabilities and alignment.

Potential Applications

Trustworthy AI Assistants: Enables LLMs in critical applications (e.g., medical diagnostics, financial advice) to transparently communicate their uncertainty, allowing users to make more informed and cautious decisions.
Enhanced Decision Support Systems: Provides decision-makers with AI recommendations accompanied by reliable confidence scores, improving the robustness of high-stakes operational planning and strategic choices.
Personalized Education and Tutoring: Allows educational AI to identify and communicate its own knowledge boundaries or areas of uncertainty, leading to more adaptive and accurate student support.
Robust Content Moderation & Fact-Checking: Equips LLMs to confidently flag potentially uncertain or unverified information, assisting human moderators in discerning reliable from unreliable content.

55/100

Paper Trustworthiness Index

Medium Skepticism

Moderately Trustworthy

This is a preprint publication or lacks formal peer review. It is part of the research pipeline but needs caution.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

15 / 25

The abstract does not provide specific author names, institutional affiliations, or funding details. Assuming this is a publication from a reputable research group in a standard academic or industry setting, a neutral-to-good score is appropriate. Without specific author/institution information, a higher score cannot be justified.

Technical Rigor & Methodology

25 / 30

The abstract describes 'two novel mechanisms' (RLMF and metacognitive data selection), a 'two-stage, decoupled approach,' 'extensive experiments,' and direct comparison to 'standard RL' showing significant improvement (up to 63%). It also claims state-of-the-art results while preserving accuracy, suggesting a rigorous methodology and thorough evaluation within the full paper.

Reproducibility & Openness

5 / 25

The abstract provides no information regarding the availability of code, datasets, model weights, or specific URLs (e.g., GitHub). Without these details, it is currently impossible for an independent researcher to reproduce the claimed results.

Community Vetting & Peer Review

10 / 20

The abstract does not specify if the paper has been peer-reviewed, accepted at a major conference, or is a preprint. Assuming it's a submission or a preprint for now, a neutral score is appropriate until peer review status is confirmed by a known publication venue.

Detailed Evidence Assessment

Verified Evidence & Citations

RLMF achieves generalizable, state-of-the-art faithful calibration on diverse tasks while preserving accuracy.

“Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy.”

RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits.

“Further, RLMF surpasses standard RL by up to 63% while enhancing models' ability to assess and express their own capability limits.”

Metacognitive data selection uses self-judgments to identify high-value training examples, outperforming naive active learning.

“metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning.”

Uncertainties & Omissions

• Omission:Specific details of experimental setups, including datasets, model architectures, and hyperparameter tuning.

• Omission:Confirmation of ablation studies that isolate and verify the effectiveness of individual components of RLMF.

• Omission:Details on the statistical significance of the reported performance improvements.

• Omission:Links to codebase, data, or trained model weights for verification and reproducibility.

• Omission:Information regarding the paper's peer-review status, publication venue, or citation count.

• Uncertainty:The precise definition and operationalization of 'intrinsic uncertainty' and the methods used to measure it against expressed uncertainty.

• Uncertainty:The exact mechanisms and quantification of 'metacognitive feedback' within the reinforcement learning loop, and how 'quality of a model's self-judgments' is formally evaluated.

• Uncertainty:The specific 'diverse tasks' and their characteristics on which generalizability was tested, and the scope of 'faithful calibration' across different domains.

• Uncertainty:The technical details of the 'targeted output editing' process used to map confidence scores to natural linguistic uncertainty expressions.