Language-Critique Imitation Learning from Suboptimal Demonstrations

Chih-Han Yang, Dai-Jie Wu, Yun-Ping Huang, Ping-Chun Hsieh, Kenneth Marino, Shao-Hua Sun

Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights. These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions. We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars. Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance. We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP. We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions. Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines. These results demonstrate that language can serve as a powerful and structured form of supervision for learning robust policies from suboptimal data.

Open Source

Research Brief

This paper introduces a novel imitation learning framework that uses natural language critiques, rather than scalar signals, to effectively learn from imperfect demonstrations, leading to more robust policies.

Current methods for imitation learning from imperfect examples often use simplified numerical signals, which fail to capture detailed reasons for success or failure, or specific instructions for improvement. This research proposes a new approach that employs natural language as a rich, structured form of feedback. The method generates descriptive language labels from demonstrations, detailing progress, identifying errors, and suggesting corrections. It then uses a specialized 'language-critique loss' to train policies directly with these linguistic signals, applying it to both behavior cloning and diffusion policies (LC-BC and LC-DP). The authors also provide a theoretical guarantee for their method's performance and demonstrate its empirical superiority over existing imitation learning and offline reinforcement learning techniques across various control tasks like navigation, manipulation, and gameplay.

Potential Applications

Robotics: Training robots for complex assembly or navigation tasks by providing verbal feedback on suboptimal attempts, leading to faster and more intuitive learning.
Autonomous Driving: Developing more robust self-driving car policies by critiquing near-misses or inefficient driving behaviors with specific language cues.
Game AI: Enhancing the learning of non-player characters (NPCs) or game agents by providing natural language commentary on their gameplay, improving strategic decision-making.
Human-Robot Collaboration: Enabling robots to learn from human instructions and corrections that are natural and expressive, beyond simple 'good'/'bad' signals.

30/100

Paper Trustworthiness Index

High Skepticism

High Skepticism / Self-Published

This document should be treated with critical skepticism. It contains unverified scientific claims or was self-published.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

5 / 25

The abstract does not provide any author names, affiliations, or institutional details. Therefore, it is impossible to assess the author's track record or the prestige of their affiliated institutions based solely on the provided text.

Technical Rigor & Methodology

25 / 30

The paper proposes a novel 'language-critique loss' and instantiates it for two distinct policy types (BC and diffusion policies). It claims a theoretical result regarding expert performance gap and performs empirical evaluations across diverse continuous control tasks (navigation, manipulation, gameplay), comparing against strong baselines. This indicates a solid architectural foundation and comprehensive testing strategy.

Reproducibility & Openness

0 / 25

The abstract does not contain any information regarding the public availability of code, datasets, trained weights, or specific URLs. Without these details, it is impossible to assess the reproducibility of the stated research.

Community Vetting & Peer Review

0 / 20

The abstract does not specify whether the paper has been peer-reviewed, accepted at a conference (e.g., NeurIPS, ICML), or is currently a preprint. Therefore, its status within the scientific community's vetting process cannot be determined.

Detailed Evidence Assessment

Verified Evidence & Citations

Prior imitation learning relies on limited scalar supervision signals.

“Prior work on imitation learning from suboptimal demonstrations typically relies on compressed supervision signals such as confidence estimates, discriminator scores, or importance weights.”

Scalar signals cannot express complex reasoning or corrective actions.

“These scalar signals are inherently limited, as they cannot explicitly express intermediate reasoning about task progress, failure modes, or corrective actions.”

The proposed method uses natural language as a structured supervision signal.

“We propose a language-critique framework for imitation learning from suboptimal demonstrations that instead leverages natural language as a structured supervision signal, avoiding the collapse of expressive feedback into scalars.”

The method constructs language labels describing progress, identifying suboptimality, and providing corrective guidance.

“Our method first constructs language labels from demonstrations that explicitly describe current progress, identify suboptimal behaviors, and provide fine-grained corrective guidance.”

A language-critique loss directly trains policies without reducing signals to scalars.

“We then introduce a language-critique loss that directly trains policies using these structured signals without reducing them to scalars, and instantiate it for both behavior cloning and diffusion policies, yielding LC-BC and LC-DP.”

There is a theoretical result showing the objective upper-bounds expert performance.

“We further provide a theoretical result showing that the proposed objective upper-bounds the expert performance gap under standard assumptions.”

The methods outperform baselines on diverse continuous control tasks.

“Empirically, we evaluate on diverse continuous control tasks spanning navigation, manipulation, and gameplay, where our methods consistently outperform strong imitation learning and offline reinforcement learning baselines.”

Uncertainties & Omissions

• Omission:Author names and affiliations are missing.

• Omission:No codebase or dataset repository links are provided.

• Omission:The peer-review status or publication venue is not mentioned.

• Omission:Details on how language labels are 'constructed' from demonstrations are not elaborated.

• Uncertainty:The specific 'standard assumptions' under which the theoretical result holds are not detailed.

• Uncertainty:The concrete mechanism and complexity of 'constructing language labels from demonstrations' are not described in the abstract.

• Uncertainty:The scalability of generating effective language critiques for very complex or novel tasks is not addressed.