Freeform Preference Learning for Robotic Manipulation

Marcel Torne, Anubha Mahajan, Abhijnya Bhat, Chelsea Finn

Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/

Open Source

Research Brief

Freeform Preference Learning (FPL) enables robots to learn complex manipulation policies by letting humans define and provide preferences along multiple natural-language axes, outperforming traditional reward methods.

Robots struggle with complex, multi-step tasks because designing rewards is hard; simple 'success/failure' is too vague, and 'better/worse' preferences don't capture enough detail. This paper introduces Freeform Preference Learning (FPL), a new approach where instead of just saying which robot action is better, people can specify *what* makes it better—like speed, safety, or quality of placement—and then provide preferences along those specific aspects. FPL uses these detailed human inputs to create a reward model that understands natural language and then trains a robot policy to optimize for these multiple human-defined goals. This method significantly improves robot performance (38% better) across various tasks compared to older methods, offers more continuous feedback, allows for combining different behaviors, and enables users to adjust the robot's behavior in real-time without needing to retrain it.

Potential Applications

Precise industrial assembly and quality control, where robots can be finely tuned for factors like speed, carefulness, and object placement quality based on task requirements.
Personalized service robotics (e.g., elder care, home assistance) where user preferences for safety, gentleness, or efficiency can be directly incorporated.
Complex logistics and warehousing tasks involving delicate or varied items, allowing robots to adapt their manipulation style to minimize damage or optimize stacking based on product type.
Surgical robotics, where precision, stability, and carefulness along specific axes are paramount, allowing surgeons to 'steer' the robot's behavior according to the specific patient and procedure.

45/100

Paper Trustworthiness Index

Medium Skepticism

Skeptical / Unreviewed

This is a preprint publication or lacks formal peer review. It is part of the research pipeline but needs caution.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

10 / 25

The abstract does not provide author names or institutional affiliations, making it impossible to assess their track record directly. The quality of the research described, however, suggests a capable research team.

Technical Rigor & Methodology

25 / 30

The abstract outlines a clear technical approach involving a language-conditioned reward model and a reward-conditioned policy. It reports a substantial 38 percentage point improvement over baselines, validated across 'four real-world and two simulated long-horizon manipulation tasks,' indicating robust experimental design and comparative analysis.

Reproducibility & Openness

10 / 25

The abstract mentions a blog post with videos at 'https://freeform-pl.github.io/fpl.website/', which suggests some level of public sharing. However, it does not explicitly state the availability of code, datasets, or trained models, which limits full reproducibility.

Community Vetting & Peer Review

0 / 20

The abstract does not specify if the paper has been peer-reviewed, accepted at a conference (e.g., NeurIPS, ICRA), or published in a journal. Therefore, there is no direct evidence of community vetting provided in the abstract.

Detailed Evidence Assessment

Verified Evidence & Citations

FPL learns robot policies from freeform human preferences.

“We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences.”

FPL allows annotators to define natural-language preference axes and provide pairwise preferences along each.

“Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis.”

FPL uses annotations to learn a language-conditioned reward model.

“These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward.”

A reward-conditioned policy is trained using this model.

“We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions.”

FPL improves performance over sparse-reward and binary-preference methods.

“Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points.”

FPL learns dense progress signals without explicit subtask segmentation.

“Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation.”

FPL shows compositionality of behavior not present in the data.

“shows compositionality of behavior not present in the data”

FPL allows users to steer the policy towards different behaviors at test time without retraining.

“and allows users to steer the policy towards different behaviors at test time without retraining.”

Uncertainties & Omissions

• Omission:Author names and institutional affiliations are not provided.

• Omission:The specific publication venue (conference, journal) or peer-review status is not mentioned.

• Omission:Direct links to code, datasets, or trained model weights are not provided.

• Omission:Details on the specific 'four real-world and two simulated long-horizon manipulation tasks' are not given.

• Uncertainty:The scalability of FPL to a broader range of complex tasks and an even wider array of preference axes beyond the tested scenarios.

• Uncertainty:Potential biases in human-provided freeform preferences and their impact on policy learning.

• Uncertainty:The robustness of the language-conditioned reward model to highly nuanced or ambiguous natural language inputs from diverse users.

• Uncertainty:The computational resources required for learning and deploying FPL policies in very complex environments.