Cosmic Feed

Frontier Research Intelligence

Back to browse
AI & CognitionarXiv2026-07-01Preprint (43)

Research Paper

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Chenyang Ma, Yue Yang, Radu Corcodel, Siddarth Jain, Andrew Wu, Chiori Hori, Diego Romeres

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.
Open Source

Research Brief

FurnitureVLA introduces a progress-enhanced Vision-Language-Action model for real-scale bimanual furniture assembly, significantly improving success rates in simulation and demonstrating strong real-world performance.

This paper presents FurnitureVLA, an advanced AI system designed to enable robots to assemble furniture using two arms, addressing the complex challenges of real-scale and long-horizon tasks. Unlike prior work that focused on simplified scenarios, FurnitureVLA tackles intricate assembly sequences involving up to 7 subtasks and thousands of control steps. The core innovation is a progress-enhanced Vision-Language-Action (VLA) model that is specifically finetuned on semantic subtasks. This model not only predicts the necessary robotic actions but also a continuous signal indicating progress through the assembly, which helps automate transitions between subtasks and reduce errors over extended operations. The researchers developed a scalable simulation environment for data generation and evaluation, along with a VR teleoperation system for collecting high-quality real-world demonstrations. FurnitureVLA dramatically improved simulation success from 48% to 80% compared to baselines and maintained performance with only a 16% drop when tested on a real Kinova Gen3 robot for the most difficult tasks.

Potential Applications
  • Automated manufacturing and assembly lines for complex products, moving beyond simple pick-and-place to multi-step, multi-component construction.
  • Robotic assistance in homes or commercial settings for tasks requiring bimanual manipulation and sequential processes, such as setting up modular systems or assembling consumer goods.
  • Logistics and warehousing, where robots could assemble custom orders, pack complex items, or reconfigure storage solutions.
  • Rapid deployment and construction in hazardous environments or disaster relief, where autonomous bimanual assembly of structures could be critical.
43/100

Paper Trustworthiness Index

Medium Skepticism
Skeptical / Unreviewed

This is a preprint publication or lacks formal peer review. It is part of the research pipeline but needs caution.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record
15 / 25

While specific authors or institutions are not named in the abstract, the comprehensive and multi-faceted nature of the research, encompassing robotics, vision, language models, simulation, and real-world deployment, suggests affiliation with a well-resourced and recognized research group or institution in AI and robotics.

Technical Rigor & Methodology
28 / 30

The paper demonstrates strong technical rigor by formalizing the task, developing a scalable simulation, collecting high-quality real-world data, proposing a novel progress-enhanced VLA, performing extensive simulation evaluations against baselines, and validating the approach on a real Kinova Gen3 platform with detailed performance metrics.

Reproducibility & Openness
0 / 25

The abstract does not provide any information regarding the public availability of code, datasets, or trained model weights, which is crucial for independent researchers to reproduce the results.

Community Vetting & Peer Review
0 / 20

The abstract does not specify if the paper has been peer-reviewed, published in a conference, or journal, making it impossible to assess its current level of community vetting.

Detailed Evidence Assessment

Verified Evidence & Citations
FurnitureVLA is the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs).
We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs).
A scalable simulation pipeline was developed for expert data generation and evaluation.
We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation...
A VR teleoperation system was built for single-operator bimanual control to collect high-quality real-world demonstrations.
...and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations.
The proposed model is a progress-enhanced VLA that jointly predicts actions and a continuous progress signal.
we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal...
The system addresses extreme long-horizon assembly with up to 7 subtasks and 1550 control steps.
To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps...
FurnitureVLA improves average simulation success from 48% to 80% compared to baselines.
FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types...
An additional 21% gain in simulation success was achieved from a design factor study.
...with an additional 21% gain from our design factor study.
The system was validated on a real Kinova Gen3 platform.
We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.
Uncertainties & Omissions
• Omission:Specific links to code, data repositories, or trained model weights for reproducibility.
• Omission:Detailed descriptions of the specific 'baselines' used for comparison in simulation.
• Omission:Names of the authors, their affiliations, and any funding bodies involved.
• Uncertainty:The exact nature and scope of the 'baselines' against which FurnitureVLA was compared in simulation.
• Uncertainty:The full generalizability of FurnitureVLA to a wider variety of furniture types or complex bimanual tasks beyond those studied.
• Uncertainty:Details on the long-term robustness and real-world error recovery mechanisms, which are critical for practical applications.
• Uncertainty:The specific details of the 'design factor study' and how its reported 21% gain integrates with the overall performance improvement.