Cosmic Feed | Frontier Research Intelligence

37 papers found

AI & CognitionarXiv2026-06-30Skeptical (25)

Click card for metadata

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Sergio Hernández-Gutiérrez, Matteo Merler, Ilze Amanda Auzina et al.

LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method's score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.

QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

Simulation of Two-qubit Gate Variability and Fidelity of Spin Qubits Built on Nanosheet Technology

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation

Freeform Preference Learning for Robotic Manipulation

FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

Automated Background Swapping for Robustness against Spurious Backgrounds

TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning

CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation

Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments

Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Efficient entanglement of three remote single-atom quantum-network nodes

PolicyGuard: From Organizational Policies to Neuro-SymbolicCompliance Review Engines

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

GR2 Technical Report

Spatially Coupled MacKay-Neal/Hsu-Anastasopoulos CSS Codes Achieve the Quantum-Erasure Hashing Bound by Seeded BP Decoding

Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization

Electromagnetic radiation from a point-like charge in a weak gravitational wave: a Shapiro-delay-motivated approach

Quantum Information as a New Lens for Precision Neutrino Physics

OopsieVerse: A Safety Benchmark with Damage-Aware Simulation for Robot Manipulation

Amplifying Membership Signal Through Chained Regeneration

Joint inference of weak lensing convergence map and cosmology with diffusion models

LUNA: Learning Universal 3D Human Animation Beyond Skinning

Evaluation of Population Initialization Methods for Genetic Programming-based Symbolic Regression

Constraining dark energy with complementary probes of large-scale structure

TreeAgent: A Generalizable Multi-Agent Framework for Automated Bias Labeling in Forestry via Compiled Expert Rules and Vision-Language Models

Reheating in No-Scale Models of Inflation

The contact temperature of arbitrary quantum states

MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments

Signed-Permutation Coordinate Transport for RMSNorm Transformers

Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision

AdaJEPA: An Adaptive Latent World Model

AxDafny: Agentic Verified Code Generation in Dafny

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

An efficient Pauli decomposition algorithm for structured matrices