Are Performance-Optimization Benchmarks Reliably Measuring Coding Agents?

Zhi Chen, Zhensu Sun, Yuling Shi, David Lo, Lingxiao Jiang

Repository-level performance-optimization benchmarks such as GSO, SWE-Perf and SWE-fficiency evaluate coding agents by applying patches to real repositories and comparing runtime against unoptimized baselines and official reference patches. Their leaderboard scores are increasingly used as evidence of coding-agent progress, but those scores can conflate runtime instability, benchmark-specific scoring rules, and how many tasks are already solved by at least one public submission. We audit these issues across the three benchmarks. First, we replay the official reference patches for 740 code optimization tasks across four common types of Google Cloud machines. Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks; SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes. Second, we show that public submission rankings depend strongly on the benchmark scoring rule. Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons, and SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%. Third, looking across 10 public submissions for each task, we find that at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks, and beats the unoptimized base code on 99.8% (449/450). Our study complements leaderboard scores by identifying tasks with more reliable performance signals, quantifying per-task score contributions, and exposing the remaining performance gaps that are hidden by aggregate rankings.

Open Source

Research Brief

Current performance-optimization benchmarks for coding agents suffer from significant reliability issues due to runtime instability, flawed scoring rules, and a high proportion of already-solved tasks, undermining their utility for measuring true agent progress.

This paper critically examines the reliability of widely used repository-level performance-optimization benchmarks for AI coding agents, specifically GSO, SWE-Perf, and SWE-fficiency. It identifies three major problems: first, runtime instability where reference patches often fail to consistently meet validity rules across different machine types (e.g., only 11/140 SWE-Perf tasks were consistently valid across Google Cloud machines), with SWE-Perf being particularly fragile due to minimal runtime changes. Second, it reveals that public submission rankings are highly sensitive to the benchmark's scoring rules, leading to disagreements in agent comparisons and disproportionate weighting of certain tasks. Third, the study finds that a substantial majority of tasks (85.3% for GSO and SWE-fficiency) have already been solved or beaten by at least one public submission, implying a lack of challenging, unsolved problems for new agents to demonstrate progress against. The research aims to improve benchmark design by highlighting these inconsistencies, quantifying score contributions, and exposing hidden performance gaps.

Potential Applications

Development of more robust and reliable AI agent evaluation platforms for code optimization.
Guidance for AI researchers and developers in selecting or designing benchmarks that accurately reflect agent capabilities and progress.
Fairer and more accurate comparison of different coding agents, fostering genuine innovation in AI-driven code optimization.
Refinement of automated software engineering tools by providing clearer signals for performance improvements.

48/100

Paper Trustworthiness Index

Medium Skepticism

Skeptical / Unreviewed

This is a preprint publication or lacks formal peer review. It is part of the research pipeline but needs caution.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

5 / 25

The abstract does not provide information about the authors, their affiliations, or funding, making it impossible to assess their track record from the provided text alone. A neutral-low score is assigned due to this lack of information.

Technical Rigor & Methodology

28 / 30

The paper demonstrates strong technical rigor by performing a detailed audit across three major benchmarks, replaying 740 tasks on four types of Google Cloud machines. It quantifies issues like replay validity, scoring rule impact on rankings, and task solution rates with specific percentages, demonstrating a thorough and systematic analytical approach to evaluating existing systems.

Reproducibility & Openness

10 / 25

The abstract mentions replaying official reference patches and using specific Google Cloud machine types, which hints at a reproducible methodology. However, it does not provide explicit URLs for code, data, or scripts, which would be crucial for independent verification. Without these direct links, full reproducibility cannot be confirmed from the abstract alone.

Community Vetting & Peer Review

5 / 20

The abstract does not specify whether the paper has undergone peer review, been accepted at a conference, or published in a journal. As such, there is no explicit evidence of community vetting mentioned in the provided text.

Detailed Evidence Assessment

Verified Evidence & Citations

Most benchmark tasks can be replayed, but their reference patches often fail validity rules across different machines.

“Most benchmark tasks can be replayed, but their reference patches satisfy the original benchmark validity rules in every cross-machine replay for only 39/102 GSO tasks, 11/140 SWE-Perf tasks, and 411/498 SWE-fficiency tasks”

SWE-Perf is particularly fragile due to minimal runtime changes from its reference patches.

“SWE-Perf is especially fragile because many reference patches produce close-to-zero runtime changes.”

Public submission rankings are strongly dependent on the benchmark scoring rule.

“public submission rankings depend strongly on the benchmark scoring rule.”

Official rankings disagree on a significant number of pairwise submission comparisons for shared benchmarks.

“Among eight public submissions shared by GSO and SWE-fficiency, the official rankings disagree on 9 of 28 pairwise submission comparisons”

SWE-fficiency's scoring rule gives disproportionately high weights to the worst-performing tasks.

“SWE-fficiency's leaderboard scoring rule assigns the worst ten tasks overly high score weights of 58.5%-82.8%.”

A high percentage of replay-valid tasks already have public submissions that match or beat reference patches.

“at least one submission matches or beats the reference patch on 85.3% (384/450) of replay-valid GSO and SWE-fficiency tasks”

Almost all replay-valid tasks have public submissions that beat the unoptimized base code.

“and beats the unoptimized base code on 99.8% (449/450).”

Uncertainties & Omissions

• Omission:No specific authors or institutional affiliations mentioned in the abstract.

• Omission:No direct links to code repositories, datasets, or experimental scripts are provided.

• Omission:The peer-review status or publication venue of the paper is not mentioned.