AxDafny: Agentic Verified Code Generation in Dafny

Benjamin Breen, Austin Letson, Borja Requena Pozo, Leopoldo Sarra

We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification. We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments. We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness. On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance. On DafnyBench, AxDafny achieves 92.7\% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points. Lastly, we show that verification success and runtime test performance measure different aspects of generated code.

Open Source

Research Brief

AxDafny is a verifier-guided AI framework that significantly improves the generation of formally verified code and associated proofs in Dafny, outperforming existing baselines.

This research explores how artificial intelligence can be used to generate not just functional code, but also the crucial mathematical proofs needed to formally guarantee that the code works correctly and securely. The team developed AxDafny, a system that iteratively refines code, along with necessary invariants, assertions, and termination arguments, guided by continuous feedback from a formal verifier. To rigorously evaluate AxDafny, they created LCB-Pro-Dafny, a new benchmark of programming problems specified with formal requirements and an automatic verification system. AxDafny demonstrated substantial improvements in verification success over baseline AI models like GPT, achieving a 92.7% verification success rate on an existing benchmark (DafnyBench), surpassing previous state-of-the-art results.

Potential Applications

Mission-critical software development: Ensuring extreme reliability for aerospace, medical devices, and autonomous systems where failures have catastrophic consequences.
Smart contract verification: Automatically generating provably correct and secure smart contracts for blockchain applications to prevent costly bugs and exploits.
Operating system kernels and device drivers: Building foundational software components with high assurance against vulnerabilities and errors.
Security-sensitive financial software: Developing highly reliable algorithms for trading, banking, and secure transactions where correctness is paramount for preventing financial loss and fraud.

50/100

Paper Trustworthiness Index

Medium Skepticism

Moderately Trustworthy

This is a preprint publication or lacks formal peer review. It is part of the research pipeline but needs caution.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

15 / 25

The abstract does not specify authors or their institutional affiliations. However, the sophisticated topic of agentic verified code generation in Dafny implies that the research is likely conducted by experts in programming languages, formal verification, and artificial intelligence, presumably from a credible academic or industrial research environment.

Technical Rigor & Methodology

25 / 30

The paper introduces a novel framework (AxDafny), a new benchmark (LCB-Pro-Dafny) with formal specifications and a verifier-based evaluation harness, and provides quantitative performance comparisons against strong baselines (GPT-5.5, strongest proof-hint baseline) on two distinct benchmarks. This indicates a robust experimental design and rigorous evaluation methodology.

Reproducibility & Openness

5 / 25

The abstract does not provide any information regarding the availability of code, datasets, or trained models (e.g., GitHub links, supplementary materials, or an open-source mention). Without such information, an independent researcher cannot easily reproduce the results or build upon the reported work.

Community Vetting & Peer Review

5 / 20

The abstract does not mention where the paper has been published, accepted (e.g., specific conference or journal), or whether it has undergone peer review. Its current status (e.g., a preprint on arXiv) is unknown from the provided text.

Detailed Evidence Assessment

Verified Evidence & Citations

AxDafny is a verifier-guided repair framework.

“We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments.”

AxDafny substantially improves verification success over baseline GPT-5.5 performance on LCB-Pro-Dafny.

“On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance.”

AxDafny achieves 92.7% verification success on DafnyBench, outperforming a baseline.

“On DafnyBench, AxDafny achieves 92.7% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points.”

Verification success and runtime test performance measure different aspects of generated code.

“Lastly, we show that verification success and runtime test performance measure different aspects of generated code.”

Uncertainties & Omissions

• Omission:Specific authors and their affiliations.

• Omission:Publication venue (e.g., conference, journal) and peer-review status.

• Omission:Links to codebase, datasets, or trained models for reproducibility.

• Omission:Detailed description of the 'GPT-5.5' baseline or specifics of the 'strongest previously reported proof-hint baseline'.

• Omission:The precise meaning or full scope of 'agentic code generation' within the AxDafny framework.

• Uncertainty:The term 'GPT-5.5' is an uncommon designation for a publicly known OpenAI model, creating ambiguity regarding the exact baseline used (likely a typo for GPT-3.5, GPT-4, or an internal variant).

• Uncertainty:The specific mechanisms of how AxDafny's 'verifier-guided repair framework' iteratively generates and refines artifacts are not detailed in the abstract.

• Uncertainty:The extent of 'substantially improves' lacks precise numerical quantification for the LCB-Pro-Dafny results, beyond stating the relative improvement on DafnyBench.