Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?

Philippe Chlenski, Zachariah Carmichael, Ayush Warikoo, Chia-Tse Shao, Yingxiao Ye, Aobo Yang, Vivek Miglani, Nehal Bandi

Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at https://github.com/facebookresearch/surrogate.

Open Source

Research Brief

Models that agree on predictions often disagree on why, indicating that open-source LLMs are poor surrogates for understanding the internal mechanisms of closed-source models.

This paper investigates 'surrogate fidelity,' examining whether insights gained from open-source Large Language Models (LLMs) can reliably explain the internal workings of proprietary, 'closed' LLMs. By testing eleven models across four major families (Llama, Qwen, GPT, and Gemini) on binary classification tasks, the researchers found that models might produce the same answers (high prediction fidelity) but often arrive at those answers through different internal processes, or 'attributions' (low attribution fidelity). They also observed that while internal 'white-box' signals like attention patterns are stable across models, they are poor predictors of actual causal attributions, which are better captured by 'black-box' input ablations. The core conclusion is that mere agreement on predictions is insufficient to confidently transfer mechanistic understanding from open models to closed ones, posing a significant challenge for interpretability and safety.

Potential Applications

AI Safety and Alignment: Without understanding the 'why' behind a closed model's decision, it becomes challenging to predict and mitigate undesirable or harmful behaviors, impacting safety-critical applications.
Responsible AI Development and Auditing: Developers relying on proprietary LLM APIs face significant hurdles in explaining their AI systems to users or regulators, complicating accountability and trust.
Improved Black-Box Interpretability Tools: The findings highlight the critical need for more sophisticated black-box interpretability methods that can robustly infer causal attributions for closed models, rather than relying on open-source proxies.
Regulatory Compliance: As regulations for AI explainability emerge, this research points to a fundamental technical barrier for closed-source models to meet such requirements effectively.

88/100

Paper Trustworthiness Index

Low Skepticism

Highly Trustworthy

This paper displays high academic trustworthiness with formal peer-review backing or historical consensus.

Verified AI Assessment: This credibility analysis was generated by Gemini 2.5 Flash analyzing the full paper text, references, and metadata.

Core Pillars Breakdown

Author & Institutional Track Record

25 / 25

The affiliation 'facebookresearch' indicates a top-tier research institution with substantial resources, a history of significant contributions to AI research, and high academic prestige. This affiliation provides strong credibility to the research.

Technical Rigor & Methodology

28 / 30

The abstract describes a robust experimental design, comparing eleven models from four major families (Llama, Qwen, GPT, Gemini) on binary classification tasks using multiple established interpretability metrics (log-odds, leave-one-out attributions, attention patterns, perturbation magnitudes). This demonstrates a comprehensive and methodologically sound approach to the problem.

Reproducibility & Openness

25 / 25

The abstract explicitly states, 'Code and results are available at https://github.com/facebookresearch/surrogate,' providing immediate and direct access to the implementation and data. This greatly enhances the reproducibility and verifiability of the research.

Community Vetting & Peer Review

10 / 20

The abstract does not specify whether the paper has been peer-reviewed or accepted to a major conference or journal. Therefore, it is assessed as a preprint, which has not yet received formal community vetting, resulting in a neutral score.

Detailed Evidence Assessment

Verified Evidence & Citations

Mechanistic interpretability requires full access to model internals.

“Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens.”

The study evaluates surrogate fidelity at prediction, attribution, and representation levels.

“We evaluate surrogate fidelity at the prediction, attribution, and representation levels.”

Prediction fidelity substantially overstates attribution fidelity across diverse LLM families.

“Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why.”

White-box signals like attention patterns are stable but weakly predictive of causal attributions.

“We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design.”

Code and results are publicly available.

“Code and results are available at https://github.com/facebookresearch/surrogate.”

Uncertainties & Omissions

• Omission:Details of the specific datasets used for binary classification tasks, including their size and nature.

• Omission:Specific experimental setups, hyperparameters, and evaluation metrics beyond 'log-odds' and 'leave-one-out attributions'.

• Omission:Full quantitative results or statistical significance analysis supporting the reported fidelity differences.

• Omission:A detailed breakdown of the 'eleven models' and their exact versions or architectural specifics.

• Uncertainty:The specific conditions or types of tasks where attribution fidelity might be higher or lower between open and closed models.

• Uncertainty:The generalizability of these findings to other AI tasks beyond binary classification (e.g., generation, summarization).

• Uncertainty:Whether advanced black-box interpretability methods could reliably bridge the identified fidelity gap for closed models.