Research Paper
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Research Brief
Models that agree on predictions often disagree on *why*, indicating that open-source LLMs are poor surrogates for understanding the internal mechanisms of closed-source models.
This paper investigates 'surrogate fidelity,' examining whether insights gained from open-source Large Language Models (LLMs) can reliably explain the internal workings of proprietary, 'closed' LLMs. By testing eleven models across four major families (Llama, Qwen, GPT, and Gemini) on binary classification tasks, the researchers found that models might produce the same answers (high prediction fidelity) but often arrive at those answers through different internal processes, or 'attributions' (low attribution fidelity). They also observed that while internal 'white-box' signals like attention patterns are stable across models, they are poor predictors of actual causal attributions, which are better captured by 'black-box' input ablations. The core conclusion is that mere agreement on predictions is insufficient to confidently transfer mechanistic understanding from open models to closed ones, posing a significant challenge for interpretability and safety.
- AI Safety and Alignment: Without understanding the 'why' behind a closed model's decision, it becomes challenging to predict and mitigate undesirable or harmful behaviors, impacting safety-critical applications.
- Responsible AI Development and Auditing: Developers relying on proprietary LLM APIs face significant hurdles in explaining their AI systems to users or regulators, complicating accountability and trust.
- Improved Black-Box Interpretability Tools: The findings highlight the critical need for more sophisticated black-box interpretability methods that can robustly infer causal attributions for closed models, rather than relying on open-source proxies.
- Regulatory Compliance: As regulations for AI explainability emerge, this research points to a fundamental technical barrier for closed-source models to meet such requirements effectively.
Paper Trustworthiness Index
Low SkepticismThis paper displays high academic trustworthiness with formal peer-review backing or historical consensus.
Core Pillars Breakdown
The affiliation 'facebookresearch' indicates a top-tier research institution with substantial resources, a history of significant contributions to AI research, and high academic prestige. This affiliation provides strong credibility to the research.
The abstract describes a robust experimental design, comparing eleven models from four major families (Llama, Qwen, GPT, Gemini) on binary classification tasks using multiple established interpretability metrics (log-odds, leave-one-out attributions, attention patterns, perturbation magnitudes). This demonstrates a comprehensive and methodologically sound approach to the problem.
The abstract explicitly states, 'Code and results are available at https://github.com/facebookresearch/surrogate,' providing immediate and direct access to the implementation and data. This greatly enhances the reproducibility and verifiability of the research.
The abstract does not specify whether the paper has been peer-reviewed or accepted to a major conference or journal. Therefore, it is assessed as a preprint, which has not yet received formal community vetting, resulting in a neutral score.